新闻中心

【数据挖掘入门】使用树模型快速搭建比赛基线模型及进阶分享

2025-07-30
浏览次数:
返回列表
本文是数据挖掘比赛入门教程,以车辆贷款违约预测挑战赛为例,演示用LightGBM树模型快速搭建基线。涵盖数据读取与内存优化、EDA分析、特征筛选,通过5折交叉验证训练模型,输出预测结果,还分享进阶思路,助力初学者系统认识比赛并入门。

☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

【数据挖掘入门】使用树模型快速搭建比赛基线模型及进阶分享 -

项目介绍:

本项目作为个比赛的入门教程,将演示如何用树模型快速搭建比赛基线及分享比赛进阶提升思路。希望能够帮助初学者对比赛形成一个系统的认识,更好地入门并在比赛中取得好成绩。

树模型LightGBM介绍:

LightGBM是基于XGBoost的一款可以快速并行的树模型框架,内部集成了多种集成学习思路,在代码实现上对XGBoost的节点划分进行了改进,内存占用更低训练速度更快。

LightGBM官网:https://lightgbm.readthedocs.io/en/latest/

参数介绍:https://lightgbm.readthedocs.io/en/latest/Parameters.html

使用介绍:你应该知道的LightGBM各种操作!

使用树模型的优势:树模型是生成规则的利器,能够从一系列有特征和标签的数据中总结出决策规则,并用树状图的结构来呈现这些规则,以解决分类和回归问题。

对于采用表格数据的任务,基本都是决策树模型的主场,像XGBoost和LightGBM这类提升(Boosting)树模型已经成为了现在数据挖掘比赛中的标配。

In [1]
# LightGBM的安装# 默认版本!pip install lightgbm# GPU版本,训练更快# !pip install lightgbm --install-option=--gpu
       
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: lightgbm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (3.1.1)
Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.16.4)
Requirement already satisfied: wheel in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.33.6)
Requirement already satisfied: scikit-learn!=0.22.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.22.1)
Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.3.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (0.14.1)
       

此次以讯飞赛题:车辆贷款违约预测挑战赛为例,并以树模型构建赛题基线模型

赛事地址:http://challenge.xfyun.cn/topic/info?type=car-loan

赛题任务:通过训练集训练模型,来预测测试集中loan_default字段的具体值,即借款人是否会拖欠付款,其中1表示客户逾期,0表示客户未逾期。

运行要求:对配置上无高要求,选择CPU版本即可运行本项目。树模型一般处理特征多或维度高时才会对内存会有一定要求。

Motiff妙多 Motiff妙多

Motiff妙多是一款AI驱动的界面设计工具,定位为“AI时代设计工具”

Motiff妙多 334 查看详情 Motiff妙多 In [2]
# 解压比赛数据集%cd /home/aistudio/data/data101719/
!unzip data.zip
       
/home/aistudio/data/data101719
Archive:  data.zip
  inflating: sample_submit.csv       
  inflating: test.csv                
  inflating: train.csv
        In [3]
# 导入依赖包import pandas as pdimport numpy as npfrom sklearn.model_selection import KFoldfrom sklearn.metrics import f1_score, roc_auc_scorefrom tqdm import tqdmimport gcimport timeimport lightgbm as lgbimport warnings
warnings.filterwarnings('ignore')
       
<IPython.core.display.HTML object>
                In [4]
# 内存优化脚本,避免内存溢出def reduce_mem(df, cols):
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in tqdm(cols):
        col_type = df[col].dtypes        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()            if str(col_type)[:3] == 'int':                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)            else:                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    print('{:.2f} Mb, {:.2f} Mb ({:.2f} %)'.format(start_mem, end_mem, 100 * (start_mem - end_mem) / start_mem))
    gc.collect()    return df
    In [5]
# 读取比赛数据集train = pd.read_csv('./train.csv')  # 训练集test = pd.read_csv('./test.csv')    # 测试集# 对数据集进行内存优化train = reduce_mem(train, [f for f in train.columns])
test = reduce_mem(test, [f for f in test.columns])
       
100%|██████████| 53/53 [00:01<00:00, 42.04it/s]
100%|██████████| 52/52 [00:00<00:00, 559.02it/s]
       
60.65 Mb, 18.02 Mb (70.28 %)
11.90 Mb, 3.55 Mb (70.19 %)
       
<br/>
        In [6]
# 根据赛题要求设置提交结果文件格式:'customer_id', 'loan_default'# 'loan_default'作为要对测试集数据进行预测的标签,1表示客户逾期,0表示客户未逾期。sample_submit = pd.DataFrame(columns=['customer_id', 'loan_default']) 
sample_submit['customer_id'] = test['customer_id']
   

数据分析(EDA):

全局数据分析:数据的整体情况,包括数据类型、大小、质量等

单变量数据分析:对每个变量进行探索性分析,包括类别变量,连续变量,文本变量等

交叉特征分析:特征与标签的交叉分析以及特征与特征之间的交叉等

训练集、测试集分布分析:训练集和测试集的分布不一致是导致线上和线下不一致的重要原因

参考文章:初学者竞赛学习手册

In [7]
# 数据大小概览,可以看出此赛题的字段较多,如何善用好特征是比赛一大难点train.info()
       
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 53 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   customer_id                    150000 non-null  int32  
 1   main_account_loan_no           150000 non-null  int16  
 2   main_account_active_loan_no    150000 non-null  int16  
 3   main_account_overdue_no        150000 non-null  int8   
 4   main_account_outstanding_loan  150000 non-null  int32  
 5   main_account_sanction_loan     150000 non-null  int32  
 6   main_account_disbursed_loan    150000 non-null  int32  
 7   sub_account_loan_no            150000 non-null  int8   
 8   sub_account_active_loan_no     150000 non-null  int8   
 9   sub_account_overdue_no         150000 non-null  int8   
 10  sub_account_outstanding_loan   150000 non-null  int32  
 11  sub_account_sanction_loan      150000 non-null  int32  
 12  sub_account_disbursed_loan     150000 non-null  int32  
 13  disbursed_amount               150000 non-null  int32  
 14  asset_cost                     150000 non-null  int32  
 15  branch_id                      150000 non-null  int8   
 16  supplier_id                    150000 non-null  int16  
 17  manufacturer_id                150000 non-null  int8   
 18  area_id                        150000 non-null  int8   
 19  employee_code_id               150000 non-null  int16  
 20  mobileno_flag                  150000 non-null  int8   
 21  idcard_flag                    150000 non-null  int8   
 22  Driving_flag                   150000 non-null  int8   
 23  passport_flag                  150000 non-null  int8   
 24  credit_score                   150000 non-null  int16  
 25  main_account_monthly_payment   150000 non-null  int32  
 26  sub_account_monthly_payment    150000 non-null  int32  
 27  last_six_month_new_loan_no     150000 non-null  int8   
 28  last_six_month_defaulted_no    150000 non-null  int8   
 29  *erage_age                    150000 non-null  int8   
 30  credit_history                 150000 non-null  int8   
 31  enquirie_no                    150000 non-null  int8   
 32  loan_to_asset_ratio            150000 non-null  float16
 33  total_account_loan_no          150000 non-null  int16  
 34  sub_account_inactive_loan_no   150000 non-null  int16  
 35  total_inactive_loan_no         150000 non-null  int8   
 36  main_account_inactive_loan_no  150000 non-null  int16  
 37  total_overdue_no               150000 non-null  int8   
 38  total_outstanding_loan         150000 non-null  int32  
 39  total_sanction_loan            150000 non-null  int32  
 40  total_disbursed_loan           150000 non-null  int32  
 41  total_monthly_payment          150000 non-null  int32  
 42  outstanding_disburse_ratio     150000 non-null  float64
 43  main_account_tenure            150000 non-null  int32  
 44  sub_account_tenure             150000 non-null  int32  
 45  disburse_to_sactioned_ratio    150000 non-null  float32
 46  active_to_inactive_act_ratio   150000 non-null  float16
 47  year_of_birth                  150000 non-null  int16  
 48  disbursed_date                 150000 non-null  int16  
 49  Credit_level                   150000 non-null  int8   
 50  employment_type                150000 non-null  int8   
 51  age                            150000 non-null  int8   
 52  loan_default                   150000 non-null  int8   
dtypes: float16(2), float32(1), float64(1), int16(10), int32(17), int8(22)
memory usage: 18.0 MB
        In [8]
# 确定每个字段中不同的个数,对nunique为1的字段直接删除。train.nunique()
       
customer_id                      150000
main_account_loan_no                104
main_account_active_loan_no          35
main_account_overdue_no              19
main_account_outstanding_loan     48609
main_account_sanction_loan        30564
main_account_disbursed_loan       32862
sub_account_loan_no                  36
sub_account_active_loan_no           21
sub_account_overdue_no                8
sub_account_outstanding_loan       2108
sub_account_sanction_loan          1519
sub_account_disbursed_loan         1725
disbursed_amount                  19235
asset_cost                        38902
branch_id                            82
supplier_id                        2888
manufacturer_id                      10
area_id                              22
employee_code_id                   3241
mobileno_flag                         1
idcard_flag                           1
Driving_flag                          2
passport_flag                         2
credit_score                        570
main_account_monthly_payment      21499
sub_account_monthly_payment        1304
last_six_month_new_loan_no           24
last_six_month_defaulted_no          14
*erage_age                         100
credit_history                      100
enquirie_no                          23
loan_to_asset_ratio                1994
total_account_loan_no               103
sub_account_inactive_loan_no         90
total_inactive_loan_no               27
main_account_inactive_loan_no        91
total_overdue_no                     19
total_outstanding_loan            49406
total_sanction_loan               31216
total_disbursed_loan              33557
total_monthly_payment             21843
outstanding_disburse_ratio         4391
main_account_tenure               12816
sub_account_tenure                 1230
disburse_to_sactioned_ratio         375
active_to_inactive_act_ratio        211
year_of_birth                        48
disbursed_date                        1
Credit_level                         14
employment_type                       3
age                                  48
loan_default                          2
dtype: int64
               

特征工程( 重点!):

1.特征交互:特征和特征之间组合、特征和特征之间衍生

2.特征编码:one-hot编码、label-encode编码等

3.特征选择:通过对特征重要性及相关性的分析,精简掉无用的特征

特征工程很大程度上是在帮助模型学习,在模型学习不好的地方或者难以学习的地方,采用特征工程的方式帮助其学习,通过人为筛选、人为构建组合特征让模型原本很难学好的东西可以更加容易地进行学习、进而拿到更好的效果。

In [9]
# 筛掉无用特征all_cols = [f for f in train.columns if f not in ['customer_id','loan_default','mobileno_flag','idcard_flag','disbursed_date']]
   

基线模型构建:

主要演示如何用树模型快速地搭建一个比赛基线模型,在特征工程及模型优化上需要结合具体赛题要求进行针对性地优化。

In [10]
# 训练集x_train = train[all_cols]# 训练集标签字段y_train = train['loan_default']# 要进行预测的测试集x_test = test[all_cols]
    In [11]
# 定义训练和预测函数def train_predict(clf, train_x, train_y, test_x, clf_name='lgb'):
    # 5折交叉验证
    folds = 5
    seed = 2025
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)

    train = np.zeros(train_x.shape[0])
    test = np.zeros(test_x.shape[0])
    cv_scores = []    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]

        train_matrix = clf.Dataset(trn_x, label=trn_y)
        valid_matrix = clf.Dataset(val_x, label=val_y)        # 树模型参数设置
        params = {            'boosting_type': 'gbdt',            'objective': 'binary',            'metric': 'auc',            'min_child_weight': 5,            'num_le*es': 2 ** 7,            'lambda_l2': 10,            'feature_fraction': 0.9,            'bagging_fraction': 0.9,            'bagging_freq': 4,            'learning_rate': 0.01,            'seed': 2025,            'n_jobs':-1,            'verbose': -1,
        }        # 早停和验证步数需要根据具体情况进行调优
        model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=500,early_stopping_rounds=200)        # 对验证集进行预测
        val_pred = model.predict(val_x, num_iteration=model.best_iteration)        # 对测试集进行预测
        test_pred = model.predict(test_x, num_iteration=model.best_iteration)

        train[valid_index] = val_pred
        test += test_pred / kf.n_splits
        cv_scores.append(roc_auc_score(val_y, val_pred))        # 输出验证集结果分数
        print(cv_scores)    print("%s_scotrainre_list:" % clf_name, cv_scores)    print("%s_score_mean:" % clf_name, np.mean(cv_scores))    print("%s_score_std:" % clf_name, np.std(cv_scores))    # 在训练完成后输出feature_importance,输出各特征的重要性
    print(pd.DataFrame({            'column': all_cols,            'importance': model.feature_importance()/5,
        }).sort_values(by='importance',ascending=False))    return train, test
    In [12]
# 进行模型的训练与预测lgb_train, lgb_test = train_predict(lgb, x_train, y_train, x_test)
       
************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.757221	valid_1's auc: 0.665608
Early stopping, best iteration is:
[648]	training's auc: 0.774819	valid_1's auc: 0.666395
[0.6663954692558639]
************************************ 2 ************************************
Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.756217	valid_1's auc: 0.6646
Early stopping, best iteration is:
[774]	training's auc: 0.786664	valid_1's auc: 0.665809
[0.6663954692558639, 0.6658088579217993]
************************************ 3 ************************************
Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.757318	valid_1's auc: 0.664588
[1000]	training's auc: 0.809107	valid_1's auc: 0.665196
Early stopping, best iteration is:
[840]	training's auc: 0.794933	valid_1's auc: 0.665534
[0.6663954692558639, 0.6658088579217993, 0.6655342821383231]
************************************ 4 ************************************
Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.758371	valid_1's auc: 0.650627
[1000]	training's auc: 0.809869	valid_1's auc: 0.652059
Early stopping, best iteration is:
[996]	training's auc: 0.809559	valid_1's auc: 0.652149
[0.6663954692558639, 0.6658088579217993, 0.6655342821383231, 0.652148910391985]
************************************ 5 ************************************
Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.757135	valid_1's auc: 0.662366
Early stopping, best iteration is:
[692]	training's auc: 0.779432	valid_1's auc: 0.662648
[0.6663954692558639, 0.6658088579217993, 0.6655342821383231, 0.652148910391985, 0.662648392749281]
lgb_scotrainre_list: [0.6663954692558639, 0.6658088579217993, 0.6655342821383231, 0.652148910391985, 0.662648392749281]
lgb_score_mean: 0.6625071824914504
lgb_score_std: 0.005338481209206612
                           column  importance
18               employee_code_id      1421.0
15                    supplier_id      1374.6
14                      branch_id      1341.0
29            loan_to_asset_ratio      1307.0
12               disbursed_amount      1150.2
13                     asset_cost      1089.6
44                  year_of_birth       995.4
21                   credit_score       781.6
17                        area_id       760.6
39     outstanding_disburse_ratio       635.6
27                 credit_history       565.6
40            main_account_tenure       560.8
26                    *erage_age       560.8
22   main_account_monthly_payment       445.8
16                manufacturer_id       434.6
38          total_monthly_payment       371.6
3   main_account_outstanding_loan       339.4
43   active_to_inactive_act_ratio       304.6
35         total_outstanding_loan       264.8
36            total_sanction_loan       233.0
46                employment_type       228.6
4      main_account_sanction_loan       213.2
37           total_disbursed_loan       205.6
28                    enquirie_no       188.8
5     main_account_disbursed_loan       182.2
31   sub_account_inactive_loan_no       155.4
0            main_account_loan_no       155.4
25    last_six_month_defaulted_no       155.2
30          total_account_loan_no       152.6
42    disburse_to_sactioned_ratio       141.6
33  main_account_inactive_loan_no       134.4
1     main_account_active_loan_no       126.4
2         main_account_overdue_no       126.4
24     last_six_month_new_loan_no       122.8
47                            age       117.6
34               total_overdue_no        87.4
45                   Credit_level        53.4
19                   Driving_flag        27.0
23    sub_account_monthly_payment        12.4
41             sub_account_tenure        12.4
6             sub_account_loan_no        10.8
9    sub_account_outstanding_loan         8.0
20                  passport_flag         7.0
32         total_inactive_loan_no         5.6
10      sub_account_sanction_loan         5.6
11     sub_account_disbursed_loan         3.0
8          sub_account_overdue_no         0.2
7      sub_account_active_loan_no         0.2
        In [13]
# 保存预测结果文件sample_submit['loan_default'] = lgb_test# 注意由于赛题要求输出的为0或1,故需要对预测结果进行一定的转换。此处设置大于0.25为1,小于或等于0.25则为0。sample_submit['loan_default'] = sample_submit['loan_default'].apply(lambda x:1 if x>0.25 else 0).values# 保存结果文件sample_submit.to_csv('result.csv', index=False)
   

以上就是【数据挖掘入门】使用树模型快速搭建比赛基线模型及进阶分享的详细内容,更多请关注其它相关文章!


# ai  # 内存占用  # cos  # red  # tome  # udio  # python  # 都是  # 吴川360seo  # 西北地区网站排名优化  # 遂川整合营销推广  # 滁州网站建设团队排名  # 厦门谷歌seo怎么优化排名  # 浙江seo技巧打造  # ktv营销策划推广方案软件下载  # 福清seo优化联系方式  # 拍拍贷网站推广  # 广西附近网站建设渠道公司  # 如何用  # 内存优化  # 为例  # 更快  # 中文网  # 官网  # 数据挖掘  # 一言  # 进阶  # play.ht  # type  # whee 


相关栏目: 【 行业资讯67740 】 【 技术百科0 】 【 网络运营39195


相关推荐: typescript要用什么工具  苹果16粉色还有哪些机型  固态硬盘如何显示  typescript在浏览器里怎么用  sqlite中datediff函数怎么用 SQLite中DATEDIFF()函数的用法分享  url解码什么意思  如何退出数据库命令行  苹果手机16新款颜色有哪些  哪里要用typescript  春运抢票失败怎么抢  dos命令 如何将变量 作为路径的一部分  如何使用net命令  typescript多久能学完  单片机怎么连接电路图  苹果16有哪些系统  typescript能干什么  苹果16关闭哪些功能好  python如何命令行换行  typescript怎么写react  手机nfc功能功能是什么意思  vb中的datediff函数怎么用 ​VB中的DateDiff函数:详尽指南  a股等权市盈率中位数是什么意思  苹果16改掉了哪些  苹果16系统有哪些改变  云笔记本电脑有什么用  固态硬盘如何拆除  5g手机怎么没视频通话功能  春运抢票最快几天能成功  ftp$如何执行宏命令  如何将系统移到固态硬盘  j*a数组怎么保存类  如何进入 dos 命令行  夸克加载什么要会员  如何用命令查看数据库日志文件  typescript怎么解析vue TypeScript在vue中的使用最新解读  单片机学习视频怎么调色  固态硬盘如何检查  苹果16系统有哪些系列  苹果16颜色有哪些  市盈率静是什么意思  win7怎么做幻灯片  春运大巴上抢票怎么抢票  type-c输入接口是什么  汽车的type-c接口是什么  哪些编程软件需用typescript  春运哪天抢票最好预约  红米手机怎么设置变成5G手机  vivo手机爱奇艺怎么投屏到电视操作步骤  跑步机power键是什么意思  如何激活固态硬盘 

搜索