【数据挖掘】二手汽车交易预测

Alex_Shen
2021-11-30 / 0 评论 / 0 点赞 / 78 阅读 / 9,086 字 / 正在检测是否收录...
温馨提示:
本文最后更新于 2022-04-06,若内容或图片失效,请留言反馈。部分素材来自网络,若不小心影响到您的利益,请联系我们删除。

Github源码

一.数据探索

数据集的格式如下:
在这里插入图片描述

特征可以分成三类:
1.日期特征: regDate, creatDate
2.类别特征: name, model, brand, bodyType, fuelType, gearbox, notRepairedDamage, regionCode, seller, offerType
3.数值特征: power, kilometer和15个匿名特征
主要关注特征的缺失率和nunique信息,主要是看有没有缺失过多或nunique太少的特征,一般情况下这两种特征对模型学习起不到作用。数值特征 power 和 kilometer nunique 值比较少,也不知道是不是数据做了处理,抹去了精度。seller 和 offerType 只有两个甚至1个不同的值,所以可以删去, 对模型学习起不到作用,模型的特征重要性也为0。
在这里插入图片描述

匿名特征的分布见下图,匿名特征在最后的模型重要性都挺高的,可以探索一下。
在这里插入图片描述

二.数据处理

(1)缺失值处理

缺失值主要集中在bodyType,fuelType,gearbox,思路是汽车的指标往往和其所属的品牌和车型有较大关系,所以采用该品牌车型下的众数来填补缺失值。

1.from scipy import stats  
2.  
3.cols = ['bodyType', 'fuelType', 'gearbox']  
4.df_feature['gp'] = df_feature['brand'].astype(  
5.    'str') + df_feature['model'].astype('str')  
6.gp_col = 'gp'  
7.  
8.df_na = df_feature[cols].isna()  
9.df_mode = df_feature.groupby(gp_col)[cols].agg(  
10.    lambda x: stats.mode(x)[0][0])  
11.  
12.for col in cols:  
13.    na_series = df_na[col]  
14.    names = list(df_feature.loc[na_series, gp_col])  
15.  
16.    t = df_mode.loc[names, col]  
17.    t.index = df_feature.loc[na_series, col].index  
18.  
19.    df_feature.loc[na_series, col] = t  
20.  
21.del df_feature['gp']  
22.df_feature[cols].isnull().sum()  

(2)无效特征删除

seller 和 offerType 只有两个甚至1个不同的值,所以可以删去, 对模型学习起不到作用,模型的特征重要性也为0。

1.del df_feature['seller']  
2.del df_feature['offerType']  

(3)目标变量分布变换

一般来说对于回归问题,目标变量正态化对模型预测有帮助,下图展示了使用 log1p 前后的价格分布情况。

1.df_feature['price'] = np.log1p(df_feature['price'])  

三.特征工程

(1)基础特征

对于两个日期特征汽车注册日期和开始售卖时间,可以二者做差值计算汽车售卖时的使用时间,我这里使用了年和天来刻画。除此以外,汽车是哪一年注册的对价格的影响也挺大。数据中存在一些异常日期数据:月份为0,处理的时候将其置为1即可。

1.df_feature['car_age_day'] = ( df_feature['creatDate'] - df_feature['regDate']).dt.days  
2.df_feature['car_age_year'] = round(df_feature['car_age_day'] / 365, 1)  

对于类别特征, 可以计算count属性, 反应销售热度。

1.df_feature['name_count'] = df_feature.groupby(['name'])['SaleID'].transform('count')  

数值特征往往结合类别特征进行统计。比如可以统计不同汽车品牌下匿名特征的统计特征:mean, std, max, min。

1.l = ['name', 'model', 'brand', 'bodyType']  
2.for f1 in tqdm(l):  
3.    for f2 in v_cols:  
4.        df_feature = stat(df_feature, df_feature, [f1], {  
5.            f2: ['mean', 'max', 'min', 'std']})  

目标变量 price 也是数值特征,所以也可以结合类别进行统计,比如计算某品牌,某车型的平均交易价格,这种做法称为目标编码。但需要注意的是,假如使用全局标签信息统计会出现标签泄露的问题,所以一般使用五折统计法,用四折的标签数据做统计给另外一折的数据做特征。

(2)匿名特征

简单一点,可以直接统计每辆车15个匿名特征的统计值,得到v_mean,v_max,v_min和v_std。然后再统计汽车交易名称下这四个特征的统计值,这道题,汽车交易名称也是一个很重要的特征。

1.v_cols = ['v_'+str(i) for i in range(15)]  
2.  
3.df_feature['v_mean'] = df_feature[v_cols].mean(axis=1)  
4.df_feature['v_max'] = df_feature[v_cols].max(axis=1)  
5.df_feature['v_min'] = df_feature[v_cols].min(axis=1)  
6.df_feature['v_std'] = df_feature[v_cols].std(axis=1)  
7.  
8.for col in ['v_mean', 'v_max', 'v_min', 'v_std']:  
9.    df_feature[f'name_{col}_mean'] = df_feature.groupby('name')[  
10.        col].transform('mean')  
11.    df_feature[f'name_{col}_std'] = df_feature.groupby('name')[  
12.        col].transform('std')  
13.    df_feature[f'name_{col}_max'] = df_feature.groupby('name')[  
14.        col].transform('max')  
15.    df_feature[f'name_{col}_min'] = df_feature.groupby('name')[  
16.        col].transform('min')  

匿名特征无法知道具体的业务含义,所以只能对匿名特征进行二阶或三阶组合,计算相加和相减,最后筛选保留以下特征:
1.df_feature['v_0_add_v_4'] = df_feature['v_0'] + df_feature['v_4']  
2.df_feature['v_0_add_v_8'] = df_feature['v_0'] + df_feature['v_8']  
3.df_feature['v_1_add_v_3'] = df_feature['v_1'] + df_feature['v_3']  
4.df_feature['v_1_add_v_4'] = df_feature['v_1'] + df_feature['v_4']  
5.df_feature['v_1_add_v_5'] = df_feature['v_1'] + df_feature['v_5']  
6.df_feature['v_1_add_v_12'] = df_feature['v_1'] + df_feature['v_12']  
7.df_feature['v_2_add_v_3'] = df_feature['v_2'] + df_feature['v_3']  
8.df_feature['v_4_add_v_11'] = df_feature['v_4'] + df_feature['v_11']  
9.df_feature['v_4_add_v_12'] = df_feature['v_4'] + df_feature['v_12']  
10.df_feature['v_0_add_v_12_add_v_14'] = df_feature['v_0'] + \  
11.    df_feature['v_12'] + df_feature['v_14']  
12.  
13.df_feature['v_4_add_v_9_minu_v_13'] = df_feature['v_4'] + \  
14.    df_feature['v_9'] - df_feature['v_13']  
15.df_feature['v_2_add_v_4_minu_v_11'] = df_feature['v_2'] + \  
16.    df_feature['v_4'] - df_feature['v_11']  
17.df_feature['v_2_add_v_3_minu_v_11'] = df_feature['v_2'] + \  
18.    df_feature['v_3'] - df_feature['v_11']  

四.模型训练

三个树模型:lightgbm、xgboost、catboost分别预测。

Lightgbm:

1.ycol = 'price'  
2.feature_names = list(  
3.    filter(lambda x: x not in [ycol, 'SaleID', 'regDate', 'creatDate', 'creatDate_year', 'creatDate_month'], df_train.columns))  
4.  
5.model = lgb.LGBMRegressor(num_leaves=64,  
6.                          max_depth=8,  
7.                          learning_rate=0.08,  
8.                          n_estimators=10000000,  
9.                          subsample=0.75,  
10.                          feature_fraction=0.75,  
11.                          reg_alpha=0.7,  
12.                          reg_lambda=1.2,  
13.                          random_state=seed,  
14.                          metric=None  
15.                          )  
16.  
17.prediction = df_test[['SaleID']]  
18.prediction['price'] = 0  
19.  
20.kfold = KFold(n_splits=5, shuffle=True, random_state=seed)  
21.for fold_id, (trn_idx, val_idx) in enumerate(kfold.split(df_train[feature_names])):  
22.    X_train = df_train.iloc[trn_idx][feature_names]  
23.    Y_train = df_train.iloc[trn_idx][ycol]  
24.  
25.    X_val = df_train.iloc[val_idx][feature_names]  
26.    Y_val = df_train.iloc[val_idx][ycol]  
27.  
28.    print('\nFold_{} Training ================================\n'.format(fold_id+1))  
29.  
30.    lgb_model = model.fit(X_train,  
31.                          Y_train,  
32.                          eval_names=['train', 'valid'],  
33.                          eval_set=[(X_train, Y_train), (X_val, Y_val)],  
34.                          verbose=500,  
35.                          eval_metric='mae',  
36.                          early_stopping_rounds=500)  
37.  
38.    pred_val = lgb_model.predict(  
39.        X_val, num_iteration=lgb_model.best_iteration_)  
40.  
41.    pred_test = lgb_model.predict(  
42.        df_test[feature_names], num_iteration=lgb_model.best_iteration_)  
43.    prediction['price'] += pred_test / 5  
44.  
45.    del lgb_model, pred_val, pred_test, X_train, Y_train, X_val, Y_val  
46.    gc.collect()  

在这里插入图片描述
单模型得分:
在这里插入图片描述

XgbBoost:

1.ycol = 'price'  
2.feature_names = list(  
3.    filter(lambda x: x not in [ycol, 'SaleID', 'regDate', 'creatDate', 'creatDate_year', 'creatDate_month'], df_train.columns))  
4.  
5.model = xgb.XGBRegressor(num_leaves=64,  
6.                         max_depth=8,  
7.                         learning_rate=0.08,  
8.                         n_estimators=10000000,  
9.                         subsample=0.75,  
10.                         feature_fraction=0.75,  
11.                         reg_alpha=0.7,  
12.                         reg_lambda=1.2,  
13.                         random_state=seed,  
14.                         metric=None,  
15.                         tree_method='hist'  
16.                         )  
17.  
18.prediction = df_test[['SaleID']]  
19.prediction['price'] = 0  
20.  
21.kfold = KFold(n_splits=5, shuffle=True, random_state=seed)  
22.for fold_id, (trn_idx, val_idx) in enumerate(kfold.split(df_train[feature_names])):  
23.    X_train = df_train.iloc[trn_idx][feature_names]  
24.    Y_train = df_train.iloc[trn_idx][ycol]  
25.  
26.    X_val = df_train.iloc[val_idx][feature_names]  
27.    Y_val = df_train.iloc[val_idx][ycol]  
28.  
29.    print('\nFold_{} Training ================================\n'.format(fold_id+1))  
30.  
31.    lgb_model = model.fit(X_train,  
32.                          Y_train,  
33.                          eval_set=[(X_train, Y_train), (X_val, Y_val)],  
34.                          verbose=1000,  
35.                          eval_metric='mae',  
36.                          early_stopping_rounds=500)  
37.  
38.    pred_val = lgb_model.predict(  
39.        X_val)  
40.    df_oof = df_train.iloc[val_idx][['SaleID', ycol]].copy()  
41.  
42.    pred_test = lgb_model.predict(  
43.        df_test[feature_names])  
44.    prediction['price'] += pred_test / 5  
45.  
46.    del lgb_model, pred_val, pred_test, X_train, Y_train, X_val, Y_val  
47.    gc.collect()  

在这里插入图片描述
单模型得分:
在这里插入图片描述

CatBoost:

1.ycol = 'price'  
2.feature_names = list(  
3.    filter(lambda x: x not in [ycol, 'SaleID', 'regDate', 'creatDate', 'creatDate_year', 'creatDate_month'], df_train.columns))  
4.  
5.model = ctb.CatBoostRegressor(  
6.    learning_rate=0.08,  
7.    depth=10,  
8.    subsample=0.75,  
9.    n_estimators=100000,  
10.    loss_function='RMSE',  
11.    random_seed=seed,  
12.)  
13.  
14.prediction = df_test[['SaleID']]  
15.prediction['price'] = 0  
16.  
17.kfold = KFold(n_splits=5, shuffle=True, random_state=seed)  
18.for fold_id, (trn_idx, val_idx) in enumerate(kfold.split(df_train[feature_names])):  
19.    X_train = df_train.iloc[trn_idx][feature_names]  
20.    Y_train = df_train.iloc[trn_idx][ycol]  
21.  
22.    X_val = df_train.iloc[val_idx][feature_names]  
23.    Y_val = df_train.iloc[val_idx][ycol]  
24.  
25.    print('\nFold_{} Training ================================\n'.format(fold_id+1))  
26.  
27.    ctb_model = model.fit(X_train,  
28.                          Y_train,  
29.                          verbose=1000,  
30.                          early_stopping_rounds=500)  
31.  
32.    pred_val = ctb_model.predict(  
33.        X_val)  
34.    df_oof = df_train.iloc[val_idx][['SaleID', ycol]].copy()  
35.  
36.    pred_test = ctb_model.predict(  
37.        df_test[feature_names])  
38.    prediction['price'] += pred_test / 5  
39.  
40.    del ctb_model, pred_val, pred_test, X_train, Y_train, X_val, Y_val  
41.    gc.collect()  

在这里插入图片描述
单模型得分:
在这里插入图片描述

模型融合:

将三个模型按照一定比例进行融合,具体比例通过枚举获得。

1.min_mae = 9999  
2.minA, minB, minC = 0, 0, 0  
3.for a in range(100):  
4.    for b in range(100-a):  
5.        c = 100-a-b  
6.        df_oof['pred'] = a/100*df_oof['ctb_pred']+b/100 * \  
7.            df_oof['xgb_pred'] + c/100 * df_oof['lgb_pred']  
8.        mae = mean_absolute_error(df_oof['price'], df_oof['pred'])  
9.        if(mae < min_mae):  
10.            minA, minB, minC = a, b, c  
11.            min_mae = mae  
12.            print(min_mae, minA, minB, minC)  

在这里插入图片描述
然后根据得分进行简单的加权,按照 0.58*ctb_pred+0.28 *xgb_pred+0.13 *lgb_pred 得到最后的汽车预测价格。
最后线上得分425.3107
在这里插入图片描述

0

评论区