Used car market price evaluation project proposal and eda

Posted by jiayuwu on July 22, 2018
Project proposal

Used-car Market Price Evaluation Based on Big Data - Project Proposal

Jiayu (Alice) Wu

This project aims to analyze how features of a used-car influence its market price and to predict the price by car features with data from on-line listing. The final product of the project is an on-line app where the users are provided market value estimation of a used-car given its features.

In this proposal, we firstly introduce the motivation and the objective for the project. Then, a dataset with 370000 observations of used-car listing on Ebay is introduced. In addition, some first-stage data cleaning and analysis is presented. Finally, the further extension based on the current progress is discussed.

Introduction

For a private buyer or seller who are new to the used-car market, a common confusion is about what to be expected. Therefore, this project attempts to provide a used-car evaluation service based on the on-line listing information, which has great potential in business application.

The objective of the project is to discuss the features of used-car with regard to its influence on the market price. Based on such analysis, a model for price prediction could be constructed in order to provide a real-time used-car evaluation service. The final product takes features (vehicle_type, brand, year_registration, power, gearbox, kilometer, etc.) of a used-car as input and output a prediction of price. Moreover, it should be capable of producing an interval prediction as price range, and it should be able to handle missing features so that the user can make a query even when they are not clear about certain feature. As a more ambitous attempt, a recommender system could be build such that we can not only evaluate the price, but also recommend features for car search by the potential buyers, based on the expected price and desired features.

The data used in the following analysis is rechieved from Kaggle (https://www.kaggle.com/orgesleka/used-cars-database/home). It contains 370000 raw observations of 20 features (29M) scrapted from used-car listing on Ebay-Kleinanzeigen(German). The high quantity and authencity makes this dataset useful for an exploratory analysis to verify the viability of this project. However, for the final product of this project, more recent data in USA should be scraped from web, which is not accomplished in the proposal stage due to the time limit.

In the following section, we are going to experiment with the dataset by data cleaning, visualization and xgboost regression, in order to reach an elementary understanding of each car feature and its influence on the market price.

Exploratory Analysis

Data Cleaning

Firstly, we load the raw data from kaggle, and import common packages for data presentation and visualization.

The raw data contains 371528 observations and 20 columns including the price and other information about the listing on-line.

As we are only interested in the features of the car itself, the irrelavant columns are dropped.

In [1]:
# import
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Data loading
raw_data = pd.read_csv('autos.csv', sep=',',encoding='cp1252')
In [2]:
print(raw_data.shape)
raw_data.sample(5)
(371528, 20)
Out[2]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model kilometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
266399 2016-04-06 15:36:18 Volkswagen_Touran_1.4_TSI_MATCH privat Angebot 16500 test bus 2012 manuell 140 touran 60000 4 benzin volkswagen nein 2016-04-06 00:00:00 0 83355 2016-04-06 15:36:18
44631 2016-03-26 18:58:44 Audi_A5_Cabrio_2.0_TFSI_quattro_S_tronic privat Angebot 26800 test cabrio 2010 automatik 211 a5 70000 6 benzin audi nein 2016-03-26 00:00:00 0 57399 2016-04-06 08:17:39
232107 2016-03-09 13:54:27 BMW_Z3_Roadster_2.8 privat Angebot 15950 control cabrio 1998 manuell 192 z_reihe 80000 5 benzin bmw nein 2016-03-09 00:00:00 0 79098 2016-03-20 11:17:29
123979 2016-03-14 10:42:25 Ford_Ka_+++SUPER_EINSTIEGS___ODER_STADTAUTO++ privat Angebot 1490 control kleinwagen 2007 manuell 60 ka 100000 1 benzin ford nein 2016-03-14 00:00:00 0 22851 2016-04-07 10:45:00
64691 2016-03-28 22:49:14 Ford_Focus_1_4_75_PS_Tuev_02/2018 privat Angebot 1000 control NaN 2018 manuell 0 NaN 150000 0 benzin ford nein 2016-03-28 00:00:00 0 45665 2016-04-03 05:17:54
In [3]:
# drop irrelavent
raw_data.drop(['dateCrawled','name','seller','offerType','abtest','monthOfRegistration',
                 'dateCreated','nrOfPictures', 'lastSeen', 'postalCode', 'model'], axis='columns', inplace=True)
raw_data.drop_duplicates(['price','vehicleType','yearOfRegistration','gearbox',
                          'powerPS','kilometer','fuelType','brand','notRepairedDamage'])
print(raw_data.shape)
raw_data.sample(5)
(371528, 9)
Out[3]:
price vehicleType yearOfRegistration gearbox powerPS kilometer fuelType brand notRepairedDamage
328442 16499 kombi 2009 manuell 190 150000 diesel audi nein
87344 2750 bus 2005 manuell 69 90000 diesel opel nein
300452 10800 limousine 2008 manuell 184 100000 benzin mercedes_benz nein
46859 850 kleinwagen 1998 NaN 0 125000 benzin opel nein
56006 400 NaN 2016 manuell 90 150000 benzin volkswagen ja

The column 'monthOfRegistration' is dropped as it has little influence on the value of a car.

The column 'model' is specific to each brand, for now we drop it and reserve only the column 'brand' to examine the main features of a car.

The remaining 8 features are:

Categorical: vehicleType, gearbox, fuelType, brand
Numerical: yearOfRegistration(the year first registered), powerPS (power in PS), kilometer (mileage has driven)

Missing Values and Outliers

The missing values in the raw data is summarized below. All missing values apear in the categorical feature. The 'notRepairedDamage' has the most missing values, and the reason might be that this entry is not clearly defined and confusing for many sellers.

In order to reserve such infomation in the following exploration, we substitute null with a category 'missing'. By summarizing all the categorical features, it can be observed that the entries are all valid and without ambiguity now.

In [4]:
# missing data
raw_data.isnull().sum()
Out[4]:
price                     0
vehicleType           37869
yearOfRegistration        0
gearbox               20209
powerPS                   0
kilometer                 0
fuelType              33386
brand                     0
notRepairedDamage     72060
dtype: int64
In [5]:
raw_data['vehicleType'].fillna(value='missing', inplace=True)
raw_data['gearbox'].fillna(value='missing', inplace=True)
raw_data['fuelType'].fillna(value='missing', inplace=True)
raw_data['notRepairedDamage'].fillna(value='missing', inplace=True)
In [6]:
# examine categorical
print(raw_data.groupby('vehicleType').size())
print(raw_data.groupby('gearbox').size())
print(raw_data.groupby('fuelType').size())
print(raw_data.groupby('brand').size())
vehicleType
andere         3357
bus           30201
cabrio        22898
coupe         19015
kleinwagen    80023
kombi         67564
limousine     95894
missing       37869
suv           14707
dtype: int64
gearbox
automatik     77105
manuell      274214
missing       20209
dtype: int64
fuelType
andere        208
benzin     223857
cng           571
diesel     107746
elektro       104
hybrid        278
lpg          5378
missing     33386
dtype: int64
brand
alfa_romeo         2345
audi              32873
bmw               40274
chevrolet          1845
chrysler           1452
citroen            5182
dacia               900
daewoo              542
daihatsu            806
fiat               9676
ford              25573
honda              2836
hyundai            3646
jaguar              621
jeep                807
kia                2555
lada                225
lancia              484
land_rover          770
mazda              5695
mercedes_benz     35309
mini               3394
mitsubishi         3061
nissan             5037
opel              40136
peugeot           11027
porsche            2215
renault           17969
rover               490
saab                530
seat               7022
skoda              5641
smart              5249
sonstige_autos     3982
subaru              779
suzuki             2328
toyota             4694
trabant             591
volkswagen        79640
volvo              3327
dtype: int64

As for the numerical features (including price), we first display the boxplot, where outliers are very obvious. Therefore, we truncate the data by its quantiles below, as a result we discard 20/% or the raw data to obtain a working dataset with 291758 observations and 9 features (price included).

In [7]:
# examine numrical
fig = plt.figure(figsize =(18,5))  
ax1 = fig.add_subplot(141) 
ax1 = raw_data[['price']].boxplot()
ax2 = fig.add_subplot(142) 
ax2 = raw_data[['yearOfRegistration']].boxplot()
ax3 = fig.add_subplot(143) 
ax3 = raw_data[['powerPS']].boxplot()
ax4 = fig.add_subplot(144) 
ax4 = raw_data[['kilometer']].boxplot()
In [8]:
raw_data[['price', 'yearOfRegistration', 'powerPS', 'kilometer']].describe()
#raw_data[['price', 'yearOfRegistration', 'powerPS', 'kilometer']].quantile([0,0.01,0.05,0.9,0.95,0.999,1])
Out[8]:
price yearOfRegistration powerPS kilometer
count 3.715280e+05 371528.000000 371528.000000 371528.000000
mean 1.729514e+04 2004.577997 115.549477 125618.688228
std 3.587954e+06 92.866598 192.139578 40112.337051
min 0.000000e+00 1000.000000 0.000000 5000.000000
25% 1.150000e+03 1999.000000 70.000000 125000.000000
50% 2.950000e+03 2003.000000 105.000000 150000.000000
75% 7.200000e+03 2008.000000 150.000000 150000.000000
max 2.147484e+09 9999.000000 20000.000000 150000.000000
In [9]:
data = raw_data[(raw_data.price >= 500) & (raw_data.price <= 200000)
                & (raw_data.yearOfRegistration <= 2016) & (raw_data.yearOfRegistration >= 1950) 
                & (raw_data.powerPS >= 50) & (raw_data.powerPS <= 700)]
In [10]:
print(data.shape, data.shape[0]/raw_data.shape[0])
data[['price', 'yearOfRegistration', 'powerPS', 'kilometer']].describe()
(291758, 9) 0.7852920910402447
Out[10]:
price yearOfRegistration powerPS kilometer
count 291758.000000 291758.000000 291758.000000 291758.000000
mean 6668.121035 2003.556091 130.300588 124812.413027
std 8640.885020 6.538935 61.811810 39465.704551
min 500.000000 1950.000000 50.000000 5000.000000
25% 1690.000000 2000.000000 88.000000 100000.000000
50% 3850.000000 2004.000000 116.000000 150000.000000
75% 8500.000000 2008.000000 156.000000 150000.000000
max 200000.000000 2016.000000 700.000000 150000.000000

Visualization and Correlation

In order to examine the distribution of data over the categories, several barplots are displayed below. The number of observation in each category differs by large, therefore, one-hot encoding should be used in the prediction model. Besides, since there are too many missing values for 'notRepairedDamage', this feature may behave poor in model fitting.

In [11]:
columns = ['brand', 'vehicleType', 'gearbox', 'fuelType', 'notRepairedDamage', 'yearOfRegistration']
for i, col in enumerate(columns):
    counts = data.groupby(by=col)[col].count().sort_values(ascending=False)
    cat = counts.index
    r = range(len(cat))
    plt.figure()
    plt.title(col)
    plt.bar(r, counts)
    plt.xticks(r, cat)
    plt.show()
    print(cat)
Index(['volkswagen', 'bmw', 'mercedes_benz', 'opel', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'toyota', 'citroen',
       'nissan', 'smart', 'mini', 'hyundai', 'volvo', 'mitsubishi', 'honda',
       'kia', 'sonstige_autos', 'porsche', 'alfa_romeo', 'suzuki', 'chevrolet',
       'chrysler', 'dacia', 'jeep', 'land_rover', 'subaru', 'jaguar',
       'daihatsu', 'saab', 'lancia', 'daewoo', 'rover', 'lada', 'trabant'],
      dtype='object', name='brand')
Index(['limousine', 'kleinwagen', 'kombi', 'bus', 'cabrio', 'coupe', 'suv',
       'missing', 'andere'],
      dtype='object', name='vehicleType')
Index(['manuell', 'automatik', 'missing'], dtype='object', name='gearbox')
Index(['benzin', 'diesel', 'missing', 'lpg', 'cng', 'hybrid', 'andere',
       'elektro'],
      dtype='object', name='fuelType')
Index(['nein', 'missing', 'ja'], dtype='object', name='notRepairedDamage')
Int64Index([2005, 2006, 2004, 2003, 1999, 2001, 2000, 2002, 2007, 2008, 2009,
            1998, 2010, 2011, 1997, 2012, 1996, 2013, 1995, 2016, 2014, 1994,
            2015, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1985, 1986, 1983,
            1984, 1982, 1980, 1979, 1978, 1981, 1972, 1977, 1974, 1973, 1976,
            1970, 1971, 1966, 1969, 1968, 1975, 1967, 1965, 1964, 1963, 1960,
            1961, 1959, 1962, 1957, 1958, 1951, 1955, 1956, 1954, 1953, 1950,
            1952],
           dtype='int64', name='yearOfRegistration')

Moreover, the correlation between features are also of interests, especially how the features of a car is related to its price.

Therefore, we encode the categorical (of 'object' type) features as intergers. Although the categorical variables are transformed to ordinal in this way, it provides a rough idea about the data for now.

In [12]:
for feature in data.columns: # Loop through all columns in the dataframe
    if data[feature].dtype == 'object': # Only apply for columns with categorical strings
        data[feature] = pd.Categorical(data[feature]).codes # Replace strings with an integer
/Users/alice/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
In [13]:
data.describe()
Out[13]:
price vehicleType yearOfRegistration gearbox powerPS kilometer fuelType brand notRepairedDamage
count 291758.000000 291758.000000 291758.000000 291758.000000 291758.000000 291758.000000 291758.000000 291758.000000 291758.000000
mean 6668.121035 4.539718 2003.556091 0.783269 130.300588 124812.413027 1.996830 20.245275 1.716762
std 8640.885020 1.833823 6.538935 0.446520 61.811810 39465.704551 1.510657 13.527708 0.600595
min 500.000000 0.000000 1950.000000 0.000000 50.000000 5000.000000 0.000000 0.000000 0.000000
25% 1690.000000 4.000000 2000.000000 1.000000 88.000000 100000.000000 1.000000 8.000000 2.000000
50% 3850.000000 5.000000 2004.000000 1.000000 116.000000 150000.000000 1.000000 21.000000 2.000000
75% 8500.000000 6.000000 2008.000000 1.000000 156.000000 150000.000000 3.000000 33.000000 2.000000
max 200000.000000 8.000000 2016.000000 2.000000 700.000000 150000.000000 7.000000 39.000000 2.000000
In [14]:
import seaborn as sns
plt.figure(figsize =(8,8))
plt.title('Feature Correlation Heat Map')
sns.heatmap(data.corr(),linewidths=.1,vmax=1.0,
            square=True,linecolor='',annot=True)
plt.savefig('1.png')

From the correlation heat map above, it can be observed that power of the car, the kilometer and the number of years used have a major influence on the price. It is understandable that the price is dependent on the performance of the used-car and how new it is.

Feature Importance by XGboost Regrssion

In this section, XGboost regression is performed on the preprocessed data to compare the importance of each feature.

XGboost is an improvement over classical regression tree method. Different from first-order gradient boosting, this method utilized the hessian information and introduced regularization. Besides, it prunes the tree from a specified 'max_depth' instead of gready search.

By cross-validation, the key parameters 'max_depth' and 'min_child_weight' (the minimum sum of weights of all observations required in a child to control overfitting) are set to be 7 and 1 respectively. Thus, the model achieved a score (rmse) of 0.86 on the validation dataset, which provides a reliable reference on the feature importance measure. Whereas, for actual prediction modeling the presion can be further improved by one-hot encoding and further parameter tuning.

In [15]:
import xgboost as xgb
from sklearn.grid_search import GridSearchCV
train_size = int(.8*len(data))
train, test = np.split(data.sample(frac=1), [train_size])
y_train = train.pop('price')
y_test = test.pop('price')
/Users/alice/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/Users/alice/anaconda3/lib/python3.6/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
In [272]:
# cross-validation
cv_params = {'max_depth': [7,5], 'min_child_weight': [1,3]}
ind_params = {'learning_rate':.1,'n_estimators':1500, 'seed':0, 'subsample':0.8, 'colsample_bytree':0.8, 'objective':'reg:linear'}
optimized_GBM = GridSearchCV(xgb.XGBRegressor(**ind_params), cv_params, scoring = 'neg_mean_squared_error', cv=5, n_jobs = -1)
optimized_GBM.fit(train, y_train)
GridSearchCV(estimator=xgb.XGBRegressor(max_depth=3, learning_rate=0.15, n_estimators=1000, silent=True, 
                                  objective='reg:linear', booster='gbtree', n_jobs=-1, nthread=None, gamma=0, 
                                  min_child_weight=1, max_delta_step=0, subsample=0.8, colsample_bytree=0.8, colsample_bylevel=0.8, 
                                  reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=0),
       param_grid = cv_params, scoring='neg_mean_squared_error',n_jobs=1,iid=False, cv=5)
Out[272]:
GridSearchCV(cv=5, error_score='raise',
       estimator=XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=0.8,
       colsample_bytree=0.8, gamma=0, learning_rate=0.15, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=-1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=0, silent=True,
       subsample=0.8),
       fit_params={}, iid=False, n_jobs=1,
       param_grid={'max_depth': [7, 5], 'min_child_weight': [1, 3]},
       pre_dispatch='2*n_jobs', refit=True,
       scoring='neg_mean_squared_error', verbose=0)
In [273]:
print(optimized_GBM.grid_scores_)
[mean: -10237369.31810, std: 488905.96840, params: {'max_depth': 7, 'min_child_weight': 1}, mean: -10358005.72493, std: 343735.67644, params: {'max_depth': 7, 'min_child_weight': 3}, mean: -10875220.49527, std: 582351.37739, params: {'max_depth': 5, 'min_child_weight': 1}, mean: -10934536.91494, std: 593701.21865, params: {'max_depth': 5, 'min_child_weight': 3}]
In [16]:
# fit
model = xgb.XGBRegressor(max_depth=7, learning_rate=0.1, n_estimators=1500, silent=True, 
                        objective='reg:linear', booster='gbtree', n_jobs=-1, nthread=None, gamma=0, 
                        min_child_weight=1, max_delta_step=0, subsample=0.8, colsample_bytree=0.8, colsample_bylevel=0.8, 
                        reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=0)
model.fit(train,y_train)
y_pred = model.predict(test)
In [17]:
model.score(test, y_test)
Out[17]:
0.8589745881122444
In [51]:
imp = pd.Series(model.feature_importances_)
sorting = np.argsort(imp)[::-1]
feat_imp = imp[sorting]
names = list(train.columns[sorting])
plt.figure(figsize =(12,8))
feat_imp.plot(kind='barh', title='Feature Importances for XGboost Regression')
plt.yticks(range(train.shape[1]),names)
plt.savefig('2.png')
feat_imp.index = names
print(feat_imp)
powerPS               0.237770
yearOfRegistration    0.201186
kilometer             0.144472
brand                 0.143449
vehicleType           0.122665
gearbox               0.054877
fuelType              0.049146
notRepairedDamage     0.046436
dtype: float32

It can be observed the performance (power) and the depreciation of the used car (used year and mileage) are the most important features, followed by the brand and type for the vehicle, which is close to intuition. Besides, notRepairedDamage, fuel and gearbox are relatively less important. In fact, the distribution of data over these categories are skewed and with a lot of missing, since not every private seller is familiar with such features.

Further Steps

In the previous analysis, the influence of 8 features on used-car market price was discussed. Based on the current progress, this project may go further in the next steps.

For a more precise prediction model, we should employ one-hot encoding to use indicator variables to represent categorical labels. Moreover, the hyperparameters for xgboost regression can be further tuned by cross-validation. Other non-parametric model like neural network or kernel regression can also be tried. In addition, a more flexible price evaluation should return an interval prediction as price range instead of a point estimator, bayesian methods could be considered for the interval estimation.

Furthermore, a recommender system could be built in order to recommend features for car search by the potential buyers. This service would be valuable for users who has little experience with car or with the used-car market. For example, one can input a price range and a vehicle type preference, and we can recommend several brands that he or she may consider when the buyer is not sure about what to look at. This task is possible to be accomplished by matrix factorization.