Used-car Market Price Evaluation Based on Big Data - Project Proposal¶

Jiayu (Alice) Wu

This project aims to analyze how features of a used-car influence its market price and to predict the price by car features with data from on-line listing. The final product of the project is an on-line app where the users are provided market value estimation of a used-car given its features.

In this proposal, we firstly introduce the motivation and the objective for the project. Then, a dataset with 370000 observations of used-car listing on Ebay is introduced. In addition, some first-stage data cleaning and analysis is presented. Finally, the further extension based on the current progress is discussed.

Introduction¶

For a private buyer or seller who are new to the used-car market, a common confusion is about what to be expected. Therefore, this project attempts to provide a used-car evaluation service based on the on-line listing information, which has great potential in business application.

The objective of the project is to discuss the features of used-car with regard to its influence on the market price. Based on such analysis, a model for price prediction could be constructed in order to provide a real-time used-car evaluation service. The final product takes features (vehicle_type, brand, year_registration, power, gearbox, kilometer, etc.) of a used-car as input and output a prediction of price. Moreover, it should be capable of producing an interval prediction as price range, and it should be able to handle missing features so that the user can make a query even when they are not clear about certain feature. As a more ambitous attempt, a recommender system could be build such that we can not only evaluate the price, but also recommend features for car search by the potential buyers, based on the expected price and desired features.

The data used in the following analysis is rechieved from Kaggle (https://www.kaggle.com/orgesleka/used-cars-database/home). It contains 370000 raw observations of 20 features (29M) scrapted from used-car listing on Ebay-Kleinanzeigen(German). The high quantity and authencity makes this dataset useful for an exploratory analysis to verify the viability of this project. However, for the final product of this project, more recent data in USA should be scraped from web, which is not accomplished in the proposal stage due to the time limit.

In the following section, we are going to experiment with the dataset by data cleaning, visualization and xgboost regression, in order to reach an elementary understanding of each car feature and its influence on the market price.

Exploratory Analysis¶

Data Cleaning¶

Firstly, we load the raw data from kaggle, and import common packages for data presentation and visualization.

The raw data contains 371528 observations and 20 columns including the price and other information about the listing on-line.

As we are only interested in the features of the car itself, the irrelavant columns are dropped.

# import
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Data loading
raw_data = pd.read_csv('autos.csv', sep=',',encoding='cp1252')

print(raw_data.shape)
raw_data.sample(5)

(371528, 20)

# drop irrelavent
raw_data.drop(['dateCrawled','name','seller','offerType','abtest','monthOfRegistration',
                 'dateCreated','nrOfPictures', 'lastSeen', 'postalCode', 'model'], axis='columns', inplace=True)
raw_data.drop_duplicates(['price','vehicleType','yearOfRegistration','gearbox',
                          'powerPS','kilometer','fuelType','brand','notRepairedDamage'])
print(raw_data.shape)
raw_data.sample(5)

(371528, 9)

The column 'monthOfRegistration' is dropped as it has little influence on the value of a car.

The column 'model' is specific to each brand, for now we drop it and reserve only the column 'brand' to examine the main features of a car.

The remaining 8 features are:

Categorical: vehicleType, gearbox, fuelType, brand
Numerical: yearOfRegistration(the year first registered), powerPS (power in PS), kilometer (mileage has driven)

Missing Values and Outliers¶

The missing values in the raw data is summarized below. All missing values apear in the categorical feature. The 'notRepairedDamage' has the most missing values, and the reason might be that this entry is not clearly defined and confusing for many sellers.

In order to reserve such infomation in the following exploration, we substitute null with a category 'missing'. By summarizing all the categorical features, it can be observed that the entries are all valid and without ambiguity now.

# missing data
raw_data.isnull().sum()

price                     0
vehicleType           37869
yearOfRegistration        0
gearbox               20209
powerPS                   0
kilometer                 0
fuelType              33386
brand                     0
notRepairedDamage     72060
dtype: int64

raw_data['vehicleType'].fillna(value='missing', inplace=True)
raw_data['gearbox'].fillna(value='missing', inplace=True)
raw_data['fuelType'].fillna(value='missing', inplace=True)
raw_data['notRepairedDamage'].fillna(value='missing', inplace=True)

# examine categorical
print(raw_data.groupby('vehicleType').size())
print(raw_data.groupby('gearbox').size())
print(raw_data.groupby('fuelType').size())
print(raw_data.groupby('brand').size())

vehicleType
andere         3357
bus           30201
cabrio        22898
coupe         19015
kleinwagen    80023
kombi         67564
limousine     95894
missing       37869
suv           14707
dtype: int64
gearbox
automatik     77105
manuell      274214
missing       20209
dtype: int64
fuelType
andere        208
benzin     223857
cng           571
diesel     107746
elektro       104
hybrid        278
lpg          5378
missing     33386
dtype: int64
brand
alfa_romeo         2345
audi              32873
bmw               40274
chevrolet          1845
chrysler           1452
citroen            5182
dacia               900
daewoo              542
daihatsu            806
fiat               9676
ford              25573
honda              2836
hyundai            3646
jaguar              621
jeep                807
kia                2555
lada                225
lancia              484
land_rover          770
mazda              5695
mercedes_benz     35309
mini               3394
mitsubishi         3061
nissan             5037
opel              40136
peugeot           11027
porsche            2215
renault           17969
rover               490
saab                530
seat               7022
skoda              5641
smart              5249
sonstige_autos     3982
subaru              779
suzuki             2328
toyota             4694
trabant             591
volkswagen        79640
volvo              3327
dtype: int64

As for the numerical features (including price), we first display the boxplot, where outliers are very obvious. Therefore, we truncate the data by its quantiles below, as a result we discard 20/% or the raw data to obtain a working dataset with 291758 observations and 9 features (price included).

# examine numrical
fig = plt.figure(figsize =(18,5))  
ax1 = fig.add_subplot(141) 
ax1 = raw_data[['price']].boxplot()
ax2 = fig.add_subplot(142) 
ax2 = raw_data[['yearOfRegistration']].boxplot()
ax3 = fig.add_subplot(143) 
ax3 = raw_data[['powerPS']].boxplot()
ax4 = fig.add_subplot(144) 
ax4 = raw_data[['kilometer']].boxplot()

raw_data[['price', 'yearOfRegistration', 'powerPS', 'kilometer']].describe()
#raw_data[['price', 'yearOfRegistration', 'powerPS', 'kilometer']].quantile([0,0.01,0.05,0.9,0.95,0.999,1])

data = raw_data[(raw_data.price >= 500) & (raw_data.price <= 200000)
                & (raw_data.yearOfRegistration <= 2016) & (raw_data.yearOfRegistration >= 1950) 
                & (raw_data.powerPS >= 50) & (raw_data.powerPS <= 700)]

print(data.shape, data.shape[0]/raw_data.shape[0])
data[['price', 'yearOfRegistration', 'powerPS', 'kilometer']].describe()

(291758, 9) 0.7852920910402447

Visualization and Correlation¶

In order to examine the distribution of data over the categories, several barplots are displayed below. The number of observation in each category differs by large, therefore, one-hot encoding should be used in the prediction model. Besides, since there are too many missing values for 'notRepairedDamage', this feature may behave poor in model fitting.

columns = ['brand', 'vehicleType', 'gearbox', 'fuelType', 'notRepairedDamage', 'yearOfRegistration']
for i, col in enumerate(columns):
    counts = data.groupby(by=col)[col].count().sort_values(ascending=False)
    cat = counts.index
    r = range(len(cat))
    plt.figure()
    plt.title(col)
    plt.bar(r, counts)
    plt.xticks(r, cat)
    plt.show()
    print(cat)

Index(['volkswagen', 'bmw', 'mercedes_benz', 'opel', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'toyota', 'citroen',
       'nissan', 'smart', 'mini', 'hyundai', 'volvo', 'mitsubishi', 'honda',
       'kia', 'sonstige_autos', 'porsche', 'alfa_romeo', 'suzuki', 'chevrolet',
       'chrysler', 'dacia', 'jeep', 'land_rover', 'subaru', 'jaguar',
       'daihatsu', 'saab', 'lancia', 'daewoo', 'rover', 'lada', 'trabant'],
      dtype='object', name='brand')

Index(['limousine', 'kleinwagen', 'kombi', 'bus', 'cabrio', 'coupe', 'suv',
       'missing', 'andere'],
      dtype='object', name='vehicleType')

Index(['manuell', 'automatik', 'missing'], dtype='object', name='gearbox')

Index(['benzin', 'diesel', 'missing', 'lpg', 'cng', 'hybrid', 'andere',
       'elektro'],
      dtype='object', name='fuelType')

Index(['nein', 'missing', 'ja'], dtype='object', name='notRepairedDamage')

Int64Index([2005, 2006, 2004, 2003, 1999, 2001, 2000, 2002, 2007, 2008, 2009,
            1998, 2010, 2011, 1997, 2012, 1996, 2013, 1995, 2016, 2014, 1994,
            2015, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1985, 1986, 1983,
            1984, 1982, 1980, 1979, 1978, 1981, 1972, 1977, 1974, 1973, 1976,
            1970, 1971, 1966, 1969, 1968, 1975, 1967, 1965, 1964, 1963, 1960,
            1961, 1959, 1962, 1957, 1958, 1951, 1955, 1956, 1954, 1953, 1950,
            1952],
           dtype='int64', name='yearOfRegistration')

Moreover, the correlation between features are also of interests, especially how the features of a car is related to its price.

Therefore, we encode the categorical (of 'object' type) features as intergers. Although the categorical variables are transformed to ordinal in this way, it provides a rough idea about the data for now.

for feature in data.columns: # Loop through all columns in the dataframe
    if data[feature].dtype == 'object': # Only apply for columns with categorical strings
        data[feature] = pd.Categorical(data[feature]).codes # Replace strings with an integer

/Users/alice/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until

data.describe()

import seaborn as sns
plt.figure(figsize =(8,8))
plt.title('Feature Correlation Heat Map')
sns.heatmap(data.corr(),linewidths=.1,vmax=1.0,
            square=True,linecolor='',annot=True)
plt.savefig('1.png')

From the correlation heat map above, it can be observed that power of the car, the kilometer and the number of years used have a major influence on the price. It is understandable that the price is dependent on the performance of the used-car and how new it is.

Feature Importance by XGboost Regrssion¶

In this section, XGboost regression is performed on the preprocessed data to compare the importance of each feature.

XGboost is an improvement over classical regression tree method. Different from first-order gradient boosting, this method utilized the hessian information and introduced regularization. Besides, it prunes the tree from a specified 'max_depth' instead of gready search.

By cross-validation, the key parameters 'max_depth' and 'min_child_weight' (the minimum sum of weights of all observations required in a child to control overfitting) are set to be 7 and 1 respectively. Thus, the model achieved a score (rmse) of 0.86 on the validation dataset, which provides a reliable reference on the feature importance measure. Whereas, for actual prediction modeling the presion can be further improved by one-hot encoding and further parameter tuning.

import xgboost as xgb
from sklearn.grid_search import GridSearchCV
train_size = int(.8*len(data))
train, test = np.split(data.sample(frac=1), [train_size])
y_train = train.pop('price')
y_test = test.pop('price')

/Users/alice/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/Users/alice/anaconda3/lib/python3.6/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

# cross-validation
cv_params = {'max_depth': [7,5], 'min_child_weight': [1,3]}
ind_params = {'learning_rate':.1,'n_estimators':1500, 'seed':0, 'subsample':0.8, 'colsample_bytree':0.8, 'objective':'reg:linear'}
optimized_GBM = GridSearchCV(xgb.XGBRegressor(**ind_params), cv_params, scoring = 'neg_mean_squared_error', cv=5, n_jobs = -1)
optimized_GBM.fit(train, y_train)
GridSearchCV(estimator=xgb.XGBRegressor(max_depth=3, learning_rate=0.15, n_estimators=1000, silent=True, 
                                  objective='reg:linear', booster='gbtree', n_jobs=-1, nthread=None, gamma=0, 
                                  min_child_weight=1, max_delta_step=0, subsample=0.8, colsample_bytree=0.8, colsample_bylevel=0.8, 
                                  reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=0),
       param_grid = cv_params, scoring='neg_mean_squared_error',n_jobs=1,iid=False, cv=5)

GridSearchCV(cv=5, error_score='raise',
       estimator=XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=0.8,
       colsample_bytree=0.8, gamma=0, learning_rate=0.15, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=-1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=0, silent=True,
       subsample=0.8),
       fit_params={}, iid=False, n_jobs=1,
       param_grid={'max_depth': [7, 5], 'min_child_weight': [1, 3]},
       pre_dispatch='2*n_jobs', refit=True,
       scoring='neg_mean_squared_error', verbose=0)

print(optimized_GBM.grid_scores_)

[mean: -10237369.31810, std: 488905.96840, params: {'max_depth': 7, 'min_child_weight': 1}, mean: -10358005.72493, std: 343735.67644, params: {'max_depth': 7, 'min_child_weight': 3}, mean: -10875220.49527, std: 582351.37739, params: {'max_depth': 5, 'min_child_weight': 1}, mean: -10934536.91494, std: 593701.21865, params: {'max_depth': 5, 'min_child_weight': 3}]

# fit
model = xgb.XGBRegressor(max_depth=7, learning_rate=0.1, n_estimators=1500, silent=True, 
                        objective='reg:linear', booster='gbtree', n_jobs=-1, nthread=None, gamma=0, 
                        min_child_weight=1, max_delta_step=0, subsample=0.8, colsample_bytree=0.8, colsample_bylevel=0.8, 
                        reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=0)
model.fit(train,y_train)
y_pred = model.predict(test)

model.score(test, y_test)

0.8589745881122444

imp = pd.Series(model.feature_importances_)
sorting = np.argsort(imp)[::-1]
feat_imp = imp[sorting]
names = list(train.columns[sorting])
plt.figure(figsize =(12,8))
feat_imp.plot(kind='barh', title='Feature Importances for XGboost Regression')
plt.yticks(range(train.shape[1]),names)
plt.savefig('2.png')
feat_imp.index = names
print(feat_imp)

powerPS               0.237770
yearOfRegistration    0.201186
kilometer             0.144472
brand                 0.143449
vehicleType           0.122665
gearbox               0.054877
fuelType              0.049146
notRepairedDamage     0.046436
dtype: float32

It can be observed the performance (power) and the depreciation of the used car (used year and mileage) are the most important features, followed by the brand and type for the vehicle, which is close to intuition. Besides, notRepairedDamage, fuel and gearbox are relatively less important. In fact, the distribution of data over these categories are skewed and with a lot of missing, since not every private seller is familiar with such features.

Further Steps¶

In the previous analysis, the influence of 8 features on used-car market price was discussed. Based on the current progress, this project may go further in the next steps.

For a more precise prediction model, we should employ one-hot encoding to use indicator variables to represent categorical labels. Moreover, the hyperparameters for xgboost regression can be further tuned by cross-validation. Other non-parametric model like neural network or kernel regression can also be tried. In addition, a more flexible price evaluation should return an interval prediction as price range instead of a point estimator, bayesian methods could be considered for the interval estimation.

Furthermore, a recommender system could be built in order to recommend features for car search by the potential buyers. This service would be valuable for users who has little experience with car or with the used-car market. For example, one can input a price range and a vehicle type preference, and we can recommend several brands that he or she may consider when the buyer is not sure about what to look at. This task is possible to be accomplished by matrix factorization.

	price	yearOfRegistration	powerPS	kilometer
count	3.715280e+05	371528.000000	371528.000000	371528.000000
mean	1.729514e+04	2004.577997	115.549477	125618.688228
std	3.587954e+06	92.866598	192.139578	40112.337051
min	0.000000e+00	1000.000000	0.000000	5000.000000
25%	1.150000e+03	1999.000000	70.000000	125000.000000
50%	2.950000e+03	2003.000000	105.000000	150000.000000
75%	7.200000e+03	2008.000000	150.000000	150000.000000
max	2.147484e+09	9999.000000	20000.000000	150000.000000

	price	yearOfRegistration	powerPS	kilometer
count	291758.000000	291758.000000	291758.000000	291758.000000
mean	6668.121035	2003.556091	130.300588	124812.413027
std	8640.885020	6.538935	61.811810	39465.704551
min	500.000000	1950.000000	50.000000	5000.000000
25%	1690.000000	2000.000000	88.000000	100000.000000
50%	3850.000000	2004.000000	116.000000	150000.000000
75%	8500.000000	2008.000000	156.000000	150000.000000
max	200000.000000	2016.000000	700.000000	150000.000000

	price	vehicleType	yearOfRegistration	gearbox	powerPS	kilometer	fuelType	brand	notRepairedDamage
count	291758.000000	291758.000000	291758.000000	291758.000000	291758.000000	291758.000000	291758.000000	291758.000000	291758.000000
mean	6668.121035	4.539718	2003.556091	0.783269	130.300588	124812.413027	1.996830	20.245275	1.716762
std	8640.885020	1.833823	6.538935	0.446520	61.811810	39465.704551	1.510657	13.527708	0.600595
min	500.000000	0.000000	1950.000000	0.000000	50.000000	5000.000000	0.000000	0.000000	0.000000
25%	1690.000000	4.000000	2000.000000	1.000000	88.000000	100000.000000	1.000000	8.000000	2.000000
50%	3850.000000	5.000000	2004.000000	1.000000	116.000000	150000.000000	1.000000	21.000000	2.000000
75%	8500.000000	6.000000	2008.000000	1.000000	156.000000	150000.000000	3.000000	33.000000	2.000000
max	200000.000000	8.000000	2016.000000	2.000000	700.000000	150000.000000	7.000000	39.000000	2.000000

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	kilometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
266399	2016-04-06 15:36:18	Volkswagen_Touran_1.4_TSI_MATCH	privat	Angebot	16500	test	bus	2012	manuell	140	touran	60000	4	benzin	volkswagen	nein	2016-04-06 00:00:00	83355	2016-04-06 15:36:18
44631	2016-03-26 18:58:44	Audi_A5_Cabrio_2.0_TFSI_quattro_S_tronic	privat	Angebot	26800	test	cabrio	2010	automatik	211	a5	70000	6	benzin	audi	nein	2016-03-26 00:00:00	57399	2016-04-06 08:17:39
232107	2016-03-09 13:54:27	BMW_Z3_Roadster_2.8	privat	Angebot	15950	control	cabrio	1998	manuell	192	z_reihe	80000	5	benzin	bmw	nein	2016-03-09 00:00:00	79098	2016-03-20 11:17:29
123979	2016-03-14 10:42:25	Ford_Ka_+++SUPER_EINSTIEGS___ODER_STADTAUTO++	privat	Angebot	1490	control	kleinwagen	2007	manuell	60	ka	100000	1	benzin	ford	nein	2016-03-14 00:00:00	22851	2016-04-07 10:45:00
64691	2016-03-28 22:49:14	Ford_Focus_1_4_75_PS_Tuev_02/2018	privat	Angebot	1000	control	NaN	2018	manuell	0	NaN	150000	0	benzin	ford	nein	2016-03-28 00:00:00	45665	2016-04-03 05:17:54

	price	vehicleType	yearOfRegistration	gearbox	powerPS	kilometer	fuelType	brand	notRepairedDamage
328442	16499	kombi	2009	manuell	190	150000	diesel	audi	nein
87344	2750	bus	2005	manuell	69	90000	diesel	opel	nein
300452	10800	limousine	2008	manuell	184	100000	benzin	mercedes_benz	nein
46859	850	kleinwagen	1998	NaN	0	125000	benzin	opel	nein
56006	400	NaN	2016	manuell	90	150000	benzin	volkswagen	ja