Used-car Market Price Evaluation Based on Big Data - Project Proposal¶
Jiayu (Alice) Wu
This project aims to analyze how features of a used-car influence its market price and to predict the price by car features with data from on-line listing. The final product of the project is an on-line app where the users are provided market value estimation of a used-car given its features.
In this proposal, we firstly introduce the motivation and the objective for the project. Then, a dataset with 370000 observations of used-car listing on Ebay is introduced. In addition, some first-stage data cleaning and analysis is presented. Finally, the further extension based on the current progress is discussed.
Introduction¶
For a private buyer or seller who are new to the used-car market, a common confusion is about what to be expected. Therefore, this project attempts to provide a used-car evaluation service based on the on-line listing information, which has great potential in business application.
The objective of the project is to discuss the features of used-car with regard to its influence on the market price. Based on such analysis, a model for price prediction could be constructed in order to provide a real-time used-car evaluation service. The final product takes features (vehicle_type, brand, year_registration, power, gearbox, kilometer, etc.) of a used-car as input and output a prediction of price. Moreover, it should be capable of producing an interval prediction as price range, and it should be able to handle missing features so that the user can make a query even when they are not clear about certain feature. As a more ambitous attempt, a recommender system could be build such that we can not only evaluate the price, but also recommend features for car search by the potential buyers, based on the expected price and desired features.
The data used in the following analysis is rechieved from Kaggle (https://www.kaggle.com/orgesleka/used-cars-database/home). It contains 370000 raw observations of 20 features (29M) scrapted from used-car listing on Ebay-Kleinanzeigen(German). The high quantity and authencity makes this dataset useful for an exploratory analysis to verify the viability of this project. However, for the final product of this project, more recent data in USA should be scraped from web, which is not accomplished in the proposal stage due to the time limit.
In the following section, we are going to experiment with the dataset by data cleaning, visualization and xgboost regression, in order to reach an elementary understanding of each car feature and its influence on the market price.
Exploratory Analysis¶
Data Cleaning¶
Firstly, we load the raw data from kaggle, and import common packages for data presentation and visualization.
The raw data contains 371528 observations and 20 columns including the price and other information about the listing on-line.
As we are only interested in the features of the car itself, the irrelavant columns are dropped.
# import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Data loading
raw_data = pd.read_csv('autos.csv', sep=',',encoding='cp1252')
print(raw_data.shape)
raw_data.sample(5)
# drop irrelavent
raw_data.drop(['dateCrawled','name','seller','offerType','abtest','monthOfRegistration',
'dateCreated','nrOfPictures', 'lastSeen', 'postalCode', 'model'], axis='columns', inplace=True)
raw_data.drop_duplicates(['price','vehicleType','yearOfRegistration','gearbox',
'powerPS','kilometer','fuelType','brand','notRepairedDamage'])
print(raw_data.shape)
raw_data.sample(5)
The column 'monthOfRegistration' is dropped as it has little influence on the value of a car.
The column 'model' is specific to each brand, for now we drop it and reserve only the column 'brand' to examine the main features of a car.
The remaining 8 features are:
Categorical: vehicleType, gearbox, fuelType, brand
Numerical: yearOfRegistration(the year first registered), powerPS (power in PS), kilometer (mileage has driven)
Missing Values and Outliers¶
The missing values in the raw data is summarized below. All missing values apear in the categorical feature. The 'notRepairedDamage' has the most missing values, and the reason might be that this entry is not clearly defined and confusing for many sellers.
In order to reserve such infomation in the following exploration, we substitute null with a category 'missing'. By summarizing all the categorical features, it can be observed that the entries are all valid and without ambiguity now.
# missing data
raw_data.isnull().sum()
raw_data['vehicleType'].fillna(value='missing', inplace=True)
raw_data['gearbox'].fillna(value='missing', inplace=True)
raw_data['fuelType'].fillna(value='missing', inplace=True)
raw_data['notRepairedDamage'].fillna(value='missing', inplace=True)
# examine categorical
print(raw_data.groupby('vehicleType').size())
print(raw_data.groupby('gearbox').size())
print(raw_data.groupby('fuelType').size())
print(raw_data.groupby('brand').size())
As for the numerical features (including price), we first display the boxplot, where outliers are very obvious. Therefore, we truncate the data by its quantiles below, as a result we discard 20/% or the raw data to obtain a working dataset with 291758 observations and 9 features (price included).
# examine numrical
fig = plt.figure(figsize =(18,5))
ax1 = fig.add_subplot(141)
ax1 = raw_data[['price']].boxplot()
ax2 = fig.add_subplot(142)
ax2 = raw_data[['yearOfRegistration']].boxplot()
ax3 = fig.add_subplot(143)
ax3 = raw_data[['powerPS']].boxplot()
ax4 = fig.add_subplot(144)
ax4 = raw_data[['kilometer']].boxplot()
raw_data[['price', 'yearOfRegistration', 'powerPS', 'kilometer']].describe()
#raw_data[['price', 'yearOfRegistration', 'powerPS', 'kilometer']].quantile([0,0.01,0.05,0.9,0.95,0.999,1])
data = raw_data[(raw_data.price >= 500) & (raw_data.price <= 200000)
& (raw_data.yearOfRegistration <= 2016) & (raw_data.yearOfRegistration >= 1950)
& (raw_data.powerPS >= 50) & (raw_data.powerPS <= 700)]
print(data.shape, data.shape[0]/raw_data.shape[0])
data[['price', 'yearOfRegistration', 'powerPS', 'kilometer']].describe()
Visualization and Correlation¶
In order to examine the distribution of data over the categories, several barplots are displayed below. The number of observation in each category differs by large, therefore, one-hot encoding should be used in the prediction model. Besides, since there are too many missing values for 'notRepairedDamage', this feature may behave poor in model fitting.
columns = ['brand', 'vehicleType', 'gearbox', 'fuelType', 'notRepairedDamage', 'yearOfRegistration']
for i, col in enumerate(columns):
counts = data.groupby(by=col)[col].count().sort_values(ascending=False)
cat = counts.index
r = range(len(cat))
plt.figure()
plt.title(col)
plt.bar(r, counts)
plt.xticks(r, cat)
plt.show()
print(cat)
Moreover, the correlation between features are also of interests, especially how the features of a car is related to its price.
Therefore, we encode the categorical (of 'object' type) features as intergers. Although the categorical variables are transformed to ordinal in this way, it provides a rough idea about the data for now.
for feature in data.columns: # Loop through all columns in the dataframe
if data[feature].dtype == 'object': # Only apply for columns with categorical strings
data[feature] = pd.Categorical(data[feature]).codes # Replace strings with an integer
data.describe()
import seaborn as sns
plt.figure(figsize =(8,8))
plt.title('Feature Correlation Heat Map')
sns.heatmap(data.corr(),linewidths=.1,vmax=1.0,
square=True,linecolor='',annot=True)
plt.savefig('1.png')
From the correlation heat map above, it can be observed that power of the car, the kilometer and the number of years used have a major influence on the price. It is understandable that the price is dependent on the performance of the used-car and how new it is.
Feature Importance by XGboost Regrssion¶
In this section, XGboost regression is performed on the preprocessed data to compare the importance of each feature.
XGboost is an improvement over classical regression tree method. Different from first-order gradient boosting, this method utilized the hessian information and introduced regularization. Besides, it prunes the tree from a specified 'max_depth' instead of gready search.
By cross-validation, the key parameters 'max_depth' and 'min_child_weight' (the minimum sum of weights of all observations required in a child to control overfitting) are set to be 7 and 1 respectively. Thus, the model achieved a score (rmse) of 0.86 on the validation dataset, which provides a reliable reference on the feature importance measure. Whereas, for actual prediction modeling the presion can be further improved by one-hot encoding and further parameter tuning.
import xgboost as xgb
from sklearn.grid_search import GridSearchCV
train_size = int(.8*len(data))
train, test = np.split(data.sample(frac=1), [train_size])
y_train = train.pop('price')
y_test = test.pop('price')
# cross-validation
cv_params = {'max_depth': [7,5], 'min_child_weight': [1,3]}
ind_params = {'learning_rate':.1,'n_estimators':1500, 'seed':0, 'subsample':0.8, 'colsample_bytree':0.8, 'objective':'reg:linear'}
optimized_GBM = GridSearchCV(xgb.XGBRegressor(**ind_params), cv_params, scoring = 'neg_mean_squared_error', cv=5, n_jobs = -1)
optimized_GBM.fit(train, y_train)
GridSearchCV(estimator=xgb.XGBRegressor(max_depth=3, learning_rate=0.15, n_estimators=1000, silent=True,
objective='reg:linear', booster='gbtree', n_jobs=-1, nthread=None, gamma=0,
min_child_weight=1, max_delta_step=0, subsample=0.8, colsample_bytree=0.8, colsample_bylevel=0.8,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=0),
param_grid = cv_params, scoring='neg_mean_squared_error',n_jobs=1,iid=False, cv=5)
print(optimized_GBM.grid_scores_)
# fit
model = xgb.XGBRegressor(max_depth=7, learning_rate=0.1, n_estimators=1500, silent=True,
objective='reg:linear', booster='gbtree', n_jobs=-1, nthread=None, gamma=0,
min_child_weight=1, max_delta_step=0, subsample=0.8, colsample_bytree=0.8, colsample_bylevel=0.8,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=0)
model.fit(train,y_train)
y_pred = model.predict(test)
model.score(test, y_test)
imp = pd.Series(model.feature_importances_)
sorting = np.argsort(imp)[::-1]
feat_imp = imp[sorting]
names = list(train.columns[sorting])
plt.figure(figsize =(12,8))
feat_imp.plot(kind='barh', title='Feature Importances for XGboost Regression')
plt.yticks(range(train.shape[1]),names)
plt.savefig('2.png')
feat_imp.index = names
print(feat_imp)
It can be observed the performance (power) and the depreciation of the used car (used year and mileage) are the most important features, followed by the brand and type for the vehicle, which is close to intuition. Besides, notRepairedDamage, fuel and gearbox are relatively less important. In fact, the distribution of data over these categories are skewed and with a lot of missing, since not every private seller is familiar with such features.
Further Steps¶
In the previous analysis, the influence of 8 features on used-car market price was discussed. Based on the current progress, this project may go further in the next steps.
For a more precise prediction model, we should employ one-hot encoding to use indicator variables to represent categorical labels. Moreover, the hyperparameters for xgboost regression can be further tuned by cross-validation. Other non-parametric model like neural network or kernel regression can also be tried. In addition, a more flexible price evaluation should return an interval prediction as price range instead of a point estimator, bayesian methods could be considered for the interval estimation.
Furthermore, a recommender system could be built in order to recommend features for car search by the potential buyers. This service would be valuable for users who has little experience with car or with the used-car market. For example, one can input a price range and a vehicle type preference, and we can recommend several brands that he or she may consider when the buyer is not sure about what to look at. This task is possible to be accomplished by matrix factorization.