Predicting House Price - Part 2: Predicting House Prices

The dataset house sale prices for King County, Seattle. It includes homes sold between May 2014 and May 2015. The dataset provides features the houses have and the price at which they were sold. It can be used to model house price prediction.

The dataset is available at kaggle https://www.kaggle.com/harlfoxem/housesalesprediction

Some of the attributes that are captured in the dataset are

  1. No of bedrooms and bathrooms
  2. Total square feet of living
  3. How many floors
  4. Wether it has a basement and size of the basement
  5. Grade of the house
  6. Weather it has waterfront and the quality of the view
  7. When the house was built and if the house is renovated, if it is renovated?
  8. Latitude and longitude
  9. price of the house
  10. When the house was sold

Let's build a model to predict the house prices and understand attributes which influences the prices most

Loading the dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
In [2]:
house_df = pd.read_csv('kc_house_data.csv')
In [3]:
house_df[0:5]
Out[3]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650
1 6414100192 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639
2 5631500400 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062
3 2487200875 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000
4 1954400510 20150218T000000 510000.0 3 2.00 1680 8080 1.0 0 0 ... 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503

5 rows × 21 columns

In [4]:
house_df.columns
Out[4]:
Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
     'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
     'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
     'lat', 'long', 'sqft_living15', 'sqft_lot15'],
    dtype='object')

Preparing the dataset

Taking log of price as it is skewed

In [5]:
house_df['log_price'] = np.log1p( house_df.price )
In [6]:
house_df['age'] = house_df.apply( lambda rec: int( rec.date[0:4] ) - rec.yr_built, axis = 1 )
house_df['year'] = house_df.apply( lambda rec: int( rec.date[0:4] ), axis = 1 )
In [7]:
import math
In [9]:
numerical_features = ['bedrooms', 'age', 'year', 'bathrooms', 'sqft_living',
                    'view', 'condition', 'grade', 'floors',
                    'sqft_lot', 'sqft_above', 'sqft_basement',
                    'sqft_living15', 'sqft_lot15']
In [10]:
house_df['is_renovated'] = house_df['yr_renovated'].map( lambda rec: int( rec != 0) )
In [11]:
categorical_features = ['waterfront', 'is_renovated']
In [12]:
all_features = numerical_features + categorical_features
In [13]:
house_final_df = house_df[all_features]
In [14]:
import math

Create dummy variables for the categorical features

In [15]:
def create_dummies( df, colname ):
col_dummies = pd.get_dummies(df[colname], prefix=colname)
col_dummies.drop(col_dummies.columns[0], axis=1, inplace=True)
df = pd.concat([df, col_dummies], axis=1)
df.drop( colname, axis = 1, inplace = True )
return df
In [16]:
for c_feature in categorical_features:
  house_final_df = create_dummies( house_final_df, c_feature )

It created two dummy variables waterfront_1 and is_renovated_1

Let's look at the final dataset before building the model

In [17]:
house_final_df[0:5]
Out[17]:
bedrooms age year bathrooms sqft_living view condition grade floors sqft_lot sqft_above sqft_basement sqft_living15 sqft_lot15 waterfront_1 is_renovated_1
0 3 59 2014 1.00 1180 0 3 7 1.0 5650 1180 0 1340 5650 0.0 0.0
1 3 63 2014 2.25 2570 0 3 7 2.0 7242 2170 400 1690 7639 0.0 1.0
2 2 82 2015 1.00 770 0 3 6 1.0 10000 770 0 2720 8062 0.0 0.0
3 4 49 2014 3.00 1960 0 5 7 1.0 5000 1050 910 1360 5000 0.0 0.0
4 3 28 2015 2.00 1680 0 3 8 1.0 8080 1680 0 1800 7503 0.0 0.0

Feature Re-engineering

In [18]:
sn.regplot( house_final_df.bedrooms , house_df.price )
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x116b1d940>

Splitting the dataset into train and test dataset

In [19]:
from sklearn.cross_validation import train_test_split
In [20]:
train_X, test_X, train_y, test_y = train_test_split( house_final_df,
                                                  house_df.log_price,
                                                  test_size = 0.2,
                                                  random_state = 1000 )

Building Linear Regression Model

In [21]:
from sklearn.linear_model import LinearRegression
from sklearn import metrics
In [22]:
linreg = LinearRegression( normalize=True )
linreg.fit( train_X, train_y )
Out[22]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)
In [23]:
list( zip( house_final_df.columns,
        np.round( linreg.coef_, 2 ) ) )
Out[23]:
[('bedrooms', -0.029999999999999999),
('age', 0.01),
('year', 0.040000000000000001),
('bathrooms', 0.080000000000000002),
('sqft_living', 0.0),
('view', 0.040000000000000001),
('condition', 0.040000000000000001),
('grade', 0.20999999999999999),
('floors', 0.12),
('sqft_lot', 0.0),
('sqft_above', 0.0),
('sqft_basement', 0.0),
('sqft_living15', 0.0),
('sqft_lot15', -0.0),
('waterfront_1', 0.37),
('is_renovated_1', 0.02)]

The model says, the price of the house reduces when the number of bedroom increases. And the price of the house increases as the age increases. This does not make sense and contrary to what we observed during the exploratory analysis

Relationship between Price and No of bedrooms

In [24]:
sn.regplot( x = train_X.bedrooms, y = train_y )
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x117f217b8>
In [25]:
sn.regplot( x = train_X.age, y = train_y )
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x118047198>

Model Evaluation

In [26]:
def model_evaluate( model_ ):
  pred_df = pd.DataFrame( { "actual": test_y, "predicted" : model_.predict( test_X ) } )
  pred_df['resid'] = pred_df.actual - pred_df.predicted
  rmse = np.sqrt( metrics.mean_squared_error( pred_df.actual, pred_df.predicted ) )
  r2 = metrics.r2_score( pred_df.actual, pred_df.predicted )
  print( "RMSE: ", rmse, " : ", "R Squared: ", r2 )
  return pred_df, rmse, r2
In [27]:
pred_df, rmse, r2 = model_evaluate( linreg )
RMSE:  0.310350490348  :  R Squared:  0.651829812911
In [28]:
pred_df[0:5]
Out[28]:
actual predicted resid
18099 11.918397 12.483393 -0.564996
2156 12.923915 12.954308 -0.030393
20684 13.353477 13.178587 0.174890
4456 13.579789 13.099432 0.480357
18714 13.188153 12.862238 0.325915

Let's looks at residuals pattern

In [29]:
sn.lmplot( 'actual', 'resid', data = pred_df, size = 6 )
Out[29]:
<seaborn.axisgrid.FacetGrid at 0x11805a550>
In [30]:
sn.distplot( pred_df.resid )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x11805f710>
In [31]:
from scipy import stats
np.round( stats.shapiro( pred_df.resid ), 2 )
Out[31]:
array([ 1.,  0.])

The residual plot depicts the residuals are not random. They increase as the house prices increase.

This might be becuase of

  • the outliers in prices.
  • multicolinearity
  • non linear relationship between price and it's predictors

Feature Engineering

In [32]:
sn.heatmap( pd.concat( [house_final_df, house_df.log_price], axis = 1 ).corr() )
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x11821bc18>
In [102]:
house_final_df['sqft_living'] = np.log1p( house_df['sqft_living'] )
In [48]:
house_final_df.columns
Out[48]:
Index(['bedrooms', 'age', 'year', 'bathrooms', 'sqft_living', 'view',
     'condition', 'grade', 'floors', 'sqft_lot', 'sqft_above',
     'sqft_basement', 'sqft_living15', 'sqft_lot15', 'waterfront_1',
     'is_renovated_1'],
    dtype='object')

Splitting the dataset into train and test datasets

As we can see, there are dummy variables being created for all categorical features. Total number of features have gone up now.

In [49]:
train_X, test_X, train_y, test_y = train_test_split( house_final_df,
                                                  house_df.log_price,
                                                  test_size = 0.2,
                                                  random_state = 1000 )

Applyting linear regression model

In [50]:
linreg = LinearRegression()
linreg.fit( train_X, train_y )
list( zip( house_final_df.columns, linreg.coef_ ) )
Out[50]:
[('bedrooms', -0.036196134415004862),
('age', 0.0056440134435417908),
('year', 0.036355755748121055),
('bathrooms', 0.073161381162300837),
('sqft_living', 0.20847420271911121),
('view', 0.040187036036257069),
('condition', 0.039485486885153333),
('grade', 0.20583414787262977),
('floors', 0.11625753793856329),
('sqft_lot', 2.172347738083058e-07),
('sqft_above', 2.5852432518687352e-05),
('sqft_basement', 0.00011212051148115632),
('sqft_living15', 0.00010390838373267042),
('sqft_lot15', -5.1589746134783621e-07),
('waterfront_1', 0.3752623167094718),
('is_renovated_1', 0.011808899864368203)]
In [51]:
linreg
Out[51]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [52]:
linreg_pred = pd.DataFrame( { "actual": test_y, "predicted" : linreg.predict( test_X ) } )

Validating the model

In [53]:
np.sqrt( metrics.mean_squared_error( linreg_pred.actual, linreg_pred.predicted ) )
Out[53]:
0.31025148474189113
In [54]:
metrics.r2_score( linreg_pred.actual, linreg_pred.predicted )
Out[54]:
0.6520519185817919
In [108]:
linreg_rmse = np.sqrt( metrics.mean_squared_error( np.expm1( linreg_pred.actual ),
                                                np.expm1( linreg_pred.predicted ) ) )

linreg_rmse
Out[108]:
198161.49127088574

Applying L1 and L2 Regularization

In [55]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.grid_search import GridSearchCV

Ridge Regression

In [56]:
ridge_m_ = Ridge()
ridge_params_ = {'alpha':[0.01, 0.1, 1, 5]}

grid_ridge_m = GridSearchCV( ridge_m_,
                          ridge_params_,
                          scoring = "mean_squared_error" ,
                          cv=5)
In [57]:
grid_ridge_m.fit( train_X, train_y )
Out[57]:
GridSearchCV(cv=5, error_score='raise',
     estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
 normalize=False, random_state=None, solver='auto', tol=0.001),
     fit_params={}, iid=True, n_jobs=1,
     param_grid={'alpha': [0.01, 0.1, 1, 5]}, pre_dispatch='2*n_jobs',
     refit=True, scoring='mean_squared_error', verbose=0)
In [58]:
grid_ridge_m.best_score_
Out[58]:
-0.094861977708762327
In [59]:
grid_ridge_m.best_params_
Out[59]:
{'alpha': 1}
In [60]:
ridge_pred, ridge_pred, r2_ridge = model_evaluate( grid_ridge_m )
RMSE:  0.310242455756  :  R Squared:  0.652072170363

Lasso Regression

In [61]:
lasso_m_ = Lasso()
lasso_params_ = {'alpha':[0.01, 0.1],
               'max_iter':[30000]}

grid_lasso_m = GridSearchCV( lasso_m_,
                          lasso_params_,
                          scoring = "r2" ,
                          cv=5)
In [62]:
grid_lasso_m.fit( train_X, train_y )
Out[62]:
GridSearchCV(cv=5, error_score='raise',
     estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
 normalize=False, positive=False, precompute=False, random_state=None,
 selection='cyclic', tol=0.0001, warm_start=False),
     fit_params={}, iid=True, n_jobs=1,
     param_grid={'max_iter': [30000], 'alpha': [0.01, 0.1]},
     pre_dispatch='2*n_jobs', refit=True, scoring='r2', verbose=0)
In [63]:
grid_lasso_m.best_score_
Out[63]:
0.64580628737259427
In [64]:
grid_lasso_m.best_params_
Out[64]:
{'alpha': 0.01, 'max_iter': 30000}
In [65]:
metrics.r2_score( test_y, grid_lasso_m.predict( test_X ) )
Out[65]:
0.64262142582081649
In [66]:
lasso_pred, lasso_rmse, lasso_r2 = model_evaluate( grid_lasso_m )
RMSE:  0.314427774187  :  R Squared:  0.642621425821
In [67]:
lasso_m_ = Lasso( alpha = 0.01, max_iter = 10000 )
lasso_m_.fit( train_X, train_y )
Out[67]:
Lasso(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=10000,
 normalize=False, positive=False, precompute=False, random_state=None,
 selection='cyclic', tol=0.0001, warm_start=False)
In [68]:
train_X.columns[lasso_m_.coef_ == 0]
Out[68]:
Index(['year', 'sqft_living', 'waterfront_1', 'is_renovated_1'], dtype='object')
In [69]:
selected_features = train_X.columns[lasso_m_.coef_ != 0]
In [70]:
selected_features
Out[70]:
Index(['bedrooms', 'age', 'bathrooms', 'view', 'condition', 'grade', 'floors',
     'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15',
     'sqft_lot15'],
    dtype='object')
In [71]:
def model_evaluate_selected_features( model_, selected_features ):
  pred_df = pd.DataFrame( { "actual": test_y, "predicted" :
                           model_.predict( test_X[selected_features] ) } )
  pred_df['resid'] = pred_df.actual - pred_df.predicted
  rmse = np.sqrt( metrics.mean_squared_error( pred_df.actual, pred_df.predicted ) )
  r2 = metrics.r2_score( pred_df.actual, pred_df.predicted )
  print( "RMSE: ", rmse, " : ", "R Squared: ", r2 )
  return pred_df, rmse, r2

Elastic Net

In [72]:
from sklearn.linear_model import ElasticNet
In [73]:
enet_m_ = ElasticNet()
enet_params_ = {'alpha':[0.01, 0.1],
           'max_iter':[10000],
           'l1_ratio': [0.3, 0.5, 0.7, 0.9] }
In [74]:
grid_enet_m = GridSearchCV( enet_m_,
                          enet_params_,
                          scoring = "mean_squared_error" ,
                          cv=5)
In [75]:
grid_enet_m.fit( train_X, train_y )
Out[75]:
GridSearchCV(cv=5, error_score='raise',
     estimator=ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
    max_iter=1000, normalize=False, positive=False, precompute=False,
    random_state=None, selection='cyclic', tol=0.0001, warm_start=False),
     fit_params={}, iid=True, n_jobs=1,
     param_grid={'max_iter': [10000], 'l1_ratio': [0.3, 0.5, 0.7, 0.9], 'alpha': [0.01, 0.1]},
     pre_dispatch='2*n_jobs', refit=True, scoring='mean_squared_error',
     verbose=0)
In [76]:
grid_enet_m.best_score_
Out[76]:
-0.096574259119442243
In [77]:
grid_enet_m.best_params_
Out[77]:
{'alpha': 0.01, 'l1_ratio': 0.3, 'max_iter': 10000}
In [78]:
metrics.r2_score( test_y, grid_enet_m.predict( test_X ) )
Out[78]:
0.64888153115564062

Decision Tree Regressor

In [79]:
from sklearn.tree import DecisionTreeRegressor

Build a decision tree for default value of max depth

In [80]:
tree_reg = DecisionTreeRegressor()
tree_reg.fit( train_X, train_y )
Out[80]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
         max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
         min_weight_fraction_leaf=0.0, presort=False, random_state=None,
         splitter='best')
In [81]:
tree_pred, tree_rmse, tree_r2 = model_evaluate( tree_reg )
RMSE:  0.389256007747  :  R Squared:  0.452281566501

Increase the max depth to 6

In [82]:
tree_reg = DecisionTreeRegressor( max_depth = 6 )
tree_reg.fit( train_X, train_y )
tree_pred, tree_rmse, tree_r2 = model_evaluate( tree_reg )
RMSE:  0.319022947571  :  R Squared:  0.63209934951

Search for optimal max depth

In [83]:
from sklearn.grid_search import GridSearchCV
In [84]:
tree_params_ = { 'max_depth' : [3, 4, 5, 6, 7, 8, 9, 10] }
In [85]:
tree_reg = DecisionTreeRegressor()
tree_grid = GridSearchCV( tree_reg,
                       tree_params_,
                       scoring = "r2" ,
                       cv=5)
In [86]:
tree_grid.fit( train_X, train_y )
Out[86]:
GridSearchCV(cv=5, error_score='raise',
     estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
         max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
         min_weight_fraction_leaf=0.0, presort=False, random_state=None,
         splitter='best'),
     fit_params={}, iid=True, n_jobs=1,
     param_grid={'max_depth': [3, 4, 5, 6, 7, 8, 9, 10]},
     pre_dispatch='2*n_jobs', refit=True, scoring='r2', verbose=0)
In [87]:
tree_grid.best_params_
Out[87]:
{'max_depth': 8}
In [88]:
tree_grid.best_score_
Out[88]:
0.64768813414694082

Ensemble Methods

Random Forest Regressor

In [89]:
from sklearn.ensemble import RandomForestRegressor
In [90]:
rand_f_reg = RandomForestRegressor( n_estimators = 100 )
In [91]:
rand_f_reg.fit( train_X, train_y )
Out[91]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
         max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
         min_samples_split=2, min_weight_fraction_leaf=0.0,
         n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
         verbose=0, warm_start=False)
In [92]:
rand_f_pred, rand_f_rmse, rand_f_r2 = model_evaluate( rand_f_reg )
RMSE:  0.274589559344  :  R Squared:  0.727444626086
In [93]:
indices = np.argsort(rand_f_reg.feature_importances_)[::-1]
In [94]:
feature_rank = pd.DataFrame( columns = ['rank', 'feature', 'importance'] )
for f in range(train_X.shape[1]):
  feature_rank.loc[f] = [f+1,
                         train_X.columns[indices[f]],
                         rand_f_reg.feature_importances_[indices[f]]]
In [95]:
sn.barplot( y = 'feature', x = 'importance', data = feature_rank )
Out[95]:
<matplotlib.axes._subplots.AxesSubplot at 0x117899908>

Adaboost Regressor

In [96]:
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor
In [97]:
dtree_reg = DecisionTreeRegressor( max_depth = 8 )

adaboost_m_ = AdaBoostRegressor( dtree_reg, n_estimators = 50 )
adaboost_m_.fit( train_X, train_y )
Out[97]:
AdaBoostRegressor(base_estimator=DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
         max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
         min_weight_fraction_leaf=0.0, presort=False, random_state=None,
         splitter='best'),
       learning_rate=1.0, loss='linear', n_estimators=50,
       random_state=None)
In [98]:
ada_pre, ada_rmse, ada_r2 = model_evaluate( adaboost_m_ )
RMSE:  0.287562569834  :  R Squared:  0.701082445043

Gradient Boosting Regressor

In [99]:
gra_boost_m_ = GradientBoostingRegressor( learning_rate = 0.2, n_estimators = 200, max_depth = 5 )
In [100]:
gra_boost_m_.fit( train_X, train_y )
Out[100]:
GradientBoostingRegressor(alpha=0.9, init=None, learning_rate=0.2, loss='ls',
           max_depth=5, max_features=None, max_leaf_nodes=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=200,
           presort='auto', random_state=None, subsample=1.0, verbose=0,
           warm_start=False)
In [101]:
gra_pred, gra_rmse, gra_r2 = model_evaluate( gra_boost_m_ )
RMSE:  0.273209466123  :  R Squared:  0.730177479813

Note:

  • Of all models, Gradient boosted regression has maximum accuracy of 73%. Typicall predicting house prices is a difficult problem and may not be influencd by many external macro economic factors, which are not captured. But predicting 73% accuracy is also useful and the ranking the features by importance using random forest model, also helps understand the factors that influencing the prices most.
In [ ]: