Predicting House Price - Part 1: Exploratory Analysis

The dataset house sale prices for King County, Seattle. It includes homes sold between May 2014 and May 2015. The dataset provides features the houses have and the price at which they were sold. It can be used to model house price prediction.

The dataset is available at kaggle https://www.kaggle.com/harlfoxem/housesalesprediction

Some of the attributes that are captured in the dataset are

  1. No of bedrooms and bathrooms
  2. Total square feet of living
  3. How many floors
  4. Wether it has a basement and size of the basement
  5. Grade of the house
  6. Weather it has waterfront and the quality of the view
  7. When the house was built and if the house is renovated, if it is renovated?
  8. Latitude and longitude
  9. price of the house
  10. When the house was sold

Let's explore the dataset to understand these attributes and their characteristics in more detail

Loading the dataset

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
In [3]:
house_df = pd.read_csv('kc_house_data.csv')
In [4]:
house_df.head( 5 )
Out[4]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650
1 6414100192 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639
2 5631500400 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062
3 2487200875 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000
4 1954400510 20150218T000000 510000.0 3 2.00 1680 8080 1.0 0 0 ... 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503

5 rows × 21 columns

In [4]:
house_df.columns
Out[4]:
Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
     'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
     'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
     'lat', 'long', 'sqft_living15', 'sqft_lot15'],
    dtype='object')

Exploratory Analysis

It is important to understand the properties of each variable including the target variable before actually creating a model to predict.

Any missing values

In [5]:
house_df.isnull().any().sum()
Out[5]:
0

Understanding distribution of price - target variable

In [6]:
sn.distplot( house_df.price )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1173bd8d0>
In [7]:
sn.boxplot( house_df.price )
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x1173bd240>

Price is a highly skewed variable. Right skewed.

  • Such variables can be tranformed using log tranformation, which might make the variable more normally distributed.
In [8]:
sn.distplot( np.log10( house_df.price ) )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x106631588>
In [9]:
house_df['log_price'] = np.log10( house_df.price )

How sqft_living is distributed?

In [10]:
sn.distplot( house_df.sqft_living )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x117bf0c88>
In [11]:
sn.distplot( np.log1p( house_df.sqft_living ) )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ab94780>

How different variables are impacting sales?

In [25]:
sn.jointplot(x="sqft_living", y="price", data=house_df, kind = 'reg', size = 5)
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[25]:
<seaborn.axisgrid.JointGrid at 0x11b4e4b38>

As expected, sqft_living is highly correlated with price. This should be a good predictor.

Check correlation of all numerical variables with price

In [26]:
house_df.columns
Out[26]:
Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
     'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
     'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
     'lat', 'long', 'sqft_living15', 'sqft_lot15', 'log_price'],
    dtype='object')
In [27]:
numerical_vars = [ 'bedrooms', 'bathrooms', 'sqft_living',
     'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15', 'price']
In [28]:
sn.heatmap( house_df[numerical_vars].corr(), annot=True )
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c54eda0>

Almost all variables are correlated except sqft_lot and sqft_lot15.

  • Price is highly correlated with sqft_living and sqft_above.
  • And moderately correlated with number of bathrooms, bedrooms and sqft_basement.

How is price is impacted by categorical variables?

Does having a waterfront influence the price of the house?

In [43]:
fig, ax = plt.subplots( figsize=( 10, 4 ) )
sn.boxplot(y = 'waterfront',
          x = 'price',
          data = house_df,
          width = 0.8,
          orient = 'h',
          showmeans = True,
          fliersize = 2,
          ax = ax)
plt.show()

Calculating point-biserial correlation to understand the relationship

Calculating the relationship between a dichotomous and continuous variable

The relationship between different types of variables are explained here

https://www.andrews.edu/~calkins/math/edrm611/edrm13.htm

In [44]:
from scipy import stats
In [45]:
r, p = stats.pointbiserialr( house_df['waterfront'],
                           house_df['price'])
print ('point biserial correlation r is %s with p = %s' %(r,p))
point biserial correlation r is 0.266369434031 with p = 0.0

Having a waterfront or not having a waterfront is not lightly correlated with price. In practice, if the correlation is more than 0.3, then it is cosidered to be correlated.

How ordinal Variables influencing the prices?

view, grade and conditions seems to be rating scales.

Correlation between having a view and price

In [42]:
fig, ax = plt.subplots( figsize=( 10, 4 ) )
sn.boxplot(y = 'view',
          x = 'price',
          data = house_df,
          width = 0.8,
          orient = 'h',
          showmeans = True,
          fliersize = 2,
          ax = ax)
plt.show()

It can be observed that the median value for each view ratings shifting towards right. That means, the having better view influence the higher price of the houses.

In [50]:
r, p = stats.spearmanr( house_df['view'],
                      house_df['price'])
print ('point biserial correlation r is %s with p = %s' %(r,p))
point biserial correlation r is 0.29393116417 with p = 0.0

Corrleation between condition of the house and price?

In [71]:
fig, ax = plt.subplots( figsize=( 10, 4 ) )
sn.boxplot(y = 'condition',
          x = 'price',
          data = house_df,
          width = 0.8,
          orient = 'h',
          showmeans = True,
          fliersize = 2,
          ax = ax)
plt.show()

Correlation between grade and price

In [72]:
fig, ax = plt.subplots( figsize=( 10, 4 ) )
sn.boxplot(y = 'grade',
          x = 'price',
          data = house_df,
          width = 0.8,
          orient = 'h',
          showmeans = True,
          fliersize = 2,
          ax = ax)
plt.show()
In [74]:
sn.jointplot(x="grade", y="price", data=house_df, kind = 'reg', size = 5)
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[74]:
<seaborn.axisgrid.JointGrid at 0x11e4be748>

The grade of the house is highly correlated with the price.

Corrrelation between floors and price

In [73]:
fig, ax = plt.subplots( figsize=( 10, 4 ) )
sn.boxplot(y = 'floors',
          x = 'price',
          data = house_df,
          width = 0.8,
          orient = 'h',
          showmeans = True,
          fliersize = 2,
          ax = ax)
plt.show()

An interesting observation that median price reduces when the number of floors goes from 2 to 3 or 3.5. In fact, the mean price for the houses with 3.5 floors is outside the distribution, as it is greately influenced by the outliers.

Creating new variables

Sometimes that variables present in the dataset can be used as it is. And we may need to derive variables from the datasets. For example, the year in which the house is built. It is not a continuous variable and if we use it as a categorical variable, it will create several categories. And there might not have much information about the categories. But what we can do it, we can calcuate the age of the house, when it is sold by calculating the difference between the year it is sold and the year it is built.

Age of the house

In [57]:
    house_df['age'] = house_df.apply( lambda rec: int( rec.date[0:4] ) - rec.yr_built, axis = 1 )
In [58]:
house_df[['yr_built', 'date', 'age']][0:5]
Out[58]:
yr_built date age
0 1955 20141013T000000 59
1 1951 20141209T000000 63
2 1933 20150225T000000 82
3 1965 20141209T000000 49
4 1987 20150218T000000 28
In [59]:
sn.jointplot(x="age", y="price", data=house_df, kind = 'reg', size = 5)
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[59]:
<seaborn.axisgrid.JointGrid at 0x11dc39710>

Well, price is declining with the age of the house, but not greatly. Probably, not a good predictor.

Is the house renovated influences the price?

In [63]:
house_df['is_renovated'] = house_df['yr_renovated'].map( lambda rec: int( rec != 0) )
In [65]:
house_df['is_renovated'][0:5]
Out[65]:
0    0
1    1
2    0
3    0
4    0
Name: is_renovated, dtype: int64
In [66]:
fig, ax = plt.subplots( figsize=( 10, 4 ) )
sn.boxplot(y = 'is_renovated',
          x = 'price',
          data = house_df,
          width = 0.8,
          orient = 'h',
          showmeans = True,
          fliersize = 2,
          ax = ax)
plt.show()
In [68]:
r, p = stats.spearmanr( house_df['is_renovated'],
                      house_df['price'])
print ('point biserial correlation r is %s with p = %s' %(r,p))
point biserial correlation r is 0.1010262967 with p = 3.85719951003e-50

Again, not highly correlated with price.

When the house is sold?

In [76]:
house_df['when_sold'] = house_df['date'].map( lambda rec: rec[0:4] )
In [77]:
fig, ax = plt.subplots( figsize=( 10, 4 ) )
sn.boxplot(y = 'when_sold',
          x = 'price',
          data = house_df,
          width = 0.8,
          orient = 'h',
          showmeans = True,
          fliersize = 2,
          ax = ax)
plt.show()

The houses were either sold in 2014 or 2015. These is no significant difference in price distributions.

Not using the variables

Let's skip zipcode, lat and long for the time being. In future tutorials, we will explore how to use these variable for model building and analysis