# Predicting House Price - Part 1: Exploratory Analysis¶

The dataset house sale prices for King County, Seattle. It includes homes sold between May 2014 and May 2015. The dataset provides features the houses have and the price at which they were sold. It can be used to model house price prediction.

The dataset is available at kaggle https://www.kaggle.com/harlfoxem/housesalesprediction

Some of the attributes that are captured in the dataset are

1. No of bedrooms and bathrooms
2. Total square feet of living
3. How many floors
4. Wether it has a basement and size of the basement
6. Weather it has waterfront and the quality of the view
7. When the house was built and if the house is renovated, if it is renovated?
8. Latitude and longitude
9. price of the house
10. When the house was sold

#### Let's explore the dataset to understand these attributes and their characteristics in more detail¶

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

In [3]:
house_df = pd.read_csv('kc_house_data.csv')

In [4]:
house_df.head( 5 )

Out[4]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650
1 6414100192 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639
2 5631500400 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062
3 2487200875 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000
4 1954400510 20150218T000000 510000.0 3 2.00 1680 8080 1.0 0 0 ... 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503

5 rows × 21 columns

In [4]:
house_df.columns

Out[4]:
Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
'lat', 'long', 'sqft_living15', 'sqft_lot15'],
dtype='object')

# Exploratory Analysis¶

It is important to understand the properties of each variable including the target variable before actually creating a model to predict.

### Any missing values¶

In [5]:
house_df.isnull().any().sum()

Out[5]:
0

### Understanding distribution of price - target variable¶

In [6]:
sn.distplot( house_df.price )

/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1173bd8d0>
In [7]:
sn.boxplot( house_df.price )

Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x1173bd240>

#### Price is a highly skewed variable. Right skewed.¶

• Such variables can be tranformed using log tranformation, which might make the variable more normally distributed.
In [8]:
sn.distplot( np.log10( house_df.price ) )

/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x106631588>
In [9]:
house_df['log_price'] = np.log10( house_df.price )


## How sqft_living is distributed?¶

In [10]:
sn.distplot( house_df.sqft_living )

/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x117bf0c88>
In [11]:
sn.distplot( np.log1p( house_df.sqft_living ) )

/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ab94780>

### How different variables are impacting sales?¶

In [25]:
sn.jointplot(x="sqft_living", y="price", data=house_df, kind = 'reg', size = 5)

/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

Out[25]:
<seaborn.axisgrid.JointGrid at 0x11b4e4b38>

### Check correlation of all numerical variables with price¶

In [26]:
house_df.columns

Out[26]:
Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
'lat', 'long', 'sqft_living15', 'sqft_lot15', 'log_price'],
dtype='object')
In [27]:
numerical_vars = [ 'bedrooms', 'bathrooms', 'sqft_living',
'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15', 'price']

In [28]:
sn.heatmap( house_df[numerical_vars].corr(), annot=True )

Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c54eda0>

#### Almost all variables are correlated except sqft_lot and sqft_lot15.¶

• Price is highly correlated with sqft_living and sqft_above.
• And moderately correlated with number of bathrooms, bedrooms and sqft_basement.

### How is price is impacted by categorical variables?¶

Does having a waterfront influence the price of the house?

In [43]:
fig, ax = plt.subplots( figsize=( 10, 4 ) )
sn.boxplot(y = 'waterfront',
x = 'price',
data = house_df,
width = 0.8,
orient = 'h',
showmeans = True,
fliersize = 2,
ax = ax)
plt.show()


Calculating point-biserial correlation to understand the relationship

### Calculating the relationship between a dichotomous and continuous variable¶

The relationship between different types of variables are explained here

https://www.andrews.edu/~calkins/math/edrm611/edrm13.htm

In [44]:
from scipy import stats

In [45]:
r, p = stats.pointbiserialr( house_df['waterfront'],
house_df['price'])
print ('point biserial correlation r is %s with p = %s' %(r,p))

point biserial correlation r is 0.266369434031 with p = 0.0


## How ordinal Variables influencing the prices?¶

view, grade and conditions seems to be rating scales.

### Correlation between having a view and price¶

In [42]:
fig, ax = plt.subplots( figsize=( 10, 4 ) )
sn.boxplot(y = 'view',
x = 'price',
data = house_df,
width = 0.8,
orient = 'h',
showmeans = True,
fliersize = 2,
ax = ax)
plt.show()


#### It can be observed that the median value for each view ratings shifting towards right. That means, the having better view influence the higher price of the houses.¶

In [50]:
r, p = stats.spearmanr( house_df['view'],
house_df['price'])
print ('point biserial correlation r is %s with p = %s' %(r,p))

point biserial correlation r is 0.29393116417 with p = 0.0


### Corrleation between condition of the house and price?¶

In [71]:
fig, ax = plt.subplots( figsize=( 10, 4 ) )
sn.boxplot(y = 'condition',
x = 'price',
data = house_df,
width = 0.8,
orient = 'h',
showmeans = True,
fliersize = 2,
ax = ax)
plt.show()


### Correlation between grade and price¶

In [72]:
fig, ax = plt.subplots( figsize=( 10, 4 ) )
x = 'price',
data = house_df,
width = 0.8,
orient = 'h',
showmeans = True,
fliersize = 2,
ax = ax)
plt.show()

In [74]:
sn.jointplot(x="grade", y="price", data=house_df, kind = 'reg', size = 5)

/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

Out[74]:
<seaborn.axisgrid.JointGrid at 0x11e4be748>

### Corrrelation between floors and price¶

In [73]:
fig, ax = plt.subplots( figsize=( 10, 4 ) )
sn.boxplot(y = 'floors',
x = 'price',
data = house_df,
width = 0.8,
orient = 'h',
showmeans = True,
fliersize = 2,
ax = ax)
plt.show()


## Creating new variables¶

Sometimes that variables present in the dataset can be used as it is. And we may need to derive variables from the datasets. For example, the year in which the house is built. It is not a continuous variable and if we use it as a categorical variable, it will create several categories. And there might not have much information about the categories. But what we can do it, we can calcuate the age of the house, when it is sold by calculating the difference between the year it is sold and the year it is built.

### Age of the house¶

In [57]:
    house_df['age'] = house_df.apply( lambda rec: int( rec.date[0:4] ) - rec.yr_built, axis = 1 )

In [58]:
house_df[['yr_built', 'date', 'age']][0:5]

Out[58]:
yr_built date age
0 1955 20141013T000000 59
1 1951 20141209T000000 63
2 1933 20150225T000000 82
3 1965 20141209T000000 49
4 1987 20150218T000000 28
In [59]:
sn.jointplot(x="age", y="price", data=house_df, kind = 'reg', size = 5)

/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

Out[59]:
<seaborn.axisgrid.JointGrid at 0x11dc39710>

Well, price is declining with the age of the house, but not greatly. Probably, not a good predictor.

### Is the house renovated influences the price?¶

In [63]:
house_df['is_renovated'] = house_df['yr_renovated'].map( lambda rec: int( rec != 0) )

In [65]:
house_df['is_renovated'][0:5]

Out[65]:
0    0
1    1
2    0
3    0
4    0
Name: is_renovated, dtype: int64
In [66]:
fig, ax = plt.subplots( figsize=( 10, 4 ) )
sn.boxplot(y = 'is_renovated',
x = 'price',
data = house_df,
width = 0.8,
orient = 'h',
showmeans = True,
fliersize = 2,
ax = ax)
plt.show()

In [68]:
r, p = stats.spearmanr( house_df['is_renovated'],
house_df['price'])
print ('point biserial correlation r is %s with p = %s' %(r,p))

point biserial correlation r is 0.1010262967 with p = 3.85719951003e-50


### When the house is sold?¶

In [76]:
house_df['when_sold'] = house_df['date'].map( lambda rec: rec[0:4] )

In [77]:
fig, ax = plt.subplots( figsize=( 10, 4 ) )
sn.boxplot(y = 'when_sold',
x = 'price',
data = house_df,
width = 0.8,
orient = 'h',
showmeans = True,
fliersize = 2,
ax = ax)
plt.show()


The houses were either sold in 2014 or 2015. These is no significant difference in price distributions.