This tutorial explains the concept of linear regression and use of stats models and sklearn to build regression model and evaluate it.

Topic Covered here are:

  • What is linear regression?
  • Analyzing Advertisement dataset.
  • Splitting dataseta and cross validating models.
  • Building linear regression model.
  • How to use sklearn API in python?
  • Interpreting the coefficients of the model.
  • Making predictions using the model.
  • Finding model residuals and analyzing it.
  • Evaluating model efficiency using RMSE and R-Square values.
In [1]:
import pandas as pd
import numpy as np

Loading Adverstiment Dataset

The adverstiting dataset captures sales revenue generated with respect to advertisement spends across multiple channles like radio, tv and newspaper.

  • The dataset is available here

Attribution Descriptions

  • TV - Spend on TV Advertisements
  • Radio - Spend on radio Advertisements
  • Newspaper - Spend on newspaper Advertisements
  • Sales - Sales revenue generated
  • Note: The amounts are in diffrent units

Read the data

In [2]:
advt = pd.read_csv( "Advertising.csv" )
In [3]:
advt.head()
Out[3]:
Unnamed: 0 TV Radio Newspaper Sales
0 1 230.1 37.8 69.2 22.1
1 2 44.5 39.3 45.1 10.4
2 3 17.2 45.9 69.3 9.3
3 4 151.5 41.3 58.5 18.5
4 5 180.8 10.8 58.4 12.9
In [4]:
advt.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
Unnamed: 0    200 non-null int64
TV            200 non-null float64
Radio         200 non-null float64
Newspaper     200 non-null float64
Sales         200 non-null float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB

Remove the first column

  • Remove the index column
In [5]:
advt = advt[["TV", "Radio", "Newspaper", "Sales"]]
In [6]:
advt.head()
Out[6]:
TV Radio Newspaper Sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 9.3
3 151.5 41.3 58.5 18.5
4 180.8 10.8 58.4 12.9

Building a Simple Linear Regression model

Linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. Source: wikipedia

A simple linear regression model is given by

$$Y = mX + b$$

where m is the slope and b is the y-intercept. Y is the dependent variable and X is the explanatory variable.

Very briefly and simplistically, Linear Regression is a class of techniques for fitting a straight line to a set of data points.

Splitting into Train and test data sets..

Typically the model should be built on a training dataset and validated against a test dataset.

  • Let's split the dataset into 70/30 ratio. 70% belongs to training and 30% belongs to test.
In [7]:
from sklearn.model_selection import train_test_split
In [8]:
X_train, X_test, y_train, y_test = train_test_split(
  advt[["TV", "Radio", "Newspaper"]],
  advt.Sales,
  test_size=0.3,
  random_state = 42 )
In [9]:
len( X_train )
Out[9]:
140
In [10]:
len( X_test )
Out[10]:
60

Using sklearn library to build the model

sklearn library has a comprehensive set of APIs to split datasets, build models, test models and calculate accuracy metrics.

In [11]:
from sklearn.linear_model import LinearRegression

Building the model with training dataset

In [12]:
linreg = LinearRegression()
linreg.fit( X_train, y_train )
Out[12]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Finding the intercept and coefficients

In [13]:
linreg.intercept_
Out[13]:
2.7089490925159119
In [14]:
list( zip( ["TV", "Radio", "Newspaper"], list( linreg.coef_ ) ) )
Out[14]:
[('TV', 0.044059280957465197),
('Radio', 0.19928749689893949),
('Newspaper', 0.0068824522222754002)]

So, the equation looks like

Sales = 2.708 + 0.044 TV + 0.199 Radio + 0.006 * Newspaper

Making predictions on test dataset

In [15]:
y_pred = linreg.predict( X_test )

Comparing the predictions with actuals

In [16]:
test_pred_df = pd.DataFrame( { 'actual': y_test,
                            'predicted': np.round( y_pred, 2),
                            'residuals': y_test - y_pred } )
In [17]:
test_pred_df[0:10]
Out[17]:
actual predicted residuals
95 16.9 16.57 0.334604
15 22.4 21.19 1.211772
30 21.4 21.55 -0.151071
158 7.3 10.89 -3.589238
128 24.7 22.20 2.497680
115 12.6 13.36 -0.755569
69 22.3 21.20 1.103075
170 8.4 7.35 1.049715
174 11.5 13.28 -1.775471
45 14.9 15.12 -0.224495

Measuring model accuracy using metrics

In [18]:
from sklearn import metrics

Calculating RMSE

  • RMSE calculate the difference between the actual value and predicted value of the response variable
  • The square root of the mean/average of the square of all of the error.
  • Compared to the similar Mean Absolute Error, RMSE amplifies and severely punishes large errors.
  • The lesser the RMSE value, the better is the model.

rmse

In [19]:
rmse = np.sqrt( metrics.mean_squared_error( y_test, y_pred ) )
In [20]:
round( rmse, 2 )
Out[20]:
1.95

Evaluating Model Accuracy

  • R-squared is a statistical measure of how close the data are to the fitted regression line.
  • R-square signifies percentage of variations in the reponse variable that can be explained by the model.
  • R-squared = Explained variation / Total variation
  • Total variation is variation of response variable around it's mean.
  • R-squared value varies between 0 and 100%. 0% signifies that the model explains none of the variability, while 100% signifies that the model explains all the variability of the response. The closer the r-square to 100%, the better is the model.

Calculating R squared

In [21]:
metrics.r2_score( y_test, y_pred )
Out[21]:
0.86094665082303679

Understanding residuals

In [22]:
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
In [23]:
residuals = y_test - y_pred
In [24]:
sn.jointplot(  advt.Sales, residuals, size = 6 )
Out[24]:
<seaborn.axisgrid.JointGrid at 0x116766be0>
In [25]:
sn.distplot( residuals )
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x116a83ac8>

The residuals are randomly distributed. There are no visible relationship. The model can be assumed to be correct.

Creating a new features and rebuilding the model

Perhaps spending \$50,000 on television advertising and \$50,000 on radio advertising results in more sales than allocating \$100,000 to either television or radio individually. In marketing, this is known as a synergy effect, while in statistics it is called an interaction effect.

In [26]:
X_train['tv_radio'] = X_train.TV * X_train.Radio
X_test['tv_radio'] = X_test.TV * X_test.Radio
In [27]:
linreg = LinearRegression()
linreg.fit( X_train, y_train )
Out[27]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [28]:
y_pred = linreg.predict( X_test )
In [29]:
metrics.r2_score(y_test, y_pred)
Out[29]:
0.9661335713236503
In [30]:
residuals = y_test - y_pred
In [31]:
sn.jointplot( advt.Sales, residuals, size = 6 )
Out[31]:
<seaborn.axisgrid.JointGrid at 0x1129859b0>
In [32]:
sn.distplot( residuals )
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x1129853c8>

K-Fold Cross Validation

In [33]:
from sklearn.model_selection import cross_val_score
In [34]:
linreg = LinearRegression()
In [35]:
cv_scores = cross_val_score( linreg, X_train, y_train, scoring = 'r2', cv = 10 )
cv_scores
Out[35]:
array([ 0.95903494,  0.9876863 ,  0.95936288,  0.9869768 ,  0.94930993,
      0.98115182,  0.96756731,  0.98532576,  0.83492477,  0.96338315])
In [36]:
print( "Average r2 score: ", np.round( np.mean( cv_scores ), 2 ) )
print( "Standard deviation in r2 score: ", np.round( np.std( cv_scores ), 2) )
Average r2 score:  0.96
Standard deviation in r2 score:  0.04

Feature Ranking based on importance

Here is what f_regression does, on input matrix XX and array yy. For every feature it computes the correlation with y:

Then it computes the F-statistic

$F_{i}= \frac{ρ2_{i}}{(1−ρ2_{i})}$

These F-values are then returned, together with the associated p-values. So the result is a tuple (F-values, p-values). Then SelectKBest takes the first component of this tuple (these will be the scores), sorts it, and picks the first k features of X with the highest scores.

In [37]:
from sklearn.feature_selection import SelectKBest, f_regression
In [38]:
model = SelectKBest( score_func=f_regression, k=4 )
In [39]:
results = model.fit( X_train, y_train )
In [40]:
results.scores_
Out[40]:
array([  185.64138393,    88.09887658,     8.83792204,  1681.74689385])
In [41]:
results.pvalues_
Out[41]:
array([  2.55193304e-27,   1.72271169e-16,   3.48260999e-03,
       3.62229342e-79])
In [42]:
['%.3f' % p for p in results.pvalues_]
Out[42]:
['0.000', '0.000', '0.003', '0.000']

As p - values are less than 5% - the variables are siginificant in the regression equation. Also higher the F value, higher is the importance. So, as per this test, we can rank the features as below.

  1. TV*Radio
  2. TV
  3. Radio
  4. Newspaper

Building and Exporting the model

In [43]:
linreg = LinearRegression()
linreg.fit( X_train, y_train )
Out[43]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [44]:
import pickle
In [45]:
from sklearn.externals import joblib
joblib.dump(linreg, 'lin_model.pkl', compress=9)
Out[45]:
['lin_model.pkl']

Importing and applying the model for prediction

In [46]:
model_clone = joblib.load('lin_model.pkl')
In [47]:
model_clone.intercept_
Out[47]:
6.2827413679807229
In [48]:
model_clone.coef_
Out[48]:
array([ 0.02065759,  0.04560647,  0.00444591,  0.00100427])
In [49]:
pred_y = model_clone.predict( X_test )