This tutorial explains the concept of linear regression and use of stats models and sklearn to build regression model and evaluate it.

Topic Covered here are:

  • Exploratory analysis in understanding influence of factors on the dependent variables.
  • Building a simple linear regression model & multiple linear regression model.
  • Understanding OLS methods to estimate model parameters.
  • How to use statsmodel API in python?
  • Interpreting the coefficients of the model.
  • How to find if the parameters estimated are significant?
  • Making predictions using the model.
  • Finding model residuals and analyzing it.
  • Evaluating model efficiency using RMSE and R-Square values.
  • Finding multicollinearity
In [2]:
import pandas as pd
import numpy as np

Loading Adverstiment Dataset

The adverstiting dataset captures sales revenue generated with respect to advertisement spends across multiple channles like radio, tv and newspaper.

  • The dataset is available here

Attribution Descriptions

  • TV - Spend on TV Advertisements
  • Radio - Spend on radio Advertisements
  • Newspaper - Spend on newspaper Advertisements
  • Sales - Sales revenue generated
  • Note: The amounts are in diffrent units

Read the data

In [3]:
advt = pd.read_csv( "Advertising.csv" )
In [4]:
advt.head()
Out[4]:
Unnamed: 0 TV Radio Newspaper Sales
0 1 230.1 37.8 69.2 22.1
1 2 44.5 39.3 45.1 10.4
2 3 17.2 45.9 69.3 9.3
3 4 151.5 41.3 58.5 18.5
4 5 180.8 10.8 58.4 12.9
In [5]:
advt.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
Unnamed: 0    200 non-null int64
TV            200 non-null float64
Radio         200 non-null float64
Newspaper     200 non-null float64
Sales         200 non-null float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB

Remove the first column

  • Remove the index column
In [6]:
advt = advt[["TV", "Radio", "Newspaper", "Sales"]]
In [7]:
advt.head()
Out[7]:
TV Radio Newspaper Sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 9.3
3 151.5 41.3 58.5 18.5
4 180.8 10.8 58.4 12.9

Exploring Data

Distribution of Variables

In [8]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [52]:
sns.distplot( advt.Sales );
In [53]:
sns.distplot( advt.Newspaper );
In [54]:
sns.distplot( advt.Radio );
In [55]:
sns.distplot( advt.TV );

Note:

Sales seems to be normal distribution. Spending on newspaper advertisement seems to be righ skewed. Most of the spends on newspaper is fairly low where are spend on radio and tv seems be uniform distribution. Spends on tv are comparatively higher then spens on radio and newspaper.

Is there a relationship between sales and spend various advertising channels?

Sales Vs. Newspaper advertisement spends

In [56]:
sns.jointplot(  advt.Newspaper, advt.Sales );

Sales Vs. TV advertisement spends

In [57]:
sns.jointplot(  advt.TV, advt.Sales );

Note:

Sales and spend on newpaper is not highly correlaed where are sales and spend on tv is highly correlated.

Visualizing pairwise correleation

In [58]:
sns.pairplot( advt );

Calculating correlations

In [16]:
advt.TV.corr( advt.Sales )
Out[16]:
0.78222442486160615
In [17]:
advt.corr()
Out[17]:
TV Radio Newspaper Sales
TV 1.000000 0.054809 0.056648 0.782224
Radio 0.054809 1.000000 0.354104 0.576223
Newspaper 0.056648 0.354104 1.000000 0.228299
Sales 0.782224 0.576223 0.228299 1.000000

Note:

  • The diagonal of the above matirx shows the auto-correlation of the variables. It is always 1. You can observe that the correlation betweeb TV and Sales is highest i.e. 0.78 and then betweeb sales and radio i.e. 0.576.

  • correlations can vary from -1 to +1. Closer to +1 means strong positive correlation and close -1 means strong negative correlation. Closer to 0 means not very strongly correlated. variables with strong correlations are mostly probably candidates for model builing.**

Visualizing the correlation

The darker is the color, the stronger is the correlation.

In [60]:
sns.heatmap( advt.corr(), annot=True );

Building a Simple Linear Regression model

Linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. Source: wikipedia

A simple linear regression model is given by

$$Y = mX + b$$

where m is the slope and b is the y-intercept. Y is the dependent variable and X is the explanatory variable.

Very briefly and simplistically, Linear Regression is a class of techniques for fitting a straight line to a set of data points.

In [61]:
import statsmodels.api as smf

Splitting Dataset

In [23]:
from sklearn.model_selection import train_test_split

Y = advt.Sales
X = smf.add_constant( advt.TV )

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    Y,
                                                    test_size=0.3 )
In [24]:
len( X_train )
Out[24]:
140
In [25]:
len( X_test )
Out[25]:
60

Estimating best fitted line using OLS

rmse

  • Here Sales is the response variable and TV is the predictor. Multiple predictors can be combined with a + symbol.
  • When an intercept is needed, a constant term has to be added.
In [26]:
lm = smf.OLS( y_test, X_test ).fit()

Getting model parameters

In [27]:
lm.params
Out[27]:
const    7.713080
TV       0.044906
dtype: float64

Parameters at 95% confidence interval

In [28]:
lm.conf_int()
Out[28]:
0 1
const 5.948499 9.477661
TV 0.034994 0.054817

Model Summary

In [29]:
lm.summary()
Out[29]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.586
Model: OLS Adj. R-squared: 0.579
Method: Least Squares F-statistic: 82.25
Date: Wed, 20 Jun 2018 Prob (F-statistic): 1.02e-12
Time: 05:33:35 Log-Likelihood: -154.13
No. Observations: 60 AIC: 312.3
Df Residuals: 58 BIC: 316.4
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 7.7131 0.882 8.750 0.000 5.948 9.478
TV 0.0449 0.005 9.069 0.000 0.035 0.055
Omnibus: 0.220 Durbin-Watson: 2.056
Prob(Omnibus): 0.896 Jarque-Bera (JB): 0.076
Skew: -0.086 Prob(JB): 0.963
Kurtosis: 2.982 Cond. No. 379.

A detailed article on explaining the summary output is available here:

http://connor-johnson.com/2014/02/18/linear-regression-with-python/

Note:

Parameters estimated are considered to be significant if p-value is less than 0.05

This indicates intercept and TV both are significant parameters. And the parameter estimates can be accepted.

So, the linear model is

$$Sales = 7.7131 + 0.0449 * {TV} $$

Evaluating Model Accuracy

  • R-squared is a statistical measure of how close the data are to the fitted regression line.
  • R-square signifies percentage of variations in the reponse variable that can be explained by the model.
  • R-squared = Explained variation / Total variation
  • Total variation is variation of response variable around it's mean.
  • R-squared value varies between 0 and 100%. 0% signifies that the model explains none of the variability, while 100% signifies that the model explains all the variability of the response. The closer the r-square to 100%, the better is the model.
In [30]:
lm.rsquared
Out[30]:
0.58645305725504637
In [31]:
round( float( lm.rsquared ), 2 )
Out[31]:
0.59

Making Predictions

In [33]:
lmpredict = lm.predict( X_test )
In [34]:
lmpredict[0:10]
Out[34]:
164    12.976005
109    19.181946
22      8.305833
83     10.784616
142    17.614743
150    20.318055
168    17.385725
129    10.389448
86     11.139370
191    11.103446
dtype: float64

Calculating RMSE

  • RMSE calculate the difference between the actual value and predicted value of the response variable
  • The square root of the mean/average of the square of all of the error.
  • Compared to the similar Mean Absolute Error, RMSE amplifies and severely punishes large errors.
  • The lesser the RMSE value, the better is the model.

rmse

In [35]:
from statsmodels.tools.eval_measures import rmse
In [37]:
rmse( y_test, lmpredict )
Out[37]:
3.1577045638046197

Get the residuals and plot them

In [38]:
lm.resid[1:10]
Out[38]:
109    0.618054
22    -2.705833
83     2.815384
142    2.485257
150   -4.218055
168   -0.285725
129   -0.689448
86     0.860630
191   -1.203446
dtype: float64

One of the assumptions is that the residuals should be normally distributed i.e. it should be random.

  • The residuals should be plotted against the response variable and it should not show any pattern.
In [62]:
sns.jointplot(  advt.Sales, lm.resid );

Multiple Linear Regression Model

Adding all independent variables to the model.

In [63]:
X = smf.add_constant( advt[['TV', 'Radio','Newspaper']] )
In [64]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    Y,
                                                    test_size=0.3 )
In [65]:
X.head( 2 )
Out[65]:
const TV Radio Newspaper
0 1.0 230.1 37.8 69.2
1 1.0 44.5 39.3 45.1
In [66]:
lm = smf.OLS( y_train, X_train ).fit()
In [67]:
lm.params
Out[67]:
const        2.798989
TV           0.047169
Radio        0.183851
Newspaper   -0.000359
dtype: float64
In [68]:
lm.pvalues
Out[68]:
const        2.961458e-11
TV           3.189362e-58
Radio        1.055526e-35
Newspaper    9.593033e-01
dtype: float64
In [69]:
lm.summary()
Out[69]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.896
Model: OLS Adj. R-squared: 0.893
Method: Least Squares F-statistic: 388.5
Date: Wed, 20 Jun 2018 Prob (F-statistic): 1.75e-66
Time: 05:37:24 Log-Likelihood: -273.45
No. Observations: 140 AIC: 554.9
Df Residuals: 136 BIC: 566.7
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 2.7990 0.386 7.242 0.000 2.035 3.563
TV 0.0472 0.002 27.949 0.000 0.044 0.051
Radio 0.1839 0.011 17.103 0.000 0.163 0.205
Newspaper -0.0004 0.007 -0.051 0.959 -0.014 0.014
Omnibus: 46.069 Durbin-Watson: 1.940
Prob(Omnibus): 0.000 Jarque-Bera (JB): 113.500
Skew: -1.335 Prob(JB): 2.26e-25
Kurtosis: 6.511 Cond. No. 460.
In [70]:
lmpredict = lm.predict( X )
In [71]:
rmse( Y, lmpredict )
Out[71]:
1.6741073852074499
In [72]:
sns.jointplot(  advt.Sales, lm.resid );

Build the model with Significant Variables

In [73]:
X = smf.add_constant( advt[['TV', 'Radio']] )

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    Y,
                                                    test_size=0.3 )
In [74]:
X_train = smf.add_constant( X_train )
X_test = smf.add_constant( X_test )
In [75]:
lm = smf.OLS( y_train, X_train ).fit()
In [76]:
lm.summary()
Out[76]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.896
Model: OLS Adj. R-squared: 0.894
Method: Least Squares F-statistic: 588.0
Date: Wed, 20 Jun 2018 Prob (F-statistic): 5.78e-68
Time: 05:38:32 Log-Likelihood: -278.14
No. Observations: 140 AIC: 562.3
Df Residuals: 137 BIC: 571.1
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 2.5430 0.381 6.670 0.000 1.789 3.297
TV 0.0479 0.002 27.680 0.000 0.045 0.051
Radio 0.1889 0.010 18.728 0.000 0.169 0.209
Omnibus: 44.122 Durbin-Watson: 2.098
Prob(Omnibus): 0.000 Jarque-Bera (JB): 94.606
Skew: -1.349 Prob(JB): 2.86e-21
Kurtosis: 5.989 Cond. No. 440.
In [77]:
y_predict = lm.predict( X_test )
In [78]:
rmse( y_predict, y_test )
Out[78]:
1.4632534113995157
In [79]:
residuals = y_test - y_predict

Residual Plots

Scatter plot

In [80]:
sns.jointplot(  y_test, residuals )
Out[80]:
<seaborn.axisgrid.JointGrid at 0x11eb5d438>

The residuals are not randomly distributed. The p-value for the correlation is less than 0.05. So, the correlation estimated is statistically significant.

In [81]:
import pylab
import scipy.stats as stats

Q-Q Plot

In [82]:
import seaborn as sn
In [84]:
sn.distplot( residuals );
In [85]:
stats.probplot( residuals, dist="norm", plot=pylab )
pylab.show();

Normality test for residuals

In [86]:
from scipy.stats import normaltest
In [87]:
normaltest( residuals )
Out[87]:
NormaltestResult(statistic=2.2783155635673511, pvalue=0.32008849268311235)

Note:

As the null hypothesis is the sample is derived from normal distribution and the p-value is more than 0.32, it can be concluded that the residuals are normally distributed.

Estimating average value of the coefficients manually

In [88]:
X.head( 5 )
Out[88]:
const TV Radio
0 1.0 230.1 37.8
1 1.0 44.5 39.3
2 1.0 17.2 45.9
3 1.0 151.5 41.3
4 1.0 180.8 10.8

Converting into numpy array

In [89]:
X_mat = X.as_matrix()
In [90]:
Y_mat = Y.as_matrix()
In [91]:
from numpy.linalg import inv
  • It is well known that an estimate of β is given by (as per this wikipedia article)

    $$β̂ =(X_{T}X)^{-1}X_{T}y$$

Applying the above mathematical formula

In [92]:
beta_vals = np.matmul( 
    np.matmul( 
        inv( np.matmul( X_mat.transpose(), X_mat ) ), X_mat.transpose() ),
    Y_mat )
In [93]:
list( beta_vals )
Out[93]:
[2.9210999124051358, 0.045754815101076131, 0.18799422662030896]

Checking multi-collinearity

  • VIF (Variance inflation factor) is an indicator of how a predictor is correlated with other predictors. This can be done by taking the predictor in question, and regress it against all of the other predictors in our model.
  • And the VIF is calculated from r-squared value of the model as below

    ## $\frac{1}{1 - R^{2}_{x1}}$

In [94]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
In [97]:
advt_mat = advt[['TV', "Radio"]].as_matrix()
In [98]:
advt_mat.shape
Out[98]:
(200, 2)
In [99]:
vif = [ variance_inflation_factor( advt_mat, i ) for i in range( advt_mat.shape[1] ) ]
In [100]:
vif_factors = pd.DataFrame()
In [101]:
vif_factors['column'] = ['TV', "Radio"]
vif_factors['vif'] = vif
In [102]:
vif_factors
Out[102]:
column vif
0 TV 2.238085
1 Radio 2.238085

Note:

As the VIF factors is less than 5, there is no multicolliearity between the variables.

Creating New Features

Adding Interaction Variable

Perhaps spending $50,000 on television advertising and $50,000 on radio advertising results in more sales than allocating $100,000 to either television or radio individually. In marketing, this is known as a synergy effect, while in statistics it is called an interaction effect. Taken from ISLR book http://www-bcf.usc.edu/~gareth/ISL/

So, let's create an interaction variable for TV and Radio Spending

In [114]:
X = smf.add_constant( advt[['TV', 'Radio', 'Newspaper']] )

X['tv_radio'] = X.TV * X.Radio

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    Y,
                                                    test_size=0.3 )
In [115]:
X_train[0:5]
Out[115]:
const TV Radio Newspaper tv_radio
48 1.0 227.2 15.8 49.9 3589.76
182 1.0 56.2 5.7 29.7 320.34
192 1.0 17.2 4.1 31.6 70.52
27 1.0 240.1 16.7 22.9 4009.67
47 1.0 239.9 41.5 18.5 9955.85
In [131]:
lm = smf.OLS( y_train, 
              X_train ).fit()
In [132]:
lm.summary2()
Out[132]:
Model: OLS Adj. R-squared: 0.932
Dependent Variable: Sales AIC: 782.8204
Date: 2018-06-20 05:50 BIC: 788.7036
No. Observations: 140 Log-Likelihood: -389.41
Df Model: 2 F-statistic: 955.9
Df Residuals: 138 Prob (F-statistic): 1.39e-81
R-squared: 0.933 Scale: 15.479
Coef. Std.Err. t P>|t| [0.025 0.975]
TV 0.0554 0.0037 15.0057 0.0000 0.0481 0.0626
tv_radio 0.0011 0.0001 8.7420 0.0000 0.0009 0.0014
Omnibus: 73.255 Durbin-Watson: 1.629
Prob(Omnibus): 0.000 Jarque-Bera (JB): 10.568
Skew: -0.251 Prob(JB): 0.005
Kurtosis: 1.751 Condition No.: 55

When synergy effect is added Radio and Newspaper become not significant variables. Either TV and TV_Radio are significant variables.

Removing Radio from the model

In [133]:
X = X[['TV', 'tv_radio']]

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    Y,
                                                    test_size=0.3 )
In [134]:
lm = smf.OLS( y_train, 
              X_train ).fit()
In [135]:
lm.summary2()
Out[135]:
Model: OLS Adj. R-squared: 0.926
Dependent Variable: Sales AIC: 785.2657
Date: 2018-06-20 05:50 BIC: 791.1490
No. Observations: 140 Log-Likelihood: -390.63
Df Model: 2 F-statistic: 881.7
Df Residuals: 138 Prob (F-statistic): 2.49e-79
R-squared: 0.927 Scale: 15.752
Coef. Std.Err. t P>|t| [0.025 0.975]
TV 0.0541 0.0038 14.1991 0.0000 0.0466 0.0617
tv_radio 0.0012 0.0001 8.8029 0.0000 0.0009 0.0015
Omnibus: 53.647 Durbin-Watson: 1.628
Prob(Omnibus): 0.000 Jarque-Bera (JB): 10.263
Skew: -0.297 Prob(JB): 0.006
Kurtosis: 1.814 Condition No.: 52

Model Deployment

The model can be exported to a file, which stores the details of the model params. The file can be transferred to the system which need to apply the model for prediction.

In [141]:
from sklearn.externals import joblib
joblib.dump(lm, 'lin_model.pkl', compress=9)
Out[141]:
['lin_model.pkl']
In [142]:
model_clone = joblib.load('lin_model.pkl')
In [143]:
model_clone.params
Out[143]:
TV          0.054140
tv_radio    0.001216
dtype: float64
In [144]:
model_clone.conf_int(0.95)
Out[144]:
0 1
TV 0.053901 0.054380
tv_radio 0.001207 0.001224