This tutorial explains the concept of linear regression and use of stats models and sklearn to build regression model and evaluate it.

Topic Covered here are:

• Exploratory analysis in understanding influence of factors on the dependent variables.
• Building a simple linear regression model & multiple linear regression model.
• Understanding OLS methods to estimate model parameters.
• How to use statsmodel API in python?
• Interpreting the coefficients of the model.
• How to find if the parameters estimated are significant?
• Making predictions using the model.
• Finding model residuals and analyzing it.
• Evaluating model efficiency using RMSE and R-Square values.
• Finding multicollinearity
In [2]:
import pandas as pd
import numpy as np


• The dataset is available here

• Sales - Sales revenue generated
• Note: The amounts are in diffrent units

In [3]:
advt = pd.read_csv( "Advertising.csv" )

In [4]:
advt.head()

Out[4]:
Unnamed: 0 TV Radio Newspaper Sales
0 1 230.1 37.8 69.2 22.1
1 2 44.5 39.3 45.1 10.4
2 3 17.2 45.9 69.3 9.3
3 4 151.5 41.3 58.5 18.5
4 5 180.8 10.8 58.4 12.9
In [5]:
advt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
Unnamed: 0    200 non-null int64
TV            200 non-null float64
Newspaper     200 non-null float64
Sales         200 non-null float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB


## Remove the first column¶

• Remove the index column
In [6]:
advt = advt[["TV", "Radio", "Newspaper", "Sales"]]

In [7]:
advt.head()

Out[7]:
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 9.3
3 151.5 41.3 58.5 18.5
4 180.8 10.8 58.4 12.9

## Exploring Data¶

### Distribution of Variables¶

In [8]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [52]:
sns.distplot( advt.Sales );

In [53]:
sns.distplot( advt.Newspaper );

In [54]:
sns.distplot( advt.Radio );

In [55]:
sns.distplot( advt.TV );


#### Note:¶

Sales seems to be normal distribution. Spending on newspaper advertisement seems to be righ skewed. Most of the spends on newspaper is fairly low where are spend on radio and tv seems be uniform distribution. Spends on tv are comparatively higher then spens on radio and newspaper.

## Is there a relationship between sales and spend various advertising channels?¶

In [56]:
sns.jointplot(  advt.Newspaper, advt.Sales );


In [57]:
sns.jointplot(  advt.TV, advt.Sales );


#### Note:¶

Sales and spend on newpaper is not highly correlaed where are sales and spend on tv is highly correlated.

## Visualizing pairwise correleation¶

In [58]:
sns.pairplot( advt );


## Calculating correlations¶

In [16]:
advt.TV.corr( advt.Sales )

Out[16]:
0.78222442486160615
In [17]:
advt.corr()

Out[17]:
TV 1.000000 0.054809 0.056648 0.782224
Newspaper 0.056648 0.354104 1.000000 0.228299
Sales 0.782224 0.576223 0.228299 1.000000

#### Note:¶

• The diagonal of the above matirx shows the auto-correlation of the variables. It is always 1. You can observe that the correlation betweeb TV and Sales is highest i.e. 0.78 and then betweeb sales and radio i.e. 0.576.

• correlations can vary from -1 to +1. Closer to +1 means strong positive correlation and close -1 means strong negative correlation. Closer to 0 means not very strongly correlated. variables with strong correlations are mostly probably candidates for model builing.**

## Visualizing the correlation¶

#### The darker is the color, the stronger is the correlation.¶

In [60]:
sns.heatmap( advt.corr(), annot=True );


## Building a Simple Linear Regression model¶

Linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. Source: wikipedia

A simple linear regression model is given by

$$Y = mX + b$$

where m is the slope and b is the y-intercept. Y is the dependent variable and X is the explanatory variable.

Very briefly and simplistically, Linear Regression is a class of techniques for fitting a straight line to a set of data points.

In [61]:
import statsmodels.api as smf


### Splitting Dataset¶

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
Y,
test_size=0.3 )

In [24]:
len( X_train )

Out[24]:
140
In [25]:
len( X_test )

Out[25]:
60

## Estimating best fitted line using OLS¶

• Here Sales is the response variable and TV is the predictor. Multiple predictors can be combined with a + symbol.
• When an intercept is needed, a constant term has to be added.
In [26]:
lm = smf.OLS( y_test, X_test ).fit()


## Getting model parameters¶

In [27]:
lm.params

Out[27]:
const    7.713080
TV       0.044906
dtype: float64

### Parameters at 95% confidence interval¶

In [28]:
lm.conf_int()

Out[28]:
0 1
const 5.948499 9.477661
TV 0.034994 0.054817

## Model Summary¶

In [29]:
lm.summary()

Out[29]:
Dep. Variable: R-squared: Sales 0.586 OLS 0.579 Least Squares 82.25 Wed, 20 Jun 2018 1.02e-12 05:33:35 -154.13 60 312.3 58 316.4 1 nonrobust
coef std err t P>|t| [0.025 0.975] 7.7131 0.882 8.750 0.000 5.948 9.478 0.0449 0.005 9.069 0.000 0.035 0.055
 Omnibus: Durbin-Watson: 0.22 2.056 0.896 0.076 -0.086 0.963 2.982 379

### A detailed article on explaining the summary output is available here:¶

http://connor-johnson.com/2014/02/18/linear-regression-with-python/

#### Note:¶

Parameters estimated are considered to be significant if p-value is less than 0.05

This indicates intercept and TV both are significant parameters. And the parameter estimates can be accepted.

So, the linear model is

$$Sales = 7.7131 + 0.0449 * {TV}$$

### Evaluating Model Accuracy¶

• R-squared is a statistical measure of how close the data are to the fitted regression line.
• R-square signifies percentage of variations in the reponse variable that can be explained by the model.
• R-squared = Explained variation / Total variation
• Total variation is variation of response variable around it's mean.
• R-squared value varies between 0 and 100%. 0% signifies that the model explains none of the variability, while 100% signifies that the model explains all the variability of the response. The closer the r-square to 100%, the better is the model.
In [30]:
lm.rsquared

Out[30]:
0.58645305725504637
In [31]:
round( float( lm.rsquared ), 2 )

Out[31]:
0.59

## Making Predictions¶

In [33]:
lmpredict = lm.predict( X_test )

In [34]:
lmpredict[0:10]

Out[34]:
164    12.976005
109    19.181946
22      8.305833
83     10.784616
142    17.614743
150    20.318055
168    17.385725
129    10.389448
86     11.139370
191    11.103446
dtype: float64

## Calculating RMSE¶

• RMSE calculate the difference between the actual value and predicted value of the response variable
• The square root of the mean/average of the square of all of the error.
• Compared to the similar Mean Absolute Error, RMSE amplifies and severely punishes large errors.
• The lesser the RMSE value, the better is the model.

In [35]:
from statsmodels.tools.eval_measures import rmse

In [37]:
rmse( y_test, lmpredict )

Out[37]:
3.1577045638046197

## Get the residuals and plot them¶

In [38]:
lm.resid[1:10]

Out[38]:
109    0.618054
22    -2.705833
83     2.815384
142    2.485257
150   -4.218055
168   -0.285725
129   -0.689448
86     0.860630
191   -1.203446
dtype: float64

#### One of the assumptions is that the residuals should be normally distributed i.e. it should be random.¶

• The residuals should be plotted against the response variable and it should not show any pattern.
In [62]:
sns.jointplot(  advt.Sales, lm.resid );


## Multiple Linear Regression Model¶

Adding all independent variables to the model.

In [63]:
X = smf.add_constant( advt[['TV', 'Radio','Newspaper']] )

In [64]:
X_train, X_test, y_train, y_test = train_test_split(X,
Y,
test_size=0.3 )

In [65]:
X.head( 2 )

Out[65]:
0 1.0 230.1 37.8 69.2
1 1.0 44.5 39.3 45.1
In [66]:
lm = smf.OLS( y_train, X_train ).fit()

In [67]:
lm.params

Out[67]:
const        2.798989
TV           0.047169
Newspaper   -0.000359
dtype: float64
In [68]:
lm.pvalues

Out[68]:
const        2.961458e-11
TV           3.189362e-58
Newspaper    9.593033e-01
dtype: float64
In [69]:
lm.summary()

Out[69]:
Dep. Variable: R-squared: Sales 0.896 OLS 0.893 Least Squares 388.5 Wed, 20 Jun 2018 1.75e-66 05:37:24 -273.45 140 554.9 136 566.7 3 nonrobust
coef std err t P>|t| [0.025 0.975] 2.7990 0.386 7.242 0.000 2.035 3.563 0.0472 0.002 27.949 0.000 0.044 0.051 0.1839 0.011 17.103 0.000 0.163 0.205 -0.0004 0.007 -0.051 0.959 -0.014 0.014
 Omnibus: Durbin-Watson: 46.069 1.94 0 113.5 -1.335 2.26e-25 6.511 460
In [70]:
lmpredict = lm.predict( X )

In [71]:
rmse( Y, lmpredict )

Out[71]:
1.6741073852074499
In [72]:
sns.jointplot(  advt.Sales, lm.resid );


## Build the model with Significant Variables¶

In [73]:
X = smf.add_constant( advt[['TV', 'Radio']] )

X_train, X_test, y_train, y_test = train_test_split(X,
Y,
test_size=0.3 )

In [74]:
X_train = smf.add_constant( X_train )

In [75]:
lm = smf.OLS( y_train, X_train ).fit()

In [76]:
lm.summary()

Out[76]:
Dep. Variable: R-squared: Sales 0.896 OLS 0.894 Least Squares 588.0 Wed, 20 Jun 2018 5.78e-68 05:38:32 -278.14 140 562.3 137 571.1 2 nonrobust
coef std err t P>|t| [0.025 0.975] 2.5430 0.381 6.670 0.000 1.789 3.297 0.0479 0.002 27.680 0.000 0.045 0.051 0.1889 0.010 18.728 0.000 0.169 0.209
 Omnibus: Durbin-Watson: 44.122 2.098 0 94.606 -1.349 2.86e-21 5.989 440
In [77]:
y_predict = lm.predict( X_test )

In [78]:
rmse( y_predict, y_test )

Out[78]:
1.4632534113995157
In [79]:
residuals = y_test - y_predict


## Residual Plots¶

### Scatter plot¶

In [80]:
sns.jointplot(  y_test, residuals )

Out[80]:
<seaborn.axisgrid.JointGrid at 0x11eb5d438>

### The residuals are not randomly distributed. The p-value for the correlation is less than 0.05. So, the correlation estimated is statistically significant.¶

In [81]:
import pylab
import scipy.stats as stats


### Q-Q Plot¶

In [82]:
import seaborn as sn

In [84]:
sn.distplot( residuals );

In [85]:
stats.probplot( residuals, dist="norm", plot=pylab )
pylab.show();


## Normality test for residuals¶

In [86]:
from scipy.stats import normaltest

In [87]:
normaltest( residuals )

Out[87]:
NormaltestResult(statistic=2.2783155635673511, pvalue=0.32008849268311235)

#### Note:¶

As the null hypothesis is the sample is derived from normal distribution and the p-value is more than 0.32, it can be concluded that the residuals are normally distributed.

## Estimating average value of the coefficients manually¶

In [88]:
X.head( 5 )

Out[88]:
0 1.0 230.1 37.8
1 1.0 44.5 39.3
2 1.0 17.2 45.9
3 1.0 151.5 41.3
4 1.0 180.8 10.8

### Converting into numpy array¶

In [89]:
X_mat = X.as_matrix()

In [90]:
Y_mat = Y.as_matrix()

In [91]:
from numpy.linalg import inv

• It is well known that an estimate of β is given by (as per this wikipedia article)

$$β̂ =(X_{T}X)^{-1}X_{T}y$$

### Applying the above mathematical formula¶

In [92]:
beta_vals = np.matmul(
np.matmul(
inv( np.matmul( X_mat.transpose(), X_mat ) ), X_mat.transpose() ),
Y_mat )

In [93]:
list( beta_vals )

Out[93]:
[2.9210999124051358, 0.045754815101076131, 0.18799422662030896]

## Checking multi-collinearity¶

• VIF (Variance inflation factor) is an indicator of how a predictor is correlated with other predictors. This can be done by taking the predictor in question, and regress it against all of the other predictors in our model.
• And the VIF is calculated from r-squared value of the model as below

## $\frac{1}{1 - R^{2}_{x1}}$

In [94]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [97]:
advt_mat = advt[['TV', "Radio"]].as_matrix()

In [98]:
advt_mat.shape

Out[98]:
(200, 2)
In [99]:
vif = [ variance_inflation_factor( advt_mat, i ) for i in range( advt_mat.shape[1] ) ]

In [100]:
vif_factors = pd.DataFrame()

In [101]:
vif_factors['column'] = ['TV', "Radio"]
vif_factors['vif'] = vif

In [102]:
vif_factors

Out[102]:
column vif
0 TV 2.238085

#### Note:¶

As the VIF factors is less than 5, there is no multicolliearity between the variables.

## Creating New Features¶

Perhaps spending $50,000 on television advertising and$50,000 on radio advertising results in more sales than allocating \$100,000 to either television or radio individually. In marketing, this is known as a synergy effect, while in statistics it is called an interaction effect. Taken from ISLR book http://www-bcf.usc.edu/~gareth/ISL/

So, let's create an interaction variable for TV and Radio Spending

In [114]:
X = smf.add_constant( advt[['TV', 'Radio', 'Newspaper']] )

X_train, X_test, y_train, y_test = train_test_split(X,
Y,
test_size=0.3 )

In [115]:
X_train[0:5]

Out[115]:
48 1.0 227.2 15.8 49.9 3589.76
182 1.0 56.2 5.7 29.7 320.34
192 1.0 17.2 4.1 31.6 70.52
27 1.0 240.1 16.7 22.9 4009.67
47 1.0 239.9 41.5 18.5 9955.85
In [131]:
lm = smf.OLS( y_train,
X_train ).fit()

In [132]:
lm.summary2()

Out[132]:
 Model: OLS Adj. R-squared: 0.932 Dependent Variable: Sales AIC: 782.82 Date: 2018-06-20 05:50 BIC: 788.704 No. Observations: 140 Log-Likelihood: -389.41 Df Model: 2 F-statistic: 955.9 Df Residuals: 138 Prob (F-statistic): 1.39e-81 R-squared: 0.933 Scale: 15.479
Coef. Std.Err. t P>|t| [0.025 0.975] 0.0554 0.0037 15.0057 0.0000 0.0481 0.0626 0.0011 0.0001 8.7420 0.0000 0.0009 0.0014
 Omnibus: 73.255 Durbin-Watson: 1.629 Prob(Omnibus): 0 Jarque-Bera (JB): 10.568 Skew: -0.251 Prob(JB): 0.005 Kurtosis: 1.751 Condition No.: 55

When synergy effect is added Radio and Newspaper become not significant variables. Either TV and TV_Radio are significant variables.

### Removing Radio from the model¶

In [133]:
X = X[['TV', 'tv_radio']]

X_train, X_test, y_train, y_test = train_test_split(X,
Y,
test_size=0.3 )

In [134]:
lm = smf.OLS( y_train,
X_train ).fit()

In [135]:
lm.summary2()

Out[135]:
 Model: OLS Adj. R-squared: 0.926 Dependent Variable: Sales AIC: 785.266 Date: 2018-06-20 05:50 BIC: 791.149 No. Observations: 140 Log-Likelihood: -390.63 Df Model: 2 F-statistic: 881.7 Df Residuals: 138 Prob (F-statistic): 2.49e-79 R-squared: 0.927 Scale: 15.752
Coef. Std.Err. t P>|t| [0.025 0.975] 0.0541 0.0038 14.1991 0.0000 0.0466 0.0617 0.0012 0.0001 8.8029 0.0000 0.0009 0.0015
 Omnibus: 53.647 Durbin-Watson: 1.628 Prob(Omnibus): 0 Jarque-Bera (JB): 10.263 Skew: -0.297 Prob(JB): 0.006 Kurtosis: 1.814 Condition No.: 52

## Model Deployment¶

The model can be exported to a file, which stores the details of the model params. The file can be transferred to the system which need to apply the model for prediction.

In [141]:
from sklearn.externals import joblib
joblib.dump(lm, 'lin_model.pkl', compress=9)

Out[141]:
['lin_model.pkl']
In [142]:
model_clone = joblib.load('lin_model.pkl')

In [143]:
model_clone.params

Out[143]:
TV          0.054140
dtype: float64
model_clone.conf_int(0.95)