This tutorial explains the concept of linear regression and use of stats models and sklearn to build regression model and evaluate it.

Topic Covered here are:

• What is linear regression?
• Splitting dataseta and cross validating models.
• Building linear regression model.
• How to use sklearn API in python?
• Interpreting the coefficients of the model.
• Making predictions using the model.
• Finding model residuals and analyzing it.
• Evaluating model efficiency using RMSE and R-Square values.
In [1]:
import pandas as pd
import numpy as np


The adverstiting dataset captures sales revenue generated with respect to advertisement spends across multiple channles like radio, tv and newspaper.

• The dataset is available here

• TV - Spend on TV Advertisements
• Newspaper - Spend on newspaper Advertisements
• Sales - Sales revenue generated
• Note: The amounts are in diffrent units

## Read the data¶

In [2]:
advt = pd.read_csv( "Advertising.csv" )

In [3]:
advt.head()

Out[3]:
Unnamed: 0 TV Radio Newspaper Sales
0 1 230.1 37.8 69.2 22.1
1 2 44.5 39.3 45.1 10.4
2 3 17.2 45.9 69.3 9.3
3 4 151.5 41.3 58.5 18.5
4 5 180.8 10.8 58.4 12.9
In [4]:
advt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
Unnamed: 0    200 non-null int64
TV            200 non-null float64
Radio         200 non-null float64
Newspaper     200 non-null float64
Sales         200 non-null float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB


## Remove the first column¶

• Remove the index column
In [5]:
advt = advt[["TV", "Radio", "Newspaper", "Sales"]]

In [6]:
advt.head()

Out[6]:
TV Radio Newspaper Sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 9.3
3 151.5 41.3 58.5 18.5
4 180.8 10.8 58.4 12.9

## Building a Simple Linear Regression model¶

Linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. Source: wikipedia

A simple linear regression model is given by

$$Y = mX + b$$

where m is the slope and b is the y-intercept. Y is the dependent variable and X is the explanatory variable.

Very briefly and simplistically, Linear Regression is a class of techniques for fitting a straight line to a set of data points.

## Splitting into Train and test data sets..¶

#### Typically the model should be built on a training dataset and validated against a test dataset.¶

• Let's split the dataset into 70/30 ratio. 70% belongs to training and 30% belongs to test.
In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
test_size=0.3,
random_state = 42 )

In [9]:
len( X_train )

Out[9]:
140
In [10]:
len( X_test )

Out[10]:
60

## Using sklearn library to build the model¶

#### sklearn library has a comprehensive set of APIs to split datasets, build models, test models and calculate accuracy metrics.¶

In [11]:
from sklearn.linear_model import LinearRegression


## Building the model with training dataset¶

In [12]:
linreg = LinearRegression()
linreg.fit( X_train, y_train )

Out[12]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Finding the intercept and coefficients¶

In [13]:
linreg.intercept_

Out[13]:
2.7089490925159119
In [14]:
list( zip( ["TV", "Radio", "Newspaper"], list( linreg.coef_ ) ) )

Out[14]:
[('TV', 0.044059280957465197),
('Newspaper', 0.0068824522222754002)]

So, the equation looks like

Sales = 2.708 + 0.044 TV + 0.199 Radio + 0.006 * Newspaper

### Making predictions on test dataset¶

In [15]:
y_pred = linreg.predict( X_test )


### Comparing the predictions with actuals¶

In [16]:
test_pred_df = pd.DataFrame( { 'actual': y_test,
'predicted': np.round( y_pred, 2),
'residuals': y_test - y_pred } )

In [17]:
test_pred_df[0:10]

Out[17]:
actual predicted residuals
95 16.9 16.57 0.334604
15 22.4 21.19 1.211772
30 21.4 21.55 -0.151071
158 7.3 10.89 -3.589238
128 24.7 22.20 2.497680
115 12.6 13.36 -0.755569
69 22.3 21.20 1.103075
170 8.4 7.35 1.049715
174 11.5 13.28 -1.775471
45 14.9 15.12 -0.224495

## Measuring model accuracy using metrics¶

In [18]:
from sklearn import metrics


### Calculating RMSE¶

• RMSE calculate the difference between the actual value and predicted value of the response variable
• The square root of the mean/average of the square of all of the error.
• Compared to the similar Mean Absolute Error, RMSE amplifies and severely punishes large errors.
• The lesser the RMSE value, the better is the model.

In [19]:
rmse = np.sqrt( metrics.mean_squared_error( y_test, y_pred ) )

In [20]:
round( rmse, 2 )

Out[20]:
1.95

## Evaluating Model Accuracy¶

• R-squared is a statistical measure of how close the data are to the fitted regression line.
• R-square signifies percentage of variations in the reponse variable that can be explained by the model.
• R-squared = Explained variation / Total variation
• Total variation is variation of response variable around it's mean.
• R-squared value varies between 0 and 100%. 0% signifies that the model explains none of the variability, while 100% signifies that the model explains all the variability of the response. The closer the r-square to 100%, the better is the model.

### Calculating R squared¶

In [21]:
metrics.r2_score( y_test, y_pred )

Out[21]:
0.86094665082303679

## Understanding residuals¶

In [22]:
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

In [23]:
residuals = y_test - y_pred

In [24]:
sn.jointplot(  advt.Sales, residuals, size = 6 )

Out[24]:
<seaborn.axisgrid.JointGrid at 0x116766be0>
In [25]:
sn.distplot( residuals )

Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x116a83ac8>

Perhaps spending \$50,000 on television advertising and \$50,000 on radio advertising results in more sales than allocating \$100,000 to either television or radio individually. In marketing, this is known as a synergy effect, while in statistics it is called an interaction effect. In [26]: X_train['tv_radio'] = X_train.TV * X_train.Radio X_test['tv_radio'] = X_test.TV * X_test.Radio  In [27]: linreg = LinearRegression() linreg.fit( X_train, y_train )  Out[27]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False) In [28]: y_pred = linreg.predict( X_test )  In [29]: metrics.r2_score(y_test, y_pred)  Out[29]: 0.9661335713236503 In [30]: residuals = y_test - y_pred  In [31]: sn.jointplot( advt.Sales, residuals, size = 6 )  Out[31]: <seaborn.axisgrid.JointGrid at 0x1129859b0> In [32]: sn.distplot( residuals )  Out[32]: <matplotlib.axes._subplots.AxesSubplot at 0x1129853c8> ## K-Fold Cross Validation¶ In [33]: from sklearn.model_selection import cross_val_score  In [34]: linreg = LinearRegression()  In [35]: cv_scores = cross_val_score( linreg, X_train, y_train, scoring = 'r2', cv = 10 ) cv_scores  Out[35]: array([ 0.95903494, 0.9876863 , 0.95936288, 0.9869768 , 0.94930993, 0.98115182, 0.96756731, 0.98532576, 0.83492477, 0.96338315]) In [36]: print( "Average r2 score: ", np.round( np.mean( cv_scores ), 2 ) ) print( "Standard deviation in r2 score: ", np.round( np.std( cv_scores ), 2) )  Average r2 score: 0.96 Standard deviation in r2 score: 0.04  ## Feature Ranking based on importance¶ Here is what f_regression does, on input matrix XX and array yy. For every feature it computes the correlation with y: Then it computes the F-statistic$F_{i}= \frac{ρ2_{i}}{(1−ρ2_{i})}\$

These F-values are then returned, together with the associated p-values. So the result is a tuple (F-values, p-values). Then SelectKBest takes the first component of this tuple (these will be the scores), sorts it, and picks the first k features of X with the highest scores.

In [37]:
from sklearn.feature_selection import SelectKBest, f_regression

In [38]:
model = SelectKBest( score_func=f_regression, k=4 )

In [39]:
results = model.fit( X_train, y_train )

In [40]:
results.scores_

Out[40]:
array([  185.64138393,    88.09887658,     8.83792204,  1681.74689385])
In [41]:
results.pvalues_

Out[41]:
array([  2.55193304e-27,   1.72271169e-16,   3.48260999e-03,
3.62229342e-79])
In [42]:
['%.3f' % p for p in results.pvalues_]

Out[42]:
['0.000', '0.000', '0.003', '0.000']

2. TV
4. Newspaper

## Building and Exporting the model¶

In [43]:
linreg = LinearRegression()
linreg.fit( X_train, y_train )

Out[43]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [44]:
import pickle

In [45]:
from sklearn.externals import joblib
joblib.dump(linreg, 'lin_model.pkl', compress=9)

Out[45]:
['lin_model.pkl']

## Importing and applying the model for prediction¶

In [46]:
model_clone = joblib.load('lin_model.pkl')

In [47]:
model_clone.intercept_

Out[47]:
6.2827413679807229
In [48]:
model_clone.coef_

Out[48]:
array([ 0.02065759,  0.04560647,  0.00444591,  0.00100427])
In [49]:
pred_y = model_clone.predict( X_test )