This tutorial explains the concept of **linear regression** and use of stats models and sklearn to build regression model and evaluate it.

**Topic Covered here are:**

- What is linear regression?
- Analyzing Advertisement dataset.
- Splitting dataseta and cross validating models.
- Building linear regression model.
- How to use sklearn API in python?
- Interpreting the coefficients of the model.
- Making predictions using the model.
- Finding model residuals and analyzing it.
- Evaluating model efficiency using RMSE and R-Square values.

```
import pandas as pd
import numpy as np
```

#### Loading Adverstiment Dataset¶

The adverstiting dataset captures sales revenue generated with respect to advertisement spends across multiple channles like radio, tv and newspaper.

**The dataset is available here**

#### Attribution Descriptions¶

- TV - Spend on TV Advertisements
- Radio - Spend on radio Advertisements
- Newspaper - Spend on newspaper Advertisements
- Sales - Sales revenue generated
*Note: The amounts are in diffrent units*

## Read the data¶

```
advt = pd.read_csv( "Advertising.csv" )
```

```
advt.head()
```

```
advt.info()
```

## Remove the first column¶

- Remove the index column

```
advt = advt[["TV", "Radio", "Newspaper", "Sales"]]
```

```
advt.head()
```

## Building a Simple Linear Regression model¶

**Linear regression** is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. *Source:* wikipedia

A simple linear regression model is given by

$$Y = mX + b$$where **m** is the slope and **b** is the y-intercept. **Y** is the dependent variable and **X** is the explanatory variable.

Very briefly and simplistically, Linear Regression is a class of techniques for fitting a straight line to a set of data points.

## Splitting into Train and test data sets..¶

#### Typically the model should be built on a training dataset and validated against a test dataset.¶

- Let's split the dataset into 70/30 ratio. 70% belongs to training and 30% belongs to test.

```
from sklearn.model_selection import train_test_split
```

```
X_train, X_test, y_train, y_test = train_test_split(
advt[["TV", "Radio", "Newspaper"]],
advt.Sales,
test_size=0.3,
random_state = 42 )
```

```
len( X_train )
```

```
len( X_test )
```

## Using sklearn library to build the model¶

*sklearn* library has a comprehensive set of APIs to split datasets, build models, test models and calculate accuracy metrics.¶

```
from sklearn.linear_model import LinearRegression
```

## Building the model with training dataset¶

```
linreg = LinearRegression()
linreg.fit( X_train, y_train )
```

### Finding the intercept and coefficients¶

```
linreg.intercept_
```

```
list( zip( ["TV", "Radio", "Newspaper"], list( linreg.coef_ ) ) )
```

So, the equation looks like

**Sales = 2.708 + 0.044 TV + 0.199 Radio + 0.006 * Newspaper**

### Making predictions on test dataset¶

```
y_pred = linreg.predict( X_test )
```

### Comparing the predictions with actuals¶

```
test_pred_df = pd.DataFrame( { 'actual': y_test,
'predicted': np.round( y_pred, 2),
'residuals': y_test - y_pred } )
```

```
test_pred_df[0:10]
```

## Measuring model accuracy using metrics¶

```
from sklearn import metrics
```

### Calculating RMSE¶

- RMSE calculate the difference between the actual value and predicted value of the response variable
- The square root of the mean/average of the square of all of the error.
- Compared to the similar Mean Absolute Error, RMSE amplifies and severely punishes large errors.
- The lesser the RMSE value, the better is the model.

```
rmse = np.sqrt( metrics.mean_squared_error( y_test, y_pred ) )
```

```
round( rmse, 2 )
```

## Evaluating Model Accuracy¶

- R-squared is a statistical measure of how close the data are to the fitted regression line.
- R-square signifies percentage of variations in the reponse variable that can be explained by the model.
- R-squared = Explained variation / Total variation
- Total variation is variation of response variable around it's mean.
- R-squared value varies between 0 and 100%. 0% signifies that the model explains none of the variability, while 100% signifies that the model explains all the variability of the response. The closer the r-square to 100%, the better is the model.

### Calculating R squared¶

```
metrics.r2_score( y_test, y_pred )
```

## Understanding residuals¶

```
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
```

```
residuals = y_test - y_pred
```

```
sn.jointplot( advt.Sales, residuals, size = 6 )
```

```
sn.distplot( residuals )
```

### The residuals are randomly distributed. There are no visible relationship. The model can be assumed to be correct.¶

## Creating a new features and rebuilding the model¶

Perhaps spending \$50,000 on television advertising and \$50,000 on radio advertising results in more sales than allocating \$100,000 to either television or radio individually. In marketing, this is known as a **synergy effect**, while in statistics it is called an interaction effect.

Taken from ISLR book http://www-bcf.usc.edu/~gareth/ISL/

So, let's create an interaction variable for TV and Radio Spending

```
X_train['tv_radio'] = X_train.TV * X_train.Radio
X_test['tv_radio'] = X_test.TV * X_test.Radio
```

```
linreg = LinearRegression()
linreg.fit( X_train, y_train )
```

```
y_pred = linreg.predict( X_test )
```

```
metrics.r2_score(y_test, y_pred)
```

```
residuals = y_test - y_pred
```

```
sn.jointplot( advt.Sales, residuals, size = 6 )
```

```
sn.distplot( residuals )
```

## K-Fold Cross Validation¶

```
from sklearn.model_selection import cross_val_score
```

```
linreg = LinearRegression()
```

```
cv_scores = cross_val_score( linreg, X_train, y_train, scoring = 'r2', cv = 10 )
cv_scores
```

```
print( "Average r2 score: ", np.round( np.mean( cv_scores ), 2 ) )
print( "Standard deviation in r2 score: ", np.round( np.std( cv_scores ), 2) )
```

## Feature Ranking based on importance¶

Here is what f_regression does, on input matrix XX and array yy. For every feature it computes the correlation with y:

Then it computes the F-statistic

$F_{i}= \frac{ρ2_{i}}{(1−ρ2_{i})}$

These F-values are then returned, together with the associated p-values. So the result is a tuple (F-values, p-values). Then SelectKBest takes the first component of this tuple (these will be the scores), sorts it, and picks the first k features of X with the highest scores.

```
from sklearn.feature_selection import SelectKBest, f_regression
```

```
model = SelectKBest( score_func=f_regression, k=4 )
```

```
results = model.fit( X_train, y_train )
```

```
results.scores_
```

```
results.pvalues_
```

```
['%.3f' % p for p in results.pvalues_]
```

### As p - values are less than 5% - the variables are siginificant in the regression equation. Also higher the F value, higher is the importance. So, as per this test, we can rank the features as below.¶

- TV*Radio
- TV
- Radio
- Newspaper

## Building and Exporting the model¶

```
linreg = LinearRegression()
linreg.fit( X_train, y_train )
```

```
import pickle
```

```
from sklearn.externals import joblib
joblib.dump(linreg, 'lin_model.pkl', compress=9)
```

## Importing and applying the model for prediction¶

```
model_clone = joblib.load('lin_model.pkl')
```

```
model_clone.intercept_
```

```
model_clone.coef_
```

```
pred_y = model_clone.predict( X_test )
```

## Comments