Understanding Bias & Variance - Part 1

When we build models and make predictions, all models tends to make some error. There is no model which can make prefect predictions. Then can we control these errors that models tend to make? To answer this, we need to understand the compositions of these errors that model make.

Prediction errors can be decomposed into two components called error due to bias and error due to variance. No model can minimize, both bias and variance. So while building models, we always have to trade-off between these two errors: bias and variance. Understanding these two types of errors are key to diagnosing model results and avoid building over- or under-fitting models.

What we will do in this tutorial is take a dataset and try creating underfitting and overfitting models and understand how errors change based on the complexity of the models and then how to find the optimal complexity of the model, where the error is minimized.

In [2]:
import pandas as pd
import numpy as np
In [3]:
curve = pd.read_csv( "curve.csv" )
In [4]:
curve.head()
Out[4]:
x y
0 2 -1.999618
1 2 -1.999618
2 8 -3.978312
3 9 -1.969175
4 10 -0.957770
In [7]:
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

Fit a simple regression line

Building a simple regression line. It is assumed that the y is linearly dependent on x.

$$y=\beta _{1}x_{1}+\varepsilon _{i}$$
In [8]:
def fit_poly( degree ):
  p = np.polyfit( curve.x, curve.y, deg = degree )
  curve['fit'] = np.polyval( p, curve.x )
  sn.regplot( curve.x, curve.y, fit_reg = False )
  return plt.plot( curve.x, curve.fit, label='fit' )
In [9]:
fit_poly( 1 )
Out[9]:
[<matplotlib.lines.Line2D at 0x5a31780>]

Note:

The regression line does not seem to fit the data. The model assumes that relationship between y and x is linear. These models are called high-bias models. High bias models assume simplicity in relationship and do not explain variance in the data well. Even if the sample changes, the parameters estimated by the models hardly respond. The model parameters are least sensitive to any changes in the sample.

Fit a regression with polynomial features

Let's bring in a polynomial feature, which is square of the feature x

$$y=\beta _{1}x_{1}+\beta _{2}x_{1}^2+\varepsilon _{i}$$
In [10]:
fit_poly( 2 )
Out[10]:
[<matplotlib.lines.Line2D at 0xa562e48>]

Note:

This regression line seems to fit data better than the previous model. But can we better this model by adding more polynomial features of higher order.

Building higher polynomial models

$$y=\beta _{1}x_{1}+\beta _{2}x_{1}^2+\beta _{2}x_{1}^3+\beta _{2}x_{1}^4+\beta _{2}x_{1}^5+\varepsilon _{i}$$
In [11]:
fit_poly( 5 )
Out[11]:
[<matplotlib.lines.Line2D at 0xaab6278>]
$$y=\beta _{1}x_{1}+\beta _{2}x_{1}^2+\beta _{2}x_{1}^3+\beta _{2}x_{1}^4+\beta _{2}x_{1}^5 +\beta _{2}x_{1}^6+\beta _{2}x_{1}^7+\beta _{2}x_{1}^8+\beta _{2}x_{1}^9+\beta _{2}x_{1}^{10}+\varepsilon _{i}$$
In [12]:
fit_poly( 10 )
Out[12]:
[<matplotlib.lines.Line2D at 0xaadf940>]

Note:

As we continue to build higher polynomial models, we realize that the model has started to fit the training data really well. This can be a case of overfitting. These kind of models are called high variance models. These models are very sensitive to the training data i.e. the model parameters estimated are very sensitive to the data points. Any change i.e. addition and removel of data points can alter these models parameters significantly. High variance models tends to overfit the dataset and do not generalize well.

Deciding the complexity of the model

Then how do we know what is the optimal complexity of a model should be? To understand that we should split our dataset into train and test, and then build models with increasing complexity to monitor the cases of underfitting and overfitting. If the model can neither fit both training and test dataset, then it is a high biased model (a case underfitting). If the model fit training dataset well but performs poorly on test dataset, then it is a high variance model (a case of overfitting). There should be models with intermediate complexity that would explain training as well as test datasets well. And that sould be the optimal complexity of the model.

In [14]:
from sklearn import metrics
from sklearn.cross_validation import train_test_split
In [15]:
def get_rmse( y, y_fit ):
  return np.sqrt( metrics.mean_squared_error( y, y_fit ) )

Split data into train and test datasets

In [100]:
train_X, test_X, train_y, test_y = train_test_split( curve.x,
                                                  curve.y,
                                                  test_size = 0.40,
                                                  random_state = 100 )

Build model with increasing complexity and measure train and test errors

In [101]:
rmse_df = pd.DataFrame( columns = ["degree", "rmse_train", "rmse_test"] )

for i in range( 1, 15 ):
  p = np.polyfit( train_X, train_y, deg = i )
  rmse_df.loc[i-1] = [ i,
                      get_rmse( train_y, np.polyval( p, train_X ) ),
                      get_rmse( test_y, np.polyval( p, test_X ) ) ]

Train and testt error vs. degrees of polynomial features in the model

In [102]:
rmse_df
Out[102]:
degree rmse_train rmse_test
0 1 5.226638 5.779652
1 2 2.394509 2.755286
2 3 2.233547 2.560184
3 4 2.231998 2.549205
4 5 2.197528 2.428728
5 6 2.062201 2.703880
6 7 2.039408 2.909237
7 8 1.995852 3.270892
8 9 1.979322 3.120420
9 10 1.976326 3.115875
10 11 1.964484 3.218203
11 12 1.657948 4.457668
12 13 1.656719 4.358014
13 14 1.642308 4.659503

Plot both train and test errors agaist model complexity

In [103]:
plt.plot( rmse_df.degree,
       rmse_df.rmse_train,
       label='train',
       color = 'r' )

plt.plot( rmse_df.degree,
       rmse_df.rmse_test,
       label='test',
       color = 'g' )

plt.legend(bbox_to_anchor=(1.05, 1),
         loc=2,
         borderaxespad=0.)
Out[103]:
<matplotlib.legend.Legend at 0xbfdb358>

Note:

It can be observed, as model complexity increases, the model begins to fit training and test data. But beyond a certain point of model complexity, even though the training error reduces, the test error starts to swell. This is the point(in the example it is 5), below which the model underfits and beyond which the model overfits the data. So, this is the point of Optimal model complexity. In this example, the optimal complexity is 5.

In the next blog posts, we will discuss some advanced concepts like how to bring down the variance of models with higher complexity using ensemble methods.

In [ ]: