## Understanding Bias & Variance - Part 1¶

When we build models and make predictions, all models tends to make some error. There is no model which can make prefect predictions. Then can we control these errors that models tend to make? To answer this, we need to understand the compositions of these errors that model make.

Prediction errors can be decomposed into two components called error due to **bias** and error due to **variance**. No model can minimize, both bias and variance. So while building models, we always have to trade-off between these two errors: bias and variance. Understanding these two types of errors are key to diagnosing model results and avoid building over- or under-fitting models.

What we will do in this tutorial is take a dataset and try creating underfitting and overfitting models and understand how errors change based on the complexity of the models and then how to find the optimal complexity of the model, where the error is minimized.

```
import pandas as pd
import numpy as np
```

```
curve = pd.read_csv( "curve.csv" )
```

```
curve.head()
```

```
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
```

### Fit a simple regression line¶

Building a simple regression line. It is assumed that the y is linearly dependent on x.

$$y=\beta _{1}x_{1}+\varepsilon _{i}$$```
def fit_poly( degree ):
p = np.polyfit( curve.x, curve.y, deg = degree )
curve['fit'] = np.polyval( p, curve.x )
sn.regplot( curve.x, curve.y, fit_reg = False )
return plt.plot( curve.x, curve.fit, label='fit' )
```

```
fit_poly( 1 )
```

#### Note:¶

The regression line does not seem to fit the data. The model assumes that relationship between *y* and *x* is linear. These models are called high-bias models. High bias models assume simplicity in relationship and do not explain variance in the data well. Even if the sample changes, the parameters
estimated by the models hardly respond. The model parameters are least sensitive to any changes in the sample.

### Fit a regression with polynomial features¶

Let's bring in a polynomial feature, which is square of the feature *x*

```
fit_poly( 2 )
```

#### Note:¶

This regression line seems to fit data better than the previous model. But can we better this model by adding more polynomial features of higher order.

### Building higher polynomial models¶

$$y=\beta _{1}x_{1}+\beta _{2}x_{1}^2+\beta _{2}x_{1}^3+\beta _{2}x_{1}^4+\beta _{2}x_{1}^5+\varepsilon _{i}$$```
fit_poly( 5 )
```

```
fit_poly( 10 )
```

#### Note:¶

As we continue to build higher polynomial models, we realize that the model has started to fit the training data really well. This can be a case of overfitting. These kind of models are called high variance models. These models are very sensitive to the training data i.e. the model parameters estimated are very sensitive to the data points. Any change i.e. addition and removel of data points can alter these models parameters significantly. High variance models tends to overfit the dataset and do not generalize well.

## Deciding the complexity of the model¶

Then how do we know what is the optimal complexity of a model should be? To understand that we should split our dataset into train and test, and then build models with increasing complexity to monitor the cases of underfitting and overfitting. If the model can neither fit both training and test dataset, then it is a high biased model (a case underfitting). If the model fit training dataset well but performs poorly on test dataset, then it is a high variance model (a case of overfitting). There should be models with intermediate complexity that would explain training as well as test datasets well. And that sould be the optimal complexity of the model.

```
from sklearn import metrics
from sklearn.cross_validation import train_test_split
```

```
def get_rmse( y, y_fit ):
return np.sqrt( metrics.mean_squared_error( y, y_fit ) )
```

### Split data into train and test datasets¶

```
train_X, test_X, train_y, test_y = train_test_split( curve.x,
curve.y,
test_size = 0.40,
random_state = 100 )
```

### Build model with increasing complexity and measure train and test errors¶

```
rmse_df = pd.DataFrame( columns = ["degree", "rmse_train", "rmse_test"] )
for i in range( 1, 15 ):
p = np.polyfit( train_X, train_y, deg = i )
rmse_df.loc[i-1] = [ i,
get_rmse( train_y, np.polyval( p, train_X ) ),
get_rmse( test_y, np.polyval( p, test_X ) ) ]
```

### Train and testt error vs. degrees of polynomial features in the model¶

```
rmse_df
```

### Plot both train and test errors agaist model complexity¶

```
plt.plot( rmse_df.degree,
rmse_df.rmse_train,
label='train',
color = 'r' )
plt.plot( rmse_df.degree,
rmse_df.rmse_test,
label='test',
color = 'g' )
plt.legend(bbox_to_anchor=(1.05, 1),
loc=2,
borderaxespad=0.)
```

#### Note:¶

It can be observed, as model complexity increases, the model begins to fit training and test data. But beyond a certain point of model complexity, even though the training error reduces, the test error starts to swell. This is the point(in the example it is 5), below which the model underfits and beyond which the model overfits the data. So, this is the point of **Optimal model complexity**. In this example, the optimal complexity is 5.

In the next blog posts, we will discuss some advanced concepts like how to bring down the variance of models with higher complexity using ensemble methods.

```
```

## Comments