Applying Regularization to avoid Overfitting

Sometimes building simpler models may not explain the data very well. Simpler models may be because we assume linear models or very little features to explain data. As we add mode features either by adding new variables or adding different forms (like qudrative or polynomial) features may result in better fit to the data. But how many more features we should add to the model. Every new feature may explain more in the data and eventually the model may overfit the data.

Overfitting models are poor in generatization and may end of doing badly with unseen data. Overfitting models have this peculiar characterstics of inflating the coefficients of the variables. Regularization is a technique to penalize the loss function by adding a multiple of an L1L1 (LASSO) or an L2L2 (Ridge) norm of the estimated parameter vector of regression.

Objective of this blog is to explain

  • Observing how parameter coefficients change as we build complex models
  • Train Vs. Test Error with respect to model complexity
  • Applying Ridge and Lasso Regression independently and observing parameter coefficients
  • Feature selection using Lasso Regression
  • Apply both L1 and L2 regularization together
In [2]:
import random
import math
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

Generate random data with sinsoidal pattern with some random noise

In [3]:
x = [round( 1 + random.random(), 3 ) for _ in range(0, 100)]
In [4]:
y = list( map( lambda x: round( math.sin(6*x), 3 ), x ) )
In [5]:
y[0:10]
Out[5]:
[0.998, 0.929, -0.891, -0.637, -0.919, -0.157, -0.661, -0.926, -0.436, -0.979]
In [6]:
noise = [round( random.random()/2, 3 ) for _ in range(0, 100)]
In [7]:
y_rand = list( map( lambda a, b: round( a + b, 3 ), y, noise ) )
In [8]:
y_rand[0:10]
Out[8]:
[1.166, 1.389, -0.397, -0.231, -0.617, 0.147, -0.187, -0.454, -0.153, -0.903]
In [9]:
x[0:10]
Out[9]:
[1.32, 1.372, 1.911, 1.686, 1.9, 1.597, 1.974, 1.768, 1.646, 1.798]
In [10]:
xy_df = pd.DataFrame( { 'x': x, 'y':y_rand } )
xy = xy_df.copy()
In [11]:
xy.head(5)
Out[11]:
x y
0 1.320 1.166
1 1.372 1.389
2 1.911 -0.397
3 1.686 -0.231
4 1.900 -0.617
In [12]:
sn.lmplot( "x", "y", data=xy, fit_reg=False, size = 5 )
Out[12]:
<seaborn.axisgrid.FacetGrid at 0x11739bfd0>

y seems to be following a sinusoidal pattern with respect to x with some amount of noise.

Create polynomial features

Let's create polynomial features of in the range of ${x^2}$ to ${x^{19}}$

In [13]:
for i in range( 2, 20 ):
  xy_df[ 'x'+ str( i ) ] = xy_df.x.map( lambda a: math.pow( a, i ) )
In [14]:
xy_df.columns
Out[14]:
Index(['x', 'y', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11',
     'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19'],
    dtype='object')
In [15]:
xy_df = xy_df[['x', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7',
             'x8', 'x9', 'x10', 'x11','x12', 'x13', 'x14',
             'x15', 'x16', 'x17', 'x18', 'x19', 'y']]
In [16]:
xy_df.head(5)
Out[16]:
x x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 y
0 1.320 1.742400 2.299968 3.035958 4.007464 5.289853 6.982606 9.217040 12.166492 16.059770 21.198896 27.982543 36.936956 48.756782 64.358953 84.953818 112.139039 148.023532 195.391062 1.166
1 1.372 1.882384 2.582631 3.543370 4.861503 6.669982 9.151215 12.555468 17.226102 23.634211 32.426138 44.488661 61.038443 83.744744 114.897789 157.639766 216.281759 296.738574 407.125323 1.389
2 1.911 3.651921 6.978821 13.336527 25.486103 48.703943 93.073235 177.862952 339.896102 649.541450 1241.273711 2372.074062 4533.033533 8662.627081 16554.280351 31635.229752 60454.924055 115529.359870 220776.606711 -0.397
3 1.686 2.842596 4.792617 8.080352 13.623474 22.969176 38.726031 65.292089 110.082462 185.599030 312.919965 527.583061 889.505041 1499.705499 2528.503472 4263.056854 7187.513856 12118.148361 20431.198136 -0.231
4 1.900 3.610000 6.859000 13.032100 24.760990 47.045881 89.387174 169.835630 322.687698 613.106626 1164.902589 2213.314919 4205.298346 7990.066858 15181.127030 28844.141357 54803.868578 104127.350298 197841.965566 -0.617

Predict Y with X ploy features

In [17]:
from sklearn.linear_model import LinearRegression

def get_lm( curve, deg = 1 ):

  lreg = LinearRegression()
  lreg.fit( curve.iloc[:,:deg], curve.y )

  lreg_predict_y = lreg.predict( curve.iloc[:,:deg] )

  plt.plot( curve.x, curve.y, 'k.')
  plt.plot( curve.x, lreg_predict_y, 'g-', label='degree ' + str(deg) + 'fit' )
  plt.legend(loc='upper right')

  return lreg

$$y=\beta _{1}x+\varepsilon _{i}$$

In [18]:
lreg_1 = get_lm( xy_df, 1)
In [19]:
lreg_1.coef_
Out[19]:
array([-1.70846171])

$$y=\beta _{1}x+\beta _{2}x^2+\varepsilon _{i}$$

In [20]:
xy_df = xy_df.sort_values( ['x'], ascending = True )

lreg_2 = get_lm( xy_df, 2)
lreg_2.coef_
Out[20]:
array([ 8.22515235, -3.30804732])

$$y=\beta _{1}x+\beta _{2}x^2+\beta _{3}x^3+\varepsilon _{i}$$

In [21]:
lreg_3 = get_lm( xy_df, 3 )
lreg_3.coef_
Out[21]:
array([ 134.02250256,  -89.0185297 ,   19.00726424])

$$y=\beta _{1}x+\beta _{2}x^2+\beta _{3}x^3+\beta _{4}x^4+\varepsilon _{i}$$

In [22]:
lreg_4 = get_lm( xy_df, 4 )
lreg_4.coef_
Out[22]:
array([ -66.94017155,  119.45404407,  -75.41587762,   15.76809304])

$$y=\beta _{1}x+\beta _{2}x^2+\beta _{3}x^3+...+\beta _{10}x^{10}+\varepsilon _{i}$$

In [23]:
lreg_10 = get_lm( xy_df, 10 )
lreg_10.coef_
Out[23]:
array([ -318894.6638388 ,   617679.02553378,  -401111.4681177 ,
      -447499.21108882,  1163125.88789259, -1127296.86290339,
       622589.70651094,  -205543.33636453,    37945.0243101 ,
        -3026.70994076])

$$y=\beta _{1}x+\beta _{2}x^2+\beta _{3}x^3+...+\beta _{15}x^{15}+\varepsilon _{i}$$

In [24]:
lreg_15 = get_lm( xy_df, 15 )
lreg_15.intercept_
Out[24]:
12354569.352559235

Note: The models seems to be overfitting as we use more poly features.

In [25]:
from sklearn import metrics
from sklearn.cross_validation import train_test_split

train_X, test_X, train_y, test_y = train_test_split( xy_df.iloc[:,:-1],
                                                   xy_df.y,
                                                   test_size = 0.40,
                                                   random_state = 100 )
In [26]:
train_X.iloc[:,:19].head( 5 )
Out[26]:
x x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19
7 1.768 3.125824 5.526457 9.770776 17.274731 30.541725 53.997770 95.468057 168.787525 298.416345 527.600098 932.796973 1649.185048 2915.759166 5155.062205 9114.149978 16113.817161 28489.228740 50368.956413
93 1.745 3.045025 5.313569 9.272177 16.179949 28.234012 49.268350 85.973271 150.023358 261.790759 456.824875 797.159407 1391.043166 2427.370324 4235.761215 7391.403321 12897.998795 22507.007897 39274.728780
91 1.606 2.579236 4.142253 6.652458 10.683848 17.158260 27.556166 44.255202 71.073854 114.144610 183.316244 294.405888 472.815856 759.342265 1219.503677 1958.522905 3145.387786 5051.492784 8112.697412
27 1.431 2.047761 2.930346 4.193325 6.000648 8.586928 12.287893 17.583976 25.162669 36.007779 51.527132 73.735326 105.515252 150.992325 216.070017 309.196195 442.459754 633.159909 906.051829
29 1.772 3.139984 5.564052 9.859500 17.471033 30.958671 54.858765 97.209731 172.255643 305.236999 540.879963 958.439294 1698.354429 3009.484048 5332.805734 9449.731760 16744.924679 29672.006531 52578.795574

Compare coefficients and Residual Errors of Complex Models

Now we will build complex models and observe how coefficients change.

In [27]:
from sklearn.metrics import metrics
lreg = LinearRegression()
/Users/manaranjan/anaconda/lib/python3.5/site-packages/sklearn/metrics/metrics.py:4: DeprecationWarning: sklearn.metrics.metrics is deprecated and will be removed in 0.18. Please import from sklearn.metrics
DeprecationWarning)
In [28]:
def get_detail_df():

  all_reg_df = pd.DataFrame( {'deg': [],
                              'intercept': [],
                              'x1':[], 'x2':[], 'x3':[], 'x4':[], 'x5':[],
                              'x6':[], 'x7':[], 'x8':[], 'x9':[], 'x10':[],
                              'x11':[], 'x12':[], 'x13':[], 'x14':[], 'x15':[],
                              'x16':[], 'x17':[], 'x18':[], 'x19':[], 'x20':[],
                              'train_rmse':[],
                              'test_rmse':[],
                              'train_r2':[],
                              'test_r2':[] } )

  all_reg_df.columns = ['deg', 'intercept',
         'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9',
         'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19', 'x20',
         'train_rmse', 'test_rmse', 'train_r2', 'test_r2' ]

  return all_reg_df
In [29]:
len( get_detail_df().columns )
Out[29]:
26
In [30]:
def get_lm_details( trainX, trainY, testX, testY, reg_df, deg = 1 ):

  lreg = LinearRegression()
  lreg.fit( trainX.iloc[:,:deg], trainY )

  predict_y_train = lreg.predict( trainX.iloc[:,:deg] )
  predict_y_test = lreg.predict( testX.iloc[:,:deg] )

  lm_series =  ( [deg] +
      [lreg.intercept_] +
      list(lreg.coef_) +
      [np.nan for i in range( 1, 21 - deg )] +
      [ np.sqrt( metrics.mean_squared_error( trainY, predict_y_train ) ),
      np.sqrt( metrics.mean_squared_error( testY, predict_y_test ) ),
      metrics.r2_score( trainY, predict_y_train ),
      metrics.r2_score( testY, predict_y_test ) ] )

#    reg_df = reg_df.append( pd.DataFrame( lm_series).T )
  reg_df = reg_df.append( pd.Series( lm_series, index = reg_df.columns ),  ignore_index = True )

  return lm_series, reg_df

Build models using features from ${x^2}$ to ${x^{19}}$

In [31]:
all_df = get_detail_df()

for i in range( 1, 20 ):
  lm_series, all_df = get_lm_details( train_X, train_y, test_X, test_y, all_df, i )
In [32]:
all_df
Out[32]:
deg intercept x1 x2 x3 x4 x5 x6 x7 x8 ... x15 x16 x17 x18 x19 x20 train_rmse test_rmse train_r2 test_r2
0 1.0 2.435750e+00 -1.466560e+00 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 0.449967 0.572252 0.500771 0.511368
1 2.0 -3.348269e+00 6.603320e+00 -2.695677e+00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 0.389359 0.485270 0.626199 0.648622
2 3.0 -5.914700e+01 1.233629e+02 -8.188726e+01 1.746089e+01 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 0.165130 0.191296 0.932766 0.945397
3 4.0 6.517367e+00 -6.199142e+01 1.107198e+02 -6.990516e+01 1.460666e+01 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 0.148107 0.170051 0.945913 0.956851
4 5.0 3.056873e+02 -1.122314e+03 1.592523e+03 -1.090807e+03 3.615027e+02 -4.653215e+01 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 0.137575 0.166260 0.953332 0.958754
5 6.0 -1.210616e+02 6.884514e+02 -1.572278e+03 1.825999e+03 -1.133973e+03 3.580201e+02 -4.512721e+01 NaN NaN ... NaN NaN NaN NaN NaN NaN 0.136917 0.168525 0.953778 0.957622
6 7.0 2.109537e+03 -1.033680e+04 2.155822e+04 -2.487673e+04 1.718857e+04 -7.116032e+03 1.633444e+03 -1.601551e+02 NaN ... NaN NaN NaN NaN NaN NaN 0.136396 0.168755 0.954129 0.957507
7 8.0 -2.266966e+04 1.296201e+05 -3.214282e+05 4.514907e+05 -3.929631e+05 2.170830e+05 -7.435901e+04 1.444502e+04 -1.218811e+03 ... NaN NaN NaN NaN NaN NaN 0.134606 0.164374 0.955324 0.959685
8 9.0 1.741336e+05 -1.120572e+06 3.182808e+06 -5.236997e+06 5.501030e+06 -3.825505e+06 1.761281e+06 -5.177217e+05 8.817264e+04 ... NaN NaN NaN NaN NaN NaN 0.131900 0.158642 0.957103 0.962447
9 10.0 1.091609e+05 -6.628316e+05 1.740816e+06 -2.562032e+06 2.264963e+06 -1.157661e+06 2.432825e+05 7.097439e+04 -6.075822e+04 ... NaN NaN NaN NaN NaN NaN 0.131890 0.158563 0.957110 0.962484
10 11.0 4.578920e+06 -3.539568e+07 1.237112e+08 -2.580704e+08 3.570455e+08 -3.440219e+08 2.355777e+08 -1.146580e+08 3.887407e+07 ... NaN NaN NaN NaN NaN NaN 0.130281 0.163028 0.958149 0.960342
11 12.0 8.528911e+06 -6.887467e+07 2.530699e+08 -5.593747e+08 8.282388e+08 -8.652451e+08 6.537789e+08 -3.598935e+08 1.431927e+08 ... NaN NaN NaN NaN NaN NaN 0.130246 0.162822 0.958172 0.960442
12 13.0 5.399358e+06 -3.884168e+07 1.209366e+08 -2.064541e+08 1.897060e+08 -3.875847e+07 -1.336482e+08 1.992770e+08 -1.527998e+08 ... NaN NaN NaN NaN NaN NaN 0.130189 0.162923 0.958209 0.960393
13 14.0 -4.486528e+07 3.718636e+08 -1.358232e+09 2.812944e+09 -3.433283e+09 1.972723e+09 9.169867e+08 -3.094968e+09 3.300750e+09 ... NaN NaN NaN NaN NaN NaN 0.129590 0.167240 0.958593 0.958266
14 15.0 -1.870482e+07 1.388100e+08 -4.340867e+08 7.038962e+08 -5.130753e+08 -1.537826e+08 6.150412e+08 -3.373201e+08 -3.332584e+08 ... 1.652110e+05 NaN NaN NaN NaN NaN 0.129239 0.168165 0.958816 0.957804
15 16.0 -7.573909e+06 5.057595e+07 -1.353798e+08 1.663391e+08 -4.047307e+07 -1.253046e+08 1.005224e+08 7.803535e+07 -1.401439e+08 ... -1.175214e+06 6.994810e+04 NaN NaN NaN NaN 0.129165 0.167953 0.958863 0.957910
16 17.0 -2.911222e+06 1.760599e+07 -4.033773e+07 3.549959e+07 1.075985e+07 -3.636457e+07 -1.913090e+06 3.611078e+07 -4.556071e+06 ... 3.305075e+06 -4.668237e+05 2.892562e+04 NaN NaN NaN 0.129140 0.167634 0.958879 0.958069
17 18.0 -9.896655e+05 5.494584e+06 -1.084291e+07 6.196379e+06 6.619831e+06 -6.714251e+06 -6.313303e+06 6.230046e+06 6.869384e+06 ... -4.482342e+06 1.187066e+06 -1.760201e+05 11339.451548 NaN NaN 0.129157 0.167214 0.958868 0.958279
18 19.0 -4.869904e+06 2.833746e+07 -6.118581e+07 4.686175e+07 2.371951e+07 -4.723136e+07 -1.662787e+07 4.481326e+07 1.589885e+07 ... -3.207359e+07 1.146116e+07 -2.533700e+06 321208.242896 -17965.210333 NaN 0.129074 0.168753 0.958921 0.957508

19 rows × 26 columns

Note: It can be observed that coefficients of the high polynomial features are inflated significantly. Some of the coefficients have been inflated to the scale of ${10}^{8}$.

Train Vs. Test Error

In [33]:
sn.set(rc={"figure.figsize": (16, 8)});

plt.plot( all_df.deg,
       all_df.train_rmse,
       label='train',
       color = 'r' )

plt.plot( all_df.deg,
       all_df.test_rmse,
       label='test',
       color = 'g' )

plt.legend(bbox_to_anchor=(1.05, 1),
         loc=2,
         borderaxespad=0.)
Out[33]:
<matplotlib.legend.Legend at 0x11ad37240>

How parameter coefficients change as models become more complex

In [34]:
sn.set(rc={"figure.figsize": (16, 8)});

for i in range(1,5):
  column_name = 'x' + str(i)
  plt.plot( list(range(1,6)),
       all_df[column_name][0:5],
       label='coefficients')

Note: the parameter coefficients are in 1000s as model uses polynomial features of 5th order

In [35]:
for i in range(1,7):
  column_name = 'x' + str(i)
  plt.plot( list(range(1,8)),
       all_df[column_name][0:7],
       label='coefficients')

Note: It can be observed that the coefficients are inflated exponentially for high order polynomial models

In [36]:
for i in range(1,10):
  column_name = 'x' + str(i)
  plt.plot( list(range(1,11)),
       list( map( lambda a: math.log( abs( a ), 10 ) , all_df[column_name][0:10] ) ),
       label='coefficients')

Learning with Ridge Regularization

In [37]:
from sklearn.linear_model import Ridge

def get_lm_ridge( curve, alpha = 0.01, deg = 1 ):

  lreg = Ridge( alpha )
  lreg.fit( curve.iloc[:,:deg], xy_df.y )

  lreg_predict_y = lreg.predict( xy_df.iloc[:,:deg] )

  plt.plot( xy_df.x, xy_df.y, 'k.')
  plt.plot( xy_df.x, lreg_predict_y, 'g-', label='degree ' + str(deg) + 'fit' )
  plt.legend(loc='upper right')

  return lreg
In [38]:
sn.set(rc={"figure.figsize": (10, 6)});

get_lm_ridge( xy_df, deg = 10 )
Out[38]:
Ridge(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=None,
 normalize=False, random_state=None, solver='auto', tol=0.001)
In [39]:
from sklearn.linear_model import Ridge

def get_lm_ridge_details( trainX, trainY, testX, testY, reg_df, deg = 1, alpha = 0.01 ):

  lreg = Ridge( alpha )
  lreg.fit( trainX.iloc[:,:deg], trainY )

  predict_y_train = lreg.predict( trainX.iloc[:,:deg] )
  predict_y_test = lreg.predict( testX.iloc[:,:deg] )

  lm_series =  ( [deg] +
      [lreg.intercept_] +
      list(lreg.coef_) +
      [np.nan for i in range( 1, 21 - deg )] +
      [ np.sqrt( metrics.mean_squared_error( trainY, predict_y_train ) ),
      np.sqrt( metrics.mean_squared_error( testY, predict_y_test ) ),
      metrics.r2_score( trainY, predict_y_train ),
      metrics.r2_score( testY, predict_y_test ) ] )

#    reg_df = reg_df.append( pd.DataFrame( lm_series).T )
  reg_df = reg_df.append( pd.Series( lm_series, index = reg_df.columns ),  ignore_index = True )

  return lm_series, reg_df
In [40]:
all_ridge_df = get_detail_df()

for i in range( 1, 20 ):
  lm_ridge_series, all_ridge_df = get_lm_ridge_details( train_X, train_y, test_X, test_y, all_ridge_df, i )
In [41]:
all_ridge_df.head(4)
Out[41]:
deg intercept x1 x2 x3 x4 x5 x6 x7 x8 ... x15 x16 x17 x18 x19 x20 train_rmse test_rmse train_r2 test_r2
0 1.0 2.431832 -1.463976 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 0.449967 0.572504 0.500770 0.510938
1 2.0 -2.407683 5.301448 -2.263923 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 0.391042 0.494212 0.622961 0.635553
2 3.0 -2.293916 5.025288 -2.050778 -0.052455 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 0.391811 0.494692 0.621477 0.634845
3 4.0 -7.750574 9.331802 5.111124 -8.466979 2.233533 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 0.303768 0.397748 0.772478 0.763940

4 rows × 26 columns

In [42]:
all_df.head(4)
Out[42]:
deg intercept x1 x2 x3 x4 x5 x6 x7 x8 ... x15 x16 x17 x18 x19 x20 train_rmse test_rmse train_r2 test_r2
0 1.0 2.435750 -1.466560 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 0.449967 0.572252 0.500771 0.511368
1 2.0 -3.348269 6.603320 -2.695677 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 0.389359 0.485270 0.626199 0.648622
2 3.0 -59.146996 123.362856 -81.887260 17.460885 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 0.165130 0.191296 0.932766 0.945397
3 4.0 6.517367 -61.991423 110.719845 -69.905163 14.60666 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 0.148107 0.170051 0.945913 0.956851

4 rows × 26 columns

In [43]:
sn.set(rc={"figure.figsize": (16, 8)});

for i in range(1,5):
  column_name = 'x' + str(i)
  plt.plot( list(range(1,6)),
       all_ridge_df[column_name][0:5],
       label='coefficients')
In [44]:
for i in range(1,10):
  column_name = 'x' + str(i)
  plt.plot( list(range(1,11)),
       all_ridge_df[column_name][0:10],
       label='coefficients')
In [45]:
plt.plot( all_ridge_df.deg,
       all_ridge_df.train_rmse,
       label='train',
       color = 'r' )

plt.plot( all_ridge_df.deg,
       all_ridge_df.test_rmse,
       label='test',
       color = 'g' )

plt.legend(bbox_to_anchor=(1.05, 1),
         loc=2,
         borderaxespad=0.)
Out[45]:
<matplotlib.legend.Legend at 0x11a770780>

Learning with Lasso Regularizer

In [46]:
from sklearn.linear_model import Lasso

def get_lm_lasso( curve, alpha = 1, deg = 1 ):

  lreg = Lasso( alpha )
  lreg.fit( curve.iloc[:,:deg], xy_df.y )

  lreg_predict_y = lreg.predict( xy_df.iloc[:,:deg] )

  plt.plot( xy_df.x, xy_df.y, 'k.')
  plt.plot( xy_df.x, lreg_predict_y, 'g-', label='degree ' + str(deg) + 'fit' )
  plt.legend(loc='upper right')

  return lreg
In [47]:
sn.set(rc={"figure.figsize": (10, 6)});

lasso_reg = get_lm_lasso( xy_df, deg = 20 )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/sklearn/linear_model/coordinate_descent.py:466: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations
ConvergenceWarning)
In [48]:
lasso_reg.coef_
Out[48]:
array([  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
       0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
       0.00000000e+00,  -0.00000000e+00,  -0.00000000e+00,
      -2.67826285e-03,  -2.85146631e-03,  -4.81888872e-04,
      -2.62943559e-05,   6.45720019e-05,   8.27839538e-05,
       3.40761043e-05,   1.36301086e-05,   5.17250911e-06,
       1.79101620e-06,   0.00000000e+00])
In [49]:
from sklearn.linear_model import Lasso

def get_lm_lasso_details( trainX, trainY, testX, testY, reg_df, deg = 1, alpha = 0.01 ):

  lreg = Lasso( alpha, max_iter=10000, tol=0.001 )
  lreg.fit( trainX.iloc[:,:deg], trainY )

  predict_y_train = lreg.predict( trainX.iloc[:,:deg] )
  predict_y_test = lreg.predict( testX.iloc[:,:deg] )

  lm_series =  ( [deg] +
      [lreg.intercept_] +
      list(lreg.coef_) +
      [np.nan for i in range( 1, 21 - deg )] +
      [ np.sqrt( metrics.mean_squared_error( trainY, predict_y_train ) ),
      np.sqrt( metrics.mean_squared_error( testY, predict_y_test ) ),
      metrics.r2_score( trainY, predict_y_train ),
      metrics.r2_score( testY, predict_y_test ) ] )

#    reg_df = reg_df.append( pd.DataFrame( lm_series).T )
  reg_df = reg_df.append( pd.Series( lm_series, index = reg_df.columns ),  ignore_index = True )

  return lm_series, reg_df
In [50]:
all_lasso_df = get_detail_df()

for i in range( 1, 20 ):
  lm_lasso_series, all_lasso_df = get_lm_lasso_details( train_X, train_y, test_X, test_y, all_lasso_df, i )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/sklearn/linear_model/coordinate_descent.py:466: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations
ConvergenceWarning)
In [51]:
all_lasso_df.iloc[0:10,1:12]
Out[51]:
intercept x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
0 2.275186 -1.360659 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1.399867 0.000000 -0.496269 NaN NaN NaN NaN NaN NaN NaN NaN
2 1.076551 0.000000 0.000000 -0.220996 NaN NaN NaN NaN NaN NaN NaN
3 0.903122 0.000000 0.000000 -0.000000 -0.104994 NaN NaN NaN NaN NaN NaN
4 0.891373 0.000000 0.000000 -0.000000 -0.093835 -0.005440 NaN NaN NaN NaN NaN
5 0.859455 0.000000 0.077840 0.000000 -0.000000 -0.178715 0.059877 NaN NaN NaN NaN
6 -1.152482 0.000000 2.228656 0.109738 0.000000 -0.571527 -0.227236 0.185842 NaN NaN NaN
7 -0.851271 0.000000 0.000000 1.985633 0.000000 -0.180465 -0.593698 -0.001164 0.112138 NaN NaN
8 -0.953545 0.000000 0.000000 1.840729 0.000000 0.000000 -0.437896 -0.096776 0.003042 0.050479 NaN
9 -0.773015 0.000000 0.000000 1.479247 0.074911 0.000000 -0.250933 -0.136345 -0.008628 0.014248 0.015828
In [52]:
for i in range(1,5):
  column_name = 'x' + str(i)
  plt.plot( list(range(1,6)),
       all_lasso_df[column_name][0:5],
       label='coefficients')
In [53]:
for i in range(1,10):
  column_name = 'x' + str(i)
  plt.plot( list(range(1,11)),
       all_lasso_df[column_name][0:10],
       label='coefficients')