Using Decision Trees, KNN, Random Forest, Bagging and Boosting for Classification

This tutorial will discuss how to build classification model and how to evaluate a model.

Topics covered in this tutorial

  • Loading dataset
  • creating dummy variables
  • splitting the data set
  • Building a decision tree model and searching the optimal tree depth
  • Exporting the decision tree graph
  • Calculating GINI impurity
  • Building KNN model and searching for optimal number of neighbourhoods
  • Building a random forest model and doing grid search for optimal hyperparameters like number of estimators and max-depth
  • Building Bagging model
  • Building adaboost and gradient boosting model

Here is an intersting problem of understanding what factors contribute to CHD and can CHD be predicted by building an analytical model.

The next two sections will introduce some basics of CHD, where the dataset is derived from and what are the attributes available in the dataset.

What is coronary heart disease?

Coronary heart disease (CHD) is when your coronary arteries (the arteries that supply your heart muscle with oxygen-rich blood) become narrowed by a gradual build-up of fatty material within their walls. These arteries can become narrowed through build-up of plaque, which is made up of cholesterol and other substances. Narrowed arteries can cause symptoms, such as chest pain (angina), shortness of breath, and fatigue.

Dataset Description

Data is avaialable at: http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/ And header informtion is available at: http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.info.txt

A retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. There are roughly two controls per case of CHD. Many of the CHD positive men have undergone blood pressure reduction treatment and other programs to reduce their risk factors after their CHD event. In some cases the measurements were made after these treatments. These data are taken from a larger dataset, described in Rousseauw et al, 1983, South African Medical Journal.

Import and load the dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
In [2]:
saheart_ds = pd.read_csv( "SAheart.data" )
In [3]:
saheart_ds.head()
Out[3]:
row.names sbp tobacco ldl adiposity famhist typea obesity alcohol age chd
0 1 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 1
1 2 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1
2 3 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 0
3 4 170 7.50 6.41 38.03 Present 51 31.99 24.26 58 1
4 5 134 13.60 3.50 27.78 Present 60 25.99 57.34 49 1
In [4]:
saheart_ds.columns
Out[4]:
Index(['row.names', 'sbp', 'tobacco', 'ldl', 'adiposity', 'famhist', 'typea',
     'obesity', 'alcohol', 'age', 'chd'],
    dtype='object')
In [5]:
saheart_ds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 462 entries, 0 to 461
Data columns (total 11 columns):
row.names    462 non-null int64
sbp          462 non-null int64
tobacco      462 non-null float64
ldl          462 non-null float64
adiposity    462 non-null float64
famhist      462 non-null object
typea        462 non-null int64
obesity      462 non-null float64
alcohol      462 non-null float64
age          462 non-null int64
chd          462 non-null int64
dtypes: float64(5), int64(5), object(1)
memory usage: 39.8+ KB

The class label int the column chd indicates if the person has a coronary heart disease: negative (0) or positive (1).

Attributes description:

  • sbp: systolic blood pressure
  • tobacco: cumulative tobacco (kg)
  • ldl: low densiity lipoprotein cholesterol
  • adiposity: the size of the hips compared to the person's height
  • famhist: family history of heart disease (Present, Absent)
  • typea: type-A behavior
  • obesity: BMI index
  • alcohol: current alcohol consumption
  • age: age at onset

There are no missing information. This is good news as we do not have to impute any data.

Encoding categorical features

In [6]:
saheart_model_df = pd.get_dummies( saheart_ds, drop_first = True )
In [7]:
saheart_model_df.head()
Out[7]:
row.names sbp tobacco ldl adiposity typea obesity alcohol age chd famhist_Present
0 1 160 12.00 5.73 23.11 49 25.30 97.20 52 1 1.0
1 2 144 0.01 4.41 28.61 55 28.87 2.06 63 1 0.0
2 3 118 0.08 3.48 32.28 52 29.14 3.81 46 0 1.0
3 4 170 7.50 6.41 38.03 51 31.99 24.26 58 1 1.0
4 5 134 13.60 3.50 27.78 60 25.99 57.34 49 1 1.0
In [8]:
saheart_model_df = saheart_model_df.drop( "row.names", axis = 1 )
In [9]:
saheart_model_df.head()
Out[9]:
sbp tobacco ldl adiposity typea obesity alcohol age chd famhist_Present
0 160 12.00 5.73 23.11 49 25.30 97.20 52 1 1.0
1 144 0.01 4.41 28.61 55 28.87 2.06 63 1 0.0
2 118 0.08 3.48 32.28 52 29.14 3.81 46 0 1.0
3 170 7.50 6.41 38.03 51 31.99 24.26 58 1 1.0
4 134 13.60 3.50 27.78 60 25.99 57.34 49 1 1.0
In [10]:
saheart_model_df.columns
Out[10]:
Index(['sbp', 'tobacco', 'ldl', 'adiposity', 'typea', 'obesity', 'alcohol',
     'age', 'chd', 'famhist_Present'],
    dtype='object')

Splitting Dataset into Train and Test

In [11]:
from sklearn.model_selection import train_test_split


feature_cols = ['sbp', 'tobacco', 'ldl',
              'adiposity', 'typea',
              'obesity', 'alcohol',
              'age', 'famhist_Present' ]

train_X, test_X,  \
train_y, test_y = train_test_split( saheart_model_df[feature_cols],
                                  saheart_model_df['chd'],
                                  test_size = 0.3,
                                  random_state = 42 )
In [12]:
len( train_X )
Out[12]:
323
In [13]:
len( test_X )
Out[13]:
139

Building Decision Tree Model

In [14]:
from sklearn import metrics
In [15]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz, export

Searching for optimal tree depth

In [16]:
depths_list = [2,3,4,5,6]

for depth in depths_list:
  clf_tree = DecisionTreeClassifier( max_depth = depth )
  clf_tree.fit( train_X, train_y )
  print( "Tree Depth: ",
        depth,
        " - ROC: ",
        metrics.roc_auc_score( test_y, clf_tree.predict( test_X ) ) )
Tree Depth:  2  - ROC:  0.519501133787
Tree Depth:  3  - ROC:  0.519614512472
Tree Depth:  4  - ROC:  0.590136054422
Tree Depth:  5  - ROC:  0.569727891156
Tree Depth:  6  - ROC:  0.548412698413

Note: Tree depth of 4 seems to be optimal, where the accuracy is highest.

In [17]:
clf_tree = DecisionTreeClassifier( max_depth = 4 )
In [18]:
clf_tree.fit( train_X, train_y )
Out[18]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
          max_features=None, max_leaf_nodes=None,
          min_impurity_split=1e-07, min_samples_leaf=1,
          min_samples_split=2, min_weight_fraction_leaf=0.0,
          presort=False, random_state=None, splitter='best')
In [19]:
tree_predict = clf_tree.predict( test_X )
In [20]:
metrics.accuracy_score( test_y, tree_predict )
Out[20]:
0.64028776978417268
In [21]:
tree_cm = metrics.confusion_matrix( test_y, tree_predict, [1,0]  )
In [22]:
sn.heatmap(tree_cm, annot=True,  fmt='.2f', xticklabels = ["CHD", "NO CHD"] , yticklabels = ["CHD", "NO CHD"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
Out[22]:
<matplotlib.text.Text at 0x110311400>

To create a decision tree visualization graph.

  • Install GraphViz (As per the OS and version you are using)
  • pip install pydotplus

Note: The notebook needs a restart.

In [23]:
export_graphviz( clf_tree,
              out_file = "chd_tree.odt",
              feature_names = train_X.columns )
In [24]:
import pydotplus as pdot

chd_tree_graph = pdot.graphviz.graph_from_dot_file( 'chd_tree.odt' )
In [25]:
chd_tree_graph.write_jpg( 'chd_tree.jpg' )
Out[25]:
True
In [26]:
from IPython.display import Image
Image(filename='chd_tree.jpg')
Out[26]:

How Gini Index is calculated?

Two metrics for choosing how to split a tree. Gini measurement is the probability of a random sample being classified correctly if we randomly pick a label according to th distribution in a branch.

Gini impurity can be computed by summing the probability $f_{i}$ of an item with label $i$ being chosen times the probability $1-f_{i}$ of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category.

$${\displaystyle I_{G}(f)=\sum _{i=1}^{J}f_{i}(1-f_{i})=\sum _{i=1}^{J}(f_{i}-{f_{i}}^{2})=\sum _{i=1}^{J}f_{i}-\sum _{i=1}^{J}{f_{i}}^{2}=1-\sum _{i=1}^{J}{f_{i}}^{2}}$$

In this specific example,

sample size: 323 No of samples for two classes are: 212 and 111

So, Gini Impurity at node level 1 is

In [27]:
gini_node_1 = 1 - pow(212/323, 2) - pow (111/323, 2)
print( gini_node_1 )
0.45111138801292056

Information gain is reduction is impurity after splitting the dataset.

$Gini_{node1} - Gini_{node2}$

For second level, gini impurity for left split is $Gini_{leftsplit}$

In [28]:
gini_left_split = 1 - pow(153/194, 2) - pow (41/194, 2)
gini_left_split
Out[28]:
0.3333510468700181

For second level, gini impurity for right split is $Gini_{rightsplit}$

In [29]:
gini_right_split = 1 - pow(59/129, 2) - pow (70/129, 2)
gini_right_split
Out[29]:
0.4963644011778139
In [30]:
from sklearn.model_selection import GridSearchCV
In [31]:
tuned_parameters = [{'max_depth': range(2,10),
                   'criterion': ['gini', 'entropy']}]
In [32]:
clf_tree = DecisionTreeClassifier()

clf = GridSearchCV(clf_tree,
                 tuned_parameters,
                 cv=5,
                 scoring='roc_auc')

clf.fit(train_X, train_y )
Out[32]:
GridSearchCV(cv=5, error_score='raise',
     estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
          max_features=None, max_leaf_nodes=None,
          min_impurity_split=1e-07, min_samples_leaf=1,
          min_samples_split=2, min_weight_fraction_leaf=0.0,
          presort=False, random_state=None, splitter='best'),
     fit_params={}, iid=True, n_jobs=1,
     param_grid=[{'max_depth': range(2, 10), 'criterion': ['gini', 'entropy']}],
     pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
     scoring='roc_auc', verbose=0)
In [33]:
clf.grid_scores_
/Users/manaranjan/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_search.py:667: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
DeprecationWarning)
Out[33]:
[mean: 0.60909, std: 0.07579, params: {'max_depth': 2, 'criterion': 'gini'},
mean: 0.60525, std: 0.05638, params: {'max_depth': 3, 'criterion': 'gini'},
mean: 0.61821, std: 0.06023, params: {'max_depth': 4, 'criterion': 'gini'},
mean: 0.61578, std: 0.04535, params: {'max_depth': 5, 'criterion': 'gini'},
mean: 0.62046, std: 0.06502, params: {'max_depth': 6, 'criterion': 'gini'},
mean: 0.60406, std: 0.05802, params: {'max_depth': 7, 'criterion': 'gini'},
mean: 0.59814, std: 0.06339, params: {'max_depth': 8, 'criterion': 'gini'},
mean: 0.59145, std: 0.05120, params: {'max_depth': 9, 'criterion': 'gini'},
mean: 0.59634, std: 0.03486, params: {'max_depth': 2, 'criterion': 'entropy'},
mean: 0.60486, std: 0.06006, params: {'max_depth': 3, 'criterion': 'entropy'},
mean: 0.58674, std: 0.07092, params: {'max_depth': 4, 'criterion': 'entropy'},
mean: 0.58847, std: 0.04521, params: {'max_depth': 5, 'criterion': 'entropy'},
mean: 0.59607, std: 0.05168, params: {'max_depth': 6, 'criterion': 'entropy'},
mean: 0.58045, std: 0.02948, params: {'max_depth': 7, 'criterion': 'entropy'},
mean: 0.57065, std: 0.03131, params: {'max_depth': 8, 'criterion': 'entropy'},
mean: 0.52809, std: 0.03169, params: {'max_depth': 9, 'criterion': 'entropy'}]
In [34]:
clf.best_params_
Out[34]:
{'criterion': 'gini', 'max_depth': 6}

Build the final model with best hyperparameters

In [35]:
clf_tree = DecisionTreeClassifier( criterion = 'entropy', max_depth = 7 )
In [36]:
clf_tree.fit( train_X, train_y )
Out[36]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=7,
          max_features=None, max_leaf_nodes=None,
          min_impurity_split=1e-07, min_samples_leaf=1,
          min_samples_split=2, min_weight_fraction_leaf=0.0,
          presort=False, random_state=None, splitter='best')
In [37]:
tree_predict = clf_tree.predict( test_X )
In [38]:
metrics.accuracy_score( test_y, tree_predict )
Out[38]:
0.64028776978417268
In [39]:
tree_cm = metrics.confusion_matrix( test_y, tree_predict, [1,0]  )
In [40]:
sn.heatmap(tree_cm, annot=True,  fmt='.2f', xticklabels = ["CHD", "NO CHD"] , yticklabels = ["CHD", "NO CHD"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
Out[40]:
<matplotlib.text.Text at 0x113b7e630>

Building KNN Model

In [41]:
from sklearn.neighbors import KNeighborsClassifier
In [42]:
knn_clf = KNeighborsClassifier( n_neighbors = 5 )
In [43]:
knn_clf.fit( train_X, train_y )
Out[43]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2,
         weights='uniform')
In [44]:
knn_cm = metrics.confusion_matrix( test_y, knn_clf.predict( test_X ), [1,0]  )
In [45]:
sn.heatmap(knn_cm, annot=True,  fmt='.2f', xticklabels = ["CHD", "NO CHD"] , yticklabels = ["CHD", "NO CHD"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
Out[45]:
<matplotlib.text.Text at 0x113c3deb8>
In [46]:
metrics.accuracy_score( test_y, knn_clf.predict( test_X ) )
Out[46]:
0.65467625899280579
In [47]:
tuned_parameters = [{'n_neighbors': [5, 10, 15, 20]}]
In [48]:
clf = GridSearchCV(KNeighborsClassifier(),
                 tuned_parameters,
                 cv=10,
                 scoring='roc_auc')

clf.fit(train_X, train_y )
Out[48]:
GridSearchCV(cv=10, error_score='raise',
     estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2,
         weights='uniform'),
     fit_params={}, iid=True, n_jobs=1,
     param_grid=[{'n_neighbors': [5, 10, 15, 20]}],
     pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
     scoring='roc_auc', verbose=0)
In [49]:
clf.best_params_
Out[49]:
{'n_neighbors': 15}
In [50]:
clf.best_score_
Out[50]:
0.72955282591505499
In [51]:
knn_clf_10 = KNeighborsClassifier( n_neighbors = 15 )
knn_clf_10.fit( train_X, train_y )
Out[51]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=1, n_neighbors=15, p=2,
         weights='uniform')
In [52]:
knn_cm_10 = metrics.confusion_matrix( test_y, knn_clf_10.predict( test_X ), [1,0]  )
In [53]:
sn.heatmap(knn_cm_10, annot=True,  fmt='.2f', xticklabels = ["CHD", "NO CHD"] , yticklabels = ["CHD", "NO CHD"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
Out[53]:
<matplotlib.text.Text at 0x113bf1f28>

Using Random Forest Model

In [54]:
from sklearn.ensemble import RandomForestClassifier
In [55]:
radm_clf = RandomForestClassifier()
radm_clf.fit( train_X, train_y )
Out[55]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
          max_depth=None, max_features='auto', max_leaf_nodes=None,
          min_impurity_split=1e-07, min_samples_leaf=1,
          min_samples_split=2, min_weight_fraction_leaf=0.0,
          n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
          verbose=0, warm_start=False)
In [56]:
radm_test_pred = pd.DataFrame( { 'actual':  test_y,
                            'predicted': radm_clf.predict( test_X ) } )
In [57]:
metrics.accuracy_score( radm_test_pred.actual, radm_test_pred.predicted )
Out[57]:
0.61151079136690645
In [58]:
tuned_parameters = [{'max_depth': [5,10,15],
                   'n_estimators': [10,50,100],
                   'max_features': [0.1,0.3,0.5]}]
In [59]:
radm_clf = RandomForestClassifier()

clf = GridSearchCV(radm_clf,
                 tuned_parameters,
                 cv=5,
                 scoring='roc_auc')

clf.fit(train_X, train_y )
Out[59]:
GridSearchCV(cv=5, error_score='raise',
     estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
          max_depth=None, max_features='auto', max_leaf_nodes=None,
          min_impurity_split=1e-07, min_samples_leaf=1,
          min_samples_split=2, min_weight_fraction_leaf=0.0,
          n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
          verbose=0, warm_start=False),
     fit_params={}, iid=True, n_jobs=1,
     param_grid=[{'n_estimators': [10, 50, 100], 'max_features': [0.1, 0.3, 0.5], 'max_depth': [5, 10, 15]}],
     pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
     scoring='roc_auc', verbose=0)
In [60]:
clf.best_params_
Out[60]:
{'max_depth': 5, 'max_features': 0.1, 'n_estimators': 100}
In [61]:
clf.best_score_
Out[61]:
0.72263043804175819
In [62]:
radm_clf = RandomForestClassifier( max_depth = 5,
                                max_features = 0.1,
                                n_estimators = 100)
radm_clf.fit( train_X, train_y )
Out[62]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
          max_depth=5, max_features=0.1, max_leaf_nodes=None,
          min_impurity_split=1e-07, min_samples_leaf=1,
          min_samples_split=2, min_weight_fraction_leaf=0.0,
          n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
          verbose=0, warm_start=False)
In [63]:
radm_test_pred = pd.DataFrame( { 'actual':  test_y,
                            'predicted': radm_clf.predict( test_X ) } )
In [64]:
tree_cm = metrics.confusion_matrix( radm_test_pred.actual, radm_test_pred.predicted, [1,0] )

sn.heatmap(tree_cm, annot=True,  fmt='.2f', xticklabels = ["CHD", "NO CHD"] , yticklabels = ["CHD", "NO CHD"] )


plt.ylabel('True label')
plt.xlabel('Predicted label')
Out[64]:
<matplotlib.text.Text at 0x113f96cf8>
In [65]:
indices = np.argsort(radm_clf.feature_importances_)[::-1]

feature_rank = pd.DataFrame( columns = ['rank', 'feature', 'importance'] )
for f in range(train_X.shape[1]):
  feature_rank.loc[f] = [f+1,
                         train_X.columns[indices[f]],
                         radm_clf.feature_importances_[indices[f]]]

sn.barplot( y = 'feature', x = 'importance', data = feature_rank )
Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x113fa6ba8>

Using Bagging Classifier

In [66]:
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
In [67]:
logreg_clf = LogisticRegression()
logreg_clf.fit( train_X, train_y )
predict_proba_df = pd.DataFrame( logreg_clf.predict_proba( test_X ) )
predict_proba_df.columns = ['prob_0', 'prob_1']
auc_score = metrics.roc_auc_score( test_y, predict_proba_df.prob_1  )
auc_score
Out[67]:
0.7970521541950113
In [68]:
logreg_clf = LogisticRegression()
bag_clf = BaggingClassifier(logreg_clf, n_estimators = 1, max_features = 1.0, max_samples = 1.0 )
bag_clf.fit( train_X, train_y )
predict_proba_df = pd.DataFrame( bag_clf.predict_proba( test_X ) )
predict_proba_df.columns = ['prob_0', 'prob_1']
auc_score = metrics.roc_auc_score( test_y, predict_proba_df.prob_1  )
auc_score
Out[68]:
0.74444444444444446
In [69]:
bag_test_pred = pd.DataFrame( { 'actual':  test_y,
                            'predicted': bag_clf.predict( test_X ) } )
bag_cm = metrics.confusion_matrix( bag_test_pred.actual, bag_test_pred.predicted, [1,0] )

sn.heatmap(bag_cm, annot=True,  fmt='.2f', xticklabels = ["CHD", "NO CHD"] , yticklabels = ["CHD", "NO CHD"] )


plt.ylabel('True label')
plt.xlabel('Predicted label')
Out[69]:
<matplotlib.text.Text at 0x11433cc88>
In [70]:
tuned_parameters = [{'max_samples': [0.5,0.7,1.0],
                   'n_estimators': [10,20,50],
                   'max_features': [4,6,1.0]}]

logreg_clf = LogisticRegression()

bag_clf = BaggingClassifier(logreg_clf)

clf = GridSearchCV(bag_clf,
                 tuned_parameters,
                 cv=5,
                 scoring='roc_auc')

clf.fit(train_X, train_y )
Out[70]:
GridSearchCV(cv=5, error_score='raise',
     estimator=BaggingClassifier(base_estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
        intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
        penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
        verbose=0, warm_start...n_estimators=10, n_jobs=1, oob_score=False,
       random_state=None, verbose=0, warm_start=False),
     fit_params={}, iid=True, n_jobs=1,
     param_grid=[{'n_estimators': [10, 20, 50], 'max_features': [4, 6, 1.0], 'max_samples': [0.5, 0.7, 1.0]}],
     pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
     scoring='roc_auc', verbose=0)
In [71]:
clf.best_score_
Out[71]:
0.75683191937061911
In [72]:
clf.best_params_
Out[72]:
{'max_features': 4, 'max_samples': 0.7, 'n_estimators': 10}

Using Adaboost Classifier

In [73]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
In [74]:
tuned_parameters = [{'n_estimators': [100, 200, 500]}]

logreg_clf = LogisticRegression()

ada_clf = AdaBoostClassifier(logreg_clf)

clf = GridSearchCV(ada_clf,
                 tuned_parameters,
                 cv=5,
                 scoring='roc_auc')

clf.fit(train_X, train_y )
Out[74]:
GridSearchCV(cv=5, error_score='raise',
     estimator=AdaBoostClassifier(algorithm='SAMME.R',
        base_estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
        intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
        penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
        verbose=0, warm_start=False),
        learning_rate=1.0, n_estimators=50, random_state=None),
     fit_params={}, iid=True, n_jobs=1,
     param_grid=[{'n_estimators': [100, 200, 500]}],
     pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
     scoring='roc_auc', verbose=0)
In [75]:
clf.best_score_
Out[75]:
0.75810310287704041
In [76]:
clf.best_params_
Out[76]:
{'n_estimators': 500}
In [77]:
logreg_clf = LogisticRegression()

ada_clf = AdaBoostClassifier(logreg_clf, n_estimators = 500)

ada_clf.fit(train_X, train_y )

ada_test_pred = pd.DataFrame( { 'actual':  test_y,
                            'predicted': ada_clf.predict( test_X ) } )
bag_cm = metrics.confusion_matrix( ada_test_pred.actual, ada_test_pred.predicted, [1,0] )

sn.heatmap(bag_cm, annot=True,  fmt='.2f', xticklabels = ["CHD", "NO CHD"] , yticklabels = ["CHD", "NO CHD"] )


plt.ylabel('True label')
plt.xlabel('Predicted label')
Out[77]:
<matplotlib.text.Text at 0x114576d30>

Gradient Boosting Classifier

In [78]:
tuned_parameters = [{'n_estimators': [50, 100], 'max_depth': [1,2,3]}]

gboost_clf = GradientBoostingClassifier()

clf = GridSearchCV(gboost_clf,
                 tuned_parameters,
                 cv=5,
                 scoring='roc_auc')

clf.fit(train_X, train_y )
Out[78]:
GridSearchCV(cv=5, error_score='raise',
     estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
            learning_rate=0.1, loss='deviance', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, presort='auto', random_state=None,
            subsample=1.0, verbose=0, warm_start=False),
     fit_params={}, iid=True, n_jobs=1,
     param_grid=[{'n_estimators': [50, 100], 'max_depth': [1, 2, 3]}],
     pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
     scoring='roc_auc', verbose=0)
In [79]:
clf.best_score_
Out[79]:
0.70822281530161091
In [80]:
clf.best_params_
Out[80]:
{'max_depth': 1, 'n_estimators': 50}

Best Model

In [81]:
logreg_clf = LogisticRegression()
bag_clf = BaggingClassifier(logreg_clf, n_estimators = 10, max_features = 0.5, max_samples = 1.0 )
bag_clf.fit( train_X, train_y )

bag_test_pred = pd.DataFrame( { 'actual':  test_y,
                            'predicted': bag_clf.predict( test_X ) } )
In [82]:
predict_proba_df = pd.DataFrame( bag_clf.predict_proba( test_X ) )
predict_proba_df.head()
Out[82]:
0 1
0 0.689258 0.310742
1 0.697193 0.302807
2 0.426387 0.573613
3 0.526153 0.473847
4 0.670731 0.329269
In [83]:
bag_test_pred = bag_test_pred.reset_index()
bag_test_pred['chd_0'] = predict_proba_df.iloc[:,0:1]
bag_test_pred['chd_1'] = predict_proba_df.iloc[:,1:2]
In [84]:
bag_test_pred[0:10]
Out[84]:
index actual predicted chd_0 chd_1
0 225 0 0 0.689258 0.310742
1 30 1 0 0.697193 0.302807
2 39 1 1 0.426387 0.573613
3 222 0 0 0.526153 0.473847
4 124 0 0 0.670731 0.329269
5 203 0 0 0.636480 0.363520
6 401 0 0 0.667898 0.332102
7 211 1 0 0.579987 0.420013
8 456 0 0 0.525463 0.474537
9 77 1 0 0.521641 0.478359
In [85]:
sn.distplot( bag_test_pred[bag_test_pred.actual == 1]["chd_1"], kde=False, color = 'b' )
sn.distplot( bag_test_pred[bag_test_pred.actual == 0]["chd_1"], kde=False, color = 'g' )
Out[85]:
<matplotlib.axes._subplots.AxesSubplot at 0x1145fe390>
In [86]:
auc_score = metrics.roc_auc_score( bag_test_pred.actual, bag_test_pred.chd_1  )
round( float( auc_score ), 2 )
Out[86]:
0.79
In [87]:
fpr, tpr, thresholds = metrics.roc_curve( bag_test_pred.actual,
                                       bag_test_pred.chd_1,
                                       drop_intermediate = False )

plt.figure(figsize=(8, 6))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
In [88]:
tpr_fpr = pd.DataFrame( { 'tpr': tpr,
                       'fpr': fpr,
                       'thresholds': thresholds } )
tpr_fpr['diff'] = tpr_fpr.tpr - tpr_fpr.fpr
In [89]:
tpr_fpr = tpr_fpr.sort_values( 'diff', ascending = False )[0:10]
tpr_fpr
Out[89]:
fpr thresholds tpr diff
45 0.177778 0.420013 0.612245 0.434467
71 0.366667 0.347934 0.795918 0.429252
46 0.188889 0.418294 0.612245 0.423356
72 0.377778 0.347527 0.795918 0.418141
75 0.400000 0.336434 0.816327 0.416327
44 0.177778 0.421565 0.591837 0.414059
47 0.200000 0.416481 0.612245 0.412245
67 0.344444 0.363015 0.755102 0.410658
70 0.366667 0.354546 0.775510 0.408844
73 0.388889 0.339238 0.795918 0.407029
In [90]:
cutoff_prob = float(tpr_fpr['thresholds'][:1])
cutoff_prob
Out[90]:
0.4200128820095247
In [91]:
bag_test_pred['new_labels'] = bag_test_pred['chd_1'].map( lambda x: 1 if x >= cutoff_prob else 0 )
cm = metrics.confusion_matrix( bag_test_pred.actual,
                            bag_test_pred.new_labels, [1,0] )
sn.heatmap(cm, annot=True,  fmt='.2f', xticklabels = ["CHD", "NO CHD"] , yticklabels = ["CHD", "NO CHD"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
Out[91]:
<matplotlib.text.Text at 0x11477bc50>