HR - Attrition Analytics - Part 2: Predict Attrition

Human Resources are critical resources of any organiazation. Organizations spend huge amount of time and money to hire and nuture their employees. It is a huge loss for companies if employees leave, especially the key resources. So if HR can predict weather employees are at risk for leaving the company, it will allow them to identify the attrition risks and help understand and provie necessary support to retain those employees or do preventive hiring to minimize the impact to the orgranization.

This dataset is taken from kaggle https://www.kaggle.com/ludobenistant/hr-analytics

Fields in the dataset include:

  • Employee satisfaction level
  • Last evaluation
  • Number of projects
  • Average monthly hours
  • Time spent at the company
  • Whether they have had a work accident
  • Whether they have had a promotion in the last 5 years
  • Department
  • Salary
  • Whether the employee has left

Given that we have understood the relationships of diffrent attributes during the Part-1, can we build a model to predict if an employee will leave the company?

In [1]:
import pandas as pd
import numpy as np

Loadin the dataset

In [2]:
hr_df = pd.read_csv( 'HR_comma_sep.csv' )
In [3]:
hr_df[0:5]
Out[3]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low
In [4]:
hr_df.columns
Out[4]:
Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'sales', 'salary'],
      dtype='object')
In [5]:
hr_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction_level       14999 non-null float64
last_evaluation          14999 non-null float64
number_project           14999 non-null int64
average_montly_hours     14999 non-null int64
time_spend_company       14999 non-null int64
Work_accident            14999 non-null int64
left                     14999 non-null int64
promotion_last_5years    14999 non-null int64
sales                    14999 non-null object
salary                   14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB

Encoding Categorical Features

In [6]:
numerical_features = ['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company']
In [7]:
categorical_features = ['Work_accident','promotion_last_5years', 'sales', 'salary']

An utility function to create dummy variable

In [8]:
hr_df = pd.get_dummies( hr_df, columns = categorical_features, drop_first = True )
In [9]:
hr_df[0:5]
Out[9]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company left Work_accident_1 promotion_last_5years_1 sales_RandD sales_accounting sales_hr sales_management sales_marketing sales_product_mng sales_sales sales_support sales_technical salary_low salary_medium
0 0.38 0.53 2 157 3 1 0 0 0 0 0 0 0 0 1 0 0 1 0
1 0.80 0.86 5 262 6 1 0 0 0 0 0 0 0 0 1 0 0 0 1
2 0.11 0.88 7 272 4 1 0 0 0 0 0 0 0 0 1 0 0 0 1
3 0.72 0.87 5 223 5 1 0 0 0 0 0 0 0 0 1 0 0 1 0
4 0.37 0.52 2 159 3 1 0 0 0 0 0 0 0 0 1 0 0 1 0

Splitting the dataset

In [10]:
feature_columns = hr_df.columns.difference( ['left'] )
In [11]:
feature_columns
Out[11]:
Index(['Work_accident_1', 'average_montly_hours', 'last_evaluation',
       'number_project', 'promotion_last_5years_1', 'salary_low',
       'salary_medium', 'sales_RandD', 'sales_accounting', 'sales_hr',
       'sales_management', 'sales_marketing', 'sales_product_mng',
       'sales_sales', 'sales_support', 'sales_technical', 'satisfaction_level',
       'time_spend_company'],
      dtype='object')
In [12]:
from sklearn.cross_validation import train_test_split


train_X, test_X, train_y, test_y = train_test_split( hr_df[feature_columns],
                                                    hr_df['left'],
                                                    test_size = 0.2,
                                                    random_state = 42 )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Building Models

Logistic Regression Model

In [13]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit( train_X, train_y )
Out[13]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [14]:
list( zip( feature_columns, logreg.coef_[0] ) )
Out[14]:
[('Work_accident_1', -1.492662273697928),
 ('average_montly_hours', 0.0049756319469671751),
 ('last_evaluation', 0.59258561199640902),
 ('number_project', -0.3037334201263634),
 ('promotion_last_5years_1', -1.2172794554977486),
 ('salary_low', 1.8131727203902352),
 ('salary_medium', 1.3088620529311437),
 ('sales_RandD', -0.57076353396328328),
 ('sales_accounting', 0.093003101014934059),
 ('sales_hr', 0.35887721222928909),
 ('sales_management', -0.36238815711951106),
 ('sales_marketing', 0.13047436227743936),
 ('sales_product_mng', 0.023809236497969326),
 ('sales_sales', 0.075841821963689354),
 ('sales_support', 0.13493943705067998),
 ('sales_technical', 0.19545538533883172),
 ('satisfaction_level', -4.1082674718875776),
 ('time_spend_company', 0.26529847508134713)]
In [15]:
logreg.intercept_
Out[15]:
array([-1.53003344])

Predicting the test cases

In [16]:
hr_test_pred = pd.DataFrame( { 'actual':  test_y,
                              'predicted': logreg.predict( test_X ) } )
In [17]:
hr_test_pred = hr_test_pred.reset_index()

Comparing the predictions with actual test data

In [18]:
hr_test_pred.sample( n = 10 )
Out[18]:
index actual predicted
796 10633 0 0
2724 9982 0 0
2901 6126 0 0
2417 3290 0 0
238 2128 0 0
2343 9241 0 0
891 11014 0 0
2892 5380 0 0
1184 9612 0 0
2773 13566 0 0

Creating a confusion matrix

In [19]:
from sklearn import metrics

cm = metrics.confusion_matrix( hr_test_pred.actual,
                              hr_test_pred.predicted, [1,0] )
cm
Out[19]:
array([[ 225,  481],
       [ 175, 2119]])
In [20]:
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
In [21]:
sn.heatmap(cm, annot=True,  fmt='.2f', xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
Out[21]:
<matplotlib.text.Text at 0x11a766940>
In [22]:
score = metrics.roc_auc_score( hr_test_pred.actual, hr_test_pred.predicted )
round( float(score), 2 )
Out[22]:
0.62

Observation:

  • Overall test accuracy is 78%. But it is not a good measure. The result is very high as there are lots of cases which are no left and the model has predicted most of them as no left.

  • The objective of the model is to indentify the people who will leave, so that the company can intervene and act.

  • This might be the case as the default model assumes people with more than 0.5 probability will not leave the company.

Predit Probability

In [23]:
test_X[:1]
Out[23]:
Work_accident_1 average_montly_hours last_evaluation number_project promotion_last_5years_1 salary_low salary_medium sales_RandD sales_accounting sales_hr sales_management sales_marketing sales_product_mng sales_sales sales_support sales_technical satisfaction_level time_spend_company
6723 1 226 0.96 5 0 0 1 0 0 0 0 1 0 0 0 0 0.65 2
In [24]:
logreg.predict_proba( test_X[:1] )
Out[24]:
array([[ 0.97203473,  0.02796527]])

Note:

The model is predicting the probability of him leaving the company is only 0.027, which is very low.

How good the model is?

In [25]:
predict_proba_df = pd.DataFrame( logreg.predict_proba( test_X ) )
predict_proba_df.head()
Out[25]:
0 1
0 0.972035 0.027965
1 0.917792 0.082208
2 0.770442 0.229558
3 0.523038 0.476962
4 0.975843 0.024157
In [26]:
hr_test_pred = pd.concat( [hr_test_pred, predict_proba_df], axis = 1 )
In [27]:
hr_test_pred.columns = ['index', 'actual', 'predicted', 'Left_0', 'Left_1']
In [28]:
auc_score = metrics.roc_auc_score( hr_test_pred.actual, hr_test_pred.Left_1  )
round( float( auc_score ), 2 )
Out[28]:
0.81
In [29]:
sn.distplot( hr_test_pred[hr_test_pred.actual == 1]["Left_1"], color = 'b' )
sn.distplot( hr_test_pred[hr_test_pred.actual == 0]["Left_1"], color = 'g' )
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a7e15c0>

Finding the optimal cutoff probability

In [30]:
fpr, tpr, thresholds = metrics.roc_curve( hr_test_pred.actual,
                                       hr_test_pred.Left_1,
                                       drop_intermediate = False )

plt.figure(figsize=(6, 4))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
In [31]:
thresholds[0:10]
Out[31]:
array([ 1.91905399,  0.91905399,  0.90057484,  0.88605892,  0.88001361,
        0.87933851,  0.87233298,  0.86974565,  0.86193266,  0.85880291])
In [32]:
fpr[0:10]
Out[32]:
array([ 0.        ,  0.00087184,  0.00130776,  0.00174368,  0.0021796 ,
        0.00261552,  0.00305144,  0.00348736,  0.00392328,  0.0043592 ])
In [33]:
tpr[0:10]
Out[33]:
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

Find optimal cutoff using youden's index

  • Youden's index is where (Sensitivity+Specificity - 1) is maximum.
  • That is when (TPR+TNR -1) is maximum.
    • max( TPR - (1 - TNR) )
    • max( TPR - FPR )
In [34]:
tpr_fpr = pd.DataFrame( { 'tpr': tpr, 'fpr': fpr, 'thresholds': thresholds } )
tpr_fpr['diff'] = tpr_fpr.tpr - tpr_fpr.fpr
tpr_fpr.sort_values( 'diff', ascending = False )[0:10]
Out[34]:
fpr thresholds tpr diff
996 0.243243 0.254034 0.736544 0.493301
992 0.241935 0.256022 0.735127 0.493192
997 0.243679 0.253207 0.736544 0.492865
993 0.242371 0.254657 0.735127 0.492756
994 0.242807 0.254589 0.735127 0.492320
1056 0.265475 0.237079 0.757790 0.492315
990 0.241500 0.256560 0.733711 0.492211
998 0.244551 0.253191 0.736544 0.491993
995 0.243243 0.254048 0.735127 0.491884
1057 0.265911 0.236939 0.757790 0.491879

Note:

  • Optimal cut-off probability is 0.254

Predicting with new cut-off probability

In [35]:
hr_test_pred['new_labels'] = hr_test_pred['Left_1'].map( lambda x: 1 if x >= 0.254 else 0 )
In [36]:
hr_test_pred[0:10]
Out[36]:
index actual predicted Left_0 Left_1 new_labels
0 6723 0 0 0.972035 0.027965 0
1 6473 0 0 0.917792 0.082208 0
2 4679 0 0 0.770442 0.229558 0
3 862 1 0 0.523038 0.476962 1
4 7286 0 0 0.975843 0.024157 0
5 8127 0 0 0.722851 0.277149 1
6 3017 0 0 0.985596 0.014404 0
7 3087 0 1 0.130254 0.869746 1
8 6425 0 0 0.769714 0.230286 0
9 2250 0 1 0.398617 0.601383 1
In [37]:
cm = metrics.confusion_matrix( hr_test_pred.actual,
                            hr_test_pred.new_labels, [1,0] )
sn.heatmap(cm, annot=True,  fmt='.2f', xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
Out[37]:
<matplotlib.text.Text at 0x11aae9828>

Building Decision Tree

In [38]:
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier, export_graphviz, export
In [39]:
clf_tree = DecisionTreeClassifier( max_depth = 3 )
clf_tree.fit( train_X, train_y, )
Out[39]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
In [40]:
tree_test_pred = pd.DataFrame( {'actual':  test_y,
                              'predicted': clf_tree.predict( test_X ) } )
In [41]:
tree_test_pred.sample( n = 10 )
Out[41]:
actual predicted
83 1 1
7406 0 0
1346 1 1
11777 0 0
11956 0 1
14620 1 1
4429 0 0
13578 0 0
13249 0 0
6927 0 0
In [42]:
metrics.roc_auc_score( tree_test_pred.actual, tree_test_pred.predicted )
Out[42]:
0.94257590314430306
In [43]:
tree_cm = metrics.confusion_matrix( tree_test_pred.predicted, 
                                   tree_test_pred.actual, 
                                   [1,0] )
sn.heatmap(tree_cm, annot=True,  
           fmt='.2f', 
           xticklabels = ["Left", "No Left"] , 
           yticklabels = ["Left", "No Left"] )

plt.ylabel('True label')
plt.xlabel('Predicted label');
In [45]:
import pydotplus as pdot
In [46]:
export_graphviz( clf_tree,
                out_file = "hr_tree_1.odt",
                class_names = ['No Left', 'Left'],
                filled = True,
                feature_names = train_X.columns )
chd_tree_graph = pdot.graphviz.graph_from_dot_file( 'hr_tree_1.odt' )
chd_tree_graph.write_jpg( 'hr_tree_1.jpg' )
from IPython.display import Image
Image(filename='hr_tree_1.jpg')
Out[46]:

Finding optimal Depth for Decision Tree

In [47]:
from sklearn.model_selection import GridSearchCV
In [48]:
tuned_parameters = [{'max_depth': range(4,10),
                   'criterion': ['gini', 'entropy']}]
In [49]:
clf_tree = DecisionTreeClassifier()

clf = GridSearchCV(clf_tree,
                 tuned_parameters,
                 cv=5,
                 return_train_score = True,  
                 scoring='roc_auc')

clf.fit(train_X, train_y )
Out[49]:
GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'max_depth': range(4, 10), 'criterion': ['gini', 'entropy']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='roc_auc', verbose=0)
In [50]:
clf.grid_scores_
/Users/manaranjan/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)
Out[50]:
[mean: 0.97180, std: 0.00540, params: {'max_depth': 4, 'criterion': 'gini'},
 mean: 0.97616, std: 0.00469, params: {'max_depth': 5, 'criterion': 'gini'},
 mean: 0.97687, std: 0.00436, params: {'max_depth': 6, 'criterion': 'gini'},
 mean: 0.97811, std: 0.00280, params: {'max_depth': 7, 'criterion': 'gini'},
 mean: 0.97896, std: 0.00419, params: {'max_depth': 8, 'criterion': 'gini'},
 mean: 0.97762, std: 0.00263, params: {'max_depth': 9, 'criterion': 'gini'},
 mean: 0.97380, std: 0.00369, params: {'max_depth': 4, 'criterion': 'entropy'},
 mean: 0.97994, std: 0.00365, params: {'max_depth': 5, 'criterion': 'entropy'},
 mean: 0.98227, std: 0.00299, params: {'max_depth': 6, 'criterion': 'entropy'},
 mean: 0.98286, std: 0.00168, params: {'max_depth': 7, 'criterion': 'entropy'},
 mean: 0.98309, std: 0.00224, params: {'max_depth': 8, 'criterion': 'entropy'},
 mean: 0.98187, std: 0.00289, params: {'max_depth': 9, 'criterion': 'entropy'}]
In [51]:
clf.best_params_
Out[51]:
{'criterion': 'entropy', 'max_depth': 8}
In [52]:
clf.best_score_
Out[52]:
0.98309125387892149

Random Forest Model

In [53]:
from sklearn.ensemble import RandomForestClassifier
In [54]:
radm_clf = RandomForestClassifier()
radm_clf.fit( train_X, train_y )
Out[54]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [55]:
radm_test_pred = pd.DataFrame( { 'actual':  test_y,
                              'predicted': radm_clf.predict( test_X ) } )
In [56]:
metrics.roc_auc_score( radm_test_pred.actual, radm_test_pred.predicted )
Out[56]:
0.97777241282221627
In [57]:
tree_cm = metrics.confusion_matrix( radm_test_pred.predicted, 
                                   radm_test_pred.actual, 
                                   [1,0] )
sn.heatmap(tree_cm, annot=True,  
           fmt='.2f', 
           xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )

plt.ylabel('True label')
plt.xlabel('Predicted label');

Grid Search For Optimal Parameters

In [58]:
tuned_parameters = [{'max_depth': [5,10,15],
                   'n_estimators': [10,50,100],
                   'max_features': [0.1,0.3,0.5]}]
In [59]:
radm_clf = RandomForestClassifier()

clf = GridSearchCV(radm_clf,
                 tuned_parameters,
                 cv=5,
                 scoring='roc_auc')

clf.fit(train_X, train_y )
Out[59]:
GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'n_estimators': [10, 50, 100], 'max_features': [0.1, 0.3, 0.5], 'max_depth': [5, 10, 15]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)
In [60]:
clf.best_params_
Out[60]:
{'max_depth': 15, 'max_features': 0.3, 'n_estimators': 100}
In [61]:
clf.best_score_
Out[61]:
0.99318995635928464

Building the final model

In [62]:
radm_clf = RandomForestClassifier( max_depth = 15,
                                max_features = 0.5,
                                n_estimators = 100)
radm_clf.fit( train_X, train_y )
Out[62]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=15, max_features=0.5, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [63]:
radm_test_pred = pd.DataFrame( { 'actual':  test_y,
                              'predicted': radm_clf.predict( test_X ) } )
In [64]:
metrics.roc_auc_score( radm_test_pred.actual, radm_test_pred.predicted )
Out[64]:
0.97864425240373332
In [65]:
tree_cm = metrics.confusion_matrix( radm_test_pred.predicted, 
                                   radm_test_pred.actual, 
                                   [1,0] )
sn.heatmap(tree_cm, annot=True,  
           fmt='.2f', 
           xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )

plt.ylabel('True label')
plt.xlabel('Predicted label');

Feature Importance from Random Forest Model

In [66]:
radm_clf.feature_importances_
Out[66]:
array([ 0.00297763,  0.12125744,  0.11377684,  0.16365538,  0.00061652,
        0.00329833,  0.00260713,  0.00085446,  0.00098673,  0.00103632,
        0.00088107,  0.00076572,  0.00055859,  0.00225127,  0.00193248,
        0.00305224,  0.40312821,  0.17636364])
In [69]:
indices = np.argsort(radm_clf.feature_importances_)[::-1]
In [70]:
feature_rank = pd.DataFrame( columns = ['rank', 'feature', 'importance'] )
for f in range(train_X.shape[1]):
    feature_rank.loc[f] = [f+1,
                           train_X.columns[indices[f]],
                           radm_clf.feature_importances_[indices[f]]]
In [71]:
feature_rank
Out[71]:
rank feature importance
0 1 satisfaction_level 0.403128
1 2 time_spend_company 0.176364
2 3 number_project 0.163655
3 4 average_montly_hours 0.121257
4 5 last_evaluation 0.113777
5 6 salary_low 0.003298
6 7 sales_technical 0.003052
7 8 Work_accident_1 0.002978
8 9 salary_medium 0.002607
9 10 sales_sales 0.002251
10 11 sales_support 0.001932
11 12 sales_hr 0.001036
12 13 sales_accounting 0.000987
13 14 sales_management 0.000881
14 15 sales_RandD 0.000854
15 16 sales_marketing 0.000766
16 17 promotion_last_5years_1 0.000617
17 18 sales_product_mng 0.000559
In [72]:
sn.barplot( y = 'feature', x = 'importance', data = feature_rank );

Note:

  • As per the model, the most important features which influence whether to leave the company,in descending order, are

    1. satisfaction_level
    2. time_spend_company
    3. number_project
    4. average_montly_hours
    5. last_evaluation
In [73]:
selected_features = ['satisfaction_level',
                     'number_project',
                     'time_spend_company',
                     'last_evaluation',
                     'average_montly_hours']

Building a Decision Tree with important features

In [74]:
clf_tree = DecisionTreeClassifier( max_depth = 4 )
clf_tree.fit( train_X[selected_features], train_y, )
Out[74]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
In [75]:
dtree_test_pred = pd.DataFrame( { 'actual':  test_y,
                              'predicted': clf_tree.predict( test_X[selected_features] ) } )
In [76]:
metrics.accuracy_score( dtree_test_pred.actual, 
                        dtree_test_pred.predicted )
Out[76]:
0.96733333333333338
In [77]:
export_graphviz( clf_tree,
                out_file = "hr_tree_2.odt",
                class_names = ['No Left', 'Left'],
                filled = True,
                feature_names = selected_features )
chd_tree_graph = pdot.graphviz.graph_from_dot_file( 'hr_tree_2.odt' )
chd_tree_graph.write_jpg( 'hr_tree_2.jpg' )
from IPython.display import Image
Image(filename='hr_tree_2.jpg')
Out[77]:

KNN Model

In [78]:
from sklearn.neighbors import KNeighborsClassifier
In [79]:
knn_clf = KNeighborsClassifier( n_neighbors = 10 )
knn_clf.fit( train_X, train_y )
Out[79]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform')
In [80]:
knn_test_pred = pd.DataFrame( { 'actual':  test_y,
                              'predicted': knn_clf.predict( test_X ) } )
In [81]:
metrics.roc_auc_score( knn_test_pred.actual, knn_test_pred.predicted )
Out[81]:
0.92426974173296028
In [82]:
tree_cm = metrics.confusion_matrix( knn_test_pred.predicted,
                                    knn_test_pred.actual,
                                   [1,0] )
sn.heatmap(tree_cm, annot=True,  
           fmt='.2f', 
           xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )

plt.ylabel('True label')
plt.xlabel('Predicted label');

Grid Search for KNN

In [83]:
tuned_parameters = [{'n_neighbors': [5, 10, 15, 20]}]
In [84]:
clf = GridSearchCV(KNeighborsClassifier(),
                 tuned_parameters,
                 cv=5,
                 scoring='roc_auc')

clf.fit(train_X, train_y )
Out[84]:
GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'n_neighbors': [5, 10, 15, 20]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)
In [85]:
clf.grid_scores_
/Users/manaranjan/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)
Out[85]:
[mean: 0.96985, std: 0.00354, params: {'n_neighbors': 5},
 mean: 0.97154, std: 0.00315, params: {'n_neighbors': 10},
 mean: 0.97028, std: 0.00360, params: {'n_neighbors': 15},
 mean: 0.96826, std: 0.00381, params: {'n_neighbors': 20}]
In [86]:
clf.best_params_
Out[86]:
{'n_neighbors': 10}
In [88]:
clf.best_score_
Out[88]:
0.97154212393449657

Conclusion:

We learnt the following aspects in this tutorial:

  1. How to build decision tree, random forest and KNN models?
  2. How to derive simple rules using decision tree, which can be intrepreted and used for build strategies?
  3. How to find optimal parameters for a model using grid search?
  4. How to find most important features usig random forest?