HR - Attrition Analytics - Part 2: Predict Attrition

Human Resources are critical resources of any organiazation. Organizations spend huge amount of time and money to hire and nuture their employees. It is a huge loss for companies if employees leave, especially the key resources. So if HR can predict weather employees are at risk for leaving the company, it will allow them to identify the attrition risks and help understand and provie necessary support to retain those employees or do preventive hiring to minimize the impact to the orgranization.

This dataset is taken from kaggle https://www.kaggle.com/ludobenistant/hr-analytics

Fields in the dataset include:

  • Employee satisfaction level
  • Last evaluation
  • Number of projects
  • Average monthly hours
  • Time spent at the company
  • Whether they have had a work accident
  • Whether they have had a promotion in the last 5 years
  • Department
  • Salary
  • Whether the employee has left

Given that we have understood the relationships of diffrent attributes during the Part-1, can we build a model to predict if an employee will leave the company?

In [1]:
import pandas as pd
import numpy as np

Loadin the dataset

In [2]:
hr_df = pd.read_csv( 'HR_comma_sep.csv' )
In [3]:
hr_df[0:5]
Out[3]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low
In [4]:
hr_df.columns
Out[4]:
Index(['satisfaction_level', 'last_evaluation', 'number_project',
     'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
     'promotion_last_5years', 'sales', 'salary'],
    dtype='object')

Encoding Categorical Features

In [5]:
numerical_features = ['satisfaction_level', 'last_evaluation', 'number_project',
     'average_montly_hours', 'time_spend_company']
In [6]:
categorical_features = ['Work_accident','promotion_last_5years', 'sales', 'salary']

An utility function to create dummy variable

In [7]:
def create_dummies( df, colname ):
col_dummies = pd.get_dummies(df[colname], prefix=colname)
col_dummies.drop(col_dummies.columns[0], axis=1, inplace=True)
df = pd.concat([df, col_dummies], axis=1)
df.drop( colname, axis = 1, inplace = True )
return df
In [8]:
for c_feature in categorical_features:
  hr_df = create_dummies( hr_df, c_feature )
In [9]:
hr_df[0:5]
Out[9]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company left Work_accident_1 promotion_last_5years_1 sales_RandD sales_accounting sales_hr sales_management sales_marketing sales_product_mng sales_sales sales_support sales_technical salary_low salary_medium
0 0.38 0.53 2 157 3 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
1 0.80 0.86 5 262 6 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
2 0.11 0.88 7 272 4 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
3 0.72 0.87 5 223 5 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
4 0.37 0.52 2 159 3 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0

Splitting the dataset

In [10]:
feature_columns = hr_df.columns.difference( ['left'] )
In [11]:
feature_columns
Out[11]:
Index(['Work_accident_1', 'average_montly_hours', 'last_evaluation',
     'number_project', 'promotion_last_5years_1', 'salary_low',
     'salary_medium', 'sales_RandD', 'sales_accounting', 'sales_hr',
     'sales_management', 'sales_marketing', 'sales_product_mng',
     'sales_sales', 'sales_support', 'sales_technical', 'satisfaction_level',
     'time_spend_company'],
    dtype='object')
In [12]:
from sklearn.cross_validation import train_test_split


train_X, test_X, train_y, test_y = train_test_split( hr_df[feature_columns],
                                                  hr_df['left'],
                                                  test_size = 0.2,
                                                  random_state = 42 )

Building Models

Logistic Regression Model

In [13]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit( train_X, train_y )
Out[13]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
        intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
        penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
        verbose=0, warm_start=False)
In [14]:
list( zip( feature_columns, logreg.coef_[0] ) )
Out[14]:
[('Work_accident_1', -1.492662273241121),
('average_montly_hours', 0.0049756295542857168),
('last_evaluation', 0.59258560152125361),
('number_project', -0.30373338699890184),
('promotion_last_5years_1', -1.2172794666390891),
('salary_low', 1.8131727427552242),
('salary_medium', 1.3088620777422102),
('sales_RandD', -0.5707635224786074),
('sales_accounting', 0.0930031125057868),
('sales_hr', 0.35887723308063302),
('sales_management', -0.36238815703379673),
('sales_marketing', 0.13047437008142895),
('sales_product_mng', 0.023809246134627697),
('sales_sales', 0.075841827611528925),
('sales_support', 0.1349394388253958),
('sales_technical', 0.1954553863872413),
('satisfaction_level', -4.1082674603754104),
('time_spend_company', 0.26529857471457918)]
In [15]:
logreg.intercept_
Out[15]:
array([-1.53003346])

Predicting the test cases

In [16]:
hr_test_pred = pd.DataFrame( { 'actual':  test_y,
                            'predicted': logreg.predict( test_X ) } )
In [17]:
hr_test_pred = hr_test_pred.reset_index()

Comparing the predictions with actual test data

In [18]:
hr_test_pred.sample( n = 10 )
Out[18]:
index actual predicted
2049 299 1 0
722 3477 0 0
1469 11693 0 1
190 14836 1 0
1301 10328 0 0
1831 9605 0 0
729 13275 0 0
1385 9639 0 0
153 14920 1 0
90 6248 0 0

Creating a confusion matrix

In [19]:
from sklearn import metrics

cm = metrics.confusion_matrix( hr_test_pred.actual,
                            hr_test_pred.predicted, [1,0] )
cm
Out[19]:
array([[ 225,  481],
     [ 175, 2119]])
In [20]:
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
In [21]:
sn.heatmap(cm, annot=True,  fmt='.2f', xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
Out[21]:
<matplotlib.text.Text at 0x11a58c9e8>
In [22]:
score = metrics.accuracy_score( hr_test_pred.actual, hr_test_pred.predicted )
round( float(score), 2 )
Out[22]:
0.78

Observation:

  • Overall test accuracy is 78%. But it is not a good measure. The result is very high as there are lots of cases which are no left and the model has predicted most of them as no left.

  • The objective of the model is to indentify the people who will leave, so that the company can intervene and act.

  • This might be the case as the default model assumes people with more than 0.5 probability will not leave the company.

Predit Probability

In [23]:
test_X[:1]
Out[23]:
Work_accident_1 average_montly_hours last_evaluation number_project promotion_last_5years_1 salary_low salary_medium sales_RandD sales_accounting sales_hr sales_management sales_marketing sales_product_mng sales_sales sales_support sales_technical satisfaction_level time_spend_company
6723 1.0 226 0.96 5 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.65 2
In [24]:
logreg.predict_proba( test_X[:1] )
Out[24]:
array([[ 0.97203474,  0.02796526]])

Note:

The model is predicting the probability of him leaving the company is only 0.027, which is very low.

How good the model is?

In [25]:
predict_proba_df = pd.DataFrame( logreg.predict_proba( test_X ) )
predict_proba_df.head()
Out[25]:
0 1
0 0.972035 0.027965
1 0.917792 0.082208
2 0.770442 0.229558
3 0.523038 0.476962
4 0.975843 0.024157
In [26]:
hr_test_pred = pd.concat( [hr_test_pred, predict_proba_df], axis = 1 )
In [27]:
hr_test_pred.columns = ['index', 'actual', 'predicted', 'Left_0', 'Left_1']
In [28]:
auc_score = metrics.roc_auc_score( hr_test_pred.actual, hr_test_pred.Left_1  )
round( float( auc_score ), 2 )
Out[28]:
0.81
In [29]:
sn.distplot( hr_test_pred[hr_test_pred.actual == 1]["Left_1"], color = 'b' )
sn.distplot( hr_test_pred[hr_test_pred.actual == 0]["Left_1"], color = 'g' )
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a5c9828>

Finding the optimal cutoff probability

In [30]:
fpr, tpr, thresholds = metrics.roc_curve( hr_test_pred.actual,
                                     hr_test_pred.Left_1,
                                     drop_intermediate = False )

plt.figure(figsize=(6, 4))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
In [31]:
thresholds[0:10]
Out[31]:
array([ 1.91905403,  0.91905403,  0.90057485,  0.88605896,  0.88001362,
      0.87933851,  0.87233301,  0.86974565,  0.86193267,  0.85880292])
In [32]:
fpr[0:10]
Out[32]:
array([ 0.        ,  0.00087184,  0.00130776,  0.00174368,  0.0021796 ,
      0.00261552,  0.00305144,  0.00348736,  0.00392328,  0.0043592 ])
In [33]:
tpr[0:10]
Out[33]:
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
In [34]:
cutoff_prob = thresholds[(np.abs(tpr - 0.7)).argmin()]
In [35]:
round( float( cutoff_prob ), 2 )
Out[35]:
0.28

Predicting with new cut-off probability

In [36]:
hr_test_pred['new_labels'] = hr_test_pred['Left_1'].map( lambda x: 1 if x >= 0.28 else 0 )
In [37]:
hr_test_pred[0:10]
Out[37]:
index actual predicted Left_0 Left_1 new_labels
0 6723 0 0 0.972035 0.027965 0
1 6473 0 0 0.917792 0.082208 0
2 4679 0 0 0.770442 0.229558 0
3 862 1 0 0.523038 0.476962 1
4 7286 0 0 0.975843 0.024157 0
5 8127 0 0 0.722851 0.277149 0
6 3017 0 0 0.985596 0.014404 0
7 3087 0 1 0.130254 0.869746 1
8 6425 0 0 0.769714 0.230286 0
9 2250 0 1 0.398617 0.601383 1
In [38]:
cm = metrics.confusion_matrix( hr_test_pred.actual,
                          hr_test_pred.new_labels, [1,0] )
sn.heatmap(cm, annot=True,  fmt='.2f', xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
Out[38]:
<matplotlib.text.Text at 0x11acf5400>

Building Decision Tree

In [39]:
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier, export_graphviz, export
from sklearn.grid_search import GridSearchCV
In [40]:
param_grid = {'max_depth': np.arange(3, 10)}

tree = GridSearchCV(DecisionTreeClassifier(), param_grid, cv = 10)
tree.fit( train_X, train_y )
Out[40]:
GridSearchCV(cv=10, error_score='raise',
     estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
          max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
          min_samples_split=2, min_weight_fraction_leaf=0.0,
          presort=False, random_state=None, splitter='best'),
     fit_params={}, iid=True, n_jobs=1,
     param_grid={'max_depth': array([3, 4, 5, 6, 7, 8, 9])},
     pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
In [41]:
tree.best_params_
Out[41]:
{'max_depth': 9}
In [42]:
tree.best_score_
Out[42]:
0.98058171514292858

Build Final Decision Tree Model

In [43]:
clf_tree = DecisionTreeClassifier( max_depth = 9 )
clf_tree.fit( train_X, train_y, )
Out[43]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=9,
          max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
          min_samples_split=2, min_weight_fraction_leaf=0.0,
          presort=False, random_state=None, splitter='best')

Observation:

Wow! the accuracy is about 98%.

In [44]:
tree_test_pred = pd.DataFrame( { 'actual':  test_y,
                            'predicted': clf_tree.predict( test_X ) } )
In [45]:
tree_test_pred.sample( n = 10 )
Out[45]:
actual predicted
1964 1 1
14335 1 1
5646 0 0
3214 0 0
6847 0 0
9963 0 0
184 1 1
10525 0 0
9237 0 0
13514 0 0
In [46]:
metrics.accuracy_score( tree_test_pred.actual, tree_test_pred.predicted )
Out[46]:
0.97866666666666668
In [47]:
tree_cm = metrics.confusion_matrix( tree_test_pred.predicted,
                                 tree_test_pred.actual,
                                 [1,0] )
sn.heatmap(tree_cm, annot=True,
         fmt='.2f',
         xticklabels = ["Left", "No Left"] , yticklabels = ["Left", "No Left"] )

plt.ylabel('True label')
plt.xlabel('Predicted label')
Out[47]:
<matplotlib.text.Text at 0x11afb1f28>

Generate rules from the decision tree

In [48]:
export_graphviz( clf_tree,
              out_file = "hr_tree.odt",
              feature_names = train_X.columns )
In [49]:
import pydotplus as pdot

chd_tree_graph = pdot.graphviz.graph_from_dot_file( 'hr_tree.odt' )
In [50]:
chd_tree_graph.write_jpg( 'hr_tree.jpg' )
Out[50]:
True
In [51]:
from IPython.display import Image
Image(filename='hr_tree.jpg')
Out[51]: