HR - Attrition Analytics - Part 1: Exploratory Analysis

Human Resources are critical resources of any organiazation. Organizations spend huge amount of time and money to hire and nuture their employees. It is a huge loss for companies if employees leave, especially the key resources. So if HR can predict weather employees are at risk for leaving the company, it will allow them to identify the attrition risks and help understand and provie necessary support to retain those employees or do preventive hiring to minimize the impact to the orgranization.

This dataset is taken from kaggle https://www.kaggle.com/ludobenistant/hr-analytics

Fields in the dataset include:

  • Employee satisfaction level
  • Last evaluation
  • Number of projects
  • Average monthly hours
  • Time spent at the company
  • Whether they have had a work accident
  • Whether they have had a promotion in the last 5 years
  • Department
  • Salary
  • Whether the employee has left

Why are our best and most experienced employees leaving prematurely?

Load the dataset

In [2]:
import pandas as pd
import numpy as np
In [3]:
hr_df = pd.read_csv( 'HR_comma_sep.csv' )

Let's look at few records

In [4]:
hr_df.head( 5 )
Out[4]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low

Basic Information about column, types and if any missing data

In [5]:
hr_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction_level       14999 non-null float64
last_evaluation          14999 non-null float64
number_project           14999 non-null int64
average_montly_hours     14999 non-null int64
time_spend_company       14999 non-null int64
Work_accident            14999 non-null int64
left                     14999 non-null int64
promotion_last_5years    14999 non-null int64
sales                    14999 non-null object
salary                   14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB

Are there any null or missing values in the dataset?

In [6]:
hr_df.isnull().any().sum()
Out[6]:
0

No Missing Data. This seems to be a good dataset.

In [7]:
hr_df.columns
Out[7]:
Index(['satisfaction_level', 'last_evaluation', 'number_project',
     'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
     'promotion_last_5years', 'sales', 'salary'],
    dtype='object')

Exploratory Analysis

How many records of people leaving the company exist in the dataset?

In [8]:
hr_left_df = pd.DataFrame( hr_df.left.value_counts() )
In [9]:
hr_left_df
Out[9]:
left
0 11428
1 3571
In [10]:
import matplotlib as plt
import seaborn as sn
%matplotlib inline
In [11]:
sn.barplot( hr_left_df.index, hr_left_df.left )
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x117bad390>

Summary of columns

In [12]:
hr_df.describe()
Out[12]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610 0.238083 0.021268
std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719 0.425924 0.144281
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000
25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000 0.000000 0.000000
50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000
75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000

The summary statistics for Work_accident, left and promotion_last_5years does not make sense, as they are categorical variables.

How many people, who had work accidents, actually left the company?

In [13]:
work_accident_count = hr_df[['Work_accident', 'left']].groupby(['Work_accident', 'left']).size().reset_index()
work_accident_count.columns = ['Work_accident', 'left', 'count']

sn.factorplot(x="Work_accident", y = 'count', hue="left", data=work_accident_count,
               size=4, kind="bar", palette="muted")
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x113f49ac8>
In [14]:
work_accident_count = hr_df[['Work_accident', 'left']].groupby(['Work_accident', 'left']).size()
work_accident_percent = work_accident_count.groupby(level=[0]).apply(lambda x: x / x.sum()).reset_index()

In terms of percentage

In [15]:
work_accident_percent.columns = ['Work_accident', 'left', 'percent']
In [16]:
sn.factorplot(x="Work_accident", y = 'percent', hue="left", data=work_accident_percent,
               size=4, kind="bar", palette="muted")
Out[16]:
<seaborn.axisgrid.FacetGrid at 0x11a6224a8>

How work accidents have impacted the statisfactin level of the empolyees?

In [18]:
sn.distplot( hr_df[hr_df.Work_accident == 1]['satisfaction_level'], color = 'r')
sn.distplot( hr_df[hr_df.Work_accident == 0]['satisfaction_level'], color = 'g')
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a943278>

How satisfaction levels influence whether to stay or leave the company?

In [19]:
sn.distplot( hr_df[hr_df.left == 0]['satisfaction_level'], color = 'g')
sn.distplot( hr_df[hr_df.left == 1]['satisfaction_level'], color = 'r')
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ab67c88>

It can be noted, large number of people who had lower satisfaction levels, have left the company. Especially, people who have satisfaction level less than 0.5. This makes sense. But there is also a surge in at higher level of satisfaction. Need to understand and deal with these employees with a different stategy.

Average satisfaction levels for people who leave and stay back in the company

In [23]:
sl_left_mean = np.mean( hr_df[hr_df.left == 0]['satisfaction_level'] )
sl_left_mean
Out[23]:
0.666809590479516
In [24]:
np.std( hr_df[hr_df.left == 0]['satisfaction_level'] )
Out[24]:
0.21709425554771716
In [25]:
np.mean( hr_df[hr_df.left == 1]['satisfaction_level'] )
Out[25]:
0.44009801176140917
In [26]:
np.std( hr_df[hr_df.left == 1]['satisfaction_level'] )
Out[26]:
0.2638964784854295

Hypothesis Test: Does lower satisfaction levels lead to people leaving the company

  • $H_{0}$: Average satisfaction level of people leaving is same as average satisfaction of people staying

  • $H_{1}$: Average satisfaction level of people leaving is less than as average satisfaction of people staying

In [27]:
from scipy import stats

stats.ttest_ind( hr_df[hr_df.left == 1]['satisfaction_level'], hr_df[hr_df.left == 0]['satisfaction_level'])
Out[27]:
Ttest_indResult(statistic=-51.61280155890104, pvalue=0.0)

The test establishes that the average satisfaction levels are different.

How last evaluation scores influencing whether to stay or leave the company?

In [28]:
sn.distplot( hr_df[hr_df.left == 0]['last_evaluation'], color = 'r')
sn.distplot( hr_df[hr_df.left == 1]['last_evaluation'], color = 'g')
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x11aea3278>

People with low evaluation and very high evaluation are leaving, where as people with average evaluation scores are staying back. That seems interesting.

How time spent in company influences attrition?

In [29]:
time_spend_count = hr_df[['time_spend_company', 'left']].groupby(['time_spend_company', 'left']).size()
time_spend_percent = time_spend_count.groupby(level=[0]).apply(lambda x: x / x.sum()).reset_index()
time_spend_percent.columns = ['time_spend_company', 'left', 'percent']
In [30]:
sn.factorplot(x="time_spend_company", y = 'percent', hue="left", data=time_spend_percent,
               size=4, kind="bar", palette="muted")
Out[30]:
<seaborn.axisgrid.FacetGrid at 0x11a81b908>

People who have spent 2 years are not leaving the company. But as experience grows people start leaving and highest after they spend 5 years in the company. But once they cross the golden years '7', they are not leaving.

Which department has maximum attrition?

In [31]:
dept_count = hr_df[['sales', 'left']].groupby(['sales', 'left']).size()
dept_count_percent = dept_count.groupby(level=[0]).apply(lambda x: x / x.sum()).reset_index()
dept_count_percent.columns = ['dept', 'left', 'percent']
sn.factorplot(y="dept",
            x = 'percent',
            hue="left",
            data = dept_count_percent,
            size=6,
            kind="bar",
            palette="muted")
Out[31]:
<seaborn.axisgrid.FacetGrid at 0x11b38f7f0>

The percentage of people leaving the company is evenly distributed across all depts. Surprisingly, the percentage is high in HR itself. Lowest in management.

Effect of whether someone got promoted in last 5 years

In [33]:
pd.crosstab( hr_df.promotion_last_5years, hr_df.left )
Out[33]:
left 0 1
promotion_last_5years
0 11128 3552
1 300 19
In [34]:
sn.factorplot(x="promotion_last_5years", hue = 'left', data=hr_df,
               size=4, kind="count", palette="muted")
Out[34]:
<seaborn.axisgrid.FacetGrid at 0x11b7d45f8>

Very few people who got promoted in last 5 years left the company, compared to people who are not promoted in last 5 years

How Salary is influencing attrition decisions?

In [137]:
sn.factorplot(x="salary", hue = 'left', data=hr_df,
               size=4, kind="count", palette="muted")
Out[137]:
<seaborn.axisgrid.FacetGrid at 0x13e6f0fd0>

Does higher salary lead to higher satisfaction level?

In [164]:
sn.distplot( hr_df[hr_df.salary == 'low']['satisfaction_level'], color = 'b')
sn.distplot( hr_df[hr_df.salary == 'medium']['satisfaction_level'], color = 'g')
sn.distplot( hr_df[hr_df.salary == 'high']['satisfaction_level'], color = 'r')
/Users/manaranjan/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[164]:
<matplotlib.axes._subplots.AxesSubplot at 0x141940b70>
In [118]:
sn.factorplot( y = "sales",
            col="salary",
            hue = "left",
            data=hr_df,
            kind="count",
            size=5)
Out[118]:
<seaborn.axisgrid.FacetGrid at 0x126586828>

No surprises. People with lowers salary have maximum percentage of exodus, while people with higher salary the exodus is least.

Lets check corrleation between Variables

In [131]:
corrmat = hr_df.corr()
f, ax = plt.pyplot.subplots(figsize=(6, 6))
sn.heatmap(corrmat, vmax=.8, square=True, annot=True)
plt.pyplot.show()

Some key observations:

  • Satisfaction level reduces as people spend more time in the company. Also, interestingly when they work on more number of projects.
  • Evaluation score is positively correlated with spending more montly hours and number of projects.
  • As satisfaction level reduces, people tend to leave company.
In [ ]: