HR - Attrition Analytics - Part 1: Exploratory Analysis¶
Human Resources are critical resources of any organiazation. Organizations spend huge amount of time and money to hire and nuture their employees. It is a huge loss for companies if employees leave, especially the key resources. So if HR can predict weather employees are at risk for leaving the company, it will allow them to identify the attrition risks and help understand and provie necessary support to retain those employees or do preventive hiring to minimize the impact to the orgranization.
This dataset is taken from kaggle https://www.kaggle.com/ludobenistant/hr-analytics
Fields in the dataset include:
- Employee satisfaction level
- Last evaluation
- Number of projects
- Average monthly hours
- Time spent at the company
- Whether they have had a work accident
- Whether they have had a promotion in the last 5 years
- Department
- Salary
- Whether the employee has left
Why are our best and most experienced employees leaving prematurely?¶
Load the dataset¶
import pandas as pd
import numpy as np
hr_df = pd.read_csv( 'HR_comma_sep.csv' )
Let's look at few records¶
hr_df.head( 5 )
Basic Information about column, types and if any missing data¶
hr_df.info()
Are there any null or missing values in the dataset?¶
hr_df.isnull().any().sum()
No Missing Data. This seems to be a good dataset.¶
hr_df.columns
Exploratory Analysis¶
How many records of people leaving the company exist in the dataset?¶
hr_left_df = pd.DataFrame( hr_df.left.value_counts() )
hr_left_df
import matplotlib as plt
import seaborn as sn
%matplotlib inline
sn.barplot( hr_left_df.index, hr_left_df.left )
Summary of columns¶
hr_df.describe()
The summary statistics for Work_accident, left and promotion_last_5years does not make sense, as they are categorical variables.¶
How many people, who had work accidents, actually left the company?¶
work_accident_count = hr_df[['Work_accident', 'left']].groupby(['Work_accident', 'left']).size().reset_index()
work_accident_count.columns = ['Work_accident', 'left', 'count']
sn.factorplot(x="Work_accident", y = 'count', hue="left", data=work_accident_count,
size=4, kind="bar", palette="muted")
work_accident_count = hr_df[['Work_accident', 'left']].groupby(['Work_accident', 'left']).size()
work_accident_percent = work_accident_count.groupby(level=[0]).apply(lambda x: x / x.sum()).reset_index()
In terms of percentage¶
work_accident_percent.columns = ['Work_accident', 'left', 'percent']
sn.factorplot(x="Work_accident", y = 'percent', hue="left", data=work_accident_percent,
size=4, kind="bar", palette="muted")
How work accidents have impacted the statisfactin level of the empolyees?¶
sn.distplot( hr_df[hr_df.Work_accident == 1]['satisfaction_level'], color = 'r')
sn.distplot( hr_df[hr_df.Work_accident == 0]['satisfaction_level'], color = 'g')
How satisfaction levels influence whether to stay or leave the company?¶
sn.distplot( hr_df[hr_df.left == 0]['satisfaction_level'], color = 'g')
sn.distplot( hr_df[hr_df.left == 1]['satisfaction_level'], color = 'r')
It can be noted, large number of people who had lower satisfaction levels, have left the company. Especially, people who have satisfaction level less than 0.5. This makes sense. But there is also a surge in at higher level of satisfaction. Need to understand and deal with these employees with a different stategy.¶
Average satisfaction levels for people who leave and stay back in the company¶
sl_left_mean = np.mean( hr_df[hr_df.left == 0]['satisfaction_level'] )
sl_left_mean
np.std( hr_df[hr_df.left == 0]['satisfaction_level'] )
np.mean( hr_df[hr_df.left == 1]['satisfaction_level'] )
np.std( hr_df[hr_df.left == 1]['satisfaction_level'] )
Hypothesis Test: Does lower satisfaction levels lead to people leaving the company¶
$H_{0}$: Average satisfaction level of people leaving is same as average satisfaction of people staying
$H_{1}$: Average satisfaction level of people leaving is less than as average satisfaction of people staying
from scipy import stats
stats.ttest_ind( hr_df[hr_df.left == 1]['satisfaction_level'], hr_df[hr_df.left == 0]['satisfaction_level'])
The test establishes that the average satisfaction levels are different.¶
How last evaluation scores influencing whether to stay or leave the company?¶
sn.distplot( hr_df[hr_df.left == 0]['last_evaluation'], color = 'r')
sn.distplot( hr_df[hr_df.left == 1]['last_evaluation'], color = 'g')
People with low evaluation and very high evaluation are leaving, where as people with average evaluation scores are staying back. That seems interesting.¶
How time spent in company influences attrition?¶
time_spend_count = hr_df[['time_spend_company', 'left']].groupby(['time_spend_company', 'left']).size()
time_spend_percent = time_spend_count.groupby(level=[0]).apply(lambda x: x / x.sum()).reset_index()
time_spend_percent.columns = ['time_spend_company', 'left', 'percent']
sn.factorplot(x="time_spend_company", y = 'percent', hue="left", data=time_spend_percent,
size=4, kind="bar", palette="muted")
People who have spent 2 years are not leaving the company. But as experience grows people start leaving and highest after they spend 5 years in the company. But once they cross the golden years '7', they are not leaving.¶
Which department has maximum attrition?¶
dept_count = hr_df[['sales', 'left']].groupby(['sales', 'left']).size()
dept_count_percent = dept_count.groupby(level=[0]).apply(lambda x: x / x.sum()).reset_index()
dept_count_percent.columns = ['dept', 'left', 'percent']
sn.factorplot(y="dept",
x = 'percent',
hue="left",
data = dept_count_percent,
size=6,
kind="bar",
palette="muted")
The percentage of people leaving the company is evenly distributed across all depts. Surprisingly, the percentage is high in HR itself. Lowest in management.¶
Effect of whether someone got promoted in last 5 years¶
pd.crosstab( hr_df.promotion_last_5years, hr_df.left )
sn.factorplot(x="promotion_last_5years", hue = 'left', data=hr_df,
size=4, kind="count", palette="muted")
Very few people who got promoted in last 5 years left the company, compared to people who are not promoted in last 5 years¶
How Salary is influencing attrition decisions?¶
sn.factorplot(x="salary", hue = 'left', data=hr_df,
size=4, kind="count", palette="muted")
Does higher salary lead to higher satisfaction level?¶
sn.distplot( hr_df[hr_df.salary == 'low']['satisfaction_level'], color = 'b')
sn.distplot( hr_df[hr_df.salary == 'medium']['satisfaction_level'], color = 'g')
sn.distplot( hr_df[hr_df.salary == 'high']['satisfaction_level'], color = 'r')
How salaries across departments are related to attrition?¶
sn.factorplot( y = "sales",
col="salary",
hue = "left",
data=hr_df,
kind="count",
size=5)
No surprises. People with lowers salary have maximum percentage of exodus, while people with higher salary the exodus is least.¶
Lets check corrleation between Variables¶
corrmat = hr_df.corr()
f, ax = plt.pyplot.subplots(figsize=(6, 6))
sn.heatmap(corrmat, vmax=.8, square=True, annot=True)
plt.pyplot.show()
Some key observations:¶
- Satisfaction level reduces as people spend more time in the company. Also, interestingly when they work on more number of projects.
- Evaluation score is positively correlated with spending more montly hours and number of projects.
- As satisfaction level reduces, people tend to leave company.
Comments