Machine Learning Employees Attrition

Machine Learning Employees Attrition


In this, Machine Learning project we are going to learn why employees are leaving . What are the factors that promote employees leaving the company? Which department in the company is highly affected and the reasons behind it Things I have done in this project




1. Data Analyzing

2. Which Department is most affected

3. Reasons Behind Attrition

4. Removing unnecessary data from our dataset

5. Convert our dataset for machine learning training model


The machine learning model used here

1.Logistic Regression(accuracy and confusion matrix)

2.Random Forest Classifier(accuracy and confusion matrix) 


Github link for The project



First thing we need to do is understand what kind of data is given. Then we will remove all the unnecessary , fill the null value if there is any . Visualise it and then we will do our machine learning


The key to success in any organization is attracting and retaining top talent. I’m an HR analyst at my company, and one of my tasks is to determine which factors keep employees at my company and which prompt others to leave. I need to know what factors I can change to prevent the loss of good people. 


Content I have data about past and current employees in a spreadsheet on my desktop. It has various data points on our employees, but I’m most interested in whether they’re still with my company or whether they’ve gone to work somewhere else. And I want to understand how this relates to workforce attrition.


Education 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor'


EnvironmentSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'


JobInvolvement 1 'Low' 2 'Medium' 3 'High' 4 'Very High'


JobSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'


PerformanceRating 1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding'


RelationshipSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'


WorkLifeBalance 1 'Bad' 2 'Good' 3 'Better' 4 'Best'



Step 1


Import all the important libraries to import the data and graph visualization

numpy is for numerical calculation

matplotlib and seaborn both use for graph plotting

pandas for data handling


import numpy as np import matplotlib.pyplot as plt import seaborn as sns import pandas as pd d = pd.read_csv('employee.csv') d.head()

Step 2 : Data Exploration


d.info() #will get all the information regarding data size ,data types present,size ,shape,null values etc

d.describe() #tell the mean , standard deviation and other statiscal data for the employees



# Categorical columns

d.select_dtypes(include='object').columns # note object type can't use directly into machine learning we need to convert it so here we are basically looking our orbject data name and how many they we will do there conversion later on


Step 3 : Removing All the unnecessary Data


By visualizing in excel we see that there is some parameter that are useless and there is no need to involve in machine for the model accuracy. Example EmployeeId, Employeecount(this value is same for all the data so passing in our machine learning model is useless) , similary in this way i have identify other useless data so we are going to drop it


dataset = d.drop(columns=['EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours'])

dataset.shape #our dataset size is decrease now 

dataset.isnull().any() #checking if there is any null value

dataset.isnull().sum() #if there is any null value it will do the sum of it

Step 4 : Visualising Our DataSet

sns.countplot(dataset['Attrition']) # tell how many employee want to leave

Stay = (dataset.Attrition=='Yes').sum()
Leave = (dataset.Attrition=='No').sum()
Leaving_pt = Stay*100/(Stay+Leave)
print(' ',Stay,'Employee want to leave',' ',Leave, 'want to stay')
print(' ',Leaving_pt,'% Percentage of employee want to leave')


Output :
237 Employee want to leave   1233 want to stay
 16.122448979591837 % Percentage of employee want to leave


plt.subplots(figsize=(12,4))
sns.countplot(x='Age', hue='Attrition', data=dataset, palette = 'colorblind')



#tells which department have highest no. of disatisfaction
plt.figure(figsize=(25,10))
sns.countplot(x ='JobRole',hue ='Attrition',data = dataset)


plt.figure(figsize=(15,15))
sns.heatmap(dataset.corr(),annot =True,fmt='.0%')

Step 5 : Handling Object Data

The reason we are handling object data because these can't be send directly to machine learning . Machine learning model that we are using only understand numbers not words. So first we will see how many object data we have then we will convert it into unint data type

dataset.select_dtypes(include ='object').columns #tells all the object data type remember we have this command earlier as well do you see difference in output run on your system or see my github for better understanding


dataset = pd.get_dummies(data=dataset, drop_first=True) #convert our dataset into unint

dataset.info() #run this command and see yourself

dataset.rename(columns={'Attrition_Yes':'Attrition'}, inplace=True) #changing the column name


Step 6 : Data Splitting


x = dataset.drop(columns='Attrition') #include all except Attrition
y = dataset['Attrition'] #include only Attrition data column 
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

split the data into training data and testing data
test_size = 0.2 means that 20% of our data is for testing rest of the data is use for training
random_state = 42 means whenever we will run this command we will get the same output

x_train = All the input variable that gives attrition data
x_test = testing data it also contains input variable that gives attrition data but it is use for prediction purpose

y_train = Contains only attrition data
y_test = Contains only attrition data but this data is to verify how good our machine learning model predicts

Step 7 : Apply Machine Learning Model

We are first going to use Logistic Regression model because we only want to know whether the employee is going to leave or not and logistic regression is good for that

from sklearn.linear_model import LogisticRegression
lgr = LogisticRegression(random_state=0)
lgr.fit(x_train,y_train)

y_pred = lgr.predict(x_test)
from sklearn.metrics import accuracy_score, confusion_matrix
acc = accuracy_score(y_test, y_pred)
print(acc*100)

output = 86.39455782312925
confusion_matrix(y_test,y_pred)
array([[253,   2],
       [ 38,   1]], dtype=int64)



Random Forest Classifier Model

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state=0)
rfc.fit(x_train,y_train)

y_rfc_pred =rfc.predict(x_test)
from sklearn.metrics import accuracy_score , confusion_matrix
accuracy_score(y_test,y_rfc_pred)*100

87.75 %

confusion_matrix(y_test,y_rfc_pred)

array([[254,   1],
       [ 35,   4]], dtype=int64)







Post a Comment

0 Comments