Analysis of Data related to Salary and Education

In this era of technology , education in the correct way as well as in productive way can leads to successfulness of a human being which is being determined by his satisfaction and the work he is doing in his near future.

Thus, we need to found out the hidden patterns among the factors which would determine the salary of a person. There are various factors which will determine this factors like his education , working-hours , determination , etc..And we are going to do the same in this analysis.

**Extraction of data was done by Barry Becker from the 1994 Census database. **


# importing libraries
import numpy as np #to do the numerical calculation
import pandas as pd # to explore and clean the data
# importing libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns
# a magic command to visualize the plots with in the Jupyter Notebook
%matplotlib inline 

Let’s load our dataset which we are going to explore and you can access it here. ___

# as our dataset don't have predefined column names , we have to specify them first
column_names = ['age',"workclass","fnlwgt","education","education-num","marital_status", "occupation","relationship","race",'sex',"capital-gain","capital-loss","hours-per-week","native-country","salary"]
# loading the data
df_adult = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",names = column_names , index_col=False)

** Let’s explore it by looking at it in tabular form.**

Data Exploration

# print out 5 instances of data for exploration of data in tabular form
df_adult.head()
age workclass fnlwgt education education-num marital_status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
# gives dimensions of data
print(df_adult.shape)
(32561, 15)

Thus, we have 32561 rows and **15 columns ** in our data.


Now, we have to analyse our data in well mannered way.We have to look for how many NaN values we have in our attribute columns.Identfying **NaN values ** is crucial for exploring the data further as the presence of NaN values don’t let us do numerical calculation on various values and as well as we can’t visualize our data if we have NaN values in our dataset.

df_adult.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital_status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital-gain      32561 non-null int64
capital-loss      32561 non-null int64
hours-per-week    32561 non-null int64
native-country    32561 non-null object
salary            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

This exploration let us reach at point where we can say there is no NaN values present in our whole data and this is a good news for us as we don’t have to clean the data any more. ___

It’s time to explore our data using some mathematical calculations and Descriptive Statistics let us do it like a charm.

Descriptive Statistics gives us too much information about the distribution of numerical attributes present in our data as it let us know these following values– 1) Mean

2) Standard Deviation (std)

3) Minimum and maximum values

df_adult.describe()
age fnlwgt education-num capital-gain capital-loss hours-per-week
count 32561.000000 3.256100e+04 32561.000000 32561.000000 32561.000000 32561.000000
mean 38.581647 1.897784e+05 10.080679 1077.648844 87.303830 40.437456
std 13.640433 1.055500e+05 2.572720 7385.292085 402.960219 12.347429
min 17.000000 1.228500e+04 1.000000 0.000000 0.000000 1.000000
25% 28.000000 1.178270e+05 9.000000 0.000000 0.000000 40.000000
50% 37.000000 1.783560e+05 10.000000 0.000000 0.000000 40.000000
75% 48.000000 2.370510e+05 12.000000 0.000000 0.000000 45.000000
max 90.000000 1.484705e+06 16.000000 99999.000000 4356.000000 99.000000

Data Visualization

Now, it’s the right time to find out some patterns with in the attributes of our data.As we have loaded the data and also cleaned it.

The first thing we are going to do is finding out the **correlation coefficient ** between numerical factors.

Correlation coefficient gives us the value of correlation between two variables, it tells us how the value of one variable change on varying the value of another one .Range of correlation coefficient is ** -1 to 1. If the value of one variable is increasing and with that increment if the value of another variable is also increasing, then we say they are **positively correlated and vice-versa.

The more the value of correlation-coefficient is closer to 1 , the more strongly correlated they are

Heatmap will help us to visualize Correlation coefficients within numerical factors present in our data.

# gives correlation matrix
correlation = df_adult.corr() 
fig = plt.figure(figsize=(10,8))
sns.heatmap(correlation, annot =True , cmap = "coolwarm" , linewidth=2 , linecolor="purple")
plt.show()

png

Takeaway from Heatmap–

As we can see there are no two such features in our data which are highly correlated.This tells us that there is very little linear relationship within the features of our data.


Before moving further, it will be a good choice to visualize the distribution of age of peoples we have in our dataset as it tells us about their work-experience and it gives us rough idea about the range or distribution of Age of peoples.

** Let’s do it with the help of Histogram.**

# changing the default style of Matplotlib to the seaborn style.
plt.style.use("seaborn")
fig,ax = plt.subplots()
ax.hist(df_adult['age'] ,bins= 10)
ax.set_xlabel("Age", weight = "bold" , fontsize = 10)
ax.set_ylabel("Count of Ages" , weight = "bold" , fontsize = 10)
ax.set_title("Distribution of Age" , weight = "bold" , fontsize = 20)
ax.spines["top"].set_visible(False)
ax.spines['left'].set_visible(False)
ax.grid(False)
plt.show()

png

Conclusion From the visualization–

Most of values within the age column are within the range of 20-50. Thus,we have the data about the younger population as well as about some experienced ones


Now , we have enough information about Distribution of Ages but there are several other important features which can help us to find out the hidden patterns with in our data and one of such features is Educational Categories.

It’s important to know about the education qualification pursued by them as it is one of the most important feature which decide the future of someone.

Let’s visualize it using Bar Chart

education_cat = df_adult["education"].value_counts().index.tolist()
print(len(education_cat))
values_for_education_cat = df_adult['education'].value_counts().tolist()
16
plt.style.use("seaborn")
fig,ax = plt.subplots()
bar_positions = np.linspace(1,32,16)
ax.bar(left = bar_positions , height =values_for_education_cat ,width = 1)
ax.set_xlabel("Education Categories", weight = "bold")
ax.set_ylabel("Counts of Students in Educational Categories" , weight = "bold")
ax.set_xticks(bar_positions)
ax.set_xticklabels(education_cat ,rotation = 90)
ax.set_title(" Demonstration of Educational Categories" , weight = "bold", fontsize = 25 , color ="purple")
ax.spines["top"].set_visible(False)
ax.spines['left'].set_visible(False)
ax.grid(False)
plt.show()

png

Takeaway from the distribution of Eduactional Categories

There are more peoples with HS- grad ** qualification, followed by **Some-college and then others with **Preschool ** qualification which have the least number of people counts.


Now , to explore the data further we have to see the relationships between various important attributes that can tell us a lot.t I think it’s great time to analyse further that whether is there any correlation between age and working hours of a human being and we can do it by either Scatter plot or joint plot I think Scatter plot can gives us the right information about whether these two attributes have correlation or not ?

Let’s visualize it.

# creating figure and axes object simulatneously

fig , ax = plt.subplots()
ax.scatter(x = df_adult['age'] , y= df_adult["hours-per-week"] )
ax.set_xlabel("Age" , weight = "bold")
ax.set_ylabel("Hours-per-week" , weight="bold")
ax.set_title("Relationship among Age and Working Hours" , fontsize=20 , weight="bold")
plt.show()

png

It seems pretty awkawrd.Isn’t it ? This visualization tells us that there is no correlation in between these attributes but wait we have to make sure it.** Mathematics and it’s beauty helps us to do this **

Let’s find out the value of correlation coefficient which tells us about how much these attributes correlate in a Mathematical Way

** The more the value of correlation coefficient, the stronger is the correlation**

from scipy.stats import pearsonr
r_value , p_value = pearsonr(df_adult['age'] , df_adult['hours-per-week'])
r_value , p_value
(0.068755707509557354, 2.011285562158478e-35)

As we can see from the calculated mathematical values too , there is no correlation between Age and working hours and it let us conclude that when we have passion for something, then age doesn’t matter.


The following boxplot gives us little information but the valuable one and that is the variation or distribution of ** Males ** and ** Females** along with their Ages is almost same which implies Equality.

sns.boxplot(x="sex" , y ="age" , data = df_adult )
plt.show()

png

It’s time to uncover remaining important patterns with-in the data.

The most important pattern that can be evaluated by finding out the hidden relationship for our main feature[Salary] is to know about the Hard-work a being is putting in order to live his/her to fullest.So, it becomes important to find out the distribution of Hours-per-week along with filtering by Sex feature.

sns.boxplot( x = "salary" , y="hours-per-week" , data = df_adult , hue ="sex")
plt.grid(False)
plt.show()

png

Takeaways from this visualization–

1) The left part of plot is showing the distribution for peoples who earns less than 50K and for the number of hours they put in thier work to earn that much. It is claerly seen that males work more than females but at the same time precentage of females is greater than males.

2) The right part of plot is showing the the distribution for peoples who earns more than 50K and for the number of hours they put in thier work to earn high. It is claerly seen that they do more hard work than the peoples who earns less than 50K and this is the only reason they earn more.

** More the hardwork = higher the pay **


Before completing our analysis , we should look whether the marital-status has some relation with salary or not ? Possibly, it can be one of the factors that don’t let someone to put more efforts and on the other side it can also be oen of the factors that motivate someone to put more efforts for his/her family.

Let’s break it down using visualization.


marital_status_cat = df_adult['marital_status'].value_counts().index
values_for_marital_cat = df_adult['marital_status'].value_counts()
less_than_50k = []
greater_than_50k =[]
for category in marital_status_cat:
    temp_counts = df_adult[(df_adult["marital_status"]== category)&(df_adult["salary"]==' <=50K')].shape[0]
    less_than_50k.append(temp_counts)
for category in marital_status_cat:
    temp_counts = df_adult[(df_adult["marital_status"]== category)&(df_adult["salary"]==' >50K')].shape[0]
    greater_than_50k.append(temp_counts)   
plt.style.use("seaborn")
fig = plt.figure(figsize=(15,7))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
bar_positions = np.linspace(1,14,7)
ax1.bar(left = bar_positions , height =less_than_50k , width=1 , label="Less than 50k")
ax.legend(loc="best")
ax1.set_xlabel(" Marital-Status", weight = "bold")
ax1.set_ylabel("Counts of Students which belongs to Marital-Status Category" , weight = "bold")
ax1.set_xticks(bar_positions)
ax1.set_xticklabels(marital_status_cat ,rotation = 90)
ax1.set_title(" Demonstration of Marital-Status Categories" , weight = "bold", fontsize = 25 , color ="purple", loc ="left")
ax1.spines["top"].set_visible(False)
ax1.spines['left'].set_visible(False)
ax1.grid(False)
ax1.text(x = 8 , y=10000 , s ="Less Than 50K", fontsize = 20 )
ax2.bar(left = bar_positions , height =greater_than_50k  , width=1)
ax2.set_xlabel("Marital-Status", weight = "bold")
ax2.set_xticks(bar_positions)
ax2.set_xticklabels(marital_status_cat ,rotation = 90)
ax2.spines["top"].set_visible(False)
ax2.spines['left'].set_visible(False)
ax2.grid(False)
ax2.text(x =8, y=6500, s ="Greater Than 50K", fontsize = 20 )
plt.show()

png

Takeaways–

  • The number of Married-civ-spouse is more in both of the case i.e in greater than 50K as well as less than 50K
  • These two plots looks similar except the case for Never-Married, as it is clear that who are unmarried are more likely to satisfy themselves in lower wages

Let’s do the same for Eduactional - Categories

greater_50k = []
lesser_50k =[]
for category in education_cat:
    counts = df_adult[(df_adult["education"] == category)&(df_adult['salary']==' <=50K')].shape[0]
    lesser_50k.append(counts)
for category in education_cat:
    counts = df_adult[(df_adult["education"] == category)&(df_adult['salary']==' >50K')].shape[0]
    greater_50k.append(counts)    
plt.style.use("seaborn")
fig = plt.figure(figsize=(15,7))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
bar_positions = np.linspace(1,32,16)
ax1.bar(left = bar_positions , height =lesser_50k , width=1 , label="Less than 50k")
ax.legend(loc="best")
ax1.set_xlabel("Education Categories", weight = "bold")
ax1.set_ylabel("Counts of Students in Educational Categories" , weight = "bold")
ax1.set_xticks(bar_positions)
ax1.set_xticklabels(education_cat ,rotation = 90)
ax1.set_title(" Demonstration of Educational Categories" , weight = "bold", fontsize = 30 , color ="purple", loc ="left")
ax1.spines["top"].set_visible(False)
ax1.spines['left'].set_visible(False)
ax1.grid(False)
ax1.text(x = 18 , y= 8000 , s ="Less Than 50K", fontsize = 20 )
ax2.bar(left = bar_positions , height =greater_50k  , width=1)
ax2.set_xlabel("Education Categories", weight = "bold")
ax2.set_xticks(bar_positions)
ax2.set_xticklabels(education_cat ,rotation = 90)
ax2.spines["top"].set_visible(False)
ax2.spines['left'].set_visible(False)
ax2.grid(False)
ax2.text(x = 15 , y= 2000 , s ="Greater Than 50K", fontsize = 20 )
plt.show()

png

The major takeaways are -

1) The main point that we can extract from this plot is that they are more numbers of peoples who have their salary below 50K almost 3 times the peoples who have their salary greater than 50K.

** Counts of people(<=50K) = 3*counts of people(>50K)** (approax)

2) People with Bachelors qualification have the greatest number when it comes to salary(>50k) and on the other hand people with High-School grad have the greatest number when it comes to salary(>50k)

3) People with higher qualification tends to have higher salary.

** Higher qualification can give you high pay **

Conclusions :

  • There are more chances of a human being to pursue HS- grad.
  • Higher educational qualifictaion can lead you to higher pay.
  • The more hard you work , the higher your salary will be.