Salaries in Data Science

About

This is a dataset found on Kaggle. This dataset was made by scrapping the job postings related to the position of 'Data Scientist' from www.glassdoor.com in USA, Selenium was used to scrap the data.

Objective:

Find the salary distribution for each type of job in the field of Data Science.
Create a way to visualize filtered jobs on streamlit easily.
Using this data set implements a multiclass classification machine learning model.

Data processing:

Link to notebook used for Cleaning and EDA

Steps taken when cleaning this data set and learning about it:

Looked at the data types of each field.
Checked for missing values.
Checked for Nulls.
Checked the cardinality of each point.
Cleaned the Job Title Column.
Cleaned the Degree Column.
Cleaned the seniority_by_title Column.
Converted the Multiple Skills from binary 1 or 0 to boolean true or false.
Correlation matrix
Quick chart of avg salary to the job title.
One hot encoding was done for the data frame to convert categorical data for its use in random forest

Key findings

The balance of the skills data is very imbalanced
The balance for degree level required is highly imbalanced: job listing for bachelor degrees is very few.

Classifier using Random Forest

The goal of this section is to try and classify the different types of jobs in data science-based on their skills. The user would input their skill set into a multi-select box on streamlit and from the model predict what type of job would best fit their skills.

Some models considered:

Multinomial logistic regression
SVM
Neural Net
KMode Clustering

Future Work

Moving forward there are several things I would like to add / change to improve this application:

Using streamlit cache to optimize performance and to smoothen users' experience when using the random forest classifier.
Increase data or get a better dataset that would be more balance for data science job types.
Try different algorithms (see models considered above).
Implement some sort of boosting algorithm to increase model accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
pages		pages
DS Salary EDA and potential Data Cleaning.ipynb		DS Salary EDA and potential Data Cleaning.ipynb
README.md		README.md
Salaries_in_Data_Science.py		Salaries_in_Data_Science.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Salaries in Data Science

About

Objective:

Data processing:

Link to notebook used for Cleaning and EDA

Steps taken when cleaning this data set and learning about it:

Key findings

Classifier using Random Forest

Some models considered:

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Salaries in Data Science

About

Objective:

Data processing:

Link to notebook used for Cleaning and EDA

Steps taken when cleaning this data set and learning about it:

Key findings

Classifier using Random Forest

Some models considered:

Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages