This is a dataset found on Kaggle. This dataset was made by scrapping the job postings related to the position of 'Data Scientist' from www.glassdoor.com in USA, Selenium was used to scrap the data.
- Find the salary distribution for each type of job in the field of Data Science.
- Create a way to visualize filtered jobs on streamlit easily.
- Using this data set implements a multiclass classification machine learning model.
- Looked at the data types of each field.
- Checked for missing values.
- Checked for Nulls.
- Checked the cardinality of each point.
- Cleaned the Job Title Column.
- Cleaned the Degree Column.
- Cleaned the seniority_by_title Column.
- Converted the Multiple Skills from binary 1 or 0 to boolean true or false.
- Correlation matrix
- Quick chart of avg salary to the job title.
- One hot encoding was done for the data frame to convert categorical data for its use in random forest
- The balance of the skills data is very imbalanced
- The balance for degree level required is highly imbalanced: job listing for bachelor degrees is very few.
The goal of this section is to try and classify the different types of jobs in data science-based on their skills.
The user would input their skill set into a multi-select box on streamlit and from the model predict what type of job would best fit their skills.
- Multinomial logistic regression
- SVM
- Neural Net
- KMode Clustering
Moving forward there are several things I would like to add / change to improve this application:
- Using streamlit cache to optimize performance and to smoothen users' experience when using the random forest classifier.
- Increase data or get a better dataset that would be more balance for data science job types.
- Try different algorithms (see models considered above).
- Implement some sort of boosting algorithm to increase model accuracy.