MachineLearning-species-classification-project

Project Overview

This project is part of the Machine Learning course at The University of Queensland. The goal is to classify species using machine learning algorithms, specifically Principal Component Analysis (PCA) for feature reduction and Random Forest (my choice) for classification. The objective is to improve classification accuracy by selecting the most informative features.

This study includes:

Feature selection through correlation analysis and PCA.
Dimensionality reduction using PCA to improve model performance.
Species classification using Random Forest.
Performance evaluation with accuracy metrics.

Project Repository

MachineLearning-species-classification-project
│── README.md                  # Project Overview
│── requirements.txt            # List of dependencies
│── data/
│   ├── 83_Loeschcke_et_al_2000_Thorax_&_wing_traits_lab pops.csv  # Dataset
│── images/
│   ├── correlation_map.png      # Correlation heatmap visualization
│   ├── PCA_result.png           # PCA visualization
│   ├── scree_plot.png           # Scree plot for PCA
│── reports/
│   ├── 47801873_Report.pdf      # Final report describing the project
│── notebooks/
│   ├── fly analysis.ipynb       # Jupyter Notebook for full analysis

Setup

Clone the repository

git clone https://github.com/TsungTseTu122/MachineLearning-species-classification-project.git

cd MachineLearning-species-classification-project

Set Up a Virtual Environment (If needed) For window users:

python -m venv venv venv\Scripts\activate

For macOS/Linux users:

python3 -m venv venv source venv/bin/activate

Install dependencies

pip install -r requirements.txt

Dataset overview

The dataset consists of thorax and wing trait measurements from biological samples. The features include:

Latitude & Longitude: Geographic location of the sample.
Temperature: Environmental temperature at the time of collection.
Vial & Replicate: Experimental grouping information.
Thorax_length: Measurement of the thorax size.
L2, L3p, L3d, lpd, l3: Wing length measurements from different parts.
w1, w2, w3: Additional wing-related metrics.
Wing_loading: A derived trait combining thorax and wing characteristics.

These features were analyzed using correlation heatmaps and PCA to determine their importance in classification.

How to use

The Jupyter Notebook (fly_analysis.ipynb) contains the complete workflow for analyzing the dataset and training the model. The steps included in the notebook are:

Data Preprocessing:

Loading the dataset and inspecting missing values.
Converting necessary columns to numerical format.
Performing correlation analysis to identify highly related features.

2 Feature Selection with PCA:

Applying Principal Component Analysis (PCA) to reduce dimensionality.
Determining the optimal number of components using a scree plot.
Visualizing data distribution in lower-dimensional space.

Feature Importance Analysis:

Extracting and ranking the most influential features in classification.
Comparing PCA-selected features with the original dataset.

Model Training with Random Forest:

Training a Random Forest classifier on the processed dataset.
Splitting data into training and testing sets.
Evaluating model performance using accuracy metrics.

All results are pre-saved, so rerunning is not required. If you want to modify or extend the analysis, simply open the notebook and adjust the code as needed.

Key Results

Feature Selection: PCA was used to reduce dimensionality while preserving important information.
Visualizations:

Correlation heatmap to understand feature relationships.

Scree plot to determine optimal PCA components.

PCA projection to visualize transformed data.
Feature Importance Ranking: The notebook provides insights into which features contribute the most to classification accuracy.
Model Performance: Random Forest achieved a final accuracy of 66.67%, improving from a baseline of 51.63%.

Future improvement

Try additional classifiers: Support Vector Machines (SVM) or Neural Networks.
Hyperparameter tuning: Exploring alternative parameter search strategies beyond GridSearchCV such as Bayesian Optimization or RandomizedSearchCV.
Explore dataset augmentation: Synthetic data generation to improve classification.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MachineLearning-species-classification-project

Project Overview

Project Repository

Setup

Dataset overview

How to use

Key Results

Future improvement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
images		images
notebooks		notebooks
reports		reports
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MachineLearning-species-classification-project

Project Overview

Project Repository

Setup

Dataset overview

How to use

Key Results

Future improvement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages