Skip to content

nulllvector/forest-cover-prediction-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Forest Cover Type Classification using Machine Learning

Overview

This project applies machine learning techniques to predict the dominant forest cover type of 30m × 30m land cells within the Roosevelt National Forest in northern Colorado. Using cartographic variables derived from USGS and US Forest Service datasets, the project compares multiple supervised learning algorithms for large-scale environmental classification and identifies the most effective model for forest cover prediction.

Theodore Roosevelt National Park Map

Theodore Roosevelt National Park Map

Dataset

  • Dataset: UCI Forest Covertype Dataset
  • Original Size: 581,012 instances
  • Features: 54 cartographic variables
  • Target Classes: 7 forest cover types
  • Project Sample: 50,000 randomly selected records
  • Source: https://archive.ics.uci.edu/ml/datasets/covertype

Problem Statement

Accurate forest cover classification is important for:

  • Environmental monitoring
  • Forest resource management
  • Biodiversity conservation
  • Land-use planning

Traditional mapping techniques are expensive and time-consuming. This project explores how machine learning can automate and improve forest cover prediction.

Technologies Used

  • Python
  • Pandas
  • NumPy
  • Scikit-Learn
  • Matplotlib
  • Seaborn
  • Jupyter Notebook

Machine Learning Pipeline

  1. Data Collection
  2. Data Preprocessing
  3. Random Sampling (50,000 records)
  4. Feature Scaling
  5. Correlation-Based Feature Selection
  6. Train-Test Split (80/20)
  7. Model Training
  8. Model Evaluation
  9. Performance Comparison

Models Implemented

  • Logistic Regression
  • K-Nearest Neighbors (KNN)
  • Decision Tree
  • Support Vector Machine (SVM)
  • Random Forest

Evaluation Metrics

  • Accuracy
  • Precision
  • Recall
  • F1-Score
  • G-Mean
  • False Positive Rate (FPR)

Results

Model Accuracy
Logistic Regression 72%
KNN 82%
Decision Tree 81%
SVM 77%
Random Forest 87%

Random Forest achieved the highest performance and demonstrated strong capability in handling high-dimensional environmental data.

Project Structure

forest-cover-classification/
│
├── notebooks/
│   └── ml_project.ipynb
│
├── reports/
│   └── Forest_Cover_Classification.pdf
│
├── images/
│   ├── correlation_matrix.png
│   ├── confusion_matrix.png
│   └── feature_importance.png
│
├── requirements.txt
├── README.md
└── LICENSE

Key Learnings

  • Feature engineering and feature selection
  • Multi-class classification
  • Model comparison and benchmarking
  • Evaluation metrics beyond accuracy
  • Ensemble learning using Random Forest

Future Improvements

  • Train on the complete 581K-instance dataset
  • Implement XGBoost and LightGBM
  • Address class imbalance using SMOTE
  • Deploy the model using Flask/FastAPI
  • Create a web application for real-time predictions

Sample Results

Correlation Matrix

Correlation Matrix

Model Performace Comparison

Model Accuracy Comparison Model Performance Comparison Heatmap

About

End-to-end machine learning pipeline for forest cover type prediction using the UCI Covertype dataset (581K+ records). Includes feature engineering, model benchmarking, and performance evaluation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors