Skip to content

Real-J/Machine-Learning-Model-for-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Machine Learning Model for Classification

This repository contains a machine learning pipeline that processes a dataset, trains multiple models, tunes hyperparameters, and evaluates performance using accuracy and confusion matrices.

The pipeline supports handling imbalanced data using SMOTE and includes Random Forest hyperparameter tuning using GridSearchCV.

πŸš€ Features

  • Preprocessing: Encodes categorical data, normalizes numerical features, and handles missing values.
  • Handles Imbalanced Data: Uses SMOTE (Synthetic Minority Over-sampling Technique).
  • Trains multiple models:
    • Decision Tree
    • Random Forest
    • Support Vector Machine (SVM)
    • k-Nearest Neighbors (KNN)
    • (Optional) LightGBM (if installed)
  • Hyperparameter Tuning: Optimizes Random Forest with GridSearchCV.
  • Feature Importance Analysis: Displays feature contributions for model decisions.
  • Model Saving and Loading: Saves the best-trained model for future predictions.

πŸ“‚ Dataset Information

Predicting Diabetes based on factors such as bmi, age, number of pregnancies etc

For Each Attribute:

Number of times pregnant Plasma glucose concentration a 2 hours in an oral glucose tolerance test Diastolic blood pressure (mm Hg) Triceps skin fold thickness (mm) 2-Hour serum insulin (mu U/ml) Body mass index (weight in kg/(height in m)^2) Diabetes pedigree function Age (years)

The dataset should be a CSV file with relevant attributes for classification tasks. The dataset used in this project focuses on medical diagnostics, predicting diabetes based on health indicators.

Dataset Columns

  • Pregnancies: Number of times the person has been pregnant
  • Glucose: Plasma glucose concentration after a 2-hour oral glucose tolerance test
  • BloodPressure: Diastolic blood pressure (mm Hg)
  • SkinThickness: Triceps skin fold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index (weight in kg/(height in m)^2)
  • DiabetesPedigreeFunction: Genetic predisposition score
  • Age: Age in years
  • Outcome: Target variable (0 = No Diabetes, 1 = Diabetes)

The dataset is cleaned by handling missing values using median imputation.

πŸ† Machine Learning Models

The pipeline implements various models to classify individuals based on their health attributes.

Model Overview

  1. Decision Tree Classifier
    • A simple tree-based model that learns decision rules.
    • Good for interpretability but prone to overfitting.
  2. Random Forest Classifier
    • An ensemble of decision trees, reducing variance and improving generalization.
    • Tuned using GridSearchCV to find the best hyperparameters.
  3. Support Vector Machine (SVM)
    • Uses a polynomial kernel to separate classes.
    • Works well for small and medium-sized datasets.
  4. k-Nearest Neighbors (KNN)
    • Classifies based on the majority vote of k nearest neighbors.
    • Performs best with well-distributed data.
  5. LightGBM (Optional)
    • Gradient boosting framework for high-performance classification.
    • Used if installed; skipped otherwise.

πŸ›  Installation

  1. Clone this repository:

    git clone https://github.com/Real-J/ml-classification.git
    cd ml-classification
  2. Install dependencies:

    pip install -r requirements.txt

    If using Anaconda, install missing dependencies:

    conda install -c conda-forge imbalanced-learn lightgbm

πŸš€ Usage

  1. Place the dataset (your_dataset.csv) in the project folder.
  2. Run the script:
    python ml_pipeline.py
  3. The script will:
    • Preprocess the dataset
    • Handle imbalanced data using SMOTE
    • Train multiple models
    • Tune Random Forest hyperparameters
    • Display feature importance
    • Plot the confusion matrix
    • Save the best model as best_model.pkl

πŸ“Š Model Performance Example

Original class distribution: [500 268]
Resampled class distribution: [500 500]
Accuracy with Decision Tree: 0.7450
Accuracy with Random Forest: 0.7900
Accuracy with SVM: 0.7100
Accuracy with KNN: 0.7050
Accuracy with LightGBM: 0.7800
Tuned Random Forest Accuracy: 0.81

Dataset Result

graph

graph

πŸ“ˆ Feature Importance

The script generates a feature importance plot, helping to identify the most influential features in classification.

πŸ” Troubleshooting

  • If you get ModuleNotFoundError, install missing libraries:
    pip install numpy pandas scikit-learn seaborn matplotlib imbalanced-learn lightgbm
  • If LightGBM is not installed, it will skip training that model without affecting the rest.

πŸ€– Future Improvements

  • Support for deep learning models (e.g., TensorFlow/Keras, PyTorch)
  • AutoML integration for hyperparameter tuning
  • Web API for real-time model inference

πŸ“œ License

This project is licensed under the MIT License.

About

This project predicts diabetes using machine learning models like Random Forest, SVM, Decision Tree, and KNN. It includes data preprocessing, SMOTE for handling class imbalance, hyperparameter tuning, and model evaluation. The best model is saved as best_diabetes_model.pkl for future predictions. πŸš€

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages