This repository contains a machine learning pipeline that processes a dataset, trains multiple models, tunes hyperparameters, and evaluates performance using accuracy and confusion matrices.
The pipeline supports handling imbalanced data using SMOTE and includes Random Forest hyperparameter tuning using GridSearchCV.
- Preprocessing: Encodes categorical data, normalizes numerical features, and handles missing values.
- Handles Imbalanced Data: Uses SMOTE (Synthetic Minority Over-sampling Technique).
- Trains multiple models:
- Decision Tree
- Random Forest
- Support Vector Machine (SVM)
- k-Nearest Neighbors (KNN)
- (Optional) LightGBM (if installed)
- Hyperparameter Tuning: Optimizes Random Forest with
GridSearchCV. - Feature Importance Analysis: Displays feature contributions for model decisions.
- Model Saving and Loading: Saves the best-trained model for future predictions.
Predicting Diabetes based on factors such as bmi, age, number of pregnancies etc
For Each Attribute:
Number of times pregnant Plasma glucose concentration a 2 hours in an oral glucose tolerance test Diastolic blood pressure (mm Hg) Triceps skin fold thickness (mm) 2-Hour serum insulin (mu U/ml) Body mass index (weight in kg/(height in m)^2) Diabetes pedigree function Age (years)
The dataset should be a CSV file with relevant attributes for classification tasks. The dataset used in this project focuses on medical diagnostics, predicting diabetes based on health indicators.
Pregnancies: Number of times the person has been pregnantGlucose: Plasma glucose concentration after a 2-hour oral glucose tolerance testBloodPressure: Diastolic blood pressure (mm Hg)SkinThickness: Triceps skin fold thickness (mm)Insulin: 2-Hour serum insulin (mu U/ml)BMI: Body mass index (weight in kg/(height in m)^2)DiabetesPedigreeFunction: Genetic predisposition scoreAge: Age in yearsOutcome: Target variable (0 = No Diabetes, 1 = Diabetes)
The dataset is cleaned by handling missing values using median imputation.
The pipeline implements various models to classify individuals based on their health attributes.
- Decision Tree Classifier
- A simple tree-based model that learns decision rules.
- Good for interpretability but prone to overfitting.
- Random Forest Classifier
- An ensemble of decision trees, reducing variance and improving generalization.
- Tuned using
GridSearchCVto find the best hyperparameters.
- Support Vector Machine (SVM)
- Uses a polynomial kernel to separate classes.
- Works well for small and medium-sized datasets.
- k-Nearest Neighbors (KNN)
- Classifies based on the majority vote of k nearest neighbors.
- Performs best with well-distributed data.
- LightGBM (Optional)
- Gradient boosting framework for high-performance classification.
- Used if installed; skipped otherwise.
-
Clone this repository:
git clone https://github.com/Real-J/ml-classification.git cd ml-classification -
Install dependencies:
pip install -r requirements.txt
If using Anaconda, install missing dependencies:
conda install -c conda-forge imbalanced-learn lightgbm
- Place the dataset (
your_dataset.csv) in the project folder. - Run the script:
python ml_pipeline.py
- The script will:
- Preprocess the dataset
- Handle imbalanced data using SMOTE
- Train multiple models
- Tune Random Forest hyperparameters
- Display feature importance
- Plot the confusion matrix
- Save the best model as
best_model.pkl
Original class distribution: [500 268]
Resampled class distribution: [500 500]
Accuracy with Decision Tree: 0.7450
Accuracy with Random Forest: 0.7900
Accuracy with SVM: 0.7100
Accuracy with KNN: 0.7050
Accuracy with LightGBM: 0.7800
Tuned Random Forest Accuracy: 0.81
The script generates a feature importance plot, helping to identify the most influential features in classification.
- If you get ModuleNotFoundError, install missing libraries:
pip install numpy pandas scikit-learn seaborn matplotlib imbalanced-learn lightgbm
- If LightGBM is not installed, it will skip training that model without affecting the rest.
- Support for deep learning models (e.g., TensorFlow/Keras, PyTorch)
- AutoML integration for hyperparameter tuning
- Web API for real-time model inference
This project is licensed under the MIT License.

