A complete EDA-to-Modeling workflow using Decision Tree, Random Forest, KNN, and Support Vector Regression
This project examines how various regression algorithms learn from data and perform under different levels of complexity.
The pipeline covers data cleaning, preprocessing, EDA, model training, and performance comparison using four popular machine learning models.
The goal was not only to achieve strong predictive performance but to deepen understanding of how algorithm choice impacts real-world results.
Build a complete machine learning workflow
Explore data patterns through EDA
Train and compare four regression models
Evaluate and interpret model performance
Strengthen intuition behind algorithm selection
Python
Pandas, NumPy
Matplotlib, Seaborn
Scikit-learn
Jupyter Notebook
Handle missing values
Fix inconsistent data types
Remove outliers where necessary
Correlation heatmaps
Distribution plots
Feature relationships
Outlier and noise detection
Feature scaling
Train-test split
Normalization for distance-based and kernel models
The following models were trained and evaluated:
# Decision Tree Regressor
from sklearn.tree import DecisionTreeRegressor
dt_model = DecisionTreeRegressor(max_depth=5)
# Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators=100)
# KNN Regressor
from sklearn.neighbors import KNeighborsRegressor
knn_model = KNeighborsRegressor(n_neighbors=5)
# Support Vector Regression
from sklearn.svm import SVR
svr_model = SVR()
Performance metrics used:
Mean Absolute Error (MAE)
Mean Squared Error (MSE)
R² Score
Cross-validation performance
| Model | Strengths | Weaknesses |
| ----------------- | ------------------------------- | -------------------------------- |
| Decision Tree | Easy to interpret, fast | Overfits easily |
| Random Forest | Best generalization, stable | Slower training |
| KNN Regressor | Simple, captures local patterns | Sensitive to scaling |
| SVR | Handles non-linear patterns | Needs tuning, slow on large data |
Random Forest achieved the most consistent and reliable performance, while SVR showed strong results with proper scaling and tuning.
Preprocessing dramatically affects model performance
Ensemble methods reduce variance and improve generalization
Distance-based models require careful scaling
Kernel methods can outperform others with the right parameters
EDA is essential before selecting or tuning models