This project demonstrates the power and versatility of regularized statistical learning methods (Ridge, Lasso, and Elastic Net) across multiple machine learning paradigms. Through comprehensive analysis of two distinct datasets, we explore how regularization techniques enhance model performance, prevent overfitting, and improve generalization across different algorithmic approaches.
The work showcases expertise in:
- Multi-paradigm Analysis: Comparative study across OLS, SVM, and Neural Networks
- Regularization Techniques: Ridge (L2), Lasso (L1), and Elastic Net (L1+L2) implementations
- Real-world Applications: Crime prediction and health analytics use cases
- Rigorous Evaluation: Statistical inference, cross-validation, and performance metrics
1. Boston Housing Dataset
- Target: Predict per capita crime rates in Boston suburbs
- Features: 13 neighborhood characteristics (506 observations)
- Challenges: Heavy right-skew, multicollinearity, outliers
- Solution: Log-transformation and regularization techniques
2. Life Expectancy Dataset Source: WHO Life Expectancy Dataset - Kaggle
- 2900+ rows, 20 predictors across health, demographic, and economic domains
- Target: Life expectancy at birth (years)
- Predictors include:
- Health: Adult Mortality, Infant deaths, Immunisation coverage, HIV/AIDS, BMI
- Demographics: Country, Year, Status (Developed/Developing)
- Economics: GDP per capita, Schooling, Income composition of resources, Health expenditure
- Data challenges addressed: Missing values handled with multi-level imputation (moving averages, country means, global medians)
- Solution: Advanced imputation strategies and nonlinear modeling
Machine Learning Pipeline - Raw Data → Preprocessing → Feature Engineering → Model Training → Evaluation → Comparison
Models Implemented 1. Ordinary Least Squares (OLS)
- Baseline unregularized linear regression
- Statistical hypothesis testing
- Cross-validation assessment
2. Regularized Linear Regression
- Ridge Regression: L2 penalty for coefficient shrinkage
- Lasso Regression: L1 penalty for feature selection
- Elastic Net: Combined L1+L2 approach
3. Support Vector Machines
- Support Vector Regression (SVR) with L2 regularization
- Support Vector Classification (SVC) with multiple penalty types
- Linear and RBF kernel comparisons
4. Multi-Layer Perceptron (MLP)
- Deep neural networks with regularization
- Comprehensive hyperparameter tuning
- Performance optimization across regularization schemes
- Education & Income Matter: Schooling and income composition consistently emerged as strong positive predictors, reinforcing the importance of investing in education and economic equity to extend life expectancy.
- Mortality Reduction is Critical: High infant, child, and adult mortality rates remain the most significant negative drivers. Interventions targeting child survival and disease prevention (e.g., HIV/AIDS) can have outsized effects.
- Non-linear Interactions Exist: Life expectancy is not simply additive - factors like GDP and BMI show diminishing returns, which only non-linear models could capture.
- Policy Relevance: Results highlight that health outcomes are multi-dimensional, requiring holistic strategies across healthcare, education, and economic policy.
- Analytics Impact: Regularisation improved model reliability on noisy, real-world data, showing its practical role in building robust health prediction systems.