This repository demonstrates a complete MLE workflow designed to transform a standard classification task into a deployable, reproducible product.
Unlike typical data science notebooks, this project focuses on production readiness:
- Hybrid Ensemble Architecture: Combines a statistical baseline (
Softmax(TF-IDF + LR)) with a Neural Network (MLP(Structured + TF-IDF → SVD)). - Lightweight Inference Engine: A standalone
pred.pyscript that performs inference using Pure NumPy, removing heavy production dependencies like PyTorch or Scikit-Learn. - Drift Prevention: Shared feature engineering logic (
features.py) ensures strict alignment between training and inference pipelines. - Automated Reporting: Automatically generates Confusion Matrices and Per-Class F1 charts upon training.
.
├── src/
│ ├── train.py # Main pipeline: Train -> Eval -> Export -> Report
│ ├── features.py # Shared feature engineering (prevents training-serving skew)
│ ├── export_artifacts.py # Serializes weights/vocab for the numpy inference engine
│ ├── reporting.py # Metrics calculation and visualization (Matplotlib)
│ └── config.py # Centralized configuration (hyperparams, column definitions)
├── baselines/
│ └── knn_baseline.py # KNN baseline for performance benchmarking
├── pred.py # Zero-dependency inference script (loads artifacts/)
├── artifacts/ # Serialized model assets (generated during training)
├── reports/ # Performance graphs and metrics (generated during training)
└── data/
└── training_data_clean.csv # (Not committed) Place your dataset here
Set up a clean virtual environment and install dependencies:
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
Place the dataset at data/.
- Required Column:
label(Target class) - Optional Column:
student_id(Used for Group-Aware Splitting to prevent data leakage from the same user appearing in both train and test sets).
Run the full pipeline. This command handles training, evaluation, artifact export, and report generation:
python -m src.train --data data/training_data_clean.csv --seed 42
Outputs:
artifacts/*: JSON and NPY files containing vocabularies, IDF vectors, and model weights.reports/metrics.json: Detailed evaluation metrics.reports/confusion_matrix.png&per_class_f1.png: Visualization of model performance.
Use the lightweight pred.py script to predict on new data. This script mimics a production environment by loading the exported artifacts and running inference without the training stack.
python pred.py path/to/unlabeled.csv > preds.txt
Note:
pred.pyexposes a simple API:predict_all(csv_path) -> List[str].
The system automatically compares the ensemble model against baselines. You can fill in your specific results below:
| Model | Accuracy | Macro-F1 | Architecture Notes |
|---|---|---|---|
| KNN Baseline | 0.5679 | 0.5646 | Features: Likert Scale + Keyword Indicators |
| Softmax Branch | 0.6364 | 0.6159 | TF-IDF + Logistic Regression |
| MLP Branch | 0.6909 | 0.6876 | One-Hot + SVD → 3-Layer MLP |
| **Ensemble ** | 0.7030 | 0.6969 | Weighted Probability Averaging |
To run the KNN baseline for comparison:
python -m baselines.knn_baseline --data data/training_data_clean.csv
To simulate a low-latency, lightweight deployment environment, the inference engine was rebuilt from scratch using only NumPy.
- Custom TF-IDF & SVD: Implemented vectorization logic that mirrors Scikit-Learn but relies solely on the exported
vocab.jsonandidf.npy. - Matrix Operations: The MLP forward pass and Logistic Regression probability calculations are performed via raw matrix multiplication.
- Benefit: Drastically reduces the size of the inference docker image and cold-start times.
A common pitfall in ML systems is Training-Serving Skew. This project solves it by isolating feature logic in src/features.py:
- Text Construction: Consistent concatenation of Likert scales, multi-select headers, and raw text.
- Tokenization: Regex-based tokenization used in
train.pyis exactly replicated inpred.py.
The src.export_artifacts.py module handles the complex task of serializing:
- Scikit-Learn pipelines (Imputers, Scalers).
- PyTorch model weights (transposed for NumPy compatibility).
- Ensemble weights.
Hyperparameters and column definitions are managed centrally in src/config.py.
- Hardware: Automatically detects CUDA/CPU.
- Ensemble Weights: Adjustable in
src/config.py(Default: 0.5/0.5).