Sentiment & Survey Response Classification System

End-to-End | Ensemble Learning | Production-Ready Inference

This repository demonstrates a complete MLE workflow designed to transform a standard classification task into a deployable, reproducible product.

Unlike typical data science notebooks, this project focuses on production readiness:

Hybrid Ensemble Architecture: Combines a statistical baseline (Softmax(TF-IDF + LR)) with a Neural Network (MLP(Structured + TF-IDF → SVD)).
Lightweight Inference Engine: A standalone pred.py script that performs inference using Pure NumPy, removing heavy production dependencies like PyTorch or Scikit-Learn.
Drift Prevention: Shared feature engineering logic (features.py) ensures strict alignment between training and inference pipelines.
Automated Reporting: Automatically generates Confusion Matrices and Per-Class F1 charts upon training.

Repository Structure

.
├── src/
│   ├── train.py               # Main pipeline: Train -> Eval -> Export -> Report
│   ├── features.py            # Shared feature engineering (prevents training-serving skew)
│   ├── export_artifacts.py    # Serializes weights/vocab for the numpy inference engine
│   ├── reporting.py           # Metrics calculation and visualization (Matplotlib)
│   └── config.py              # Centralized configuration (hyperparams, column definitions)
├── baselines/
│   └── knn_baseline.py        # KNN baseline for performance benchmarking
├── pred.py                    # Zero-dependency inference script (loads artifacts/)
├── artifacts/                 # Serialized model assets (generated during training)
├── reports/                   # Performance graphs and metrics (generated during training)
└── data/
    └── training_data_clean.csv  # (Not committed) Place your dataset here

Quick Start

1. Installation

Set up a clean virtual environment and install dependencies:

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt

2. Data Setup

Place the dataset at data/.

Required Column: label (Target class)
Optional Column: student_id (Used for Group-Aware Splitting to prevent data leakage from the same user appearing in both train and test sets).

3. Train & Export

Run the full pipeline. This command handles training, evaluation, artifact export, and report generation:

python -m src.train --data data/training_data_clean.csv --seed 42

Outputs:

artifacts/*: JSON and NPY files containing vocabularies, IDF vectors, and model weights.
reports/metrics.json: Detailed evaluation metrics.
reports/confusion_matrix.png & per_class_f1.png: Visualization of model performance.

4. Inference (Production Mode)

Use the lightweight pred.py script to predict on new data. This script mimics a production environment by loading the exported artifacts and running inference without the training stack.

python pred.py path/to/unlabeled.csv > preds.txt

Note: pred.py exposes a simple API: predict_all(csv_path) -> List[str].

Performance & Benchmarking

The system automatically compares the ensemble model against baselines. You can fill in your specific results below:

Model	Accuracy	Macro-F1	Architecture Notes
KNN Baseline	0.5679	0.5646	Features: Likert Scale + Keyword Indicators
Softmax Branch	0.6364	0.6159	TF-IDF + Logistic Regression
MLP Branch	0.6909	0.6876	One-Hot + SVD → 3-Layer MLP
Ensemble	0.7030	0.6969	Weighted Probability Averaging

To run the KNN baseline for comparison:

python -m baselines.knn_baseline --data data/training_data_clean.csv

Engineering Highlights

1. Pure NumPy Inference (`pred.py`)

To simulate a low-latency, lightweight deployment environment, the inference engine was rebuilt from scratch using only NumPy.

Custom TF-IDF & SVD: Implemented vectorization logic that mirrors Scikit-Learn but relies solely on the exported vocab.json and idf.npy.
Matrix Operations: The MLP forward pass and Logistic Regression probability calculations are performed via raw matrix multiplication.
Benefit: Drastically reduces the size of the inference docker image and cold-start times.

2. Feature Consistency

A common pitfall in ML systems is Training-Serving Skew. This project solves it by isolating feature logic in src/features.py:

Text Construction: Consistent concatenation of Likert scales, multi-select headers, and raw text.
Tokenization: Regex-based tokenization used in train.py is exactly replicated in pred.py.

3. Automated Artifact Export

The src.export_artifacts.py module handles the complex task of serializing:

Scikit-Learn pipelines (Imputers, Scalers).
PyTorch model weights (transposed for NumPy compatibility).
Ensemble weights.

Configuration

Hyperparameters and column definitions are managed centrally in src/config.py.

Hardware: Automatically detects CUDA/CPU.
Ensemble Weights: Adjustable in src/config.py (Default: 0.5/0.5).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentiment & Survey Response Classification System

End-to-End | Ensemble Learning | Production-Ready Inference

Repository Structure

Quick Start

1. Installation

2. Data Setup

3. Train & Export

4. Inference (Production Mode)

Performance & Benchmarking

Engineering Highlights

1. Pure NumPy Inference (`pred.py`)

2. Feature Consistency

3. Automated Artifact Export

Configuration

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Sentiment & Survey Response Classification System

End-to-End | Ensemble Learning | Production-Ready Inference

Repository Structure

Quick Start

1. Installation

2. Data Setup

3. Train & Export

4. Inference (Production Mode)

Performance & Benchmarking

Engineering Highlights

1. Pure NumPy Inference (pred.py)

2. Feature Consistency

3. Automated Artifact Export

Configuration

1. Pure NumPy Inference (`pred.py`)