An end-to-end Data Science & MLOps pipeline designed to predict customer churn using the IBM Telco Customer dataset. This repository demonstrates a complete machine learning lifecycle, focusing on reproducibility, clean code, and production-ready serving.
Customer churn prediction is highly critical for telecommunications companies. By accurately forecasting which clients are at risk of leaving, companies can proactively target them with retention strategies, saving significant revenue. This project encapsulates the entire workflow required to train a robust model and expose it as a scalable API.
Technologies Used: Python · Poetry · Uvicorn/FastAPI (Serving) · MLflow (Tracking) · Sphinx (Documentation)
graph LR
A[Raw Data] -->|make_dataset.py| B(Clean Data)
B -->|build_features.py| C(Processed Features)
C -->|train.py| D[XGBoost / LightGBM / CatBoost]
D --> E((MLflow Tracking))
D --> F[FastAPI Serving]
F -->|predict API| G(Client)
- Python 3.12 (recommended)
- Poetry for environment and dependency management
Clone the repository and install the dependencies:
pip install poetry
poetry installThis will create an isolated virtual environment and automatically download all the dependencies defined in
pyproject.toml.
.
├── README.md # Project description
├── pyproject.toml # Poetry dependencies configuration
├── notebooks/ # Jupyter notebooks (EDA, features, models, tracking)
│ ├── 01_eda.ipynb # Exploratory Data Analysis (EDA)
│ ├── 02_features.ipynb # Feature creation and transformation
│ ├── 03_models.ipynb # Model training and validation
│ └── 04_mlflow.ipynb # Experiment tracking with MLflow
├── reports/ # Rendered HTML results and figures
├── docs/ # Sphinx generated documentation
├── src/ # Main source code
│ ├── data/
│ │ └── make_dataset.py # Data downloading and cleaning
│ ├── features/
│ │ ├── build_features.py # Feature engineering and preparation
│ │ └── feature_selection.py # Feature selection methods
│ ├── models/
│ │ ├── models.py # Classifiers definition and hyperparameter grids
│ │ ├── train.py # Nested CV training and artifact saving
│ │ └── predict.py # Batch predictions with trained models
│ └── serving/
│ └── app.py # Prediction API (FastAPI)
src/data/make_dataset.py: Downloads, cleans, and transforms raw data to prepare it for modeling.src/features/build_features.py: Handles feature engineering, encoding, scaling, and splitting data.src/features/feature_selection.py: Utilities for variable selection (variance, correlation, collinearity, RFECV).src/models/models.py: Defines available classifiers and their hyperparameter grids.src/models/train.py: Trains models with nested cross-validation, selects the optimal threshold, and persists artifacts.src/models/predict.py: Generates batch predictions from a saved model.src/serving/app.py: FastAPI application exposing the production models via REST endpoints.
-
Install Dependencies
poetry install
-
Generate Documentation
- Windows:
.\docs\make.bat html - Linux/macOS:
make -C docs html
- Windows:
-
Download 'Telco Churn' Data
poetry run python src/data/make_dataset.py --out data/raw/telco.csv
-
Data Preparation
poetry run python src/features/build_features.py --in data/preprocessed/telco_preprocessed.xlsx --out data/processed --kind cc
-
Train, Evaluate, and Save Model
poetry run python src/models/train.py --data data/processed --models models
*Note: This script may take up to an hour to execute. It will save the trained model and testing metrics. *
-
Serve the API
poetry run uvicorn src.serving.app:app --host 127.0.0.1 --port 8000
- Documentation is available at:
GET http://localhost:8000/documentation - Note: In production repo this is exposed via Github Pages.
- Documentation is available at:
-
Example Prediction Request
curl -X POST http://127.0.0.1:8000/predict -H "Content-Type: application/json" --data @sample.json
To view the experiment logs and interactive UI:
poetry run mlflow uiAfter starting the server, run the
04_mlflow.ipynbnotebook to log the experiments.