This repository contains an end-to-end machine learning system for predicting mortgage loan default using Fannie Mae data. It includes training pipelines, model artifacts, and a Streamlit-based scoring app.
- Task: Predict 24-month mortgage default probability (origination-time features only; no leakage)
- Model: Logistic Regression (class-weighted baseline)
- Data: Fannie Mae Single-Family Loan Performance (processed features; raw excluded)
- Scale: 338,473 loans • default rate 0.78%
- Performance (holdout): ROC-AUC 0.819 • PR-AUC 0.0536
- Artifacts:
artifacts/model.joblib,artifacts/feature_schema.json,artifacts/metadata.json - Demo: Streamlit scoring app (
app/app.py) — upload CSV, get PD + flag + risk bucket
Repro proof: Metrics and training metadata are persisted in artifacts/metadata.json after running python3 -m src.train.
End-to-end credit risk modeling project using Fannie Mae mortgage data.
This repository demonstrates a full pipeline from data preparation and model training
to a deployable inference application for scoring new loans.
The project is intentionally structured to reflect production-oriented workflows: training and inference are separated, model artifacts are persisted, and scoring is exposed through a lightweight application interface.
Objective:
Estimate the probability that a mortgage loan defaults within 24 months of origination,
using only origination-time features (no post-origination data leakage).
Key components:
- Data ingestion, labeling, and feature engineering
- Baseline probability-of-default (PD) model
- Reproducible training pipeline with persisted artifacts
- Browser-based scoring app for inference
.
├── src/
│ ├── train.py # Training pipeline (produces model artifacts)
│ └── config.py # Paths and configuration
├── app/
│ └── app.py # Streamlit inference/scoring app
├── artifacts/
│ ├── model.joblib # Trained model + preprocessing pipeline
│ ├── feature_schema.json # Required input feature contract
│ └── metadata.json # Training metadata and metrics
├── Notebooks/
│ ├── 01_data_ingestion_and_labeling.ipynb
│ ├── 02_feature_engineering.ipynb
│ └── 03_modeling_and_evaluation.ipynb
├── data/
│ └── processed/ # Processed feature data (raw files excluded)
├── requirements.txt
└── README.md
The model is trained using Fannie Mae Single-Family Loan Performance data.
- Raw quarterly loan files are not included in this repository due to size.
- Labeling logic and feature construction are demonstrated in the notebooks.
- The final model uses origination-level features only, avoiding look-ahead bias and data leakage.
- Algorithm: Logistic Regression (baseline)
- Target: 24-month default indicator
- Class imbalance: handled via class weighting
- Evaluation metrics:
- ROC AUC
- Precision–Recall AUC
- Thresholding: separated from training and configurable at inference time
The baseline model is intentionally simple and interpretable. The emphasis of this project is on end-to-end system design and reproducibility, not model complexity.
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python3 -m src.trainls -lah artifacts
python3 -c "import json; print(json.dumps(json.load(open('artifacts/metadata.json')), indent=2))"A lightweight inference application is included to score new loans using the trained model artifact. The app loads the persisted preprocessing + model pipeline and applies it deterministically at inference time.
streamlit run app/app.pyThe application opens in a browser and allows users to upload a CSV for scoring.
The app expects a clean origination-level feature CSV with the following columns:
orig_interest_rateorig_upborig_loan_termproperty_typeloan_purposeproperty_stateloan_type
Raw Fannie Mae quarterly performance files are intentionally not accepted by the app.
In a production system, ETL and labeling occur upstream; the scoring service operates on
validated feature tables only.
For each loan, the app produces:
pd_default_24m— predicted probability of default within 24 monthsflag— binary decision based on a configurable thresholdrisk_bucket— coarse risk category (Low / Medium / High) for reporting
The decision threshold can be adjusted at runtime to support different risk policies (e.g., screening vs. underwriting).
Scored results can be downloaded as a CSV.
- Probabilities are uncalibrated and optimized for ranking and screening.
- Class imbalance is handled via class weighting.
- Threshold selection is policy-dependent and intentionally separated from training.
- The baseline model is intentionally simple; the emphasis of this project is on end-to-end system design and reproducibility rather than model complexity.
- Because default is rare (~0.78%), PR-AUC is emphasized alongside ROC-AUC to reflect real screening performance under class imbalance.
MIT License