Peer-to-peer lending platforms like Lending Club, Upstart, and Prosper revolutionized consumer credit by connecting borrowers directly with investors. However, accurate default prediction is critical for platform success:
Key Challenges:
- Credit Access: Traditional FICO-only models reject creditworthy borrowers with thin credit files
- Risk Pricing: Interest rates must accurately reflect default probability to protect investor returns
- Portfolio Performance: High default rates erode investor confidence and platform viability
The Opportunity: As a risk management professional with 10+ years in fintech lending, I've seen how traditional credit scoring models miss nuanced risk signals in alternative data. This project applies machine learning to predict loan defaults using rich behavioral and financial features beyond traditional credit scores.
Goal: Build interpretable ML models that reduce false decline rates by 15% while maintaining portfolio risk within acceptable thresholds, enabling $2-3M in additional revenue from approved creditworthy applicants.
Technical: Build interpretable ML models achieving >0.85 AUC-ROC for credit default prediction
Business: Reduce false decline rate by 15% while maintaining risk appetite, enabling an estimated $2-3M in additional revenue from approved creditworthy applicants
Learning: Master end-to-end ML workflow from EDA to production deployment in fintech context
End-to-end machine learning pipeline: From raw Lending Club data (2.26M loans) to production-ready credit risk models with quantified business impact ($2-5M revenue opportunity).
Current Progress: Week 2 Complete (25%) β β Week 3 Feature Engineering In Progress π
- β Week 1: Exploratory Data Analysis completed
- β Week 2 Day 1: Missing value handling completed
- π Week 2 Day 2: Feature engineering & data leakage prevention in progress
This week: FICO binning, interaction features, categorical encoding, data leakage audit, train/test split
- Source: Lending Club Loan Data (2007-2018)
- Platform: Kaggle / Lending Club public data
- Link: https://www.kaggle.com/datasets/wordsforthewise/lending-club
- Records: 2,260,701 personal loans from US peer-to-peer lending marketplace
- Original Features: 151 variables
- After Initial Cleaning: 95 variables (removed structurally missing columns)
- Key Variables:
- Credit Metrics: FICO score, credit history length, delinquencies, public records
- Loan Characteristics: Amount, term (36/60 months), interest rate, grade (A-G), purpose
- Borrower Profile: Annual income, employment length, home ownership, debt-to-income ratio
- Behavioral Data: Payment history, credit utilization, number of open accounts
- Target: Loan status (Fully Paid, Charged Off, Default)
- Time Period: 2007-2018 (includes pre/post financial crisis data)
- Geographic Coverage: Across all 50 US states
- Python 3.8+
- Pandas - Data manipulation
- NumPy - Numerical computing
- Scikit-learn - Machine learning
- Matplotlib & Seaborn - Visualization
- Jupyter Notebook - Interactive analysis
git clone https://github.com/HammurabiCodes/Credit-Risk-Prediction-ML.git
cd Credit-Risk-Prediction-MLpython -m venv venv
source venv/bin/activatepip install -r requirements.txtDownload from Kaggle Lending Club Dataset and place in data folder
jupyter notebook- Total Loans: 2,260,701
- Time Period: 2007-2018
- Overall Default Rate: 15.68%
- Original Features: 151 columns
- Loan Amount Range: $500 - $40,000 (Average: $15,047, Median: $12,900)
- Average FICO Score: 703
- Average DTI Ratio: 18.8%
1. FICO Score Impact π
- FICO <660: 30.88% default rate (HIGH RISK)
- FICO 660-700: 15.96% default rate
- FICO 700-740: 9.91% default rate
- FICO 740-780: 6.12% default rate
- FICO >780: 4.14% default rate (LOW RISK)
Key Insight: FICO scores below 660 show 7.5x higher default risk compared to scores above 780.
2. Loan Grade Performance π― Grade G loans default at 11x the rate of Grade A loans (40.01% vs 3.59%). Clear gradient validates Lending Club's risk assessment system.
3. Debt-to-Income (DTI) Ratio π³ DTI ratios above 30% show 75% higher default risk compared to DTI below 10%.
4. Loan Purpose Analysis Small business and educational loans are 2x riskier than car loans (20%+ vs 10% default rates).
5. Financial Crisis Impact (2007-2018) π 2007-2008 loans show 8x higher default rates than 2018 loans. Time-based features will be critical for model accuracy.
Opportunity Identified: Current lending may be too conservative on 660-700 FICO range when combined with strong secondary signals (low DTI, stable employment, home improvement purpose).
Estimated Business Impact:
- Potential to approve additional 50,000-100,000 creditworthy borrowers annually
- Estimated revenue impact: $2-5M from interest on approved loans
- Risk mitigation through better pricing (interest rates aligned with predicted default probability)
Dataset contained 588,262+ missing values across 95 columns after dropping structurally missing features. Strategic approach needed to preserve predictive signal while ensuring data quality.
Step 1: Drop High-Missing Columns (>50%)
- Removed: 44 columns
- Examples: member_id (100%), hardship columns (99.5%), settlement columns (98.5%)
- Rationale: Structurally missing - only populated for rare edge cases (<1% of loans)
Step 2: Selective Dropping (20-50% missing)
- Removed: 14 columns
- Examples: open_acc_6m (38%), mths_since_rcnt_il (40%)
- Rationale: Newer features not available for 2007-2012 loans (temporal data gap)
Step 3: Impute Critical Features
| Feature | Missing % | Strategy | Rationale |
|---|---|---|---|
| annual_inc | 0.00% | Median | Robust to outlier millionaires |
| dti | 0.08% | Median | Financial data is right-skewed |
| revol_util | 0.08% | Median | Credit utilization skewed |
| emp_length | 6.50% | Mode + Indicator | Missingness may signal unemployment |
| emp_title | 7.39% | "Unknown" + Indicator | High cardinality feature |
Step 4: Domain-Driven "Months Since" Handling π‘
| Column | Missing % | Fill Value | Interpretation |
|---|---|---|---|
| mths_since_recent_inq | 13.1% | 999 | Never had credit inquiry = POSITIVE |
| num_tl_120dpd_2m | 6.8% | 0 | Never severely delinquent = POSITIVE |
| mo_sin_old_il_acct | 6.2% | 999 | No installment loan history |
Key Domain Insight: In credit risk, "never had a delinquency" is a positive signal, not missing data. Filling with 999 (vs 0) distinguishes "never occurred" from "happened recently."
| Metric | Before | After | Change |
|---|---|---|---|
| Rows | 2,260,701 | 2,260,701 | No data loss β |
| Columns | 151 | 95 | -56 columns |
| Missing Values | 588,262+ | 0 | 100% complete* β |
| File Size | 1.6 GB | 1.37 GB | -14% |
*Important Note on "0% Missing": We achieved 0% missing values through a combination of:
- Dropping columns with >20% missingness (58 columns removed)
- Strategic imputation of remaining features using domain-appropriate methods
- Creating indicator variables to preserve information about missingness patterns
This is standard practice in credit modeling, but it's critical to understand:
- We didn't magically have perfect data
- We made deliberate, documented decisions about how to handle each feature
- Missingness patterns themselves became features (emp_length_missing, emp_title_missing)
Example: Annual Income
- Dataset: 99 borrowers earn $50,000, 1 earns $5,000,000
- Mean: $99,500 β (Unrealistic - pulled up by outlier)
- Median: $50,000 β (Representative of typical borrower)
This principle applies to ALL financial data with outliers.
Why This Matters: In production credit models, you must predict default at the time of loan origination, using only information available before approval. Using post-origination data creates artificially inflated model performance that won't work in real lending decisions.
Categories of Features:
These features are known when the borrower applies:
- FICO score (fico_range_low, fico_range_high)
- Annual income (annual_inc)
- Employment length (emp_length)
- Debt-to-income ratio (dti)
- Home ownership status (home_ownership)
- Loan purpose (purpose)
- Loan amount requested (loan_amnt)
- Credit history length (earliest_cr_line)
- Number of open accounts (open_acc)
- Total credit lines (total_acc)
- Revolving utilization (revol_util)
- Delinquencies in last 2 years (delinq_2yrs)
- Public records (pub_rec)
- Inquiries in last 6 months (inq_last_6mths)
These may contain information from after loan approval:
- Last payment date (last_pymnt_d) - DEFINITE LEAKAGE
- Last payment amount (last_pymnt_amnt) - DEFINITE LEAKAGE
- Next payment date (next_pymnt_d) - DEFINITE LEAKAGE
- Total payment to date (total_pymnt) - DEFINITE LEAKAGE
- Total received principal (total_rec_prncp) - DEFINITE LEAKAGE
- Total received interest (total_rec_int) - DEFINITE LEAKAGE
- Recoveries (recoveries, collection_recovery_fee) - DEFINITE LEAKAGE
- Out_prncp (outstanding principal) - POST-ORIGINATION
These features directly reveal or strongly indicate the target:
- loan_status (this IS our target variable)
- Any "hardship" or "settlement" variables (already removed - indicate default)
- Total payment to date (perfect predictor of fully paid status)
- Collection amounts (only exist after default)
Without Leakage Prevention:
- Model achieves 99%+ AUC (suspiciously perfect)
- Uses payment history to predict default (circular logic)
- Completely unusable in production
- Would be flagged immediately by credit risk committees
With Proper Leakage Prevention:
- Model achieves realistic 75-85% AUC
- Uses only origination-time features
- Deployable in production lending decisions
- Demonstrates professional understanding of credit modeling
Real-World Example: A fintech startup built a loan default model with 98% accuracy. When they tried to deploy it, they discovered it used "months since last payment" as a top feature - which obviously isn't available when approving new loans. They had to rebuild from scratch. This mistake cost them 6 months and $500K.
- β Exploratory Data Analysis (EDA) with large datasets (2.26M records)
- β Strategic missing value handling with domain context
- β Statistical imputation techniques
- π Data leakage prevention in time-series financial data
- π Feature engineering for credit risk
- π Machine learning model development (coming soon)
- β Credit risk modeling in fintech lending
- β Understanding FICO scores, DTI ratios, loan grades
- π Point-in-time feature selection for credit models
- π Production deployment considerations
- β Business impact quantification ($2-5M opportunity)
- Implement SHAP values for model explainability (regulatory requirement)
- Ensemble stacking (XGBoost + Logistic Regression)
- Time-based validation (train 2007-2015, test 2016-2018)
- Hyperparameter tuning with GridSearchCV
- Build Streamlit dashboard for credit officers
- Real-time scoring API integration
- A/B testing framework (champion/challenger)
- MLOps pipeline with automated retraining
Zaina - Risk Management Professional β ML/Data Science
- πΌ LinkedIn: linkedin.com/in/olivia-tamimi
- π» GitHub: github.com/HammurabiCodes
- π§ Email: olivia.tamimi1@gmail.com
- π Education: MS Business Analytics (Fintech Track) @ Rutgers University
Background: 10+ years fintech lending | Combining domain expertise with technical ML skills
Last Updated: March 3, 2026
Project Status: Week 2 Day 2 - Data Leakage Audit in Progress
Expected Completion: May 2026
"0% missing values" doesn't mean we started with perfect data - it means we made strategic, documented decisions about imputation and dropping features.
Understanding that "never had a delinquency" is a positive signal (not missing data) requires 10+ years of credit risk experience. Technical skills alone aren't enough.
Always ask: "Would this feature be available when making the actual decision?" In credit models, using payment history to predict default is useless in production.
Future employers want to see your reasoning, not just your results. Why did you choose median over mean? Why did you remove certain features? Show your thinking.
A model with 75% AUC that's deployable is infinitely more valuable than a 99% AUC model with data leakage that can never go to production.