Customer Propensity ML Pipeline

A production-ready machine learning system for predicting customer purchase propensity with exceptional performance (91.8% F1 score, 99.7% ROC-AUC).

Quick Start

1. Training Pipeline

python3 ml_pipeline_mvp.py

This runs the complete ML pipeline including:

Data loading and validation
Advanced feature engineering
Multiple model training (Logistic Regression, Random Forest, XGBoost)
Cross-validation and hyperparameter tuning
Model evaluation and comparison
Test prediction generation

2. Making Predictions

python3 model_inference.py --input_file your_data.csv --output_file predictions.csv

3. Viewing Results

Check the ml_outputs/ directory for:

Trained models (*.pkl)
Performance metrics (model_evaluation_results.json)
Test predictions (test_predictions.csv)
Feature importance analysis (*_feature_importance.csv)

System Architecture

Core Components

ml_pipeline_mvp.py - Complete training pipeline
- Handles 455K+ training samples with 4.19% class imbalance
- Advanced feature engineering with interaction features
- Multiple model training with class weight balancing
- Comprehensive evaluation with business-relevant metrics
model_inference.py - Production inference system
- Loads best model (Random Forest)
- Validates input data format
- Generates predictions with probability scores
- Handles missing features gracefully
InteractionFeatureTransformer - Custom feature engineering
- Creates interaction features between high-impact variables
- Builds engagement and intent scores
- Handles device-behavior interactions

Performance Results

Model	F1 Score	Precision	Recall	ROC-AUC
Random Forest	0.9182	0.8555	0.9908	0.9974
XGBoost	0.9125	0.8454	0.9911	0.9974
Logistic Regression	0.9123	0.8441	0.9924	0.9974

Key Achievements

99.08% Recall: Captures nearly all potential customers
85.55% Precision: Minimizes false positives for cost-effective targeting
99.74% ROC-AUC: Exceptional discrimination ability
0.84% Test Prediction Rate: Conservative, high-confidence targeting

Installation

Requirements

pip install pandas numpy scikit-learn matplotlib seaborn xgboost imbalanced-learn

Or use the provided requirements file:

pip install -r requirements.txt

Verified Environment

Python 3.11+
All major ML libraries supported
Tested on macOS (compatible with Linux/Windows)

Input Data Format

Expected CSV format with binary features (0/1):

UserID,basket_icon_click,basket_add_list,basket_add_detail,sort_by,image_picker,account_page_click,promo_banner_click,detail_wishlist_add,list_size_dropdown,closed_minibasket_click,checked_delivery_detail,checked_returns_detail,sign_in,saw_checkout,saw_sizecharts,saw_delivery,saw_account_upgrade,saw_homepage,device_mobile,device_computer,device_tablet,returning_user,loc_uk
user123,0,1,0,0,0,1,0,0,1,0,1,0,1,1,0,0,0,1,1,0,0,1,1

Required Features (23 total):

Engagement: basket_icon_click, basket_add_list, basket_add_detail, sort_by, image_picker
Account: account_page_click, sign_in, saw_account_upgrade
Shopping Intent: checked_delivery_detail, checked_returns_detail, saw_checkout, saw_delivery, saw_sizecharts
Content: promo_banner_click, detail_wishlist_add, list_size_dropdown, closed_minibasket_click, saw_homepage
Device: device_mobile, device_computer, device_tablet
User Type: returning_user
Location: loc_uk

Output Format

Predictions include:

UserID: Customer identifier
predicted_ordered: Binary prediction (0/1)
purchase_probability: Probability score (0-1)
prediction_timestamp: When prediction was made

Example output:

UserID,predicted_ordered,purchase_probability,prediction_timestamp
user123,1,0.8234,2025-08-15T01:33:13.227942
user456,0,0.0234,2025-08-15T01:33:13.227942

Business Applications

Marketing Campaigns

Target customers with probability > 0.5 for high-conversion campaigns
Use probability scores for budget allocation
A/B test different thresholds for optimal ROI

Customer Segmentation

High Intent (p > 0.5): Premium campaigns, personalized offers
Medium Intent (0.1 < p < 0.5): Nurture campaigns, retargeting
Low Intent (p < 0.1): Brand awareness, long-term nurturing

Feature Insights

Top predictive features for business strategy:

Checkout Behavior: saw_checkout, checked_delivery_detail
Engagement: basket interactions, account activity
Intent Signals: sign_in combined with shopping actions

Advanced Usage

Custom Threshold Tuning

# Load model and adjust threshold based on business needs
import pickle
with open('ml_outputs/random_forest_model.pkl', 'rb') as f:
    model = pickle.load(f)

# Get probabilities
probs = model.predict_proba(X)[:, 1]

# Custom threshold (e.g., for high precision)
custom_threshold = 0.7
predictions = (probs >= custom_threshold).astype(int)

Batch Processing

from model_inference import CustomerPropensityPredictor

predictor = CustomerPropensityPredictor('ml_outputs/')
predictor.load_best_model()

# Process large files in chunks
for chunk in pd.read_csv('large_dataset.csv', chunksize=10000):
    predictions = predictor.predict(chunk)
    predictions.to_csv('predictions_chunk.csv', mode='a', header=False, index=False)

Monitoring and Maintenance

Model Performance Tracking

Monitor prediction distribution over time
Track business conversion rates vs predicted probabilities
Alert if input data distribution shifts significantly

Recommended Retraining Schedule

Monthly: Performance monitoring and drift detection
Quarterly: Full model retraining with new data
Annually: Feature engineering review and architecture updates

Files Generated

ml_outputs/
├── Models (Production Ready)
│   ├── random_forest_model.pkl           # Best model (F1: 0.9182)
│   ├── xgboost_model.pkl                 # Alternative model
│   └── logistic_regression_model.pkl     # Baseline model
├── Predictions
│   └── test_predictions.csv              # Test set predictions
├── Analysis
│   ├── model_evaluation_results.json     # Performance metrics
│   ├── random_forest_feature_importance.csv
│   ├── xgboost_feature_importance.csv
│   └── logistic_regression_coefficients.csv
├── Visualizations
│   └── model_comparison.png              # Performance comparison
└── Internal
    └── detailed_evaluation_results.pkl   # Complete evaluation data

Troubleshooting

Common Issues

Import Errors

pip install --upgrade scikit-learn xgboost imbalanced-learn

Memory Issues with Large Files

Process data in chunks using pandas chunksize parameter
Use model_inference.py which handles memory efficiently

Feature Missing Warnings

Missing features automatically filled with 0 (appropriate for binary data)
Check input data format against expected schema

Support

For technical issues:

Check log files (ml_pipeline.log) for detailed error messages
Verify input data format matches expected schema
Ensure all required packages are installed with compatible versions

License

This project is designed for production use with comprehensive error handling, logging, and scalability features.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
analysis_mvp		analysis_mvp
feature_analysis_outputs		feature_analysis_outputs
ml_outputs		ml_outputs
segmentation_outputs		segmentation_outputs
univariate_mvp		univariate_mvp
PROJECT_COMPLETION_SUMMARY.md		PROJECT_COMPLETION_SUMMARY.md
PR_EXECUTIVE_SUMMARY.md		PR_EXECUTIVE_SUMMARY.md
README.md		README.md
analysis_mvp.py		analysis_mvp.py
business_recommendations_executive_summary.md		business_recommendations_executive_summary.md
campaign_strategies_by_segment.md		campaign_strategies_by_segment.md
create_evaluation_plots.py		create_evaluation_plots.py
customer_segmentation_simple.py		customer_segmentation_simple.py
data_inventory_and_quality_check.py		data_inventory_and_quality_check.py
dataset_description.md		dataset_description.md
dataset_exploration_findings.md		dataset_exploration_findings.md
feature_importance_analysis_simple.py		feature_importance_analysis_simple.py
feature_insights.md		feature_insights.md
load_and_explore_training_dataset.py		load_and_explore_training_dataset.py
marketing_analytics_insights_summary.md		marketing_analytics_insights_summary.md
marketing_implementation_guide.md		marketing_implementation_guide.md
ml_pipeline_mvp.py		ml_pipeline_mvp.py
model_inference.py		model_inference.py
pipeline_summary.md		pipeline_summary.md
propensity_analysis_dashboard.html		propensity_analysis_dashboard.html
requirements.txt		requirements.txt
roi_analysis_detailed.md		roi_analysis_detailed.md
streamlined_marketing_workflow_outline.md		streamlined_marketing_workflow_outline.md
testing_sample.csv		testing_sample.csv
training_sample.csv		training_sample.csv
univariate_analysis_mvp.py		univariate_analysis_mvp.py

Folders and files

Latest commit

History

Repository files navigation

Customer Propensity ML Pipeline

Quick Start

1. Training Pipeline

2. Making Predictions

3. Viewing Results

System Architecture

Core Components

Performance Results

Key Achievements

Installation

Requirements

Verified Environment

Input Data Format

Required Features (23 total):

Output Format

Business Applications

Marketing Campaigns

Customer Segmentation

Feature Insights

Advanced Usage

Custom Threshold Tuning

Batch Processing

Monitoring and Maintenance

Model Performance Tracking

Recommended Retraining Schedule

Files Generated

Troubleshooting

Common Issues

Support

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages