A production-ready machine learning system for predicting customer purchase propensity with exceptional performance (91.8% F1 score, 99.7% ROC-AUC).
python3 ml_pipeline_mvp.pyThis runs the complete ML pipeline including:
- Data loading and validation
- Advanced feature engineering
- Multiple model training (Logistic Regression, Random Forest, XGBoost)
- Cross-validation and hyperparameter tuning
- Model evaluation and comparison
- Test prediction generation
python3 model_inference.py --input_file your_data.csv --output_file predictions.csvCheck the ml_outputs/ directory for:
- Trained models (
*.pkl) - Performance metrics (
model_evaluation_results.json) - Test predictions (
test_predictions.csv) - Feature importance analysis (
*_feature_importance.csv)
-
ml_pipeline_mvp.py - Complete training pipeline
- Handles 455K+ training samples with 4.19% class imbalance
- Advanced feature engineering with interaction features
- Multiple model training with class weight balancing
- Comprehensive evaluation with business-relevant metrics
-
model_inference.py - Production inference system
- Loads best model (Random Forest)
- Validates input data format
- Generates predictions with probability scores
- Handles missing features gracefully
-
InteractionFeatureTransformer - Custom feature engineering
- Creates interaction features between high-impact variables
- Builds engagement and intent scores
- Handles device-behavior interactions
| Model | F1 Score | Precision | Recall | ROC-AUC |
|---|---|---|---|---|
| Random Forest | 0.9182 | 0.8555 | 0.9908 | 0.9974 |
| XGBoost | 0.9125 | 0.8454 | 0.9911 | 0.9974 |
| Logistic Regression | 0.9123 | 0.8441 | 0.9924 | 0.9974 |
- 99.08% Recall: Captures nearly all potential customers
- 85.55% Precision: Minimizes false positives for cost-effective targeting
- 99.74% ROC-AUC: Exceptional discrimination ability
- 0.84% Test Prediction Rate: Conservative, high-confidence targeting
pip install pandas numpy scikit-learn matplotlib seaborn xgboost imbalanced-learnOr use the provided requirements file:
pip install -r requirements.txt- Python 3.11+
- All major ML libraries supported
- Tested on macOS (compatible with Linux/Windows)
Expected CSV format with binary features (0/1):
UserID,basket_icon_click,basket_add_list,basket_add_detail,sort_by,image_picker,account_page_click,promo_banner_click,detail_wishlist_add,list_size_dropdown,closed_minibasket_click,checked_delivery_detail,checked_returns_detail,sign_in,saw_checkout,saw_sizecharts,saw_delivery,saw_account_upgrade,saw_homepage,device_mobile,device_computer,device_tablet,returning_user,loc_uk
user123,0,1,0,0,0,1,0,0,1,0,1,0,1,1,0,0,0,1,1,0,0,1,1
- Engagement: basket_icon_click, basket_add_list, basket_add_detail, sort_by, image_picker
- Account: account_page_click, sign_in, saw_account_upgrade
- Shopping Intent: checked_delivery_detail, checked_returns_detail, saw_checkout, saw_delivery, saw_sizecharts
- Content: promo_banner_click, detail_wishlist_add, list_size_dropdown, closed_minibasket_click, saw_homepage
- Device: device_mobile, device_computer, device_tablet
- User Type: returning_user
- Location: loc_uk
Predictions include:
- UserID: Customer identifier
- predicted_ordered: Binary prediction (0/1)
- purchase_probability: Probability score (0-1)
- prediction_timestamp: When prediction was made
Example output:
UserID,predicted_ordered,purchase_probability,prediction_timestamp
user123,1,0.8234,2025-08-15T01:33:13.227942
user456,0,0.0234,2025-08-15T01:33:13.227942- Target customers with probability > 0.5 for high-conversion campaigns
- Use probability scores for budget allocation
- A/B test different thresholds for optimal ROI
- High Intent (p > 0.5): Premium campaigns, personalized offers
- Medium Intent (0.1 < p < 0.5): Nurture campaigns, retargeting
- Low Intent (p < 0.1): Brand awareness, long-term nurturing
Top predictive features for business strategy:
- Checkout Behavior: saw_checkout, checked_delivery_detail
- Engagement: basket interactions, account activity
- Intent Signals: sign_in combined with shopping actions
# Load model and adjust threshold based on business needs
import pickle
with open('ml_outputs/random_forest_model.pkl', 'rb') as f:
model = pickle.load(f)
# Get probabilities
probs = model.predict_proba(X)[:, 1]
# Custom threshold (e.g., for high precision)
custom_threshold = 0.7
predictions = (probs >= custom_threshold).astype(int)from model_inference import CustomerPropensityPredictor
predictor = CustomerPropensityPredictor('ml_outputs/')
predictor.load_best_model()
# Process large files in chunks
for chunk in pd.read_csv('large_dataset.csv', chunksize=10000):
predictions = predictor.predict(chunk)
predictions.to_csv('predictions_chunk.csv', mode='a', header=False, index=False)- Monitor prediction distribution over time
- Track business conversion rates vs predicted probabilities
- Alert if input data distribution shifts significantly
- Monthly: Performance monitoring and drift detection
- Quarterly: Full model retraining with new data
- Annually: Feature engineering review and architecture updates
ml_outputs/
├── Models (Production Ready)
│ ├── random_forest_model.pkl # Best model (F1: 0.9182)
│ ├── xgboost_model.pkl # Alternative model
│ └── logistic_regression_model.pkl # Baseline model
├── Predictions
│ └── test_predictions.csv # Test set predictions
├── Analysis
│ ├── model_evaluation_results.json # Performance metrics
│ ├── random_forest_feature_importance.csv
│ ├── xgboost_feature_importance.csv
│ └── logistic_regression_coefficients.csv
├── Visualizations
│ └── model_comparison.png # Performance comparison
└── Internal
└── detailed_evaluation_results.pkl # Complete evaluation data
Import Errors
pip install --upgrade scikit-learn xgboost imbalanced-learnMemory Issues with Large Files
- Process data in chunks using pandas
chunksizeparameter - Use
model_inference.pywhich handles memory efficiently
Feature Missing Warnings
- Missing features automatically filled with 0 (appropriate for binary data)
- Check input data format against expected schema
For technical issues:
- Check log files (
ml_pipeline.log) for detailed error messages - Verify input data format matches expected schema
- Ensure all required packages are installed with compatible versions
This project is designed for production use with comprehensive error handling, logging, and scalability features.