Build a high‑performance fraud detection model using XGBoost, leveraging unique cardholder identifiers and sophisticated feature engineering on the IEEE‑CIS dataset. The pipeline normalizes temporal features, applies frequency and group aggregations, and produces a submission ready for Kaggle competition.
The core of this solution is the identification of unique cardholders (UIDs) and the aggregation of their transaction behavior over time. By normalizing temporal features and analyzing transaction patterns, the model can effectively distinguish between legitimate users and fraudulent actors.
- D-Column Normalization: Converting relative time deltas to absolute points in time for stability.
- Cardholder UID Creation: Combining multiple card and address features to track individual credit cards.
- Advanced Encodings:
- Frequency Encoding for high-cardinality features.
- Group Aggregations (Mean, Std, Nunique) based on cardholder UIDs.
- Optimized Pipeline: Uses
pd.concatto avoid DataFrame fragmentation and improve performance.
Below is the result of our model performance on the Kaggle leaderboard:
- Python 3.x
- pandas
- numpy
- xgboost
- scikit-learn
- Place the competition datasets (
train_transaction.csv,train_identity.csv,test_transaction.csv,test_identity.csv) in the root directory. - Run the training script:
python xgb_magic_model.py
- The script will generate a
submission_xgb_magic.csvfile ready for Kaggle submission.
A detailed technical report of the model architecture and feature engineering process can be found in XGBoost_Model_Report.md.
