DATA-3402, Spring'2026| Student Name: George Pallob Dewri | ID: 1002242956
This project focuses on predicting bank customer churn using a machine learning classification approach. Customer churn occurs when clients discontinue their relationship with a bank. By analyzing a comprehensive training dataset of 165,034 bank customers, we developed a binary classification model to identify individuals at high risk of exiting the bank. The data was preprocessed using standardization and one-hot encoding, and a Random Forest Classifier was trained to predict the churn status. Finally, the model was applied to an unseen test dataset of 110,023 customers to generate predictions for a Kaggle competition submission.
The data used in this project consists of tabular Bank Customer Churn datasets sourced from Kaggle:
train.csv(165,034 rows, 14 columns): Used for training and validating the model. Contains demographic features (Geography, Gender, Age), financial features (Credit Score, Balance, Estimated Salary), and the target variableExited(1 = Churn, 0 = Stay).test.csv(110,023 rows, 13 columns): The holdout dataset used to evaluate the model's final performance on the Kaggle leaderboard.
There were no missing values in either dataset.
Customer retention is a critical issue for financial institutions, as acquiring new customers is often more expensive than retaining existing ones. The goal of this project is to build a robust predictive model that accurately classifies whether a customer will churn based on their profile.
The project pipeline involves:
- Data Loading & Initial Analysis: Counting rows/features and identifying data types.
- Data Visualization: Comparing the distributions of key features like Age and Balance against the target class to identify predictive signals.
- Data Preprocessing: Removing non-predictive identifiers (
id,CustomerId,Surname), encoding categorical variables, and scaling continuous variables. - Machine Learning: Training a classification algorithm, evaluating it on a validation set, and predicting outcomes for the
test.csvdataset.
To solve this binary classification problem, the following approach was implemented:
- Feature Engineering: Categorical features (
GeographyandGender) were converted into numerical formats using One-Hot Encoding (pd.get_dummies). - Rescaling: Continuous numerical features (
CreditScore,Age,Tenure,Balance,NumOfProducts,EstimatedSalary) were standardized usingStandardScalerto ensure no single feature dominated the model. - Algorithm Selection: A Random Forest Classifier was selected due to its robustness, ability to handle complex non-linear relationships, and strong performance on tabular data.
- Test Set Alignment: Ensured the final test dataset maintained the exact feature structure as the training dataset prior to generating predictions.
The Random Forest model was initially evaluated on a 10% validation sample split from the training data. The model successfully identified patterns leading to customer churn, particularly highlighting the importance of features like Age and Balance.
- Validation Performance: The classification report indicated a solid balance between precision and recall, effectively identifying true churners while minimizing false alarms.
- Final Submission: The trained model successfully generated predictions for all 110,023 customers in the test set, outputting the results into
submission.csvfor Kaggle evaluation.
To run the analysis and reproduce the results, follow these steps:
- Clone this repository to your local machine.
- Ensure you have the required Python libraries installed (
pandas,numpy,matplotlib,seaborn,scikit-learn,tabulate). - Place both the
train.csvandtest.csvdatasets in the same directory as the Jupyter Notebook. - Open the Jupyter Notebook (
.ipynbfile) and run all cells sequentially. - The code will process the data, train the model, output the performance tables, and generate the final
submission.csvfile.
- Dataset: https://www.kaggle.com/competitions/playground-series-s4e1/overview
- Course Materials: Project Guidelines and template provided for DATA-3402 by Dr. Farbin.