Bank Customer Churn Prediction

DATA-3402, Spring'2026| Student Name: George Pallob Dewri | ID: 1002242956

Abstract

This project focuses on predicting bank customer churn using a machine learning classification approach. Customer churn occurs when clients discontinue their relationship with a bank. By analyzing a comprehensive training dataset of 165,034 bank customers, we developed a binary classification model to identify individuals at high risk of exiting the bank. The data was preprocessed using standardization and one-hot encoding, and a Random Forest Classifier was trained to predict the churn status. Finally, the model was applied to an unseen test dataset of 110,023 customers to generate predictions for a Kaggle competition submission.

Dataset

The data used in this project consists of tabular Bank Customer Churn datasets sourced from Kaggle:

train.csv (165,034 rows, 14 columns): Used for training and validating the model. Contains demographic features (Geography, Gender, Age), financial features (Credit Score, Balance, Estimated Salary), and the target variable Exited (1 = Churn, 0 = Stay).
test.csv (110,023 rows, 13 columns): The holdout dataset used to evaluate the model's final performance on the Kaggle leaderboard.

There were no missing values in either dataset.

Project Description

Customer retention is a critical issue for financial institutions, as acquiring new customers is often more expensive than retaining existing ones. The goal of this project is to build a robust predictive model that accurately classifies whether a customer will churn based on their profile.

The project pipeline involves:

Data Loading & Initial Analysis: Counting rows/features and identifying data types.
Data Visualization: Comparing the distributions of key features like Age and Balance against the target class to identify predictive signals.
Data Preprocessing: Removing non-predictive identifiers (id, CustomerId, Surname), encoding categorical variables, and scaling continuous variables.
Machine Learning: Training a classification algorithm, evaluating it on a validation set, and predicting outcomes for the test.csv dataset.

Approach

To solve this binary classification problem, the following approach was implemented:

Feature Engineering: Categorical features (Geography and Gender) were converted into numerical formats using One-Hot Encoding (pd.get_dummies).
Rescaling: Continuous numerical features (CreditScore, Age, Tenure, Balance, NumOfProducts, EstimatedSalary) were standardized using StandardScaler to ensure no single feature dominated the model.
Algorithm Selection: A Random Forest Classifier was selected due to its robustness, ability to handle complex non-linear relationships, and strong performance on tabular data.
Test Set Alignment: Ensured the final test dataset maintained the exact feature structure as the training dataset prior to generating predictions.

Results

The Random Forest model was initially evaluated on a 10% validation sample split from the training data. The model successfully identified patterns leading to customer churn, particularly highlighting the importance of features like Age and Balance.

Validation Performance: The classification report indicated a solid balance between precision and recall, effectively identifying true churners while minimizing false alarms.
Final Submission: The trained model successfully generated predictions for all 110,023 customers in the test set, outputting the results into submission.csv for Kaggle evaluation.

Usage

To run the analysis and reproduce the results, follow these steps:

Clone this repository to your local machine.
Ensure you have the required Python libraries installed (pandas, numpy, matplotlib, seaborn, scikit-learn, tabulate).
Place both the train.csv and test.csv datasets in the same directory as the Jupyter Notebook.
Open the Jupyter Notebook (.ipynb file) and run all cells sequentially.
The code will process the data, train the model, output the performance tables, and generate the final submission.csv file.

References

Dataset: https://www.kaggle.com/competitions/playground-series-s4e1/overview
Course Materials: Project Guidelines and template provided for DATA-3402 by Dr. Farbin.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Kaggle Tabular Data.ipynb		Kaggle Tabular Data.ipynb
Project.ipynb		Project.ipynb
README.md		README.md
submission.csv		submission.csv
test.csv		test.csv
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bank Customer Churn Prediction

Abstract

Dataset

Project Description

Approach

Results

Usage

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bank Customer Churn Prediction

Abstract

Dataset

Project Description

Approach

Results

Usage

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages