Skip to content

George-12104014/Final-Kaggle-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bank Customer Churn Prediction

DATA-3402, Spring'2026| Student Name: George Pallob Dewri | ID: 1002242956

Abstract

This project focuses on predicting bank customer churn using a machine learning classification approach. Customer churn occurs when clients discontinue their relationship with a bank. By analyzing a comprehensive training dataset of 165,034 bank customers, we developed a binary classification model to identify individuals at high risk of exiting the bank. The data was preprocessed using standardization and one-hot encoding, and a Random Forest Classifier was trained to predict the churn status. Finally, the model was applied to an unseen test dataset of 110,023 customers to generate predictions for a Kaggle competition submission.

Dataset

The data used in this project consists of tabular Bank Customer Churn datasets sourced from Kaggle:

  • train.csv (165,034 rows, 14 columns): Used for training and validating the model. Contains demographic features (Geography, Gender, Age), financial features (Credit Score, Balance, Estimated Salary), and the target variable Exited (1 = Churn, 0 = Stay).
  • test.csv (110,023 rows, 13 columns): The holdout dataset used to evaluate the model's final performance on the Kaggle leaderboard.

There were no missing values in either dataset.

Project Description

Customer retention is a critical issue for financial institutions, as acquiring new customers is often more expensive than retaining existing ones. The goal of this project is to build a robust predictive model that accurately classifies whether a customer will churn based on their profile.

The project pipeline involves:

  1. Data Loading & Initial Analysis: Counting rows/features and identifying data types.
  2. Data Visualization: Comparing the distributions of key features like Age and Balance against the target class to identify predictive signals.
  3. Data Preprocessing: Removing non-predictive identifiers (id, CustomerId, Surname), encoding categorical variables, and scaling continuous variables.
  4. Machine Learning: Training a classification algorithm, evaluating it on a validation set, and predicting outcomes for the test.csv dataset.

Approach

To solve this binary classification problem, the following approach was implemented:

  • Feature Engineering: Categorical features (Geography and Gender) were converted into numerical formats using One-Hot Encoding (pd.get_dummies).
  • Rescaling: Continuous numerical features (CreditScore, Age, Tenure, Balance, NumOfProducts, EstimatedSalary) were standardized using StandardScaler to ensure no single feature dominated the model.
  • Algorithm Selection: A Random Forest Classifier was selected due to its robustness, ability to handle complex non-linear relationships, and strong performance on tabular data.
  • Test Set Alignment: Ensured the final test dataset maintained the exact feature structure as the training dataset prior to generating predictions.

Results

The Random Forest model was initially evaluated on a 10% validation sample split from the training data. The model successfully identified patterns leading to customer churn, particularly highlighting the importance of features like Age and Balance.

  • Validation Performance: The classification report indicated a solid balance between precision and recall, effectively identifying true churners while minimizing false alarms.
  • Final Submission: The trained model successfully generated predictions for all 110,023 customers in the test set, outputting the results into submission.csv for Kaggle evaluation.

Usage

To run the analysis and reproduce the results, follow these steps:

  1. Clone this repository to your local machine.
  2. Ensure you have the required Python libraries installed (pandas, numpy, matplotlib, seaborn, scikit-learn, tabulate).
  3. Place both the train.csv and test.csv datasets in the same directory as the Jupyter Notebook.
  4. Open the Jupyter Notebook (.ipynb file) and run all cells sequentially.
  5. The code will process the data, train the model, output the performance tables, and generate the final submission.csv file.

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors