Skip to content

filesofShane/Email-Classifier

Repository files navigation

Email Classifier Pipeline

An automated email classification pipeline that categorises incoming financial services emails into one of five categories using the Anthropic Claude API (Claude Haiku). The classifier uses a zero-shot prompting approach — no model training is required.

Categories

  • Account Management
  • Insurance Claims
  • Investment Advisory
  • Loan Processing
  • Other

Project Structure

Email Classifier/
├── train/                  # Raw training HTML emails
├── test/                   # Raw test HTML emails
├── parse_01.py             # HTML email parser utility
├── EDA_02.ipynb            # Exploratory Data Analysis notebook
├── training_03.py          # Data preparation, loading, and train/validation split
├── evaluation_04.py        # Evaluation framework, classification report, and confusion matrix
├── generate_data.py        # Synthetic data generator (used to augment the small sample size)
├── main.py                 # Classifier and prediction pipeline entry point
├── train_labels.csv        # Ground truth labels for the training set
├── predictions.csv         # Final predicted labels and confidence scores for the test set
├── requirements.txt        # Python dependencies
└── .env.example            # Environment variable template

Setup

1. Clone or download the project

2. Install dependencies

pip install -r requirements.txt

3. Set up the Anthropic API key

Copy .env.example to .env and add your key:

ANTHROPIC_API_KEY=your-api-key-here

Or set it directly in your terminal:

# Windows PowerShell
$env:ANTHROPIC_API_KEY="sk-ant-..."
# Windows Command Prompt
set ANTHROPIC_API_KEY=sk-ant-...

4. Run the pipeline

python main.py

Output

  • predictions.csv — final predictions on the test set with columns:
    • email_id — email identifier
    • predicted_category — predicted class label
    • confidence_score — model's confidence margin between top 2 predicted categories (0 to 1)
  • confusion_matrix.png — confusion matrix from validation set evaluation

Synthetic Data Generation (Optional)

The original dataset consisted of 44 training emails and 12 test emails — a relatively small sample size. A synthetic data generator (generate_data.py) was included to augment the dataset while preserving the original class imbalance distribution.

How to generate synthetic data

Step 1 — Update the starting email IDs

Open generate_data.py and update the starting ID constants to continue from where your existing data ends:

TRAIN_START_ID = 45   # change this to: existing train emails + 1
TEST_START_ID  = 13   # change this to: existing test emails + 1

For example if your existing train set has 44 emails and test set has 12, the values above are correct. If your dataset is different, adjust accordingly.

Step 2 — Run the generator

python generate_data.py

This will generate HTML files in a synthetic/ folder:

  • synthetic/train/ — synthetic training emails
  • synthetic/test/ — synthetic test emails
  • synthetic/synthetic_train_labels.csv — labels for synthetic training emails

Step 3 — Merge synthetic data with original data

Copy the synthetic HTML files into your existing train/ and test/ folders, then merge the labels:

import pandas as pd

original  = pd.read_csv("train_labels.csv")
synthetic = pd.read_csv("synthetic/synthetic_train_labels.csv")
pd.concat([original, synthetic]).reset_index(drop=True).to_csv("train_labels.csv", index=False)

No changes to the pipeline scripts are needed — the parser automatically picks up all HTML files in the train and test folders.


Requirements

See requirements.txt. Key dependencies:

  • anthropic — Claude API client
  • beautifulsoup4 — HTML email parser
  • pandas — data manipulation
  • scikit-learn — train/validation split and evaluation metrics
  • matplotlib — confusion matrix visualisation

About

An email classifier that makes use of prompts using Anthropic's Claude API.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors