Email Classifier Pipeline

An automated email classification pipeline that categorises incoming financial services emails into one of five categories using the Anthropic Claude API (Claude Haiku). The classifier uses a zero-shot prompting approach — no model training is required.

Project Structure

Email Classifier/
├── train/                  # Raw training HTML emails
├── test/                   # Raw test HTML emails
├── parse_01.py             # HTML email parser utility
├── EDA_02.ipynb            # Exploratory Data Analysis notebook
├── training_03.py          # Data preparation, loading, and train/validation split
├── evaluation_04.py        # Evaluation framework, classification report, and confusion matrix
├── generate_data.py        # Synthetic data generator (used to augment the small sample size)
├── main.py                 # Classifier and prediction pipeline entry point
├── train_labels.csv        # Ground truth labels for the training set
├── predictions.csv         # Final predicted labels and confidence scores for the test set
├── requirements.txt        # Python dependencies
└── .env.example            # Environment variable template

Setup

1. Clone or download the project

2. Install dependencies

pip install -r requirements.txt

3. Set up the Anthropic API key

Copy .env.example to .env and add your key:

ANTHROPIC_API_KEY=your-api-key-here

Or set it directly in your terminal:

# Windows PowerShell
$env:ANTHROPIC_API_KEY="sk-ant-..."

# Windows Command Prompt
set ANTHROPIC_API_KEY=sk-ant-...

4. Run the pipeline

python main.py

Output

predictions.csv — final predictions on the test set with columns:
- email_id — email identifier
- predicted_category — predicted class label
- confidence_score — model's confidence margin between top 2 predicted categories (0 to 1)
confusion_matrix.png — confusion matrix from validation set evaluation

Synthetic Data Generation (Optional)

The original dataset consisted of 44 training emails and 12 test emails — a relatively small sample size. A synthetic data generator (generate_data.py) was included to augment the dataset while preserving the original class imbalance distribution.

How to generate synthetic data

Step 1 — Update the starting email IDs

Open generate_data.py and update the starting ID constants to continue from where your existing data ends:

TRAIN_START_ID = 45   # change this to: existing train emails + 1
TEST_START_ID  = 13   # change this to: existing test emails + 1

For example if your existing train set has 44 emails and test set has 12, the values above are correct. If your dataset is different, adjust accordingly.

Step 2 — Run the generator

python generate_data.py

This will generate HTML files in a synthetic/ folder:

synthetic/train/ — synthetic training emails
synthetic/test/ — synthetic test emails
synthetic/synthetic_train_labels.csv — labels for synthetic training emails

Step 3 — Merge synthetic data with original data

Copy the synthetic HTML files into your existing train/ and test/ folders, then merge the labels:

import pandas as pd

original  = pd.read_csv("train_labels.csv")
synthetic = pd.read_csv("synthetic/synthetic_train_labels.csv")
pd.concat([original, synthetic]).reset_index(drop=True).to_csv("train_labels.csv", index=False)

No changes to the pipeline scripts are needed — the parser automatically picks up all HTML files in the train and test folders.

Requirements

See requirements.txt. Key dependencies:

anthropic — Claude API client
beautifulsoup4 — HTML email parser
pandas — data manipulation
scikit-learn — train/validation split and evaluation metrics
matplotlib — confusion matrix visualisation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Email Classifier Pipeline

Categories

Project Structure

Setup

1. Clone or download the project

2. Install dependencies

3. Set up the Anthropic API key

4. Run the pipeline

Output

Synthetic Data Generation (Optional)

How to generate synthetic data

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Synthetic data		Synthetic data
test		test
train		train
.gitignore		.gitignore
02_EDA.ipynb		02_EDA.ipynb
ReadMe.md		ReadMe.md
evaluation_04.py		evaluation_04.py
generate_data_1.py		generate_data_1.py
main.py		main.py
parse_01.py		parse_01.py
requirements.txt		requirements.txt
train_labels.csv		train_labels.csv
training_03.py		training_03.py

Folders and files

Latest commit

History

Repository files navigation

Email Classifier Pipeline

Categories

Project Structure

Setup

1. Clone or download the project

2. Install dependencies

3. Set up the Anthropic API key

4. Run the pipeline

Output

Synthetic Data Generation (Optional)

How to generate synthetic data

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages