An automated email classification pipeline that categorises incoming financial services emails into one of five categories using the Anthropic Claude API (Claude Haiku). The classifier uses a zero-shot prompting approach — no model training is required.
- Account Management
- Insurance Claims
- Investment Advisory
- Loan Processing
- Other
Email Classifier/
├── train/ # Raw training HTML emails
├── test/ # Raw test HTML emails
├── parse_01.py # HTML email parser utility
├── EDA_02.ipynb # Exploratory Data Analysis notebook
├── training_03.py # Data preparation, loading, and train/validation split
├── evaluation_04.py # Evaluation framework, classification report, and confusion matrix
├── generate_data.py # Synthetic data generator (used to augment the small sample size)
├── main.py # Classifier and prediction pipeline entry point
├── train_labels.csv # Ground truth labels for the training set
├── predictions.csv # Final predicted labels and confidence scores for the test set
├── requirements.txt # Python dependencies
└── .env.example # Environment variable template
pip install -r requirements.txtCopy .env.example to .env and add your key:
ANTHROPIC_API_KEY=your-api-key-here
Or set it directly in your terminal:
# Windows PowerShell
$env:ANTHROPIC_API_KEY="sk-ant-..."# Windows Command Prompt
set ANTHROPIC_API_KEY=sk-ant-...python main.pypredictions.csv— final predictions on the test set with columns:email_id— email identifierpredicted_category— predicted class labelconfidence_score— model's confidence margin between top 2 predicted categories (0 to 1)
confusion_matrix.png— confusion matrix from validation set evaluation
The original dataset consisted of 44 training emails and 12 test emails — a relatively small sample size. A synthetic data generator (generate_data.py) was included to augment the dataset while preserving the original class imbalance distribution.
Step 1 — Update the starting email IDs
Open generate_data.py and update the starting ID constants to continue from where your existing data ends:
TRAIN_START_ID = 45 # change this to: existing train emails + 1
TEST_START_ID = 13 # change this to: existing test emails + 1For example if your existing train set has 44 emails and test set has 12, the values above are correct. If your dataset is different, adjust accordingly.
Step 2 — Run the generator
python generate_data.pyThis will generate HTML files in a synthetic/ folder:
synthetic/train/— synthetic training emailssynthetic/test/— synthetic test emailssynthetic/synthetic_train_labels.csv— labels for synthetic training emails
Step 3 — Merge synthetic data with original data
Copy the synthetic HTML files into your existing train/ and test/ folders, then merge the labels:
import pandas as pd
original = pd.read_csv("train_labels.csv")
synthetic = pd.read_csv("synthetic/synthetic_train_labels.csv")
pd.concat([original, synthetic]).reset_index(drop=True).to_csv("train_labels.csv", index=False)No changes to the pipeline scripts are needed — the parser automatically picks up all HTML files in the train and test folders.
See requirements.txt. Key dependencies:
anthropic— Claude API clientbeautifulsoup4— HTML email parserpandas— data manipulationscikit-learn— train/validation split and evaluation metricsmatplotlib— confusion matrix visualisation