This project is a phishing email detection system built as a Chrome Extension with a FastAPI backend. The extension scans Gmail or normal webpages, extracts visible text and links, sends that content to the backend, and displays a phishing label, phishing probability score, and explanation flags.
The system supports two phases:
- Pretraining mode: the backend has no trained model loaded and uses a deliberately weak heuristic baseline.
- Post-training mode: the backend loads a trained TF-IDF + Logistic Regression model and uses it to calculate phishing probability.
The training workflow is designed to avoid data leakage:
- Training uses a balanced set of
900 phishing + 900 non-phishingemails. - Emails already used inside
test_data/are excluded from the training pool. - The same test set can be evaluated before and after training to show improvement.
phishing/
├── backend/
│ ├── app/
│ │ ├── evaluation.py
│ │ ├── feedback.py
│ │ ├── flags.py
│ │ ├── main.py
│ │ ├── model_io.py
│ │ ├── reputation.py
│ │ └── schemas.py
│ ├── artifacts/
│ │ ├── classifier.joblib
│ │ ├── vectorizer.joblib
│ │ ├── metrics.json
│ │ └── confusion_matrix.csv
│ ├── evaluate_test_data.py
│ ├── prepare_test_data.py
│ ├── train.py
│ └── Dockerfile
├── extension/
│ ├── manifest.json
│ ├── content.js
│ ├── service_worker.js
│ ├── popup.html
│ ├── popup.js
│ ├── options.html
│ ├── options.js
│ └── icon128.png
├── test_data/
│ ├── phishing/
│ ├── non_phishing/
│ ├── evaluation_log_template.csv
│ └── fixed_evaluation_log_template.csv
├── requirements.txt
└── README.md
+-----------------------------+ +-------------------------------+
| Chrome Browser | | FastAPI Backend |
|-----------------------------| |-------------------------------|
| Gmail / Website Page | | /predict |
| visible text + links | | /health |
+-------------+---------------+ | /preferences/block |
| | /preferences/allow |
v | /reputation/check |
+-----------------------------+ +---------------+---------------+
| extension/content.js | |
|-----------------------------| |
| isVisible() | |
| getGmailScanRoot() | |
| extractVisibleText() | |
| extractLinks() | |
| detects page changes | |
+-------------+---------------+ |
| chrome.runtime message |
v |
+-----------------------------+ HTTP POST /predict |
| extension/service_worker.js |------------------------->|
|-----------------------------| |
| runScanForTab() | v
| requestPageData() | +-------------------------------+
| maybeAutoScan() | | backend/app/main.py |
| saveLastScanResult() | |-------------------------------|
| showScanNotification() | | startup_event() |
+-------------+---------------+ | predict() |
| | apply_user_domain_policy() |
| | has_strong_heuristic_evidence |
v +---------------+---------------+
+-----------------------------+ |
| extension/popup.js | |
|-----------------------------| |
| renderResult() | |
| Scan button | |
| Block Domain button | |
| Allow Domain button | |
+-----------------------------+ |
|
+----------------------------------+----------------------------------+
| | |
v v v
+-----------------------------+ +-----------------------------+ +-----------------------------+
| backend/app/flags.py | | backend/app/model_io.py | | backend/app/reputation.py |
|-----------------------------| |-----------------------------| |-----------------------------|
| detect_flags() | | load_artifacts() | | DomainReputationService |
| keyword checks | | vectorizer.joblib | | trusted domains |
| money scam checks | | classifier.joblib | | flagged domains |
| pressure checks | +-------------+---------------+ | Google Safe Browsing opt. |
| link checks | | +-----------------------------+
+-----------------------------+ |
v
+-----------------------------+
| backend/artifacts/ |
|-----------------------------|
| vectorizer.joblib |
| classifier.joblib |
| metrics.json |
| confusion_matrix.csv |
+-----------------------------+
+-----------------------------+
| backend/train.py |
+-------------+---------------+
|
v
+-----------------------------+
| Kaggle phishing dataset |
| naserabdullahalam/... |
+-------------+---------------+
|
v
+-----------------------------+
| Clean and normalize text |
| Convert labels to 0 or 1 |
+-------------+---------------+
|
v
+-----------------------------+
| Exclude test_data emails |
| Prevents training leakage |
+-------------+---------------+
|
v
+-----------------------------+
| Sample balanced data |
| 900 phishing |
| 900 non-phishing |
+-------------+---------------+
|
v
+-----------------------------+
| Split training/validation |
| 80% train, 20% validation |
+-------------+---------------+
|
v
+-----------------------------+
| TfidfVectorizer |
| Converts words to numbers |
| max_features=30000 |
| ngram_range=(1,2) |
+-------------+---------------+
|
v
+-----------------------------+
| LogisticRegression |
| Learns phishing patterns |
| class_weight=balanced |
+-------------+---------------+
|
v
+-----------------------------+
| Save model artifacts |
| vectorizer.joblib |
| classifier.joblib |
| metrics.json |
+-----------------------------+
The project has three main parts:
- The Chrome Extension is the user-facing layer. It scans Gmail or a normal webpage, extracts visible text and links, and displays the result.
- The FastAPI backend is the decision engine. It receives text and links, applies rule-based phishing checks, checks domain reputation, applies allowlist/blocklist policy, and runs the trained machine learning model when available.
- The training pipeline is separate from the live app. It downloads a phishing email dataset, removes test-data overlap, trains the model, and saves the model artifacts.
When Gmail or a webpage changes, extension/content.js detects the page change and extracts visible page text and links.
extension/service_worker.js receives the scan trigger, asks content.js for the page data, then sends this request to the backend:
POST http://127.0.0.1:8000/predict
The backend returns a response like this:
{
"label": "phishing",
"probability_phishing": 0.630395,
"flags": ["Risky words found: beneficiary, investment, transfer"]
}The popup displays:
label: eitherphishingorlegitimateprobability_phishing: a score from0.0to1.0flags: human-readable explanation of what looked suspicious
If the result is medium or high risk, the extension also shows a browser notification.
Training happens by running:
python backend/train.pyThe script uses the Kaggle dataset:
naserabdullahalam/phishing-email-dataset
The training script performs these steps:
- Loads email text and labels from the dataset.
- Converts labels into
1 = phishingand0 = legitimate. - Extracts email text from
test_data. - Removes matching test emails from the training dataset.
- Randomly samples
900phishing and900non-phishing emails. - Splits the
1800examples into training and validation sets. - Converts email text into numeric TF-IDF features.
- Trains a Logistic Regression classifier.
- Saves the trained files into
backend/artifacts.
Current saved training metrics:
Training examples: 1800
Phishing examples: 900
Non-phishing examples: 900
Precision: 95.48%
Recall: 93.89%
F1 score: 94.68%
PR AUC: 98.56%
Validation confusion matrix:
actual legitimate predicted legitimate: 172
actual legitimate predicted phishing: 8
actual phishing predicted legitimate: 11
actual phishing predicted phishing: 169
There are two scoring modes.
Before training, the backend has no machine learning files. It uses a deliberately weak heuristic baseline:
pretraining_score = 0.15 * number_of_flags
maximum score = 0.45
Because the heuristic phishing threshold is 0.50, pretraining mode records explanations but does not overstate confidence. This makes the trained model's improvement clearer during evaluation.
After training, the backend loads:
backend/artifacts/vectorizer.joblib
backend/artifacts/classifier.joblib
The trained model score is calculated as:
probability_phishing = classifier.predict_proba(vectorized_email_text)[0][1]
The backend marks a page as phishing when:
probability_phishing >= 0.60
The threshold is 0.60 instead of 0.50 because earlier testing showed that the lower threshold produced too many false positives.
The pretraining heuristic baseline uses manual scoring:
Each detected flag adds 0.15
Maximum pretraining score is 0.45
Heuristic phishing threshold is 0.50
The trained model uses learned weighting:
TF-IDF gives importance to words and two-word phrases.
Logistic Regression learns which words and phrases increase or decrease phishing probability.
class_weight="balanced" prevents the model from favoring one class too strongly.
There are also domain policy overrides:
Blocklisted domain: force phishing, score at least 0.98.
Allowlisted clean domain: force legitimate, score at most 0.10.
Allowlisted risky page: still warn if the ML score is high or strong phishing rules appear.
The user does not directly interact with the ML model. The extension hides that complexity.
User opens Gmail/email/page
Extension extracts text and links
Extension sends content to FastAPI
FastAPI converts text into ML features
ML model calculates phishing probability
FastAPI returns label, score, and explanation flags
Extension displays warning to user
Use the command set that matches your operating system.
Open Terminal.
cd /Users/mehdisheriff/Desktop/phishing
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtRun the backend:
python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000Health check:
curl http://127.0.0.1:8000/healthStop the backend with Control + C.
Deactivate the virtual environment:
deactivateThe backend can now be started in either mode without deleting files.
Use without model when you want the pretraining or heuristic-only baseline. In this mode, the API still runs, but it does not load vectorizer.joblib or classifier.joblib.
Use with model when you want the trained TF-IDF + Logistic Regression classifier to calculate the phishing probability.
source .venv/bin/activate
PHISHING_DISABLE_MODEL=1 python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000Health check:
curl http://127.0.0.1:8000/healthExpected health response:
{
"status": "ok",
"model_loaded": false,
"model_disabled_by_env": true
}source .venv/bin/activate
python backend/train.py
python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000Expected health response:
{
"status": "ok",
"model_loaded": true,
"model_disabled_by_env": false
}.\.venv\Scripts\Activate.ps1
$env:PHISHING_DISABLE_MODEL="1"
python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000Health check:
Invoke-RestMethod http://127.0.0.1:8000/healthTo return to normal model loading in the same PowerShell window:
Remove-Item Env:\PHISHING_DISABLE_MODEL.\.venv\Scripts\Activate.ps1
python backend/train.py
python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000.venv\Scripts\activate.bat
set PHISHING_DISABLE_MODEL=1
python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000Health check:
curl http://127.0.0.1:8000/healthTo return to normal model loading in the same CMD window:
set PHISHING_DISABLE_MODEL=.venv\Scripts\activate.bat
python backend\train.py
python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000Open PowerShell in the project folder.
cd C:\path\to\phishing
py -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txtIf PowerShell blocks activation, run this once for the current user:
Set-ExecutionPolicy -Scope CurrentUser RemoteSignedThen activate again:
.\.venv\Scripts\Activate.ps1Run the backend:
python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000Health check:
Invoke-RestMethod http://127.0.0.1:8000/healthStop the backend with Ctrl + C.
Deactivate the virtual environment:
deactivateOpen Command Prompt in the project folder.
cd C:\path\to\phishing
py -m venv .venv
.venv\Scripts\activate.bat
python -m pip install --upgrade pip
pip install -r requirements.txtRun the backend:
python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000Health check:
curl http://127.0.0.1:8000/healthStop the backend with Ctrl + C.
Deactivate the virtual environment:
deactivateActivate your virtual environment first, then run:
macOS / Linux:
python backend/train.pyWindows PowerShell:
python backend/train.pyWindows CMD:
python backend\train.pyUseful options:
python backend/train.py --samples-per-class 900
python backend/train.py --samples-per-class 900 --allow-synthetic-non-phishingTraining outputs:
backend/artifacts/vectorizer.joblib
backend/artifacts/classifier.joblib
backend/artifacts/metrics.json
backend/artifacts/confusion_matrix.csv
After training, restart the backend so it loads the new model artifacts.
The project includes test emails under:
test_data/phishing/
test_data/non_phishing/
The main fixed evaluation CSV is:
test_data/fixed_evaluation_log_template.csv
Start the backend in heuristic-only mode if you want to record pretraining results:
PHISHING_DISABLE_MODEL=1 python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000In a second Terminal window:
source .venv/bin/activate
python backend/evaluate_test_data.py --api-base-url http://127.0.0.1:8000 --evaluation-csv test_data/fixed_evaluation_log_template.csv --reset-resultsTrain the model:
python backend/train.pyRestart the backend, then run post-training evaluation:
python backend/evaluate_test_data.py --api-base-url http://127.0.0.1:8000 --evaluation-csv test_data/fixed_evaluation_log_template.csvStart the backend in heuristic-only mode:
$env:PHISHING_DISABLE_MODEL="1"
python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000In a second PowerShell window:
.\.venv\Scripts\Activate.ps1
python backend/evaluate_test_data.py --api-base-url http://127.0.0.1:8000 --evaluation-csv test_data/fixed_evaluation_log_template.csv --reset-resultsTrain the model:
python backend/train.pyClear the heuristic-only switch, restart the backend, then run post-training evaluation:
Remove-Item Env:\PHISHING_DISABLE_MODEL
python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000python backend/evaluate_test_data.py --api-base-url http://127.0.0.1:8000 --evaluation-csv test_data/fixed_evaluation_log_template.csvStart the backend in heuristic-only mode:
set PHISHING_DISABLE_MODEL=1
python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000In a second CMD window:
.venv\Scripts\activate.bat
python backend\evaluate_test_data.py --api-base-url http://127.0.0.1:8000 --evaluation-csv test_data\fixed_evaluation_log_template.csv --reset-resultsTrain the model:
python backend\train.pyClear the heuristic-only switch, restart the backend, then run post-training evaluation:
set PHISHING_DISABLE_MODEL=
python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000python backend\evaluate_test_data.py --api-base-url http://127.0.0.1:8000 --evaluation-csv test_data\fixed_evaluation_log_template.csvThe current fixed evaluation result is:
Pretraining accuracy: 50.00%
Post-training accuracy: 97.50%
This is optional. It is useful if you want to manually open the test emails and see how the Chrome extension reacts.
macOS / Linux:
python -m http.server 8010 --directory test_dataWindows PowerShell:
python -m http.server 8010 --directory test_dataWindows CMD:
python -m http.server 8010 --directory test_dataThen open:
http://127.0.0.1:8010/
Use this method while developing, testing, or presenting from your own machine.
- Open Chrome.
- Go to
chrome://extensions. - Turn on Developer mode in the top-right corner.
- Click Load unpacked.
- Select the
extensionfolder inside this project. - The extension should now appear in Chrome's extensions list.
- Click the extension's Details button if you want to pin it to the toolbar.
- Open the extension options page.
- Confirm the API URL is:
http://127.0.0.1:8000
Usage:
- Open Gmail or a test email page.
- Wait for the automatic scan or click the extension icon.
- Click Scan Current Page for a manual scan.
- Review the label, score, and flags.
- Use Block Domain if a linked domain should always warn.
- Use Allow Domain if a linked domain is trusted, while still allowing risky content to trigger warnings.
Use this method when you want to share the extension folder with another tester without publishing it to the Chrome Web Store.
- Make sure the backend URL in the extension options points to the backend the tester can access.
- Zip the
extensionfolder. - Send the zip file to the tester.
- The tester should unzip it.
- The tester should open
chrome://extensions. - The tester should enable Developer mode.
- The tester should click Load unpacked.
- The tester should select the unzipped
extensionfolder.
Important: the extension does not include the backend. The FastAPI backend must also be running locally or deployed remotely.
For a remote demo or real deployment:
- Deploy the FastAPI backend to a public HTTPS URL.
- Confirm the deployed backend responds at
/health. - Load the Chrome extension locally with Load unpacked.
- Open the extension options page.
- Change the API URL from
http://127.0.0.1:8000to the deployed backend URL. - Save the option.
- Open Gmail or a test page and run a scan.
For Chrome Web Store release:
- Keep the extension files inside the
extensionfolder. - Use a production HTTPS backend URL.
- Review requested permissions in
extension/manifest.json. - Zip the extension folder.
- Create a Chrome Web Store Developer account.
- Upload the zip package.
- Complete the privacy, permissions, and listing information.
- Submit for review.
Health:
curl http://127.0.0.1:8000/healthPrediction:
curl -X POST http://127.0.0.1:8000/predict \
-H "Content-Type: application/json" \
-d '{
"visible_text": "Urgent! Verify your account immediately",
"links": [{"text": "Click here", "href": "http://suspicious-example.com/login"}]
}'Preferences:
curl http://127.0.0.1:8000/preferences
curl http://127.0.0.1:8000/feedback/statsBlock a domain:
curl -X POST http://127.0.0.1:8000/preferences/block \
-H "Content-Type: application/json" \
-d '{"domain":"bad-domain.com"}'Allow a domain:
curl -X POST http://127.0.0.1:8000/preferences/allow \
-H "Content-Type: application/json" \
-d '{"domain":"trusted-domain.com"}'The backend can use local reputation files:
backend/artifacts/reputation/trusted_domains.txt
backend/artifacts/reputation/flagged_domains.txt
Reload reputation lists without restarting:
curl -X POST http://127.0.0.1:8000/reputation/reloadCheck reputation status:
curl http://127.0.0.1:8000/reputation/statusGoogle Safe Browsing is optional. If you have an API key, set it before starting the backend.
macOS / Linux:
export GOOGLE_SAFE_BROWSING_API_KEY="your_api_key_here"
python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000Windows PowerShell:
$env:GOOGLE_SAFE_BROWSING_API_KEY="your_api_key_here"
python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000Windows CMD:
set GOOGLE_SAFE_BROWSING_API_KEY=your_api_key_here
python -m uvicorn backend.app.main:app --reload --host 127.0.0.1 --port 8000Build the backend image from the project root:
docker build -f backend/Dockerfile -t phishing-backend .Run the backend container:
docker run --rm -p 8000:8000 phishing-backendImportant: make sure these files exist before building or deploying the container:
backend/artifacts/vectorizer.joblib
backend/artifacts/classifier.joblib
For a real deployment:
- Host the FastAPI backend on a service such as Render, Railway, Fly.io, AWS, Azure, or a university server.
- Include
vectorizer.joblibandclassifier.joblibwith the backend deployment. - Use HTTPS for the deployed backend.
- Change the extension API URL from
http://127.0.0.1:8000to the deployed backend URL. - Restrict CORS so only the Chrome extension can call the backend.
- Add production logging for predictions, errors, false positives, and false negatives.
- Retrain the model periodically with newer phishing and legitimate examples.
- Version each trained model so evaluation results can be linked to the exact model used.
- Package the Chrome extension and submit it to the Chrome Web Store if public release is required.
- Keep training data and test data separate for every retraining cycle.
If the extension shows an error:
- Confirm the backend is running on
http://127.0.0.1:8000. - Confirm the extension options page uses the same API URL.
- Confirm the page is not a restricted Chrome page such as
chrome://extensions. - Open
http://127.0.0.1:8000/healthand check that the API responds.
If /health says "model_loaded": false:
- The backend did not find model artifacts.
- Run
python backend/train.py. - Restart the backend.
If PowerShell blocks virtual environment activation:
- Run
Set-ExecutionPolicy -Scope CurrentUser RemoteSigned. - Then run
.\.venv\Scripts\Activate.ps1again.
This project implements a Chrome extension that extracts email and webpage text, sends it to a FastAPI backend, combines explainable phishing rules with a trained TF-IDF Logistic Regression model, and returns a phishing probability, label, and human-readable explanation to warn users about suspicious content.