An automated defensive gateway that protects the Khmer community from digital financial fraud. NETH intercepts, parses, and classifies three threat vectors:
| # | Threat | Engine |
|---|---|---|
| 01 | KHQR payload tampering / identity-routing mismatch | khqr_core + bakong_verify |
| 02 | Physical QR sticker overlays on merchant placards | vision_overlay |
| 03 | Localized Khmer-language phishing / social engineering | nlp_khmer |
Every check returns one of three risk levels: β
Safe (0),
ββββββββββββββββββββββββββββββββββββββββββββββββ
β Front ends: Web UI Β· Telegram bot Β· API β
βββββββββββββββββββββββββ¬βββββββββββββββββββββββ
β
ββββββββββββΌβββββββββββ
β scoring.NethGatewayβ β max-severity aggregation
ββββββββββββ¬βββββββββββ
ββββββββββββββββ¬βββββββββββΌββββββββββββ¬ββββββββββββββββ
βΌ βΌ βΌ βΌ βΌ
vision_overlay khqr_core bakong_verify nlp_khmer (your model)
multi-QR / TLV walker account-id β Khmer phishing
extract + CRC-16 holder name heuristic/XLM-R
A real overlay scam uses a structurally perfect KHQR pointing at the
attacker's own account β valid CRC, valid TLV. So khqr_core treats CRC/TLV as
a validity pre-filter, and the decisive check is identity_match: an
offline cross-field check comparing the displayed name (Tag 59) against the
account routing (Tag 29/30 bank code). If a QR labeled "ABA" routes to an
ACLEDA account, that's the classic overlay pattern β blocked.
Note: the public Bakong API does not resolve an account id to a holder name (privacy/anti-enumeration), so identity defense is the cross-field routing check above β not a name lookup.
bakong_verifyis reserved for account-existence / transaction verification once a token is configured.
pip install -r requirements.txt # core deps
uvicorn neth.api:app --reload # then open http://127.0.0.1:8000/
pytest -q # run the offline smoke testsThe gateway is useful immediately with no model download or API key β the NLP engine ships a working Khmer heuristic baseline, and KHQR/vision run offline.
- Open the bot in Telegram: @neth_watch_bot
- Press Start.
- Send it either:
- π· a photo of a KHQR to check it, or
- π a forwarded message or link you're unsure about.
- It replies in Khmer with β
Safe /
β οΈ Suspicious / β Blocked and the reason.
Always confirm the recipient's name in your banking app before paying β NETH is an advisory aid, not a guarantee.
- In Telegram, message @BotFather β
/newbotβ pick a name + username β copy the token it gives you. - Provide the token to NETH (any one of these β never commit it to git):
- File: create
.telegram_tokenin the project root containing just the token, or - Env var:
setx NETH_TELEGRAM_TOKEN "<token>"(Windows; then open a new terminal), or - Argument:
python -m neth.bot <token>
- File: create
- Install and run:
When you see
pip install -r requirements.txt python -m neth.bot
Telegram bot runningβ¦, message your bot and press Start.
Notes:
- Only one instance may poll a token at a time β don't run it locally and on a server with the same token (Telegram returns a "Conflict" error).
- The token is a secret.
.telegram_token,.env, and*.tokenare git-ignored. - For always-on hosting (so the bot runs without your PC), see DEPLOY.md.
| Variable | Enables |
|---|---|
NETH_BAKONG_TOKEN |
live Bakong account-name verification (the strong overlay defense) |
NETH_BAKONG_BASE |
override Bakong API base URL |
NETH_NLP_MODEL |
path to a fine-tuned XLM-RoBERTa Khmer classifier (else heuristic) |
NETH_URL_ONLINE |
1 to enable shortener expansion + URL threat feeds |
NETH_URLHAUS_KEY |
URLhaus (abuse.ch) auth key for known-malicious-URL lookups |
NETH_GSB_KEY |
Google Safe Browsing API key |
NETH_TELEGRAM_TOKEN |
run the Telegram bot: python -m neth.bot |
url_reputation.py scores links in layers. Offline (always on): correct
canonical bank-domain matching (e.g. ABA = ababank.com, not aba.com),
brand-off-domain lookalikes, punycode/homoglyph hosts, IP-literal hosts, @
userinfo tricks, and shortener detection. Online (opt-in via NETH_URL_ONLINE=1):
shortener expansion plus URLhaus / Google Safe Browsing feeds β a feed hit is
decisive. Offline gives a usable prior; the feeds make accuracy measurable.
Bank coverage. The brand-lookalike list lives in
data/bank_domains.yaml (~30 Cambodian + international
brands) and is loaded at runtime β add a bank by editing YAML, no code change.
Brand-agnostic checks (feeds, IP/punycode/@/shortener/TLD) protect every
bank, listed or not; the lookalike rule only covers listed brands.
GET /health
POST /api/analyze/text {"text": "..."} β verdict
POST /api/analyze/khqr {"payload": "000201β¦"} β verdict
POST /api/analyze/image multipart file=<photo> β verdict
POST /api/feedback {input_type,input_excerpt,predicted_score,correct_label,note}
GET /api/feedback/stats β counts + scam-missed-as-safe
Responses are Khmer-first: every verdict carries summary_km and each
signal a reason_km, with English kept alongside for logs. Inputs are size-
capped and the URL fetcher is SSRF-guarded (blocks internal/metadata IPs).
python scripts/fetch_eval_data.py --n 500 # download URLhaus + Tranco -> eval_data/
python bench_urls.py --phish eval_data/phish.txt --benign eval_data/benign.txt --sweep
python bench_gateway.py # whole-gateway: text + KHQR modalitiesbench_urls.py measures the URL engine; bench_gateway.py measures the full
pipeline across text and KHQR. Bundled samples are tiny (validate logic, not a
real-world score) β use fetch_eval_data.py for a meaningful number.
Users can report wrong verdicts (web buttons / POST /api/feedback). Corrections
are stored in data/feedback.db (SQLite, git-ignored) as truncated excerpts β
not full payloads. Export for training with FeedbackStore.export_jsonl(). This
is how NETH gathers ground truth and the labeled corpus to train the Khmer NLP.
neth/
βββ khqr_core.py EMVCo/KHQR TLV parser + CRC-16 (validity pre-filter)
βββ identity_match.py offline Tag 59 β Tag 29/30 routing mismatch (overlay defense)
βββ bakong_verify.py account-existence / transaction verification (needs token)
βββ nlp_khmer.py Khmer phishing detector (heuristic + optional transformer)
βββ url_reputation.py layered URL scoring (SSRF-guarded) + threat feeds
βββ vision_overlay.py QR extraction + multi-QR overlay detection
βββ scoring.py signal aggregation β final verdict
βββ i18n.py Khmer localization of verdicts
βββ feedback.py SQLite feedback/ground-truth store
βββ api.py FastAPI server (JSON API + web UI)
βββ bot.py Telegram front end
βββ web/ static UI (index.html, style.css, app.js)
bench_urls.py Β· bench_gateway.py Β· scripts/fetch_eval_data.py benchmarking
data/bank_domains.yaml Β· data/bank_codes.yaml editable brand data
tests/test_engines.py 20 offline tests
NETH is an advisory aid, not an authority. It reduces risk on the common scams; it does not guarantee a QR or link is safe. Always confirm the recipient name in your banking app before paying. Known gaps:
- Identity/routing check has limited coverage. It only flags a nameβbank
mismatch for banks whose codes are in
data/bank_codes.yaml(currently 4: ABA, ACLEDA, Canadia, Wing). For any other bank it says "couldn't verify routing" β not "safe." It also cannot detect a scammer who pastes a QR for their own account at the same bank as the real merchant. - No account-name verification. The public Bakong API does not expose account β holder-name lookup, so NETH cannot confirm who an account belongs to. Only your banking app can.
- Khmer NLP is an unvalidated heuristic. A keyword/URL model with no measured accuracy; it is evaded by rewording and will both miss scams and false-alarm. Treat its verdict as a weak hint until a model is trained on real labeled data.
- Overlay (photo) detection is weak. It flags multiple QR codes in a frame, but misses the common case where a sticker fully covers the original (one QR).
- URL accuracy is unproven at scale. The benchmark passes on a tiny bundled
sample; no real-world precision/recall figure exists yet (see
bench_*). - Threat feeds are opt-in and rate-limited. Without
NETH_URL_ONLINE=1and API keys, URL scoring is heuristic-only. - Not a substitute for vigilance. A clever, well-localized scam with a valid QR and clean link can pass every check.
- Train the XLM-RoBERTa Khmer phishing classifier on a labeled local dataset
- Train a YOLOv8 sticker-boundary model to augment
vision_overlay.detect() - Known-bad URL/domain feed for
nlp_khmer(URLhaus + Google Safe Browsing) - Per-merchant known-good QR reference store for overlay comparison
Open-source community edition. NETH assists detection; always verify the recipient name before paying.