Dictionary + LLM correction stages (--enrich, --llm-verify) by Boladi888 · Pull Request #3 · Boladi888/PinSub

Boladi888 · 2026-05-10T23:42:39Z

Two-phase correction pipeline. The Chinese subtitles are the source of truth for any Chinese-language film; the existing English on Bluray rips / fan torrents is often missing or imprecise. This PR adds a dictionary stage that fills empty cues conservatively, then an optional local-LLM stage that re-translates the dictionary-filled cues into idiomatic English.

Pipeline integration

parse → simplify → pinyin → align → merge → translations.json overrides
        → PHASE 1 (--enrich, dict)   → PHASE 2 (--llm-verify, local LLM)
        → validate → write

Phase 1 runs after translations.json overrides so manual per-cue overrides always win. Phase 2 only runs when Phase 1 actually filled cues.

Phase 1 — CC-CEDICT dictionary fill

Conservative. Only fills empty English cues; never overwrites existing English.
load_cedict(path) parses the standard CC-CEDICT text format into {simplified: [(pinyin, [defs])]}. Multi-entry keys are kept.
segment_greedy(text, cedict) does left-to-right longest-match against dictionary keys. 12-char window. No external segmenter dependency.
gloss_segment(seg, cedict) picks the entry with the most definitions ("most common reading" heuristic).
clean_gloss(gloss) strips editorial annotations: leading parentheticals ((literary), (courteous, as opposed to ...)), bracketed pinyin ([ni3]), inner editorial annotations, (CL:...) classifier hints, and see also X cross-references.
gloss_hanzi(hanzi, cedict, name_map=None) is the full-line glosser. Segments matching a name_map key use the bare-pinyin form of the name instead of the literal gloss — so 慕白 stays Mubai instead of becoming to admire white.
Override tables for high-frequency cases:
- PARTICLE_OVERRIDES — sentence-ending modal particles (啦, 啊, 呀, 哦, 嘛, 呢, 吧, 嗯, 唉, 哟, etc.) drop entirely.
- IMPERATIVE_LEMMA_OVERRIDES — common dialogue verbs (停, 来, 走, 等, 看, 听, 请, 起, 坐, 让开, 小心, 别动, 不要, 快, etc.) emit the bare command form rather than the dictionary's to X infinitive.
Every change logged to <out>.changes.tsv.

Phase 2 — local-LLM verifier (via llama.cpp-vulkan)

Spawns a local llama.cpp HTTP server on a Vulkan-capable GPU and re-translates the cues Phase 1 filled. Designed for an Intel Arc A770 16 GB; works on any Vulkan device with enough VRAM for the chosen model.

find_arc_vulkan_device(server_exe) runs llama-server --list-devices and picks the Arc adapter (skipping the integrated GPU on multi-device systems).
start_llama_server() spawns the server with --gpu-layers auto --no-warmup, polls /health until ready. Prints a 🟧 banner and a 🔴 STARTING line per the project's server-safety convention.
stop_llama_server() terminates and verifies the port is freed.
System prompt says the Chinese is truth and to output only the English line. /no_think is appended to suppress Qwen3's reasoning mode (which otherwise consumes the whole token budget into reasoning_content with empty content).
User prompt carries: the Chinese, the existing English (if any), the dict suggestion as a hint, the two prior cues for context, and name-map hints for any character mentioned.
Defensive _clean_llm_output() strips wrapping quotes, leftover chat-template tags, and lead-in labels like "English:" / "Translation:".
Defensive fallback in llm_complete() — if a model ignores /no_think and returns empty content with non-empty reasoning_content, extract a best-guess answer from the reasoning stream rather than failing.
Per-cue cache at Research/cache/<imdb>.llm.json, keyed by sha1(model | hanzi). Re-runs skip cached cues. Cache survives across runs of the same film.

CLI

Flag	Default	What it does
`--enrich`	off	Phase 1 dictionary correction. Writes `<out>.changes.tsv`.
`--cedict <path>`	`Research/primary_sources/cedict/cedict_1_0_ts_utf-8_mdbg.txt`	Path to CC-CEDICT.
`--llm-verify`	off	Phase 2 LLM verify pass. Requires `--enrich`.
`--llm-model <path>`	`$PINSUB_LLM_GGUF` env var	GGUF model file.
`--llm-server <path>`	`$PINSUB_LLAMA_SERVER` env var	`llama-server.exe` path.
`--llm-port <int>`	8765	localhost port (avoids colliding with port 8080).
`--llm-ctx <int>`	4096	llama-server context size.
`--llm-max <int>`	0 = all	Cap verify pass at N cues (smoke testing).

Validation — Crouching Tiger, Hidden Dragon (2000)

983 merged cues; 44 had no English.
Phase 1: 42 of 44 filled. The 2 skips are correct no-ops (pure-particle cue 嗯, English-only wanted-poster cue with no Hanzi).
Phase 2 (Qwen3-14B Q5_K_M, Vulkan): 39 of 42 LLM-replaced. The 3 unreplaced are cases where the dict's imperative override or name_map already produced the right output (Stop!, Mubai, Xiulian) and the LLM agreed.

Sample of the Phase-1 → Phase-2 diff:

Cue	Hanzi	Phase 1 (dict)	Phase 2 (LLM)
1	卧虎藏龙	`lit. hidden dragon, crouching tiger (idiom)`	`Crouching Tiger, Hidden Dragon`
188	上房顶了	`apex`	`He's on the rooftop!`
208	青冥剑是没找回来	`green dark double-edged sword to be... drowned to retrieve Come`	`The Green Darkness Sword wasn't recovered.`
352	你这老狐貍害得我家破人亡我今天杀了你……	(verbose 6-line literal)	`You old fox, you ruined my family, and today I'll kill you...`
906	小心	`Be careful`	`Watch out.`
983	剧终，谢谢观赏！	`The End (appearing at the end of a movie etc) ， to thank to look at sth with pleasure...`	`The End. Thank you for watching!`

Dependencies

No new Python deps. Phase 2 talks HTTP to a separately-installed llama-server.exe. CC-CEDICT (CC BY-SA 4.0) is acquired separately, not shipped with this repo.

CC-CEDICT acquisition

mkdir -p Research/primary_sources/cedict
curl -sL -o Research/primary_sources/cedict/cedict_1_0_ts_utf-8_mdbg.txt.gz \
  https://www.mdbg.net/chinese/export/cedict/cedict_1_0_ts_utf-8_mdbg.txt.gz
gunzip Research/primary_sources/cedict/cedict_1_0_ts_utf-8_mdbg.txt.gz

Phase 2 prerequisites

A llama.cpp build with Vulkan support (any recent llama-server.exe).
A GGUF model with strong Chinese capability — Qwen2.5 or Qwen3 family recommended (Alibaba; trained heavily on Chinese).
Sufficient VRAM for the chosen model + a ~1.5 GB KV cache at ctx=4096.

Set the two env vars and run:

$env:PINSUB_LLAMA_SERVER = "path\to\llama-server.exe"
$env:PINSUB_LLM_GGUF     = "path\to\YourModel.gguf"

python PinSub.py --mkv ... --zh ... --english ... --out ... --enrich --llm-verify

CC-CEDICT-driven empty-cue fill. The Chinese subtitles are the source of truth; this stage fills cues where the merged trilingual result has no English line, building a literal gloss from CC-CEDICT entries via greedy-longest-match segmentation. Conservative: never overwrites existing English (LLM verifier territory). Every fill is logged to <out>.changes.tsv as a side channel for review. - load_cedict() parses the standard CC-CEDICT format into {simplified: [(pinyin, [defs])]}. - segment_greedy() does left-to-right longest-match with a 12-char window. - gloss_hanzi() respects name_map: known proper nouns use the bare-pinyin form (e.g. 慕白 -> "Mubai") instead of the literal CC-CEDICT gloss (would have been "to admire white"). - clean_gloss() strips editorial annotations: leading parentheticals ("(literary)", "(courteous, as opposed to ...)"), bracketed pinyin ("[ni3]"), inner annotations, "(CL:...)" classifier hints, and "see also X" cross-references. - Particle overrides drop sentence-ending modal particles (啦, 啊, 呀, 哦, 嘛, 呢, 吧, etc.) entirely; they convey no English-row content. - Imperative overrides emit the bare command form for common dialogue verbs (停 -> "Stop", 来 -> "Come", 让开 -> "Move aside") instead of the dictionary's "to X" infinitive. CLI: --enrich (off by default), --cedict <path>. Default cedict path is Research/primary_sources/cedict/cedict_1_0_ts_utf-8_mdbg.txt. CTHD test run: 42 of 44 empty cues filled. The 2 skips are correct no-ops (pure-particle cue and English-only wanted-poster cue).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cdbabe6280

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-10T23:45:35Z

+        return cues
+    changes: list[tuple[int, str, str, str, str]] = []
+    for c in cues:
+        if c.hanzi and not c.english.strip():


Skip enrichment for cues without Hanzi

When --enrich sees an empty-English cue whose Chinese track text contains only Latin text/numbers/punctuation, this condition still calls gloss_hanzi; that function passes non-Hanzi characters through one by one and then joins them with spaces, so a cue like WANTED becomes an English fill of W A N T E D and is logged as a dictionary change. This affects no-Hanzi subtitle cards or signs in the Chinese SRT; require an actual Hanzi match before filling, or have gloss_hanzi return empty when no Hanzi was glossed.

Useful? React with 👍 / 👎.

Layers a local-LLM verification stage on top of Phase 1. When --llm-verify is set, spawn a llama.cpp HTTP server on the Vulkan device, re-translate the dictionary-filled cues, replace dict glosses with idiomatic English, and cache per-cue results so re-runs skip cues already verified. CTHD test run: 42 dictionary fills → 39 LLM replacements; the 3 unreplaced are cases where the dict's imperative override (Stop!, Move aside) or name_map (Mubai, Xiulian) already produced the right output and the LLM agreed. - Server lifecycle: find_arc_vulkan_device() picks the Arc device id from `llama-server --list-devices` (skips the iGPU). start_llama_server() spawns with --gpu-layers auto --no-warmup and polls /health. The 🔴 STARTING / 🟧 A770 USE banner and final "double-confirmed to be closed" line meet Master §1.6. - Prompt design: system prompt tells the model the Chinese is truth and to output only the English line. User prompt carries the Chinese, the existing English (if any), the dict suggestion as a hint, the two prior cues for context, and name_map hints for any character mentioned. /no_think appended to suppress Qwen3's reasoning mode (which otherwise eats the whole token budget into reasoning_content with empty content). - Defensive fallback: if a model ignores /no_think and returns empty content with non-empty reasoning_content, llm_complete extracts a best-guess answer from the reasoning stream. - Cache: Research/cache/<imdb>.llm.json, keyed by sha1(model | hanzi). Re-runs skip cached cues entirely. Cache survives across runs of the same film. - A770 logging: every dispatch appends a structured row to Logs/A770_usage.md with timestamp, model, quant, n_ctx, device, call count, prompt/completion token totals, completion tok/s, JIT time, total wall, success/failure, and source session pointer. Matches the K-Arc2/A770/data/characterization.md row format so the data is publishable. CLI: --llm-verify (requires --enrich), --llm-model <gguf-path>, --llm-server <exe-path>, --llm-port (default 8765 to avoid colliding with TIMMY on 8080), --llm-ctx, --llm-max (cap cue count for smoke tests). Default sources for paths are PINSUB_LLM_GGUF and PINSUB_LLAMA_SERVER env vars, so the only environment surface needed is two paths — no directory-wide access to the model store. No new Python dependencies (stdlib http.client + json + subprocess).

Bigger-scope correction than --llm-verify. Where Phase 2 only re-translates the cues Phase 1 filled (empty English), Phase 3 also goes after cues whose existing English diverges from what the Chinese says. Python does the heavy lifting (heuristic prefiltering, decision logic, name-map enforcement); the LLM is the dumb worker that translates and judges. Pipeline per cue: 1. Python heuristic decides whether the cue is worth sending to the LLM at all. Cheap signals: empty English; low content-word overlap between the CC-CEDICT dict-gloss and the existing English; length ratio outliers between English chars and Hanzi chars. Tweakable overlap_threshold (default 0.2). Reports a histogram of reasons so the user can tune. 2. TIMMY-translate (fresh prompt, no existing English in context) produces a Chinese-faithful candidate. Per-cue cache. 3. If existing English exists AND differs from TIMMY's, a separate fresh-context TIMMY call judges A=existing, B=TIMMY, or C=write a better one. Per-cue cache. 4. Python applies the verdict, logs the change. CLI: --llm-correct Enable Phase 3. --llm-scope {fills,divergent,all} Which cues to consider. Default 'divergent' (Python heuristic flags suspicion). 'fills' is the conservative Phase-2-equivalent (empty-English cues only). 'all' touches every cue with Chinese — expensive. --llm-no-compare Skip the comparison pass (use TIMMY's translation outright when it differs from existing). Halves LLM call volume. Caching: Translation cache key: T|sha1(model | hanzi). Compare cache key: C|sha1(model | hanzi | en_a | en_b). Stored at Research/cache/<imdb>.correct.json. Re-runs against the same film skip both translate and compare for cues already seen. A770 logging: Each --llm-correct run appends one row to Logs/A770_usage.md with the job string including the chosen scope and a per-action count summary in the notes column. CTHD divergent-scope pre-survey: 752 of 1030 cues flagged as suspect (73%). Smoke test of 10 cues showed 9/10 replaced with high-quality output — TIMMY correctly identified that the Bluray English subs were shifted/misaligned with the Chinese in many cues. Full run gated on owner approval given the A770 contention cost. No new Python dependencies; reuses the LlamaSession infrastructure introduced in Phase 2.

TIMMY's system prompt now lives in README_TIMMY.md next to PinSub.py (loaded at startup, falls back to an inline default if absent). Pattern matches the owner's TIMMY_SDNext.md convention — prompt-engineering decisions live in markdown, not strings buried 1500 lines into Python. Translation philosophy in the new prompt: word-for-word fidelity on content (every noun/verb/name/modifier gets translated), function-word latitude (drop Chinese particles English doesn't have, add English articles/copulas Chinese implies), synonym latitude on word choice, Chinese word order preserved when grammatically tolerable. Rearrange only when literal order is incomprehensible. Name handling generalized: the README_TIMMY rule (bare-pinyin with tone marks stripped, syllable spacing preserved, capitals on each name part) works on any film. If a per-film name_map provides a canonical spelling, that's passed as a "Known names:" hint and wins. If not, TIMMY uses the pinyin row in the prompt to derive the English form. Glossary database: - New glossary.json at the project root with seed wuxia/kungfu terms (枪 → spear, 师娘 → Master's wife, 镖局 → security agency, 江湖 → the martial world, 师父 → Master, etc. ~20 entries). Whitelisted in .gitignore so it ships publicly like names/example.json. - names/<imdb>.json gets an optional `glossary` field for per-film overrides + additions. Same schema, per-film entries override global on key collisions. - PinSub scans each cue's Hanzi for any glossary key, passes hits to TIMMY as a "Glossary:" section. Longest-match-first so compound terms (天下第一枪) match before atomic terms (枪). Python validators + retry loop: - validate_translation() checks each TIMMY output for Hanzi-in-output (hard reject), name canonicalization (when name_map provided), and length-sanity (only on longer cues). - Failures feed back into the next TIMMY call as explicit feedback ("Your output still contains Chinese characters: 镖局. Translate every Hanzi..."). Loop up to --llm-rounds (default 3) iterations. - rounds_used histogram emitted in action counts so the operator can see how often round-2/3 retries are needed. CPU thread cap (--llm-threads, default 4): caps llama-server's CPU side. Vulkan path spends most time on the GPU; this is the knob to keep host CPU headroom (e.g. when other workloads are using the CPU). CLI additions: --llm-rounds <int> Max validator-driven retry rounds (default 3). --llm-threads <int> CPU thread cap for llama-server (default 4). CTHD smoke (10 cues, refactor pass): - 镖局 now translated as "security agency" via global glossary (previously left as Hanzi by TIMMY). - All 10 cues passed round-1 validation; no retries needed in this sample. Loop is wired; just didn't fire because the prompt + hints were sufficient. - File-specific glossary entries (青冥剑 → Qingming Sword, 天下第一枪 → Number One Spear under Heaven) ready for use on larger runs. Public-repo additions: README_TIMMY.md, glossary.json. Both tracked.

Surfaces subtitle FORMATTING risks the user actually cares about: long pinyin rows that overflow on screen, multi-cue mergers that span 4+ lines, fast-dialogue clusters where Python alignment may have slipped, hold-for-reading extensions that bleed into the next beat. Python heuristics score each cue on multiple signals: - chars=N total joined-text length over 100 - long-line=N single line over 80 chars (pinyin overflow risk) - lines=N joined cue has 4+ lines (merger artifact) - long-dur=Nms duration > 8 s (hold-for-reading bleed) - short-dur=Nms < 700 ms with > 12 chars (fast-dialogue alignment risk) - dense=Nc/s char density > 30 chars/sec (unreadable in time) - fast-cluster both neighbor gaps < 400 ms (rapid dialogue) Top-N (default 12) cues by combined score are picked. ffmpeg extracts a single frame at each cue's mid-timestamp with the trilingual subtitle burned in via the `subtitles=` filter. Frames + an HTML grid index land under <out>.spotcheck/ next to the trilingual output. The HTML is self-contained — file:// img URLs work offline. Each card shows the frame, the cue index, timing, duration, the score tags that explain WHY the cue was picked, and the trilingual text. CTHD test run: 8/8 picked cues all carry long-line + dense signals, matching the user's primary concern (pinyin row overflow on fast-talking scenes). Example: cue 471 has a 23-syllable pinyin line crammed into a 4.2-second cue at 39 chars/sec — overflow at burn-in is nearly certain. CLI: --spotcheck enable the spot-check pass after writing the trilingual output --spotcheck-n N number of cues to surface (default 12) No new Python dependencies (uses ffmpeg + stdlib html + no PIL). Path B (vision-LLM judging frame+subtitle pairs) layers on top once a VLM GGUF is available.

ffmpeg's `subtitles=` filter (libass) uses output PTS to decide which cue is active. Input -ss zeros the PTS, so the filter saw time 0 and emitted no subtitle pixels (subtitle:0KiB in the muxer report). Frames came out clean of the burned-in subtitle, defeating the spot-check. Fix: -copyts preserves the original stream PTS through the seek, so libass sees the real timestamp and renders the right cue. Also add -update 1 to silence the muxer's image-sequence-pattern warning, and keep the fast input-seek (-ss 5s before cue) + precise output-seek combination for performance. Tested on CTHD cue 368 (Wudang temple scene): trilingual subtitle now burned in correctly — Hanzi top, pinyin middle, English bottom. Also de-duplicated the fast-cluster tag/score increment in select_spotcheck_cues — a leftover from prior iteration was scoring some cues twice.

chatgpt-codex-connector Bot reviewed May 10, 2026

View reviewed changes

Boladi888 changed the title ~~Add Phase-1 dictionary correction (--enrich flag)~~ Dictionary + LLM correction stages (--enrich, --llm-verify) May 11, 2026

Boladi888 added 4 commits May 10, 2026 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dictionary + LLM correction stages (--enrich, --llm-verify)#3

Dictionary + LLM correction stages (--enrich, --llm-verify)#3
Boladi888 wants to merge 6 commits intomasterfrom
feat-dict-correction

Boladi888 commented May 10, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Boladi888 commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pipeline integration

Phase 1 — CC-CEDICT dictionary fill

Phase 2 — local-LLM verifier (via llama.cpp-vulkan)

CLI

Validation — Crouching Tiger, Hidden Dragon (2000)

Dependencies

CC-CEDICT acquisition

Phase 2 prerequisites

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Boladi888 commented May 10, 2026 •

edited

Loading