Dictionary + LLM correction stages (--enrich, --llm-verify)#3
Dictionary + LLM correction stages (--enrich, --llm-verify)#3
Conversation
CC-CEDICT-driven empty-cue fill. The Chinese subtitles are the source of
truth; this stage fills cues where the merged trilingual result has no
English line, building a literal gloss from CC-CEDICT entries via
greedy-longest-match segmentation.
Conservative: never overwrites existing English (LLM verifier territory).
Every fill is logged to <out>.changes.tsv as a side channel for review.
- load_cedict() parses the standard CC-CEDICT format into
{simplified: [(pinyin, [defs])]}.
- segment_greedy() does left-to-right longest-match with a 12-char window.
- gloss_hanzi() respects name_map: known proper nouns use the bare-pinyin
form (e.g. 慕白 -> "Mubai") instead of the literal CC-CEDICT gloss
(would have been "to admire white").
- clean_gloss() strips editorial annotations: leading parentheticals
("(literary)", "(courteous, as opposed to ...)"), bracketed pinyin
("[ni3]"), inner annotations, "(CL:...)" classifier hints, and "see
also X" cross-references.
- Particle overrides drop sentence-ending modal particles (啦, 啊, 呀,
哦, 嘛, 呢, 吧, etc.) entirely; they convey no English-row content.
- Imperative overrides emit the bare command form for common dialogue
verbs (停 -> "Stop", 来 -> "Come", 让开 -> "Move aside") instead of
the dictionary's "to X" infinitive.
CLI: --enrich (off by default), --cedict <path>. Default cedict path is
Research/primary_sources/cedict/cedict_1_0_ts_utf-8_mdbg.txt.
CTHD test run: 42 of 44 empty cues filled. The 2 skips are correct
no-ops (pure-particle cue and English-only wanted-poster cue).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cdbabe6280
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return cues | ||
| changes: list[tuple[int, str, str, str, str]] = [] | ||
| for c in cues: | ||
| if c.hanzi and not c.english.strip(): |
There was a problem hiding this comment.
Skip enrichment for cues without Hanzi
When --enrich sees an empty-English cue whose Chinese track text contains only Latin text/numbers/punctuation, this condition still calls gloss_hanzi; that function passes non-Hanzi characters through one by one and then joins them with spaces, so a cue like WANTED becomes an English fill of W A N T E D and is logged as a dictionary change. This affects no-Hanzi subtitle cards or signs in the Chinese SRT; require an actual Hanzi match before filling, or have gloss_hanzi return empty when no Hanzi was glossed.
Useful? React with 👍 / 👎.
Layers a local-LLM verification stage on top of Phase 1. When --llm-verify is set, spawn a llama.cpp HTTP server on the Vulkan device, re-translate the dictionary-filled cues, replace dict glosses with idiomatic English, and cache per-cue results so re-runs skip cues already verified. CTHD test run: 42 dictionary fills → 39 LLM replacements; the 3 unreplaced are cases where the dict's imperative override (Stop!, Move aside) or name_map (Mubai, Xiulian) already produced the right output and the LLM agreed. - Server lifecycle: find_arc_vulkan_device() picks the Arc device id from `llama-server --list-devices` (skips the iGPU). start_llama_server() spawns with --gpu-layers auto --no-warmup and polls /health. The 🔴 STARTING / 🟧 A770 USE banner and final "double-confirmed to be closed" line meet Master §1.6. - Prompt design: system prompt tells the model the Chinese is truth and to output only the English line. User prompt carries the Chinese, the existing English (if any), the dict suggestion as a hint, the two prior cues for context, and name_map hints for any character mentioned. /no_think appended to suppress Qwen3's reasoning mode (which otherwise eats the whole token budget into reasoning_content with empty content). - Defensive fallback: if a model ignores /no_think and returns empty content with non-empty reasoning_content, llm_complete extracts a best-guess answer from the reasoning stream. - Cache: Research/cache/<imdb>.llm.json, keyed by sha1(model | hanzi). Re-runs skip cached cues entirely. Cache survives across runs of the same film. - A770 logging: every dispatch appends a structured row to Logs/A770_usage.md with timestamp, model, quant, n_ctx, device, call count, prompt/completion token totals, completion tok/s, JIT time, total wall, success/failure, and source session pointer. Matches the K-Arc2/A770/data/characterization.md row format so the data is publishable. CLI: --llm-verify (requires --enrich), --llm-model <gguf-path>, --llm-server <exe-path>, --llm-port (default 8765 to avoid colliding with TIMMY on 8080), --llm-ctx, --llm-max (cap cue count for smoke tests). Default sources for paths are PINSUB_LLM_GGUF and PINSUB_LLAMA_SERVER env vars, so the only environment surface needed is two paths — no directory-wide access to the model store. No new Python dependencies (stdlib http.client + json + subprocess).
Bigger-scope correction than --llm-verify. Where Phase 2 only re-translates
the cues Phase 1 filled (empty English), Phase 3 also goes after cues
whose existing English diverges from what the Chinese says. Python does
the heavy lifting (heuristic prefiltering, decision logic, name-map
enforcement); the LLM is the dumb worker that translates and judges.
Pipeline per cue:
1. Python heuristic decides whether the cue is worth sending to the
LLM at all. Cheap signals: empty English; low content-word overlap
between the CC-CEDICT dict-gloss and the existing English; length
ratio outliers between English chars and Hanzi chars. Tweakable
overlap_threshold (default 0.2). Reports a histogram of reasons so
the user can tune.
2. TIMMY-translate (fresh prompt, no existing English in context)
produces a Chinese-faithful candidate. Per-cue cache.
3. If existing English exists AND differs from TIMMY's, a separate
fresh-context TIMMY call judges A=existing, B=TIMMY, or C=write a
better one. Per-cue cache.
4. Python applies the verdict, logs the change.
CLI:
--llm-correct Enable Phase 3.
--llm-scope {fills,divergent,all}
Which cues to consider. Default 'divergent'
(Python heuristic flags suspicion). 'fills'
is the conservative Phase-2-equivalent
(empty-English cues only). 'all' touches
every cue with Chinese — expensive.
--llm-no-compare Skip the comparison pass (use TIMMY's
translation outright when it differs from
existing). Halves LLM call volume.
Caching:
Translation cache key: T|sha1(model | hanzi). Compare cache key:
C|sha1(model | hanzi | en_a | en_b). Stored at
Research/cache/<imdb>.correct.json. Re-runs against the same film
skip both translate and compare for cues already seen.
A770 logging:
Each --llm-correct run appends one row to Logs/A770_usage.md with the
job string including the chosen scope and a per-action count summary
in the notes column.
CTHD divergent-scope pre-survey: 752 of 1030 cues flagged as suspect
(73%). Smoke test of 10 cues showed 9/10 replaced with high-quality
output — TIMMY correctly identified that the Bluray English subs were
shifted/misaligned with the Chinese in many cues. Full run gated on
owner approval given the A770 contention cost.
No new Python dependencies; reuses the LlamaSession infrastructure
introduced in Phase 2.
TIMMY's system prompt now lives in README_TIMMY.md next to PinSub.py
(loaded at startup, falls back to an inline default if absent). Pattern
matches the owner's TIMMY_SDNext.md convention — prompt-engineering
decisions live in markdown, not strings buried 1500 lines into Python.
Translation philosophy in the new prompt: word-for-word fidelity on
content (every noun/verb/name/modifier gets translated), function-word
latitude (drop Chinese particles English doesn't have, add English
articles/copulas Chinese implies), synonym latitude on word choice,
Chinese word order preserved when grammatically tolerable. Rearrange
only when literal order is incomprehensible.
Name handling generalized: the README_TIMMY rule (bare-pinyin with
tone marks stripped, syllable spacing preserved, capitals on each name
part) works on any film. If a per-film name_map provides a canonical
spelling, that's passed as a "Known names:" hint and wins. If not,
TIMMY uses the pinyin row in the prompt to derive the English form.
Glossary database:
- New glossary.json at the project root with seed wuxia/kungfu terms
(枪 → spear, 师娘 → Master's wife, 镖局 → security agency, 江湖 →
the martial world, 师父 → Master, etc. ~20 entries). Whitelisted in
.gitignore so it ships publicly like names/example.json.
- names/<imdb>.json gets an optional `glossary` field for per-film
overrides + additions. Same schema, per-film entries override
global on key collisions.
- PinSub scans each cue's Hanzi for any glossary key, passes hits
to TIMMY as a "Glossary:" section. Longest-match-first so compound
terms (天下第一枪) match before atomic terms (枪).
Python validators + retry loop:
- validate_translation() checks each TIMMY output for Hanzi-in-output
(hard reject), name canonicalization (when name_map provided), and
length-sanity (only on longer cues).
- Failures feed back into the next TIMMY call as explicit feedback
("Your output still contains Chinese characters: 镖局. Translate
every Hanzi..."). Loop up to --llm-rounds (default 3) iterations.
- rounds_used histogram emitted in action counts so the operator can
see how often round-2/3 retries are needed.
CPU thread cap (--llm-threads, default 4): caps llama-server's CPU
side. Vulkan path spends most time on the GPU; this is the knob to keep
host CPU headroom (e.g. when other workloads are using the CPU).
CLI additions:
--llm-rounds <int> Max validator-driven retry rounds (default 3).
--llm-threads <int> CPU thread cap for llama-server (default 4).
CTHD smoke (10 cues, refactor pass):
- 镖局 now translated as "security agency" via global glossary
(previously left as Hanzi by TIMMY).
- All 10 cues passed round-1 validation; no retries needed in this
sample. Loop is wired; just didn't fire because the prompt + hints
were sufficient.
- File-specific glossary entries (青冥剑 → Qingming Sword, 天下第一枪
→ Number One Spear under Heaven) ready for use on larger runs.
Public-repo additions: README_TIMMY.md, glossary.json. Both tracked.
Surfaces subtitle FORMATTING risks the user actually cares about: long pinyin rows that overflow on screen, multi-cue mergers that span 4+ lines, fast-dialogue clusters where Python alignment may have slipped, hold-for-reading extensions that bleed into the next beat. Python heuristics score each cue on multiple signals: - chars=N total joined-text length over 100 - long-line=N single line over 80 chars (pinyin overflow risk) - lines=N joined cue has 4+ lines (merger artifact) - long-dur=Nms duration > 8 s (hold-for-reading bleed) - short-dur=Nms < 700 ms with > 12 chars (fast-dialogue alignment risk) - dense=Nc/s char density > 30 chars/sec (unreadable in time) - fast-cluster both neighbor gaps < 400 ms (rapid dialogue) Top-N (default 12) cues by combined score are picked. ffmpeg extracts a single frame at each cue's mid-timestamp with the trilingual subtitle burned in via the `subtitles=` filter. Frames + an HTML grid index land under <out>.spotcheck/ next to the trilingual output. The HTML is self-contained — file:// img URLs work offline. Each card shows the frame, the cue index, timing, duration, the score tags that explain WHY the cue was picked, and the trilingual text. CTHD test run: 8/8 picked cues all carry long-line + dense signals, matching the user's primary concern (pinyin row overflow on fast-talking scenes). Example: cue 471 has a 23-syllable pinyin line crammed into a 4.2-second cue at 39 chars/sec — overflow at burn-in is nearly certain. CLI: --spotcheck enable the spot-check pass after writing the trilingual output --spotcheck-n N number of cues to surface (default 12) No new Python dependencies (uses ffmpeg + stdlib html + no PIL). Path B (vision-LLM judging frame+subtitle pairs) layers on top once a VLM GGUF is available.
ffmpeg's `subtitles=` filter (libass) uses output PTS to decide which cue is active. Input -ss zeros the PTS, so the filter saw time 0 and emitted no subtitle pixels (subtitle:0KiB in the muxer report). Frames came out clean of the burned-in subtitle, defeating the spot-check. Fix: -copyts preserves the original stream PTS through the seek, so libass sees the real timestamp and renders the right cue. Also add -update 1 to silence the muxer's image-sequence-pattern warning, and keep the fast input-seek (-ss 5s before cue) + precise output-seek combination for performance. Tested on CTHD cue 368 (Wudang temple scene): trilingual subtitle now burned in correctly — Hanzi top, pinyin middle, English bottom. Also de-duplicated the fast-cluster tag/score increment in select_spotcheck_cues — a leftover from prior iteration was scoring some cues twice.
Two-phase correction pipeline. The Chinese subtitles are the source of truth for any Chinese-language film; the existing English on Bluray rips / fan torrents is often missing or imprecise. This PR adds a dictionary stage that fills empty cues conservatively, then an optional local-LLM stage that re-translates the dictionary-filled cues into idiomatic English.
Pipeline integration
Phase 1 runs after
translations.jsonoverrides so manual per-cue overrides always win. Phase 2 only runs when Phase 1 actually filled cues.Phase 1 — CC-CEDICT dictionary fill
load_cedict(path)parses the standard CC-CEDICT text format into{simplified: [(pinyin, [defs])]}. Multi-entry keys are kept.segment_greedy(text, cedict)does left-to-right longest-match against dictionary keys. 12-char window. No external segmenter dependency.gloss_segment(seg, cedict)picks the entry with the most definitions ("most common reading" heuristic).clean_gloss(gloss)strips editorial annotations: leading parentheticals ((literary),(courteous, as opposed to ...)), bracketed pinyin ([ni3]), inner editorial annotations,(CL:...)classifier hints, andsee also Xcross-references.gloss_hanzi(hanzi, cedict, name_map=None)is the full-line glosser. Segments matching aname_mapkey use the bare-pinyin form of the name instead of the literal gloss — so 慕白 staysMubaiinstead of becomingto admire white.PARTICLE_OVERRIDES— sentence-ending modal particles (啦, 啊, 呀, 哦, 嘛, 呢, 吧, 嗯, 唉, 哟, etc.) drop entirely.IMPERATIVE_LEMMA_OVERRIDES— common dialogue verbs (停, 来, 走, 等, 看, 听, 请, 起, 坐, 让开, 小心, 别动, 不要, 快, etc.) emit the bare command form rather than the dictionary'sto Xinfinitive.<out>.changes.tsv.Phase 2 — local-LLM verifier (via llama.cpp-vulkan)
Spawns a local llama.cpp HTTP server on a Vulkan-capable GPU and re-translates the cues Phase 1 filled. Designed for an Intel Arc A770 16 GB; works on any Vulkan device with enough VRAM for the chosen model.
find_arc_vulkan_device(server_exe)runsllama-server --list-devicesand picks the Arc adapter (skipping the integrated GPU on multi-device systems).start_llama_server()spawns the server with--gpu-layers auto --no-warmup, polls/healthuntil ready. Prints a 🟧 banner and a 🔴 STARTING line per the project's server-safety convention.stop_llama_server()terminates and verifies the port is freed./no_thinkis appended to suppress Qwen3's reasoning mode (which otherwise consumes the whole token budget intoreasoning_contentwith emptycontent)._clean_llm_output()strips wrapping quotes, leftover chat-template tags, and lead-in labels like "English:" / "Translation:".llm_complete()— if a model ignores/no_thinkand returns emptycontentwith non-emptyreasoning_content, extract a best-guess answer from the reasoning stream rather than failing.Research/cache/<imdb>.llm.json, keyed bysha1(model | hanzi). Re-runs skip cached cues. Cache survives across runs of the same film.CLI
--enrich<out>.changes.tsv.--cedict <path>Research/primary_sources/cedict/cedict_1_0_ts_utf-8_mdbg.txt--llm-verify--enrich.--llm-model <path>$PINSUB_LLM_GGUFenv var--llm-server <path>$PINSUB_LLAMA_SERVERenv varllama-server.exepath.--llm-port <int>--llm-ctx <int>--llm-max <int>Validation — Crouching Tiger, Hidden Dragon (2000)
嗯, English-only wanted-poster cue with no Hanzi).name_mapalready produced the right output (Stop!,Mubai,Xiulian) and the LLM agreed.Sample of the Phase-1 → Phase-2 diff:
lit. hidden dragon, crouching tiger (idiom)Crouching Tiger, Hidden DragonapexHe's on the rooftop!green dark double-edged sword to be... drowned to retrieve ComeThe Green Darkness Sword wasn't recovered.You old fox, you ruined my family, and today I'll kill you...Be carefulWatch out.The End (appearing at the end of a movie etc) , to thank to look at sth with pleasure...The End. Thank you for watching!Dependencies
No new Python deps. Phase 2 talks HTTP to a separately-installed
llama-server.exe. CC-CEDICT (CC BY-SA 4.0) is acquired separately, not shipped with this repo.CC-CEDICT acquisition
Phase 2 prerequisites
Set the two env vars and run: