Skip to content

Dictionary + LLM correction stages (--enrich, --llm-verify)#3

Open
Boladi888 wants to merge 6 commits intomasterfrom
feat-dict-correction
Open

Dictionary + LLM correction stages (--enrich, --llm-verify)#3
Boladi888 wants to merge 6 commits intomasterfrom
feat-dict-correction

Conversation

@Boladi888
Copy link
Copy Markdown
Owner

@Boladi888 Boladi888 commented May 10, 2026

Two-phase correction pipeline. The Chinese subtitles are the source of truth for any Chinese-language film; the existing English on Bluray rips / fan torrents is often missing or imprecise. This PR adds a dictionary stage that fills empty cues conservatively, then an optional local-LLM stage that re-translates the dictionary-filled cues into idiomatic English.

Pipeline integration

parse → simplify → pinyin → align → merge → translations.json overrides
        → PHASE 1 (--enrich, dict)   → PHASE 2 (--llm-verify, local LLM)
        → validate → write

Phase 1 runs after translations.json overrides so manual per-cue overrides always win. Phase 2 only runs when Phase 1 actually filled cues.

Phase 1 — CC-CEDICT dictionary fill

  • Conservative. Only fills empty English cues; never overwrites existing English.
  • load_cedict(path) parses the standard CC-CEDICT text format into {simplified: [(pinyin, [defs])]}. Multi-entry keys are kept.
  • segment_greedy(text, cedict) does left-to-right longest-match against dictionary keys. 12-char window. No external segmenter dependency.
  • gloss_segment(seg, cedict) picks the entry with the most definitions ("most common reading" heuristic).
  • clean_gloss(gloss) strips editorial annotations: leading parentheticals ((literary), (courteous, as opposed to ...)), bracketed pinyin ([ni3]), inner editorial annotations, (CL:...) classifier hints, and see also X cross-references.
  • gloss_hanzi(hanzi, cedict, name_map=None) is the full-line glosser. Segments matching a name_map key use the bare-pinyin form of the name instead of the literal gloss — so 慕白 stays Mubai instead of becoming to admire white.
  • Override tables for high-frequency cases:
    • PARTICLE_OVERRIDES — sentence-ending modal particles (啦, 啊, 呀, 哦, 嘛, 呢, 吧, 嗯, 唉, 哟, etc.) drop entirely.
    • IMPERATIVE_LEMMA_OVERRIDES — common dialogue verbs (停, 来, 走, 等, 看, 听, 请, 起, 坐, 让开, 小心, 别动, 不要, 快, etc.) emit the bare command form rather than the dictionary's to X infinitive.
  • Every change logged to <out>.changes.tsv.

Phase 2 — local-LLM verifier (via llama.cpp-vulkan)

Spawns a local llama.cpp HTTP server on a Vulkan-capable GPU and re-translates the cues Phase 1 filled. Designed for an Intel Arc A770 16 GB; works on any Vulkan device with enough VRAM for the chosen model.

  • find_arc_vulkan_device(server_exe) runs llama-server --list-devices and picks the Arc adapter (skipping the integrated GPU on multi-device systems).
  • start_llama_server() spawns the server with --gpu-layers auto --no-warmup, polls /health until ready. Prints a 🟧 banner and a 🔴 STARTING line per the project's server-safety convention.
  • stop_llama_server() terminates and verifies the port is freed.
  • System prompt says the Chinese is truth and to output only the English line. /no_think is appended to suppress Qwen3's reasoning mode (which otherwise consumes the whole token budget into reasoning_content with empty content).
  • User prompt carries: the Chinese, the existing English (if any), the dict suggestion as a hint, the two prior cues for context, and name-map hints for any character mentioned.
  • Defensive _clean_llm_output() strips wrapping quotes, leftover chat-template tags, and lead-in labels like "English:" / "Translation:".
  • Defensive fallback in llm_complete() — if a model ignores /no_think and returns empty content with non-empty reasoning_content, extract a best-guess answer from the reasoning stream rather than failing.
  • Per-cue cache at Research/cache/<imdb>.llm.json, keyed by sha1(model | hanzi). Re-runs skip cached cues. Cache survives across runs of the same film.

CLI

Flag Default What it does
--enrich off Phase 1 dictionary correction. Writes <out>.changes.tsv.
--cedict <path> Research/primary_sources/cedict/cedict_1_0_ts_utf-8_mdbg.txt Path to CC-CEDICT.
--llm-verify off Phase 2 LLM verify pass. Requires --enrich.
--llm-model <path> $PINSUB_LLM_GGUF env var GGUF model file.
--llm-server <path> $PINSUB_LLAMA_SERVER env var llama-server.exe path.
--llm-port <int> 8765 localhost port (avoids colliding with port 8080).
--llm-ctx <int> 4096 llama-server context size.
--llm-max <int> 0 = all Cap verify pass at N cues (smoke testing).

Validation — Crouching Tiger, Hidden Dragon (2000)

  • 983 merged cues; 44 had no English.
  • Phase 1: 42 of 44 filled. The 2 skips are correct no-ops (pure-particle cue , English-only wanted-poster cue with no Hanzi).
  • Phase 2 (Qwen3-14B Q5_K_M, Vulkan): 39 of 42 LLM-replaced. The 3 unreplaced are cases where the dict's imperative override or name_map already produced the right output (Stop!, Mubai, Xiulian) and the LLM agreed.

Sample of the Phase-1 → Phase-2 diff:

Cue Hanzi Phase 1 (dict) Phase 2 (LLM)
1 卧虎藏龙 lit. hidden dragon, crouching tiger (idiom) Crouching Tiger, Hidden Dragon
188 上房顶了 apex He's on the rooftop!
208 青冥剑是没找回来 green dark double-edged sword to be... drowned to retrieve Come The Green Darkness Sword wasn't recovered.
352 你这老狐貍害得我家破人亡 我今天杀了你…… (verbose 6-line literal) You old fox, you ruined my family, and today I'll kill you...
906 小心 Be careful Watch out.
983 剧终,谢谢观赏! The End (appearing at the end of a movie etc) , to thank to look at sth with pleasure... The End. Thank you for watching!

Dependencies

No new Python deps. Phase 2 talks HTTP to a separately-installed llama-server.exe. CC-CEDICT (CC BY-SA 4.0) is acquired separately, not shipped with this repo.

CC-CEDICT acquisition

mkdir -p Research/primary_sources/cedict
curl -sL -o Research/primary_sources/cedict/cedict_1_0_ts_utf-8_mdbg.txt.gz \
  https://www.mdbg.net/chinese/export/cedict/cedict_1_0_ts_utf-8_mdbg.txt.gz
gunzip Research/primary_sources/cedict/cedict_1_0_ts_utf-8_mdbg.txt.gz

Phase 2 prerequisites

  • A llama.cpp build with Vulkan support (any recent llama-server.exe).
  • A GGUF model with strong Chinese capability — Qwen2.5 or Qwen3 family recommended (Alibaba; trained heavily on Chinese).
  • Sufficient VRAM for the chosen model + a ~1.5 GB KV cache at ctx=4096.

Set the two env vars and run:

$env:PINSUB_LLAMA_SERVER = "path\to\llama-server.exe"
$env:PINSUB_LLM_GGUF     = "path\to\YourModel.gguf"

python PinSub.py --mkv ... --zh ... --english ... --out ... --enrich --llm-verify

CC-CEDICT-driven empty-cue fill. The Chinese subtitles are the source of
truth; this stage fills cues where the merged trilingual result has no
English line, building a literal gloss from CC-CEDICT entries via
greedy-longest-match segmentation.

Conservative: never overwrites existing English (LLM verifier territory).
Every fill is logged to <out>.changes.tsv as a side channel for review.

- load_cedict() parses the standard CC-CEDICT format into
  {simplified: [(pinyin, [defs])]}.
- segment_greedy() does left-to-right longest-match with a 12-char window.
- gloss_hanzi() respects name_map: known proper nouns use the bare-pinyin
  form (e.g. 慕白 -> "Mubai") instead of the literal CC-CEDICT gloss
  (would have been "to admire white").
- clean_gloss() strips editorial annotations: leading parentheticals
  ("(literary)", "(courteous, as opposed to ...)"), bracketed pinyin
  ("[ni3]"), inner annotations, "(CL:...)" classifier hints, and "see
  also X" cross-references.
- Particle overrides drop sentence-ending modal particles (啦, 啊, 呀,
  哦, 嘛, 呢, 吧, etc.) entirely; they convey no English-row content.
- Imperative overrides emit the bare command form for common dialogue
  verbs (停 -> "Stop", 来 -> "Come", 让开 -> "Move aside") instead of
  the dictionary's "to X" infinitive.

CLI: --enrich (off by default), --cedict <path>. Default cedict path is
Research/primary_sources/cedict/cedict_1_0_ts_utf-8_mdbg.txt.

CTHD test run: 42 of 44 empty cues filled. The 2 skips are correct
no-ops (pure-particle cue and English-only wanted-poster cue).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cdbabe6280

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread PinSub.py
return cues
changes: list[tuple[int, str, str, str, str]] = []
for c in cues:
if c.hanzi and not c.english.strip():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip enrichment for cues without Hanzi

When --enrich sees an empty-English cue whose Chinese track text contains only Latin text/numbers/punctuation, this condition still calls gloss_hanzi; that function passes non-Hanzi characters through one by one and then joins them with spaces, so a cue like WANTED becomes an English fill of W A N T E D and is logged as a dictionary change. This affects no-Hanzi subtitle cards or signs in the Chinese SRT; require an actual Hanzi match before filling, or have gloss_hanzi return empty when no Hanzi was glossed.

Useful? React with 👍 / 👎.

Layers a local-LLM verification stage on top of Phase 1. When --llm-verify
is set, spawn a llama.cpp HTTP server on the Vulkan device, re-translate
the dictionary-filled cues, replace dict glosses with idiomatic English,
and cache per-cue results so re-runs skip cues already verified.

CTHD test run: 42 dictionary fills → 39 LLM replacements; the 3 unreplaced
are cases where the dict's imperative override (Stop!, Move aside) or
name_map (Mubai, Xiulian) already produced the right output and the LLM
agreed.

- Server lifecycle: find_arc_vulkan_device() picks the Arc device id from
  `llama-server --list-devices` (skips the iGPU). start_llama_server()
  spawns with --gpu-layers auto --no-warmup and polls /health. The 🔴
  STARTING / 🟧 A770 USE banner and final "double-confirmed to be closed"
  line meet Master §1.6.
- Prompt design: system prompt tells the model the Chinese is truth and
  to output only the English line. User prompt carries the Chinese, the
  existing English (if any), the dict suggestion as a hint, the two prior
  cues for context, and name_map hints for any character mentioned. /no_think
  appended to suppress Qwen3's reasoning mode (which otherwise eats the
  whole token budget into reasoning_content with empty content).
- Defensive fallback: if a model ignores /no_think and returns empty
  content with non-empty reasoning_content, llm_complete extracts a
  best-guess answer from the reasoning stream.
- Cache: Research/cache/<imdb>.llm.json, keyed by sha1(model | hanzi).
  Re-runs skip cached cues entirely. Cache survives across runs of the
  same film.
- A770 logging: every dispatch appends a structured row to
  Logs/A770_usage.md with timestamp, model, quant, n_ctx, device, call
  count, prompt/completion token totals, completion tok/s, JIT time,
  total wall, success/failure, and source session pointer. Matches the
  K-Arc2/A770/data/characterization.md row format so the data is
  publishable.

CLI: --llm-verify (requires --enrich), --llm-model <gguf-path>, --llm-server
<exe-path>, --llm-port (default 8765 to avoid colliding with TIMMY on
8080), --llm-ctx, --llm-max (cap cue count for smoke tests).

Default sources for paths are PINSUB_LLM_GGUF and PINSUB_LLAMA_SERVER env
vars, so the only environment surface needed is two paths — no
directory-wide access to the model store.

No new Python dependencies (stdlib http.client + json + subprocess).
@Boladi888 Boladi888 changed the title Add Phase-1 dictionary correction (--enrich flag) Dictionary + LLM correction stages (--enrich, --llm-verify) May 11, 2026
Boladi888 added 4 commits May 10, 2026 20:20
Bigger-scope correction than --llm-verify. Where Phase 2 only re-translates
the cues Phase 1 filled (empty English), Phase 3 also goes after cues
whose existing English diverges from what the Chinese says. Python does
the heavy lifting (heuristic prefiltering, decision logic, name-map
enforcement); the LLM is the dumb worker that translates and judges.

Pipeline per cue:
  1. Python heuristic decides whether the cue is worth sending to the
     LLM at all. Cheap signals: empty English; low content-word overlap
     between the CC-CEDICT dict-gloss and the existing English; length
     ratio outliers between English chars and Hanzi chars. Tweakable
     overlap_threshold (default 0.2). Reports a histogram of reasons so
     the user can tune.
  2. TIMMY-translate (fresh prompt, no existing English in context)
     produces a Chinese-faithful candidate. Per-cue cache.
  3. If existing English exists AND differs from TIMMY's, a separate
     fresh-context TIMMY call judges A=existing, B=TIMMY, or C=write a
     better one. Per-cue cache.
  4. Python applies the verdict, logs the change.

CLI:
  --llm-correct                Enable Phase 3.
  --llm-scope {fills,divergent,all}
                              Which cues to consider. Default 'divergent'
                              (Python heuristic flags suspicion). 'fills'
                              is the conservative Phase-2-equivalent
                              (empty-English cues only). 'all' touches
                              every cue with Chinese — expensive.
  --llm-no-compare            Skip the comparison pass (use TIMMY's
                              translation outright when it differs from
                              existing). Halves LLM call volume.

Caching:
  Translation cache key: T|sha1(model | hanzi). Compare cache key:
  C|sha1(model | hanzi | en_a | en_b). Stored at
  Research/cache/<imdb>.correct.json. Re-runs against the same film
  skip both translate and compare for cues already seen.

A770 logging:
  Each --llm-correct run appends one row to Logs/A770_usage.md with the
  job string including the chosen scope and a per-action count summary
  in the notes column.

CTHD divergent-scope pre-survey: 752 of 1030 cues flagged as suspect
(73%). Smoke test of 10 cues showed 9/10 replaced with high-quality
output — TIMMY correctly identified that the Bluray English subs were
shifted/misaligned with the Chinese in many cues. Full run gated on
owner approval given the A770 contention cost.

No new Python dependencies; reuses the LlamaSession infrastructure
introduced in Phase 2.
TIMMY's system prompt now lives in README_TIMMY.md next to PinSub.py
(loaded at startup, falls back to an inline default if absent). Pattern
matches the owner's TIMMY_SDNext.md convention — prompt-engineering
decisions live in markdown, not strings buried 1500 lines into Python.

Translation philosophy in the new prompt: word-for-word fidelity on
content (every noun/verb/name/modifier gets translated), function-word
latitude (drop Chinese particles English doesn't have, add English
articles/copulas Chinese implies), synonym latitude on word choice,
Chinese word order preserved when grammatically tolerable. Rearrange
only when literal order is incomprehensible.

Name handling generalized: the README_TIMMY rule (bare-pinyin with
tone marks stripped, syllable spacing preserved, capitals on each name
part) works on any film. If a per-film name_map provides a canonical
spelling, that's passed as a "Known names:" hint and wins. If not,
TIMMY uses the pinyin row in the prompt to derive the English form.

Glossary database:
  - New glossary.json at the project root with seed wuxia/kungfu terms
    (枪 → spear, 师娘 → Master's wife, 镖局 → security agency, 江湖 →
    the martial world, 师父 → Master, etc. ~20 entries). Whitelisted in
    .gitignore so it ships publicly like names/example.json.
  - names/<imdb>.json gets an optional `glossary` field for per-film
    overrides + additions. Same schema, per-film entries override
    global on key collisions.
  - PinSub scans each cue's Hanzi for any glossary key, passes hits
    to TIMMY as a "Glossary:" section. Longest-match-first so compound
    terms (天下第一枪) match before atomic terms (枪).

Python validators + retry loop:
  - validate_translation() checks each TIMMY output for Hanzi-in-output
    (hard reject), name canonicalization (when name_map provided), and
    length-sanity (only on longer cues).
  - Failures feed back into the next TIMMY call as explicit feedback
    ("Your output still contains Chinese characters: 镖局. Translate
    every Hanzi..."). Loop up to --llm-rounds (default 3) iterations.
  - rounds_used histogram emitted in action counts so the operator can
    see how often round-2/3 retries are needed.

CPU thread cap (--llm-threads, default 4): caps llama-server's CPU
side. Vulkan path spends most time on the GPU; this is the knob to keep
host CPU headroom (e.g. when other workloads are using the CPU).

CLI additions:
  --llm-rounds <int>    Max validator-driven retry rounds (default 3).
  --llm-threads <int>   CPU thread cap for llama-server (default 4).

CTHD smoke (10 cues, refactor pass):
  - 镖局 now translated as "security agency" via global glossary
    (previously left as Hanzi by TIMMY).
  - All 10 cues passed round-1 validation; no retries needed in this
    sample. Loop is wired; just didn't fire because the prompt + hints
    were sufficient.
  - File-specific glossary entries (青冥剑 → Qingming Sword, 天下第一枪
    → Number One Spear under Heaven) ready for use on larger runs.

Public-repo additions: README_TIMMY.md, glossary.json. Both tracked.
Surfaces subtitle FORMATTING risks the user actually cares about: long
pinyin rows that overflow on screen, multi-cue mergers that span 4+
lines, fast-dialogue clusters where Python alignment may have slipped,
hold-for-reading extensions that bleed into the next beat.

Python heuristics score each cue on multiple signals:
  - chars=N         total joined-text length over 100
  - long-line=N     single line over 80 chars (pinyin overflow risk)
  - lines=N         joined cue has 4+ lines (merger artifact)
  - long-dur=Nms    duration > 8 s (hold-for-reading bleed)
  - short-dur=Nms   < 700 ms with > 12 chars (fast-dialogue alignment risk)
  - dense=Nc/s      char density > 30 chars/sec (unreadable in time)
  - fast-cluster    both neighbor gaps < 400 ms (rapid dialogue)

Top-N (default 12) cues by combined score are picked. ffmpeg extracts a
single frame at each cue's mid-timestamp with the trilingual subtitle
burned in via the `subtitles=` filter. Frames + an HTML grid index land
under <out>.spotcheck/ next to the trilingual output.

The HTML is self-contained — file:// img URLs work offline. Each card
shows the frame, the cue index, timing, duration, the score tags that
explain WHY the cue was picked, and the trilingual text.

CTHD test run: 8/8 picked cues all carry long-line + dense signals,
matching the user's primary concern (pinyin row overflow on fast-talking
scenes). Example: cue 471 has a 23-syllable pinyin line crammed into a
4.2-second cue at 39 chars/sec — overflow at burn-in is nearly certain.

CLI:
  --spotcheck       enable the spot-check pass after writing the trilingual output
  --spotcheck-n N   number of cues to surface (default 12)

No new Python dependencies (uses ffmpeg + stdlib html + no PIL). Path B
(vision-LLM judging frame+subtitle pairs) layers on top once a VLM GGUF
is available.
ffmpeg's `subtitles=` filter (libass) uses output PTS to decide which
cue is active. Input -ss zeros the PTS, so the filter saw time 0 and
emitted no subtitle pixels (subtitle:0KiB in the muxer report). Frames
came out clean of the burned-in subtitle, defeating the spot-check.

Fix: -copyts preserves the original stream PTS through the seek, so
libass sees the real timestamp and renders the right cue. Also add
-update 1 to silence the muxer's image-sequence-pattern warning, and
keep the fast input-seek (-ss 5s before cue) + precise output-seek
combination for performance.

Tested on CTHD cue 368 (Wudang temple scene): trilingual subtitle now
burned in correctly — Hanzi top, pinyin middle, English bottom.

Also de-duplicated the fast-cluster tag/score increment in
select_spotcheck_cues — a leftover from prior iteration was scoring
some cues twice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant