Skip to content

Boladi888/PinSub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PinSub

A small Python tool that turns a Chinese-language film into a trilingual subtitle for language learning. Each subtitle cue shows three lines:

汉字
hànzì
English

— Simplified Chinese characters on top, pinyin in the middle, English on the bottom — sharing a single timestamp so all three appear together in any standard player (VLC, MPV, Plex, Subtitle Edit). If your source .srt is Traditional, PinSub converts it to Simplified before merging.

Why

If you're learning Mandarin and watching wuxia, dramas, or animation, you usually have to choose between English subs (you understand the plot) or Chinese subs (you practice reading) — and even with both, you can't always tell how a character should be pronounced. PinSub merges everything into one track so you can read characters, check pinyin when you don't recognize them, and fall back to English when you're lost.

What it does

  1. Pulls the English subtitle out of an .mkv (Bluray rips usually carry an English track).
  2. Reads an existing Chinese .srt (Traditional or Simplified) and converts to Simplified if needed.
  3. Generates pinyin for the Chinese line — lowercase tone-marked vowels by default (nǐ hǎo), with a fallback to numeric tones (ni3 hao3) if your player's font drops the diacritics. Capitals only at sentence start and on known proper nouns (Lǐ Mùbái).
  4. Aligns the two tracks if their timing differs (Bluray English vs. fan-source Chinese rarely line up exactly).
  5. Writes a single trilingual file you can drop next to the video — both .srt and .ass are supported (.ass lets PinSub size each row independently and manage word wrap).

Requirements

  • Python 3.10+
  • ffmpeg and ffprobe on PATH (for inspecting and extracting subtitle streams).
  • Optional: Subtitle Edit if your Bluray English subtitle is image-based (PGS) and needs OCR before PinSub can read it.

Install the two Python deps:

pip install pypinyin opencc-python-reimplemented

Developed on Windows 11; Linux/macOS should also work.

Usage

Inspect the subtitle streams in a video:

python PinSub.py --inspect --mkv "Movie.mkv"

Generate trilingual subtitles:

python PinSub.py \
  --mkv "Movie.mkv" \
  --zh  "Movie.zh.srt" \
  --out "Movie.zh-en-pinyin.ass"

If the English subtitle stream is image-based (PGS, common on Bluray rips), PinSub will write a .sup file and stop with a message asking you to OCR it in Subtitle Edit. After OCR, re-run with --english:

python PinSub.py \
  --mkv "Movie.mkv" \
  --zh  "Movie.zh.srt" \
  --english "Movie.english.srt" \
  --out "Movie.zh-en-pinyin.ass"

CLI flags

Flag Default What it does
--mkv <path> required source video
--zh <path> required Chinese .srt (Traditional or Simplified)
--out <path> required output trilingual file (.ass recommended; .srt also supported)
--english <path> pre-existing English .srt; skip ffmpeg extraction
--names <path> auto per-film proper-noun JSON (see below)
--inspect list subtitle streams in --mkv and exit
--pinyin-style tone|number tone tone marks () or numeric (Li3)
--no-simplify skip OpenCC Trad→Simp (use if --zh is already Simplified)
--window-ms <int> 1500 per-cue alignment tolerance
--no-bom write output without UTF-8 BOM

Names files

Different films have different proper nouns, and PinSub leaves it to the user to supply them per film. The format is one JSON file per film:

{
  "imdb_id": "tt0190332",
  "title": "Crouching Tiger, Hidden Dragon",
  "name_map": {
    "李慕白": "Lǐ Mùbái",
    "俞秀莲": "Yú Xiùlián"
  },
  "english_name_map": {
    "Shu Lien": "Xiu Lian"
  }
}
  • name_map maps Hanzi (Simplified, as it appears in the cue) to the capitalized pinyin output. Tone marks live on the lowercase vowel; the capital sits on the consonant — , not . PinSub processes longest keys first, so include both full names and given-name fragments if the dialogue uses both (李慕白 AND 慕白).
  • english_name_map maps old or Wade-Giles English spellings to the Hanyu Pinyin form used in the pinyin row, so the English line matches (Shu LienXiu Lian). Word-boundary regex, case-sensitive.
  • Top-level fields like imdb_id and title are for your reference only — PinSub ignores them. Underscore-prefixed keys inside the maps (e.g. _help) are also ignored, useful for in-file comments.

PinSub looks for the names file in this order:

  1. --names <path> — explicit, always wins.
  2. Auto-detect: if your .mkv filename contains an IMDb tag like {imdb-tt0190332} (the Plex / TRaSH-Guides convention), PinSub looks for names/tt0190332.json next to PinSub.py.
  3. No file found: the pipeline still runs; pinyin won't get name capitalization and English won't get Wade-Giles fixes, but output is still readable.

A starter file is at names/example.json. Copy it, rename to names/<your-imdb-id>.json, and fill in the maps for your film.

Roadmap

  • MT-Chinese for English-only films. Today PinSub needs an existing Chinese .srt. Generating a Chinese row by machine-translating the English would let PinSub work on any film. Engine choice (API vs local model) and trilingual layout are still being designed.

License

MIT.

Author

boladi

About

Generate trilingual (Simplified Hanzi / pinyin / English) subtitles from a video file plus a Chinese .srt — for Mandarin language learners.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages