feat: SRT-driven edit pipeline + edit-plan recommender#41
feat: SRT-driven edit pipeline + edit-plan recommender#41xiaogang-sudo wants to merge 3 commits into
Conversation
Independent helper that assembles a final cut by aligning source ranges to an SRT timeline, bypassing the existing transcript-based EDL flow. Use when you have a finished script (script.srt = final captions timeline) and a list of source ranges keyed by SRT id. Pipeline: parse SRT + plan -> strict validate -> align -> extract segments (per-source ffprobe, HDR tone-map, sync tails, cache) -> gap clips for non-contiguous SRT cues -> lossless concat -> final pass with optional global voice mix + subtitle burn LAST (Hard Rule 1). Key correctness properties: - All intermediates land in a safe-ASCII temp work_dir; CJK / quoted user paths never reach libavfilter or the concat demuxer. - SRT input decoded with utf-8-sig / utf-8 / gb18030 / cp936 / cp1252 fallback; cue settings (position:90% etc.) tolerated. - Per-segment cache keyed by ffmpeg version + encoding params + effective bg_volume so encoder tweaks invalidate stale clips. - Source streams probed once; no-audio source auto-degrades bg_volume to 0 for its segments; out-of-bounds ranges fail fast. - Global --voice spans the whole timeline (apad/atrim to total_duration in the final compose), not per-segment — a 5s VO does not restart at every cut. - 30ms audio fades + fps=24,setpts and aresample sync tails on every segment prevent A/V drift through many short concats. - burn_subtitles is self-defending: unsafe subs paths are copied to a temp ASCII SRT before being fed to libavfilter. - Batch (jobs.json / .csv) auto-isolates outputs by manifest index; --continue-on-error skips failing rows; --no-overwrite refuses to clobber existing outputs. Includes examples (Form A array, Form B object with multi-source + voices, batch manifest, CJK SRT) and pytest coverage (14 e2e + batch tests using lavfi-synthesized media; passes against ffmpeg 8.x on Windows). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…cript
Bridges the gap between Scribe word-level transcripts and the
srt_driven_edit pipeline. Given a final-cut script.srt and a source
recording's Scribe JSON, produces an edit_plan.json (Form A or B) plus
a sidecar review markdown for human-in-the-loop QA.
Matching strategy is intentionally local (no LLM, no API):
1. Filter the transcript to timestamped 'word' tokens (audio_event /
spacing skipped; --keep-audio-events keeps markers as context).
2. Group consecutive words into non-overlapping candidates, breaking
on sentence-end punctuation, silences >= gap_threshold, or speaker
change. Long candidates split at phrase punctuation, then by hard
word-level windows. All edges land on word boundaries.
3. Score each (cue, candidate) pair as
0.7 * (0.6 * SequenceMatcher + 0.4 * Jaccard)
+ 0.3 * 1/(1+|dur_delta|/cue_dur)
where Jaccard auto-switches between Latin word-token and CJK
character-bigram representations.
4. Greedy assignment; --allow-reuse drops the no-reuse constraint.
5. Emit Form A (default, drop-in for srt_driven_edit --plan) or Form
B; review markdown lists matched text, score, duration delta, and
warnings (low score / duration mismatch / candidate-shorter-than-
cue).
Hard failure modes (exit 1): any cue with no assignable candidate;
malformed transcript JSON; transcript with no word tokens.
Soft failures (warnings only): low score, candidate too short for cue.
The matcher cannot understand storyline — if SRT narration words do
not appear in the source transcript, scores will be low. The sidecar
review.md is the manual QA surface; it is intentionally not pulled
into the plan (parse_plan in srt_driven_edit stays strict).
--packed (takes_packed.md) and --context-window flags are reserved
placeholders only; both raise no error but do not yet alter behavior.
Includes 11 pytest tests including a full end-to-end:
recommend -> sde.run_job -> final.mp4 against lavfi-synthesized media.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLAUDE.md is auto-loaded by Claude Code when working in this directory, giving sessions a consistent picture of the project's scope, tech constraints, and out-of-bounds behaviors before the user has to say it. AGENTS.md does the same for Codex review sessions, classifying review output into must-fix / should-improve / later so suggestions are actionable rather than open-ended rewrites. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
3 issues found across 15 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="helpers/recommend_edit_plan.py">
<violation number="1" location="helpers/recommend_edit_plan.py:134">
P2: `--keep-audio-events` / `keep_audio_events` is dead code: audio events kept in `load_transcript_words` are silently discarded in `build_candidates`' unconditional `type != "word"` filter. The flag produces identical output in both states, misleading users who expect `(laughter)`/`(applause)` context to be included in candidate text.</violation>
</file>
<file name="helpers/srt_driven_edit.py">
<violation number="1" location="helpers/srt_driven_edit.py:553">
P2: Per-segment voice files lack preflight audio stream validation</violation>
<violation number="2" location="helpers/srt_driven_edit.py:771">
P1: Per-segment orientation scaling conflicts with concat demuxer `-c copy` stream-copy requirement, which mandates identical stream parameters (including resolution) across all inputs. In Form B multi-source plans, mixing portrait and landscape sources produces clips with mismatched resolutions (e.g. 1080×1920 vs 1920×1080), causing ffmpeg concat failure or glitched output.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
| vf_parts: list[str] = [] | ||
| if is_hdr_source(seg.source_path): | ||
| vf_parts.append(TONEMAP_CHAIN) | ||
| vf_parts.append(scale_filter_for(seg.source_path)) |
There was a problem hiding this comment.
P1: Per-segment orientation scaling conflicts with concat demuxer -c copy stream-copy requirement, which mandates identical stream parameters (including resolution) across all inputs. In Form B multi-source plans, mixing portrait and landscape sources produces clips with mismatched resolutions (e.g. 1080×1920 vs 1920×1080), causing ffmpeg concat failure or glitched output.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At helpers/srt_driven_edit.py, line 771:
<comment>Per-segment orientation scaling conflicts with concat demuxer `-c copy` stream-copy requirement, which mandates identical stream parameters (including resolution) across all inputs. In Form B multi-source plans, mixing portrait and landscape sources produces clips with mismatched resolutions (e.g. 1080×1920 vs 1920×1080), causing ffmpeg concat failure or glitched output.</comment>
<file context>
@@ -0,0 +1,1522 @@
+ vf_parts: list[str] = []
+ if is_hdr_source(seg.source_path):
+ vf_parts.append(TONEMAP_CHAIN)
+ vf_parts.append(scale_filter_for(seg.source_path))
+
+ if seg.pad_short and seg.plan_src_dur + 1e-6 < target:
</file context>
| return out | ||
|
|
||
|
|
||
| def build_candidates( |
There was a problem hiding this comment.
P2: --keep-audio-events / keep_audio_events is dead code: audio events kept in load_transcript_words are silently discarded in build_candidates' unconditional type != "word" filter. The flag produces identical output in both states, misleading users who expect (laughter)/(applause) context to be included in candidate text.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At helpers/recommend_edit_plan.py, line 134:
<comment>`--keep-audio-events` / `keep_audio_events` is dead code: audio events kept in `load_transcript_words` are silently discarded in `build_candidates`' unconditional `type != "word"` filter. The flag produces identical output in both states, misleading users who expect `(laughter)`/`(applause)` context to be included in candidate text.</comment>
<file context>
@@ -0,0 +1,561 @@
+ return out
+
+
+def build_candidates(
+ words: list[dict],
+ *,
</file context>
| raise SystemExit(f"source '{name}' missing on disk: {sp}") | ||
| for name, vp in voices_map.items(): | ||
| if not vp.exists(): | ||
| raise SystemExit(f"voice '{name}' missing on disk: {vp}") |
There was a problem hiding this comment.
P2: Per-segment voice files lack preflight audio stream validation
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At helpers/srt_driven_edit.py, line 553:
<comment>Per-segment voice files lack preflight audio stream validation</comment>
<file context>
@@ -0,0 +1,1522 @@
+ raise SystemExit(f"source '{name}' missing on disk: {sp}")
+ for name, vp in voices_map.items():
+ if not vp.exists():
+ raise SystemExit(f"voice '{name}' missing on disk: {vp}")
+ if legacy_default_source is not None and not legacy_default_source.exists():
+ raise SystemExit(f"--source missing on disk: {legacy_default_source}")
</file context>
Summary
Adds an independent SRT-driven editing pipeline plus a lexical recommender that bridges Scribe transcripts to it. All existing helpers (
render.py,grade.py,transcribe.py, etc.) are untouched.helpers/srt_driven_edit.py— fullextract → gap → concat → final-composepipeline:position:90%etc.)fps=24,setpts=PTS-STARTPTS/aresample=async=1:first_pts=0,asetpts) on every clip--voicespans the whole output timeline (mixed in the final compose, not per-segment)--continue-on-error,--no-overwritehelpers/recommend_edit_plan.py— bridges Scribe transcript JSON →edit_plan.json:0.6 SequenceMatcher + 0.4 Jaccard(token-level for Latin, char-2-gram for CJK) blended with duration similaritysrt_driven_edit --plan) or Form B; sidecar*_review.mdfor human QA--packed/--context-windowflags reserved as placeholders (documented as such)tests/— 28 pytest tests using lavfi-synthesized media:--no-overwrite, gap insertion)sde.run_job→ final.mp4pyproject.toml— addsdev = ["pytest>=7"]as an optional dependency.CLAUDE.md/AGENTS.md— project guidance for AI assistants working in this repo. Happy to remove or reword if these conflict with upstream framing.Pipeline position
This complements the existing transcript-first EDL flow rather than replacing it — use the new pipeline when you already have a finished narration script and want to align it to a source recording.
Reviewer notes
srt_driven_edit.pyreuses 4 symbols fromrender.pyviatry: from render import ...with fallbacks, so it still runs ifrender.pyis unavailable.ffmpeg+ffprobeonPATH;conftest.pyskips the wholetests/directory otherwise.examples/srt_driven/_smoke_test.pyis a no-pytest fallback that covers the parser / encoding / cache-key layers.Test plan
pip install -e ".[dev]"python -m pytest tests/ -v(~40s on a typical machine)python examples/srt_driven/_smoke_test.py🤖 Generated with Claude Code
Summary by cubic
Add a standalone SRT‑driven edit pipeline and a local edit‑plan recommender to align finished scripts to source footage without changing existing helpers. Enables an offline flow: script.srt + transcript.json → edit_plan.json → final.mp4.
New Features
helpers/srt_driven_edit.py: End‑to‑end SRT‑driven pipeline (parse + validate + align → cached extract → gap insert → concat → final compose with global--voice), with safe‑ASCII temp paths, SRT encoding fallback, ffmpeg/ffprobe preflight, batch manifests, QC report, and “burn subtitles last” with short audio fades.helpers/recommend_edit_plan.py: Buildsedit_plan.jsonfrom script.srt and Scribe word‑level transcript via local lexical scoring and greedy assignment; outputs Form A/B and a*_review.mdfor QA.Dependencies
devextra:pytest>=7.Written for commit 87439d1. Summary will update on new commits. Review in cubic