cc_quality: post-processing validator that ensures generated captions are actually readable#24
Open
bhuvan-somisetty wants to merge 1 commit into
Open
Conversation
… SRT files Validates generated captions against WCAG 2.1, FCC 47 CFR § 79.1, and BBC Subtitle Guidelines before they reach viewers. Catches problems that detection modules cannot — reading speed, minimum on-screen duration, overlaps, and line-length limits — and auto-fixes timing violations without altering text. - validator.py: five rules (MIN_DURATION, READING_SPEED, LINE_LENGTH, OVERLAP, MIN_GAP) with per-caption violation details and a 0-100 quality score - optimizer.py: timestamp-only auto-fix pass + SRT serialiser - cli.py: cc-quality CLI with --fix, --report json, --content-type children - Full Hindi/Devanagari support (script auto-detected, tighter line limit applied) - 39 passing unit tests with zero external dependencies - Closes PlanetRead#23
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
Every PR in this project answers when to add a caption. This one answers the question that comes after: are the resulting captions usable by the viewers they are meant to serve?
A caption can be technically present in an SRT file and still be invisible to a deaf student - if it flashes on-screen for 0.4 seconds, or contains 60 words that scroll at 900 WPM, or overlaps the next caption and produces a visual blur. This module catches those problems before the file reaches a viewer.
What changed
cc_quality/- a standalone post-processing module, zero ML dependencies, no coupling to any other PR:Rules enforced
MIN_DURATION- caption < 1.5 s on-screenREADING_SPEED- WPM exceeds limitLINE_LENGTH- line too long to fit video frameOVERLAP- caption end > next caption startMIN_GAP- gap < 83 ms between captionsCLI
Example output on a problematic file:
As a library
Hindi / Devanagari
Script is auto-detected. Hindi captions get the tighter 28-character line limit and are preserved intact through parse → validate → optimize → write cycles. Test fixture included.
Tests
No external dependencies - runs with stdlib + pytest only.
Why this fits alongside the other PRs
This module sits at the end of every pipeline, not inside any one of them. It does not compete with Goal 1, 2, or 3 implementations - it validates whatever they produce. Any merged implementation can pipe its SRT output through
cc-qualityas a final quality gate.Closes #23