Skip to content

cc_quality: post-processing validator that ensures generated captions are actually readable#24

Open
bhuvan-somisetty wants to merge 1 commit into
PlanetRead:mainfrom
bhuvan-somisetty:feat/cc-quality-validator
Open

cc_quality: post-processing validator that ensures generated captions are actually readable#24
bhuvan-somisetty wants to merge 1 commit into
PlanetRead:mainfrom
bhuvan-somisetty:feat/cc-quality-validator

Conversation

@bhuvan-somisetty

@bhuvan-somisetty bhuvan-somisetty commented May 14, 2026

Copy link
Copy Markdown

What this is

Every PR in this project answers when to add a caption. This one answers the question that comes after: are the resulting captions usable by the viewers they are meant to serve?

A caption can be technically present in an SRT file and still be invisible to a deaf student - if it flashes on-screen for 0.4 seconds, or contains 60 words that scroll at 900 WPM, or overlaps the next caption and produces a visual blur. This module catches those problems before the file reaches a viewer.

What changed

cc_quality/ - a standalone post-processing module, zero ML dependencies, no coupling to any other PR:

cc_quality/
├── models.py      # Caption, Violation, ValidationReport dataclasses
├── validator.py   # Five accessibility rules → quality score 0–100
├── optimizer.py   # Timestamp-only auto-fix + SRT serialiser
└── cli.py         # cc-quality CLI

tests/
├── test_validator.py   # 28 tests covering every rule and edge case
├── test_optimizer.py   # 11 tests for auto-fix and round-trip
└── fixtures/           # good.srt, violations.srt, hindi.srt

Rules enforced

Rule Severity Standard
MIN_DURATION - caption < 1.5 s on-screen error BBC Subtitle Guidelines 2024
READING_SPEED - WPM exceeds limit error FCC 47 CFR § 79.1 (220 adult / 130 children)
LINE_LENGTH - line too long to fit video frame warning BBC (42 chars Latin, 28 Devanagari)
OVERLAP - caption end > next caption start error WCAG 2.1 SC 1.2.2
MIN_GAP - gap < 83 ms between captions warning ~2 frames at 24 fps

CLI

pip install -e .

# Validate
cc-quality output.srt
cc-quality output.srt --content-type children
cc-quality output.srt --report json

# Auto-fix timing violations (never changes caption text)
cc-quality output.srt --fix
cc-quality output.srt --fix --output reviewed.srt

Example output on a problematic file:

────────────────────────────────────────────────────────────
  CC Quality Report  ·  output.srt
────────────────────────────────────────────────────────────
  Quality score : 74.0 / 100
  Captions      : 4  |  Errors : 2  |  Warnings : 1
────────────────────────────────────────────────────────────

  ✗ [MIN_DURATION] Caption #1  @00:01.00
     Caption displays for 0.80s (minimum 1.5s per BBC guidelines)
     → Extend end time to 2.500

  ✗ [READING_SPEED] Caption #2  @00:03.00
     Reading speed 960 WPM exceeds FCC limit of 220 WPM
     → Extend display duration or shorten the caption text

As a library

from cc_quality import parse_srt, validate, optimize, write_srt

captions = parse_srt(open("output.srt").read())
report = validate(captions, content_type="adult")

print(f"Score: {report.quality_score:.1f}/100")
for v in report.violations:
    print(f"[{v.severity}] {v.rule}: {v.detail}")

if not report.passed():
    fixed = optimize(captions)
    open("fixed.srt", "w").write(write_srt(fixed))

Hindi / Devanagari

Script is auto-detected. Hindi captions get the tighter 28-character line limit and are preserved intact through parse → validate → optimize → write cycles. Test fixture included.

Tests

39 passed in 0.17s

No external dependencies - runs with stdlib + pytest only.

Why this fits alongside the other PRs

This module sits at the end of every pipeline, not inside any one of them. It does not compete with Goal 1, 2, or 3 implementations - it validates whatever they produce. Any merged implementation can pipe its SRT output through cc-quality as a final quality gate.

Closes #23

… SRT files

Validates generated captions against WCAG 2.1, FCC 47 CFR § 79.1, and BBC
Subtitle Guidelines before they reach viewers. Catches problems that detection
modules cannot — reading speed, minimum on-screen duration, overlaps, and
line-length limits — and auto-fixes timing violations without altering text.

- validator.py: five rules (MIN_DURATION, READING_SPEED, LINE_LENGTH, OVERLAP,
  MIN_GAP) with per-caption violation details and a 0-100 quality score
- optimizer.py: timestamp-only auto-fix pass + SRT serialiser
- cli.py: cc-quality CLI with --fix, --report json, --content-type children
- Full Hindi/Devanagari support (script auto-detected, tighter line limit applied)
- 39 passing unit tests with zero external dependencies
- Closes PlanetRead#23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Caption output quality: generated SRT files should meet WCAG/FCC accessibility standards

1 participant