feat: syllabic caption localizer for 9 Indian languages with RCI accessibility scoring by bhuvan-somisetty · Pull Request #29 · PlanetRead/Intelligent-cc-generation

bhuvan-somisetty · 2026-05-16T22:45:15Z

PlanetRead's CC pipeline emits English labels like [alarm] and [tabla] regardless of what language the viewer reads. For a low-literacy Telugu student in a classroom that caption is invisible they can hear the sound but the text does not reach them.

This PR adds src/caption/localizer.py: a zero-dependency localization layer that translates every CC event into the viewer's language before it reaches the screen. Nine languages are supported: Hindi, Telugu, Tamil, Kannada, Bengali, Marathi, Gujarati, Punjabi, and English. All 25 CC labels including the India-specific ones ([tabla], [dhol], [temple bells], [firecrackers]) are translated for each language in their native script.

Pipeline position

The localizer sits between the fusion engine (PR #28) and the SRT/SLS writer (PR #24). CaptionSignal is interface-compatible with the CCDecision dataclass from the decision engine no adapters needed.

Syllabic reading unit counting

Every existing CC system BBC/FCC spec included calculates display duration from word count. That works for English but badly underestimates reading load for Indic scripts. A single Telugu compound like హర్షధ్వానాలు spans five syllables. Counting it as "one word" gives a display time 5× too short for a low-literacy learner.

This module counts aksharas: Unicode vowel nuclei (independent vowel letters + dependent vowel signs / matras) across all seven Indic script blocks. Each vowel nucleus corresponds to one syllable the real cognitive unit of reading in Indic orthography.

Reading speed is expressed in syllables per minute (SPM) calibrated per script per literacy tier:

Dravidian scripts (Telugu, Tamil, Kannada) are calibrated slower than Devanagari. Tamil sits at 100 SPM for low-literacy because its consonant cluster ligatures demand extra orthographic decoding per reading unit. Using a single shared display time is not a minor inaccuracy it systematically disadvantages the learners PlanetRead exists to serve.

Reading Complexity Index

batch_localize_with_report() returns a LocalizationReport alongside the captions. The report includes a Reading Complexity Index (RCI) per caption:

RCI = min(100, 100 × display_duration / low_literacy_required_duration)

RCI = 100 means even the weakest reader in that language has enough time. RCI < 60 flags content for review. The two heatmaps below show why this matters:

Left panel: when content is calibrated for low-literacy viewers, RCI is 100 across the board. Right panel: when the same content is calibrated for expert viewers, Tamil and Telugu start producing red cells captions that fluent readers can follow but low-literacy learners cannot. The RCI makes that gap measurable and actionable.

The full accessibility analysis across all 9 languages and 9 labels:

Running the demo

# Telugu, low-literacy (default)
python scripts/demo_localize.py --lang te

# Tamil, expert calibration watch the RCI flag 2 captions
python scripts/demo_localize.py --lang ta --level expert

# All supported languages
python scripts/demo_localize.py --list-languages

Public API

from src.caption import localize, batch_localize, batch_localize_with_report

# Single event
cap = localize(signal, language="te", literacy_level=LiteracyLevel.LOW_LITERACY)
# cap.label_local    → translated label in Telugu script
# cap.akshara_count  → syllabic reading units
# cap.rci            → 0–100 accessibility score for low-literacy viewers
# cap.end_s          → duration adjusted for Telugu SPM at LOW_LITERACY tier

# Batch with accessibility report
captions, report = batch_localize_with_report(signals, "ta", LiteracyLevel.EXPERT)
# report.mean_rci            → average accessibility score
# report.under_threshold_count → captions flagged as too fast
# report.coverage_pct        → % of events with a native-script translation

Tests

136 tests cover all 9 languages, all 25 labels, Unicode script detection, akshara counting for every script block, SPM floor and ceiling clamps, literacy level ordering, RCI computation, LocalizationReport fields, batch_localize_with_report behavior, fallback for unknown labels and language codes, and score passthrough. CI from PR #25 runs them on Python 3.10, 3.11, and 3.12.

136 passed in 0.14s

…ssibility scoring Signed-off-by: bhuvan-somisetty <somisettybhuvan5@gmail.com>

feat: syllabic caption localizer for 9 Indian languages with RCI acce…

5982914

…ssibility scoring Signed-off-by: bhuvan-somisetty <somisettybhuvan5@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: syllabic caption localizer for 9 Indian languages with RCI accessibility scoring#29

feat: syllabic caption localizer for 9 Indian languages with RCI accessibility scoring#29
bhuvan-somisetty wants to merge 1 commit into
PlanetRead:mainfrom
bhuvan-somisetty:feat/caption-enricher

bhuvan-somisetty commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bhuvan-somisetty commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pipeline position

Syllabic reading unit counting

Reading Complexity Index

Running the demo

Public API

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bhuvan-somisetty commented May 16, 2026 •

edited

Loading