Skip to content

feat: syllabic caption localizer for 9 Indian languages with RCI accessibility scoring#29

Open
bhuvan-somisetty wants to merge 1 commit into
PlanetRead:mainfrom
bhuvan-somisetty:feat/caption-enricher
Open

feat: syllabic caption localizer for 9 Indian languages with RCI accessibility scoring#29
bhuvan-somisetty wants to merge 1 commit into
PlanetRead:mainfrom
bhuvan-somisetty:feat/caption-enricher

Conversation

@bhuvan-somisetty

@bhuvan-somisetty bhuvan-somisetty commented May 16, 2026

Copy link
Copy Markdown

PlanetRead's CC pipeline emits English labels like [alarm] and [tabla] regardless of what language the viewer reads. For a low-literacy Telugu student in a classroom that caption is invisible they can hear the sound but the text does not reach them.

This PR adds src/caption/localizer.py: a zero-dependency localization layer that translates every CC event into the viewer's language before it reaches the screen. Nine languages are supported: Hindi, Telugu, Tamil, Kannada, Bengali, Marathi, Gujarati, Punjabi, and English. All 25 CC labels including the India-specific ones ([tabla], [dhol], [temple bells], [firecrackers]) are translated for each language in their native script.

Pipeline position

Pipeline architecture

The localizer sits between the fusion engine (PR #28) and the SRT/SLS writer (PR #24). CaptionSignal is interface-compatible with the CCDecision dataclass from the decision engine no adapters needed.

Syllabic reading unit counting

Every existing CC system BBC/FCC spec included calculates display duration from word count. That works for English but badly underestimates reading load for Indic scripts. A single Telugu compound like హర్షధ్వానాలు spans five syllables. Counting it as "one word" gives a display time 5× too short for a low-literacy learner.

This module counts aksharas: Unicode vowel nuclei (independent vowel letters + dependent vowel signs / matras) across all seven Indic script blocks. Each vowel nucleus corresponds to one syllable the real cognitive unit of reading in Indic orthography.

Reading speed is expressed in syllables per minute (SPM) calibrated per script per literacy tier:

Reading speed calibration

Dravidian scripts (Telugu, Tamil, Kannada) are calibrated slower than Devanagari. Tamil sits at 100 SPM for low-literacy because its consonant cluster ligatures demand extra orthographic decoding per reading unit. Using a single shared display time is not a minor inaccuracy it systematically disadvantages the learners PlanetRead exists to serve.

Reading Complexity Index

batch_localize_with_report() returns a LocalizationReport alongside the captions. The report includes a Reading Complexity Index (RCI) per caption:

RCI = min(100, 100 × display_duration / low_literacy_required_duration)

RCI = 100 means even the weakest reader in that language has enough time. RCI < 60 flags content for review. The two heatmaps below show why this matters:

RCI heatmap

Left panel: when content is calibrated for low-literacy viewers, RCI is 100 across the board. Right panel: when the same content is calibrated for expert viewers, Tamil and Telugu start producing red cells captions that fluent readers can follow but low-literacy learners cannot. The RCI makes that gap measurable and actionable.

The full accessibility analysis across all 9 languages and 9 labels:

Accessibility analysis

Running the demo

# Telugu, low-literacy (default)
python scripts/demo_localize.py --lang te

# Tamil, expert calibration watch the RCI flag 2 captions
python scripts/demo_localize.py --lang ta --level expert

# All supported languages
python scripts/demo_localize.py --list-languages

Public API

from src.caption import localize, batch_localize, batch_localize_with_report

# Single event
cap = localize(signal, language="te", literacy_level=LiteracyLevel.LOW_LITERACY)
# cap.label_local    → translated label in Telugu script
# cap.akshara_count  → syllabic reading units
# cap.rci            → 0–100 accessibility score for low-literacy viewers
# cap.end_s          → duration adjusted for Telugu SPM at LOW_LITERACY tier

# Batch with accessibility report
captions, report = batch_localize_with_report(signals, "ta", LiteracyLevel.EXPERT)
# report.mean_rci            → average accessibility score
# report.under_threshold_count → captions flagged as too fast
# report.coverage_pct        → % of events with a native-script translation

Tests

136 tests cover all 9 languages, all 25 labels, Unicode script detection, akshara counting for every script block, SPM floor and ceiling clamps, literacy level ordering, RCI computation, LocalizationReport fields, batch_localize_with_report behavior, fallback for unknown labels and language codes, and score passthrough. CI from PR #25 runs them on Python 3.10, 3.11, and 3.12.

136 passed in 0.14s

…ssibility scoring

Signed-off-by: bhuvan-somisetty <somisettybhuvan5@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant