feat: category-aware CC decision engine (Goal 3) by bhuvan-somisetty · Pull Request #28 · PlanetRead/Intelligent-cc-generation

bhuvan-somisetty · 2026-05-16T15:56:24Z

Goal 1 detects what a sound is. Goal 2 detects whether anyone on screen reacted to it. But neither of those alone is enough to decide whether to add a caption - and that decision is the whole point of this tool.

This is Goal 3: the fusion engine that combines both signals and makes the call.

The core problem this solves

Without a decision engine, you have two bad options: caption everything (overcaption noise that viewers learn to ignore) or caption only sounds above some arbitrary single threshold (miss real events that had low audio confidence but strong visual reactions). Neither serves a hearing-impaired student watching a Hindi educational video.

The engine introduces a third approach: category-aware fusion, where the rules for firing a caption depend on what kind of sound was detected.

Decision logic — three tiers

Sound event detected by Goal 1
          |
          v
    What category?
    /           |           \
HIGH_IMPACT   AMBIENT     GENERAL
    |           |           |
 (gunshot,   (music,    (applause,
explosion,   rain,       crying,
  alarm,     wind,      dog bark,
  siren,    traffic)     tabla,
glass brk)              crowd...)
    |           |           |
Audio-led   Gate first:  Weighted
80/20 mix.  reaction    fusion.
No visual   must exist.  60/40
required.   Then 35/65  audio-led.
            audio/visual.
            |           |
            v           v
      combined >= 0.55 or REJECT
      (avoids ambient overcaption)

HIGH_IMPACT events (gunshot, explosion, alarm, siren, glass breaking): a strong audio signal alone is sufficient. Visual reaction can rescue lower-confidence detections — if the camera reacts strongly to what might be a suppressed gunshot, that's worth a caption even if YAMNet only scored it 0.35.

AMBIENT events (music, rain, wind, traffic): these are the overcaption trap. A lesson with rain outside will trigger [rain] at every YAMNet frame. The engine gates ambient sounds on a minimum visual reaction score first, then requires the weighted combined score to clear a higher threshold. Music playing in the background with no visible reaction from anyone on screen - no caption.

GENERAL events (applause, crying, dog barking, tabla, firecrackers, etc.): audio leads at 60%, visual fills in the gap. A strong audio detection passes without any visual reaction. A borderline detection can still be accepted if the speaker clearly reacts.

What each decision looks like

Every CCDecision carries a plain-English reason field explaining what happened:

# A gunshot accepted on audio confidence alone
CCDecision(
    accepted=True,
    label="[gunshot]",
    start_s=12.48,
    end_s=12.96,
    audio_confidence=0.81,
    reaction_score=0.0,
    combined_score=0.648,
    reason="high-impact event: combined score 0.65 (audio 0.81 × 0.8 + reaction 0.0 × 0.2)"
)

# Background music rejected because nobody reacted
CCDecision(
    accepted=False,
    label="[music]",
    start_s=45.12,
    end_s=46.08,
    audio_confidence=0.92,
    reaction_score=0.08,
    combined_score=0.374,
    reason="ambient sound rejected: reaction score 0.08 below minimum 0.35 (no visible scene response - likely background noise)"
)

# Tabla accepted - GENERAL category, strong audio confidence
CCDecision(
    accepted=True,
    label="[tabla]",
    start_s=30.0,
    end_s=32.4,
    audio_confidence=0.85,
    reaction_score=0.0,
    combined_score=0.51,
    reason="accepted: combined score 0.51 (audio 0.85 × 0.6 + reaction 0.0 × 0.4)"
)

The reason strings are designed to be useful for editors reviewing suggestions - they can see at a glance whether a rejection was due to weak audio, no visual reaction, or just not clearing the combined threshold.

How it connects to the rest of the pipeline

  src/audio/detector.py  (Goal 1)
  AudioEvent { label, start_s, end_s, confidence }
                        |
                        v
  src/vision/reaction.py  (Goal 2)
  ReactionResult { reaction_score }
                        |
                        v
  src/fusion/engine.py  (this PR — Goal 3)
  decide(AudioSignal, VisualSignal) -> CCDecision
                        |
              accepted? |
             /          \
           yes           no
            |             |
     write SRT/SLS    discard
    (Goals 1+2 PRs)

AudioSignal and VisualSignal are thin dataclasses defined in this module - the engine does not import from Goal 1 or Goal 2 directly, so it can be reviewed and merged independently.

FusionConfig exposes every threshold and weight as a dataclass field with documented defaults. Researchers who want to tune precision vs. recall for a specific content type can do so without touching the logic.

Tests

38 tests cover all three category paths, the ambient reaction gate, the combined-score boundary conditions, timestamp and label preservation, batch_decide, and custom FusionConfig overrides. India-specific labels ([tabla], [firecrackers]) have explicit test cases confirming they route through GENERAL rather than AMBIENT.

38 passed in 0.19s

No ML dependencies. Runs with pip install pytest && pytest tests/.

Refs #2

Signed-off-by: bhuvan-somisetty <somisettybhuvan5@gmail.com>

feat: add category-aware CC decision engine (Goal 3)

38f4d02

Signed-off-by: bhuvan-somisetty <somisettybhuvan5@gmail.com>

bhuvan-somisetty mentioned this pull request May 16, 2026

feat: syllabic caption localizer for 9 Indian languages with RCI accessibility scoring #29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: category-aware CC decision engine (Goal 3)#28

feat: category-aware CC decision engine (Goal 3)#28
bhuvan-somisetty wants to merge 1 commit into
PlanetRead:mainfrom
bhuvan-somisetty:feat/cc-decision-engine-goal3

bhuvan-somisetty commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bhuvan-somisetty commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The core problem this solves

Decision logic — three tiers

What each decision looks like

How it connects to the rest of the pipeline

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bhuvan-somisetty commented May 16, 2026 •

edited

Loading