Skip to content

feat: category-aware CC decision engine (Goal 3)#28

Open
bhuvan-somisetty wants to merge 1 commit into
PlanetRead:mainfrom
bhuvan-somisetty:feat/cc-decision-engine-goal3
Open

feat: category-aware CC decision engine (Goal 3)#28
bhuvan-somisetty wants to merge 1 commit into
PlanetRead:mainfrom
bhuvan-somisetty:feat/cc-decision-engine-goal3

Conversation

@bhuvan-somisetty

@bhuvan-somisetty bhuvan-somisetty commented May 16, 2026

Copy link
Copy Markdown

Goal 1 detects what a sound is. Goal 2 detects whether anyone on screen reacted to it. But neither of those alone is enough to decide whether to add a caption - and that decision is the whole point of this tool.

This is Goal 3: the fusion engine that combines both signals and makes the call.


The core problem this solves

Without a decision engine, you have two bad options: caption everything (overcaption noise that viewers learn to ignore) or caption only sounds above some arbitrary single threshold (miss real events that had low audio confidence but strong visual reactions). Neither serves a hearing-impaired student watching a Hindi educational video.

The engine introduces a third approach: category-aware fusion, where the rules for firing a caption depend on what kind of sound was detected.


Decision logic — three tiers

Sound event detected by Goal 1
          |
          v
    What category?
    /           |           \
HIGH_IMPACT   AMBIENT     GENERAL
    |           |           |
 (gunshot,   (music,    (applause,
explosion,   rain,       crying,
  alarm,     wind,      dog bark,
  siren,    traffic)     tabla,
glass brk)              crowd...)
    |           |           |
Audio-led   Gate first:  Weighted
80/20 mix.  reaction    fusion.
No visual   must exist.  60/40
required.   Then 35/65  audio-led.
            audio/visual.
            |           |
            v           v
      combined >= 0.55 or REJECT
      (avoids ambient overcaption)

HIGH_IMPACT events (gunshot, explosion, alarm, siren, glass breaking): a strong audio signal alone is sufficient. Visual reaction can rescue lower-confidence detections — if the camera reacts strongly to what might be a suppressed gunshot, that's worth a caption even if YAMNet only scored it 0.35.

AMBIENT events (music, rain, wind, traffic): these are the overcaption trap. A lesson with rain outside will trigger [rain] at every YAMNet frame. The engine gates ambient sounds on a minimum visual reaction score first, then requires the weighted combined score to clear a higher threshold. Music playing in the background with no visible reaction from anyone on screen - no caption.

GENERAL events (applause, crying, dog barking, tabla, firecrackers, etc.): audio leads at 60%, visual fills in the gap. A strong audio detection passes without any visual reaction. A borderline detection can still be accepted if the speaker clearly reacts.


What each decision looks like

Every CCDecision carries a plain-English reason field explaining what happened:

# A gunshot accepted on audio confidence alone
CCDecision(
    accepted=True,
    label="[gunshot]",
    start_s=12.48,
    end_s=12.96,
    audio_confidence=0.81,
    reaction_score=0.0,
    combined_score=0.648,
    reason="high-impact event: combined score 0.65 (audio 0.81 × 0.8 + reaction 0.0 × 0.2)"
)

# Background music rejected because nobody reacted
CCDecision(
    accepted=False,
    label="[music]",
    start_s=45.12,
    end_s=46.08,
    audio_confidence=0.92,
    reaction_score=0.08,
    combined_score=0.374,
    reason="ambient sound rejected: reaction score 0.08 below minimum 0.35 (no visible scene response - likely background noise)"
)

# Tabla accepted - GENERAL category, strong audio confidence
CCDecision(
    accepted=True,
    label="[tabla]",
    start_s=30.0,
    end_s=32.4,
    audio_confidence=0.85,
    reaction_score=0.0,
    combined_score=0.51,
    reason="accepted: combined score 0.51 (audio 0.85 × 0.6 + reaction 0.0 × 0.4)"
)

The reason strings are designed to be useful for editors reviewing suggestions - they can see at a glance whether a rejection was due to weak audio, no visual reaction, or just not clearing the combined threshold.


How it connects to the rest of the pipeline

  src/audio/detector.py  (Goal 1)
  AudioEvent { label, start_s, end_s, confidence }
                        |
                        v
  src/vision/reaction.py  (Goal 2)
  ReactionResult { reaction_score }
                        |
                        v
  src/fusion/engine.py  (this PR — Goal 3)
  decide(AudioSignal, VisualSignal) -> CCDecision
                        |
              accepted? |
             /          \
           yes           no
            |             |
     write SRT/SLS    discard
    (Goals 1+2 PRs)

AudioSignal and VisualSignal are thin dataclasses defined in this module - the engine does not import from Goal 1 or Goal 2 directly, so it can be reviewed and merged independently.

FusionConfig exposes every threshold and weight as a dataclass field with documented defaults. Researchers who want to tune precision vs. recall for a specific content type can do so without touching the logic.


Tests

38 tests cover all three category paths, the ambient reaction gate, the combined-score boundary conditions, timestamp and label preservation, batch_decide, and custom FusionConfig overrides. India-specific labels ([tabla], [firecrackers]) have explicit test cases confirming they route through GENERAL rather than AMBIENT.

38 passed in 0.19s

No ML dependencies. Runs with pip install pytest && pytest tests/.


Refs #2

Signed-off-by: bhuvan-somisetty <somisettybhuvan5@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant