Skip to content

[DMP 2026]: Create Intelligent Closed Caption (CC) Suggestion Tool #2

@keerthiseelan-planetread

Description

Ticket Contents

Description

Our goal is to develop an AI-powered tool that intelligently identifies moments in a video where a Closed Caption (CC) annotation is genuinely necessary — such as when a non-speech audio event meaningfully affects the speakers or the scene — and suggests contextually relevant CC text, without over-captioning routine or low-impact sounds. The tool will analyze both the audio and visual tracks together to determine whether a non-speech event is significant enough to warrant a CC, reducing the manual effort of editors and accessibility teams who currently add CC annotations by hand.

Goals & Mid-Point Milestone

Goals

  • Goal 1: Sound Event Detection Module Automatically detect and classify non-speech audio events in a given video file with confidence scores and timestamps. Steps Involved: The video file is taken as input. The audio track is extracted and passed through an open-source sound event detection model. The model classifies events such as honking, explosions, laughter, music, glass breaking, alarms, and applause. The output is a list of detected events with confidence scores and start/end timestamps.

  • Goal 2: Speaker Reaction Detection Module (Mid-Point Milestone) Detect visible speaker or scene reactions to audio events using visual analysis of video frames. Steps Involved: At each detected audio event timestamp, the corresponding video frames are extracted. A visual analysis model detects reactions such as head turns, startled body language, paused speech, or facial expressions. A reaction confidence score is assigned per event and stored alongside the audio event data for downstream combination.

  • Goal 3: CC Decision Engine & SRT/SLS Output Combine audio event signals and visual reaction signals to make a CC/no-CC decision and generate a labelled output file. Steps Involved: The audio event confidence and visual reaction confidence are combined to determine whether a CC is warranted. A CC text label is auto-generated for each accepted event (e.g., [honking], [gunshot], [crowd cheering]). The accepted suggestions are exported with correct timestamps into a standard SRT or SLS file. The tool is tested on a sample set of Hindi and regional-language content and feedback is collected from editors on suggestion accuracy.

  • The midpoint milestones will be completion of Goal 1 and Goal 2.

Setup/Installation

No response

Expected Outcome

The Intelligent Closed Caption (CC) Suggestion Tool is a Python-based backend pipeline that accepts any video file as input and produces a ready-to-use SRT or SLS file containing only contextually meaningful, non-speech closed caption annotations — reducing manual effort for accessibility editors and teams working on Hindi and regional-language content.

Acceptance Criteria

The tool should successfully detect non-speech audio events, assess speaker/scene reaction, and produce a CC-annotated SRT or SLS file for any given video file. It must avoid over-captioning ambient sounds that do not affect the speakers or narrative.

Implementation Details

Open-source stack — Python, audio event detection model (e.g., YAMNet or PANNs), OpenCV (frame extraction), MediaPipe or similar (pose and expression analysis), decision combiner logic, SRT/SLS file output.

Mockups/Wireframes

No response

Product Name

Intelligent Closed Caption (CC) Suggestion Tool

Organisation Name

Planet Read

Domain

⁠Education

Tech Skills Needed

Artificial Intelligence, Computer Vision, Python, Machine Learning

Mentor(s)

@abinash-sketch @keerthiseelan-planetread

Category

Backend, Machine Learning, AI

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions