Skip to content

[DMP 2026]: Create Intelligent Closed Caption (CC) Suggestion Tool #2

@keerthiseelan-planetread

Description

Ticket Contents

Description

Our goal is to develop an AI-powered tool that intelligently identifies moments in a video where a Closed Caption (CC) annotation is genuinely necessary — such as when a non-speech audio event meaningfully affects the speakers or the scene — and suggests contextually relevant CC text, without over-captioning routine or low-impact sounds. The tool will analyze both the audio and visual tracks together to determine whether a non-speech event is significant enough to warrant a CC, reducing the manual effort of editors and accessibility teams who currently add CC annotations by hand.

Goals & Mid-Point Milestone

Goals

  • Goal 1: Sound Event Detection Module Automatically detect and classify non-speech audio events in a given video file with confidence scores and timestamps. Steps Involved: The video file is taken as input. The audio track is extracted and passed through an open-source sound event detection model. The model classifies events such as honking, explosions, laughter, music, glass breaking, alarms, and applause. The output is a list of detected events with confidence scores and start/end timestamps.

  • Goal 2: Speaker Reaction Detection Module (Mid-Point Milestone) Detect visible speaker or scene reactions to audio events using visual analysis of video frames. Steps Involved: At each detected audio event timestamp, the corresponding video frames are extracted. A visual analysis model detects reactions such as head turns, startled body language, paused speech, or facial expressions. A reaction confidence score is assigned per event and stored alongside the audio event data for downstream combination.

  • Goal 3: CC Decision Engine & SRT/SLS Output Combine audio event signals and visual reaction signals to make a CC/no-CC decision and generate a labelled output file. Steps Involved: The audio event confidence and visual reaction confidence are combined to determine whether a CC is warranted. A CC text label is auto-generated for each accepted event (e.g., [honking], [gunshot], [crowd cheering]). The accepted suggestions are exported with correct timestamps into a standard SRT or SLS file. The tool is tested on a sample set of Hindi and regional-language content and feedback is collected from editors on suggestion accuracy.

  • The midpoint milestones will be completion of Goal 1 and Goal 2.

Setup/Installation

No response

Expected Outcome

The Intelligent Closed Caption (CC) Suggestion Tool is a Python-based backend pipeline that accepts any video file as input and produces a ready-to-use SRT or SLS file containing only contextually meaningful, non-speech closed caption annotations — reducing manual effort for accessibility editors and teams working on Hindi and regional-language content.

Acceptance Criteria

The tool should successfully detect non-speech audio events, assess speaker/scene reaction, and produce a CC-annotated SRT or SLS file for any given video file. It must avoid over-captioning ambient sounds that do not affect the speakers or narrative.

Implementation Details

Open-source stack — Python, audio event detection model (e.g., YAMNet or PANNs), OpenCV (frame extraction), MediaPipe or similar (pose and expression analysis), decision combiner logic, SRT/SLS file output.

Mockups/Wireframes

No response

Product Name

Intelligent Closed Caption (CC) Suggestion Tool

Organisation Name

Planet Read

Domain

⁠Education

Tech Skills Needed

Artificial Intelligence, Computer Vision, Python, Machine Learning

Mentor(s)

@abinash-sketch @keerthiseelan-planetread

Category

Backend, Machine Learning, AI

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions