You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our goal is to develop an AI-powered tool that intelligently identifies moments in a video where a Closed Caption (CC) annotation is genuinely necessary — such as when a non-speech audio event meaningfully affects the speakers or the scene — and suggests contextually relevant CC text, without over-captioning routine or low-impact sounds. The tool will analyze both the audio and visual tracks together to determine whether a non-speech event is significant enough to warrant a CC, reducing the manual effort of editors and accessibility teams who currently add CC annotations by hand.
Goals & Mid-Point Milestone
Goals
Goal 1: Sound Event Detection Module Automatically detect and classify non-speech audio events in a given video file with confidence scores and timestamps. Steps Involved: The video file is taken as input. The audio track is extracted and passed through an open-source sound event detection model. The model classifies events such as honking, explosions, laughter, music, glass breaking, alarms, and applause. The output is a list of detected events with confidence scores and start/end timestamps.
Goal 2: Speaker Reaction Detection Module (Mid-Point Milestone) Detect visible speaker or scene reactions to audio events using visual analysis of video frames. Steps Involved: At each detected audio event timestamp, the corresponding video frames are extracted. A visual analysis model detects reactions such as head turns, startled body language, paused speech, or facial expressions. A reaction confidence score is assigned per event and stored alongside the audio event data for downstream combination.
Goal 3: CC Decision Engine & SRT/SLS Output Combine audio event signals and visual reaction signals to make a CC/no-CC decision and generate a labelled output file. Steps Involved: The audio event confidence and visual reaction confidence are combined to determine whether a CC is warranted. A CC text label is auto-generated for each accepted event (e.g., [honking], [gunshot], [crowd cheering]). The accepted suggestions are exported with correct timestamps into a standard SRT or SLS file. The tool is tested on a sample set of Hindi and regional-language content and feedback is collected from editors on suggestion accuracy.
The midpoint milestones will be completion of Goal 1 and Goal 2.
Setup/Installation
No response
Expected Outcome
The Intelligent Closed Caption (CC) Suggestion Tool is a Python-based backend pipeline that accepts any video file as input and produces a ready-to-use SRT or SLS file containing only contextually meaningful, non-speech closed caption annotations — reducing manual effort for accessibility editors and teams working on Hindi and regional-language content.
Acceptance Criteria
The tool should successfully detect non-speech audio events, assess speaker/scene reaction, and produce a CC-annotated SRT or SLS file for any given video file. It must avoid over-captioning ambient sounds that do not affect the speakers or narrative.
Implementation Details
Open-source stack — Python, audio event detection model (e.g., YAMNet or PANNs), OpenCV (frame extraction), MediaPipe or similar (pose and expression analysis), decision combiner logic, SRT/SLS file output.
Ticket Contents
Description
Our goal is to develop an AI-powered tool that intelligently identifies moments in a video where a Closed Caption (CC) annotation is genuinely necessary — such as when a non-speech audio event meaningfully affects the speakers or the scene — and suggests contextually relevant CC text, without over-captioning routine or low-impact sounds. The tool will analyze both the audio and visual tracks together to determine whether a non-speech event is significant enough to warrant a CC, reducing the manual effort of editors and accessibility teams who currently add CC annotations by hand.
Goals & Mid-Point Milestone
Goals
Goal 1: Sound Event Detection Module Automatically detect and classify non-speech audio events in a given video file with confidence scores and timestamps. Steps Involved: The video file is taken as input. The audio track is extracted and passed through an open-source sound event detection model. The model classifies events such as honking, explosions, laughter, music, glass breaking, alarms, and applause. The output is a list of detected events with confidence scores and start/end timestamps.
Goal 2: Speaker Reaction Detection Module (Mid-Point Milestone) Detect visible speaker or scene reactions to audio events using visual analysis of video frames. Steps Involved: At each detected audio event timestamp, the corresponding video frames are extracted. A visual analysis model detects reactions such as head turns, startled body language, paused speech, or facial expressions. A reaction confidence score is assigned per event and stored alongside the audio event data for downstream combination.
Goal 3: CC Decision Engine & SRT/SLS Output Combine audio event signals and visual reaction signals to make a CC/no-CC decision and generate a labelled output file. Steps Involved: The audio event confidence and visual reaction confidence are combined to determine whether a CC is warranted. A CC text label is auto-generated for each accepted event (e.g., [honking], [gunshot], [crowd cheering]). The accepted suggestions are exported with correct timestamps into a standard SRT or SLS file. The tool is tested on a sample set of Hindi and regional-language content and feedback is collected from editors on suggestion accuracy.
The midpoint milestones will be completion of Goal 1 and Goal 2.
Setup/Installation
No response
Expected Outcome
The Intelligent Closed Caption (CC) Suggestion Tool is a Python-based backend pipeline that accepts any video file as input and produces a ready-to-use SRT or SLS file containing only contextually meaningful, non-speech closed caption annotations — reducing manual effort for accessibility editors and teams working on Hindi and regional-language content.
Acceptance Criteria
The tool should successfully detect non-speech audio events, assess speaker/scene reaction, and produce a CC-annotated SRT or SLS file for any given video file. It must avoid over-captioning ambient sounds that do not affect the speakers or narrative.
Implementation Details
Open-source stack — Python, audio event detection model (e.g., YAMNet or PANNs), OpenCV (frame extraction), MediaPipe or similar (pose and expression analysis), decision combiner logic, SRT/SLS file output.
Mockups/Wireframes
No response
Product Name
Intelligent Closed Caption (CC) Suggestion Tool
Organisation Name
Planet Read
Domain
Education
Tech Skills Needed
Artificial Intelligence, Computer Vision, Python, Machine Learning
Mentor(s)
@abinash-sketch @keerthiseelan-planetread
Category
Backend, Machine Learning, AI