feat: implement MVP for Intelligent Closed Caption Suggestion Tool by anmol457 · Pull Request #7 · PlanetRead/Intelligent-cc-generation

anmol457 · 2026-05-07T07:31:52Z

Intelligent Closed Caption (CC) Suggestion Tool — initial implementation

A backend pipeline that analyzes a video file and generates closed-caption suggestions for meaningful non-speech audio events. The system avoids over-captioning by combining audio event detection with visual reaction analysis and applying a weighted decision engine before producing SRT or SLS output.

Architecture

The system is organized as a backend pipeline with separate modules for each responsibility.

Video File
   |
   v
Audio Extraction
   |
   v
Sound Event Detection
   |
   v
Timestamped Audio Events
   |
   v
Visual Reaction Detection
   |
   v
Reaction Scores
   |
   v
CC Decision Engine
   |
   v
SRT / SLS Output

Module breakdown

1. CLI & orchestration — `cli.py`, `pipeline.py`

Entry point and pipeline coordinator. Accepts video path, thresholds, and output format; runs all stages sequentially and returns a structured PipelineResult.

intelligent-cc video.mp4 -o output.srt
intelligent-cc video.mp4 --format sls -o output.sls
intelligent-cc video.mp4 --audio-threshold 0.30 --decision-threshold 0.55 --max-events 20

2. Data models — `models.py`

Typed dataclasses that form the pipeline's shared language.

Model	Key fields
`AudioEvent`	`label`, `confidence`, `start`, `end`
`ReactionSignal`	`motion_score`, `face_shift_score`, `frame_count`
`CaptionSuggestion`	`label`, `text`, timestamps, scores, `reason`
`PipelineResult`	full run output with all accepted suggestions

3. Audio event detection — `audio.py`

Extracts mono 16 kHz audio via imageio-ffmpeg, loads with librosa, and classifies patches using YAMNet (TensorFlow Hub).

Filters speech-like classes; retains: alarm, applause, honking, gunshot, laughter, music, siren, explosion, cheering, glass breaking
Merges adjacent detections to prevent fragmented captions for the same continuous sound
Uses a bundled ffmpeg binary — no global install required

4. Visual reaction detection — `vision.py`

Samples video frames around each audio event's timestamp and computes two signals:

Motion score — Farneback Optical Flow measures pixel-level movement magnitude
Face/head shift score — MediaPipe tracks facial position; OpenCV Haar Cascade used as fallback (MediaPipe when needed)

Outputs a ReactionSignal per event.

5. Decision engine — `decision.py`

decision_score = (audio_confidence × 0.55)
               + (reaction_confidence × 0.45)
               + 0.12  # if high-impact label

High-impact labels: alarm, explosion, glass breaking, gunshot, honking, scream, siren

Low-confidence background sounds with no visual reaction are suppressed. High-impact events can bypass a weak reaction score.

6. Output generation — `output.py`

.srt — Standard SubRip, compatible with all major players and editing tools.

1
00:00:00,000 --> 00:00:00,480
[music]

.sls — Structured JSON with full metadata per suggestion: label, timestamps, audio_confidence, reaction_confidence, decision_score, reason (audio+visual / high-impact-audio).

{
    "index": 4,
    "label": "gunshot",
    "text": "[gunshot]",
    "start": 12.0,
    "end": 18.72,
    "audio_confidence": 0.9698405265808105,
    "reaction_confidence": 0.4639640204182693,
    "decision_score": 0.862196098807667,
    "reason": "audio+visual"
  }

Video Demonstration

Intelligent.CC.Generation.MVP.Demo.-.Anmol.Varshney.mp4

anmol457 · 2026-05-07T07:36:15Z

@keerthiseelan-planetread @abinash-sketch I have implemented the MVP and attached the demonstration video. Please review it .

feat: implement MVP for Intelligent Closed Caption Suggestion Tool

7287d88

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement MVP for Intelligent Closed Caption Suggestion Tool#7

feat: implement MVP for Intelligent Closed Caption Suggestion Tool#7
anmol457 wants to merge 1 commit into
PlanetRead:mainfrom
anmol457:main

anmol457 commented May 7, 2026 •

edited

Loading

Uh oh!

anmol457 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anmol457 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Intelligent Closed Caption (CC) Suggestion Tool — initial implementation

Architecture

Module breakdown

1. CLI & orchestration — cli.py, pipeline.py

2. Data models — models.py

3. Audio event detection — audio.py

4. Visual reaction detection — vision.py

5. Decision engine — decision.py

6. Output generation — output.py

Video Demonstration

Uh oh!

anmol457 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

anmol457 commented May 7, 2026 •

edited

Loading

1. CLI & orchestration — `cli.py`, `pipeline.py`

2. Data models — `models.py`

3. Audio event detection — `audio.py`

4. Visual reaction detection — `vision.py`

5. Decision engine — `decision.py`

6. Output generation — `output.py`