Skip to content

feat: implement MVP for Intelligent Closed Caption Suggestion Tool#7

Open
anmol457 wants to merge 1 commit into
PlanetRead:mainfrom
anmol457:main
Open

feat: implement MVP for Intelligent Closed Caption Suggestion Tool#7
anmol457 wants to merge 1 commit into
PlanetRead:mainfrom
anmol457:main

Conversation

@anmol457
Copy link
Copy Markdown

@anmol457 anmol457 commented May 7, 2026

Intelligent Closed Caption (CC) Suggestion Tool — initial implementation

A backend pipeline that analyzes a video file and generates closed-caption suggestions for meaningful non-speech audio events. The system avoids over-captioning by combining audio event detection with visual reaction analysis and applying a weighted decision engine before producing SRT or SLS output.


Architecture

The system is organized as a backend pipeline with separate modules for each responsibility.

Video File
   |
   v
Audio Extraction
   |
   v
Sound Event Detection
   |
   v
Timestamped Audio Events
   |
   v
Visual Reaction Detection
   |
   v
Reaction Scores
   |
   v
CC Decision Engine
   |
   v
SRT / SLS Output




Audio Analysis Pipeline Visual Reaction Analysis Pipeline
Decision Engine Logic

Module breakdown

1. CLI & orchestration — cli.py, pipeline.py

Entry point and pipeline coordinator. Accepts video path, thresholds, and output format; runs all stages sequentially and returns a structured PipelineResult.

intelligent-cc video.mp4 -o output.srt
intelligent-cc video.mp4 --format sls -o output.sls
intelligent-cc video.mp4 --audio-threshold 0.30 --decision-threshold 0.55 --max-events 20

2. Data models — models.py

Typed dataclasses that form the pipeline's shared language.

Model Key fields
AudioEvent label, confidence, start, end
ReactionSignal motion_score, face_shift_score, frame_count
CaptionSuggestion label, text, timestamps, scores, reason
PipelineResult full run output with all accepted suggestions

3. Audio event detection — audio.py

Extracts mono 16 kHz audio via imageio-ffmpeg, loads with librosa, and classifies patches using YAMNet (TensorFlow Hub).

  • Filters speech-like classes; retains: alarm, applause, honking, gunshot, laughter, music, siren, explosion, cheering, glass breaking
  • Merges adjacent detections to prevent fragmented captions for the same continuous sound
  • Uses a bundled ffmpeg binary — no global install required

4. Visual reaction detection — vision.py

Samples video frames around each audio event's timestamp and computes two signals:

  • Motion score — Farneback Optical Flow measures pixel-level movement magnitude
  • Face/head shift score — MediaPipe tracks facial position; OpenCV Haar Cascade used as fallback (MediaPipe when needed)

Outputs a ReactionSignal per event.

5. Decision engine — decision.py

decision_score = (audio_confidence × 0.55)
               + (reaction_confidence × 0.45)
               + 0.12  # if high-impact label

High-impact labels: alarm, explosion, glass breaking, gunshot, honking, scream, siren

Low-confidence background sounds with no visual reaction are suppressed. High-impact events can bypass a weak reaction score.

6. Output generation — output.py

.srt — Standard SubRip, compatible with all major players and editing tools.

1
00:00:00,000 --> 00:00:00,480
[music]

.sls — Structured JSON with full metadata per suggestion: label, timestamps, audio_confidence, reaction_confidence, decision_score, reason (audio+visual / high-impact-audio).

{
    "index": 4,
    "label": "gunshot",
    "text": "[gunshot]",
    "start": 12.0,
    "end": 18.72,
    "audio_confidence": 0.9698405265808105,
    "reaction_confidence": 0.4639640204182693,
    "decision_score": 0.862196098807667,
    "reason": "audio+visual"
  }

Video Demonstration

Intelligent.CC.Generation.MVP.Demo.-.Anmol.Varshney.mp4

@anmol457
Copy link
Copy Markdown
Author

anmol457 commented May 7, 2026

@keerthiseelan-planetread @abinash-sketch I have implemented the MVP and attached the demonstration video. Please review it .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant