Skip to content

feat: Implement foundational MVP pipeline for Whisper-based closed caption generation#4

Open
ravencore06 wants to merge 8 commits into
PlanetRead:mainfrom
ravencore06:cc-audio-extraction
Open

feat: Implement foundational MVP pipeline for Whisper-based closed caption generation#4
ravencore06 wants to merge 8 commits into
PlanetRead:mainfrom
ravencore06:cc-audio-extraction

Conversation

@ravencore06
Copy link
Copy Markdown

This PR introduces the foundational Minimum Viable Product (MVP) pipeline for the Intelligent Closed Caption Generation tool. It establishes the core project structure, dependencies, and functional modules to successfully ingest video, transcribe speech via AI, and output mathematically precise subtitle files.

Fixes #2

Key Changes

  • Project Scaffolding: Established a clean, modular src/ directory structure separating audio extraction, ASR, and subtitle formatting.
  • Audio Extraction (src/audio_extractor.py): Integrated moviepy to dynamically strip audio tracks from incoming video files and format them to 16kHz WAV format (optimal for Whisper).
  • ASR Engine (src/speech_to_text.py): Integrated openai-whisper for highly accurate speech-to-text with precise segment timestamps.
  • Zero-Config FFmpeg Handling: Engineered a runtime injection of imageio-ffmpeg to automatically expose bundled FFmpeg binaries to Whisper. This prevents the pipeline from crashing on Windows machines that do not have FFmpeg manually installed in their system PATH.
  • Subtitle Generation (src/subtitle_generator.py): Created a custom formatter to convert raw Whisper segment timestamps into standard SubRip (.srt) syntax.
  • CLI Entry Point (main.py): Added a simple command-line interface utilizing argparse to allow easy execution of the pipeline.

How to Test

  1. Pull this branch and ensure your virtual environment is active.
  2. Install the new dependencies:
    pip install -r requirements.txt

@ravencore06
Copy link
Copy Markdown
Author

@keerthiseelan-planetread , could you please review this MVP and suggest if the direction aligns with project goals?

@abinash-sketch
Copy link
Copy Markdown

Let me know when can we connect.

@ravencore06
Copy link
Copy Markdown
Author

ravencore06 commented May 7, 2026

Hi @abinash-sketch,thank you for reaching out.
I’d be happy to connect and discuss the project further.

Email: srinidhisadhanala@gmail.com
LinkedIn:https://www.linkedin.com/in/srinidhi-sadhanala-raven21

my timings are flexible. So please feel free to connect whenever convenient for you.

@ravencore06
Copy link
Copy Markdown
Author

ravencore06 commented May 8, 2026

Module 1: Sound Event Detection Demo

Detects sounds like car horns, gunshots, dog barks, and sirens from audio using YAMNet + MediaPipe. Outputs timestamps and confidence scores.

Files: sound_event_detection.py (core), demo_module1.py (runner), create_demo_samples.py (downloads 6 sample clips)

To run:

pip install -r requirements.txt
python create_demo_samples.py && python demo_module1.py --all
Results: All 6 test sounds detected correctly (car horn 80%, siren 67%, dog bark 85%, glass breaking 33%, gunshot 89%, engine 67%).

sample inputs:
car_horn.wav

dog_bark.wav

glass_break.wav

Demo video:
https://drive.google.com/file/d/1t2Wag1vHcxlU2NVxShGO8zol5845mjb0/view?usp=sharing

@ravencore06
Copy link
Copy Markdown
Author

Module 2: Visual Scene/Action Detection

Detects visual events from video frames:

Objects - person, vehicle, helicopter, dog, etc. (via MediaPipe Image Classifier / MobileNet)
Actions - punching, fall down (via MediaPipe Pose keypoint analysis)
Scene changes - camera cuts, location switches (via histogram comparison)
Outputs events with timestamps as JSON.

To run: python demo_module2.py "Avengers vs Ultron.mp4"

Input video:

https://drive.google.com/file/d/1mcOrIeyXr_A3lWq6fktQUJ14eS8vw6xQ/view?usp=sharing

demo video:

https://drive.google.com/file/d/1mMGjgSg7qD4oFw_Npz7l8Omrud60FCmx/view?usp=sharing

@ravencore06
Copy link
Copy Markdown
Author

Module 3 takes sounds + visuals + speech and combines them into a single subtitle file.

It decides which caption is most important at every moment. If a gunshot happens during dialogue, the gunshot shows first — the speech waits until it's quiet again. It outputs a .srt file you can load into VLC or any video player.

Input video: https://drive.google.com/file/d/1mcOrIeyXr_A3lWq6fktQUJ14eS8vw6xQ/view?usp=sharing

demo video:

Screen.Recording.2026-05-12.140457.mp4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DMP 2026]: Create Intelligent Closed Caption (CC) Suggestion Tool

2 participants