feat: Implement foundational MVP pipeline for Whisper-based closed caption generation#4
feat: Implement foundational MVP pipeline for Whisper-based closed caption generation#4ravencore06 wants to merge 8 commits into
Conversation
|
@keerthiseelan-planetread , could you please review this MVP and suggest if the direction aligns with project goals? |
|
Let me know when can we connect. |
|
Hi @abinash-sketch,thank you for reaching out. Email: srinidhisadhanala@gmail.com my timings are flexible. So please feel free to connect whenever convenient for you. |
|
Module 1: Sound Event Detection Demo Detects sounds like car horns, gunshots, dog barks, and sirens from audio using YAMNet + MediaPipe. Outputs timestamps and confidence scores. Files: sound_event_detection.py (core), demo_module1.py (runner), create_demo_samples.py (downloads 6 sample clips) To run: pip install -r requirements.txt sample inputs: Demo video: |
|
Module 2: Visual Scene/Action Detection Detects visual events from video frames: Objects - person, vehicle, helicopter, dog, etc. (via MediaPipe Image Classifier / MobileNet) To run: python demo_module2.py "Avengers vs Ultron.mp4" Input video: https://drive.google.com/file/d/1mcOrIeyXr_A3lWq6fktQUJ14eS8vw6xQ/view?usp=sharing demo video: https://drive.google.com/file/d/1mMGjgSg7qD4oFw_Npz7l8Omrud60FCmx/view?usp=sharing |
|
Module 3 takes sounds + visuals + speech and combines them into a single subtitle file. It decides which caption is most important at every moment. If a gunshot happens during dialogue, the gunshot shows first — the speech waits until it's quiet again. It outputs a Input video: https://drive.google.com/file/d/1mcOrIeyXr_A3lWq6fktQUJ14eS8vw6xQ/view?usp=sharing demo video: Screen.Recording.2026-05-12.140457.mp4 |
This PR introduces the foundational Minimum Viable Product (MVP) pipeline for the Intelligent Closed Caption Generation tool. It establishes the core project structure, dependencies, and functional modules to successfully ingest video, transcribe speech via AI, and output mathematically precise subtitle files.
Fixes #2
Key Changes
src/directory structure separating audio extraction, ASR, and subtitle formatting.src/audio_extractor.py): Integratedmoviepyto dynamically strip audio tracks from incoming video files and format them to 16kHz WAV format (optimal for Whisper).src/speech_to_text.py): Integratedopenai-whisperfor highly accurate speech-to-text with precise segment timestamps.imageio-ffmpegto automatically expose bundled FFmpeg binaries to Whisper. This prevents the pipeline from crashing on Windows machines that do not have FFmpeg manually installed in their system PATH.src/subtitle_generator.py): Created a custom formatter to convert raw Whisper segment timestamps into standard SubRip (.srt) syntax.main.py): Added a simple command-line interface utilizingargparseto allow easy execution of the pipeline.How to Test