Add intelligent CC suggestion pipeline#17
Open
Uneeb808 wants to merge 6 commits into
Open
Conversation
Author
|
here is the implementaion: c4gt-video.mp4 |
Author
|
please review some of the important changes i made |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR: Intelligent CC Suggestion Tool — Module 1 Complete
Summary
This PR delivers a fully working Module 1 (Sound Event Detection & confidence score with timestamps → SRT/SLS output) and lays the architectural groundwork for Module 2 (Visual Reaction Detection). The pipeline accepts any video file and produces closed-caption suggestions for meaningful non-speech audio events ,without over-captioning ambient sounds.
Unlike raw sound-event pipelines, this system is optimized specifically for accessibility-oriented CC generation by suppressing ambient,speech-leakage and low-context detections.
What this PR includes
Pipeline Architecture
Run
📎 Colab links: https://colab.research.google.com/drive/1aAbBrZBw1xg8ASqS98lyCewVWRSZb_Bj?usp=sharing,
https://colab.research.google.com/drive/15kpMJkWYWQO0sBoJZhYFMqcRBbLLzVMy?usp=sharing
Research: Benchmark Across 5 Model Families
Before settling on YAMNet as the production solution, I benchmarked five model families. Here's what I found:
WAV2CLIP + CLAP — Not viable
Both models embed audio into CLIP/text space and score against text prompts via cosine similarity. In theory, free-form labels; in practice:
PANNs CNN14 — Better mAP, but wrong fit for CC
Despite having the stronger mAP(mAP 0.385 vs YAMNet's 0.306) it was outperformed by yamnet in by overpredicitng
broad labels (like animal sound instead of dog barking) as its not made for specifc event detection.
Qwen2-Audio-7B — Most promising, fine-tune path forward
Qwen2-Audio is a 7B audio-language model (Whisper-large-v2 encoder + LLM). Instead of cosine similarity or fixed class
indices, it reasons about audio in natural language and returns structured JSON.
It came closest to accuracy of yamnet and was the most promising competitor to it but at the cost of huge size difference
which can affect it in production. we used 4-bit quantization to reduce its size and make it runnable on a colab and with
further tweaking and improving it can easily match yamnet and even outperform it easily
I will be researching more and keep it as a 2nd option for now
Full Comparison Table
cc @abinash-sketch @keerthiseelan-planetread