Skip to content

Add intelligent CC suggestion pipeline#17

Open
Uneeb808 wants to merge 6 commits into
PlanetRead:mainfrom
Uneeb808:yamnet-cc-detector-clean
Open

Add intelligent CC suggestion pipeline#17
Uneeb808 wants to merge 6 commits into
PlanetRead:mainfrom
Uneeb808:yamnet-cc-detector-clean

Conversation

@Uneeb808
Copy link
Copy Markdown

@Uneeb808 Uneeb808 commented May 9, 2026

PR: Intelligent CC Suggestion Tool — Module 1 Complete

Summary

This PR delivers a fully working Module 1 (Sound Event Detection & confidence score with timestamps → SRT/SLS output) and lays the architectural groundwork for Module 2 (Visual Reaction Detection). The pipeline accepts any video file and produces closed-caption suggestions for meaningful non-speech audio events ,without over-captioning ambient sounds.
Unlike raw sound-event pipelines, this system is optimized specifically for accessibility-oriented CC generation by suppressing ambient,speech-leakage and low-context detections.

What this PR includes

  • Full YAMNet-based sound event detection pipeline with a transient-aware 3-path filter to catch short sounds while reducing ambient false positives
  • English + Hindi SRT/SLS/JSON/CSV export, fully offline with no translation API
  • Silero VAD speech suppression so speech frames never become false CC events
  • librosa onset pass to recover very short transients (<0.2s) missed by YAMNet windows
  • Accessibility-oriented CC filtering to avoid over-captioning low-context background sounds
  • Architectural groundwork for Module 2 visual reaction scoring

Pipeline Architecture

INPUT VIDEO
    │
    ├──▶ AUDIO EXTRACTION (imageio-ffmpeg, no system install needed)
    │           │
    │    ┌──────┴──────┐
    │    │ Silero VAD  │──▶ speech intervals (suppressed from detection)
    │    └─────────────┘
    │           │
    │    ┌──────▼──────────────────────────────────┐
    │    │  YAMNet  ·  RMS gate  ·  Blocklist     │
    │    │                                         │
    │    │  Transient? ──YES──▶ accept immediately │
    │    │      │               (dog bark, gunshot,│
    │    │      NO              door slam, glass…) │
    │    │      ▼                                  │
    │    │  Consensus voting (2/3 frames)          │
    │    │  + onset check (engine, rain, crowd)    │
    │    └──────────────────┬──────────────────────┘
    │                       │
    │           librosa onset pass (catches <0.2s events)
    │                       │
    │           Merge · Deduplicate · Sort
    │
    └──▶ SRT (EN + HI)  ·  SLS  ·  JSON  ·  CSV

Run

python detect.py --input video.mp4 --srt outputs/cc_en.srt --srt-hi outputs/cc_hi.srt

📎 Colab links: https://colab.research.google.com/drive/1aAbBrZBw1xg8ASqS98lyCewVWRSZb_Bj?usp=sharing,
https://colab.research.google.com/drive/15kpMJkWYWQO0sBoJZhYFMqcRBbLLzVMy?usp=sharing


Research: Benchmark Across 5 Model Families

Before settling on YAMNet as the production solution, I benchmarked five model families. Here's what I found:


WAV2CLIP + CLAP — Not viable

Both models embed audio into CLIP/text space and score against text prompts via cosine similarity. In theory, free-form labels; in practice:

  • CLAP (HTSAT-base) - repeated failures due to architecture and config mismatch,its not reliable
  • WAV2CLIP - WAV2CLIP was more suitable as a multimodal embedding component than as a standalone environmental sound detector.

PANNs CNN14 — Better mAP, but wrong fit for CC

Despite having the stronger mAP(mAP 0.385 vs YAMNet's 0.306) it was outperformed by yamnet in by overpredicitng

broad labels (like animal sound instead of dog barking) as its not made for specifc event detection.

Qwen2-Audio-7B — Most promising, fine-tune path forward

Qwen2-Audio is a 7B audio-language model (Whisper-large-v2 encoder + LLM). Instead of cosine similarity or fixed class

indices, it reasons about audio in natural language and returns structured JSON.

It came closest to accuracy of yamnet and was the most promising competitor to it but at the cost of huge size difference

which can affect it in production. we used 4-bit quantization to reduce its size and make it runnable on a colab and with

further tweaking and improving it can easily match yamnet and even outperform it easily

I will be researching more and keep it as a 2nd option for now

Full Comparison Table

YAMNet PANNs CNN14 CLAP WAV2CLIP Qwen2-Audio
AudioSet mAP 0.306 0.385 ~0.47 ~0.40 LLM-based
Parameters 3.7M 81M 87M ~60M 7B
Label type Fixed 521 Fixed 527 Free text Free text Free text + reasoning
Hindi support manual map manual map via prompt via prompt native
False positive control gates + blocklist blocklist too aggressive mAP ceiling inconsistent LLM reasoning
Speed fastest fast fast moderate slow
Offline ✅ (quantized)
Verdict production ❌ noisy for CC ❌ load errors ❌ low quality 🔬 fine-tune target

cc @abinash-sketch @keerthiseelan-planetread

@Uneeb808
Copy link
Copy Markdown
Author

Uneeb808 commented May 9, 2026

here is the implementaion:

c4gt-video.mp4

@Uneeb808
Copy link
Copy Markdown
Author

please review some of the important changes i made

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant