Add intelligent CC suggestion pipeline by Uneeb808 · Pull Request #17 · PlanetRead/Intelligent-cc-generation

Uneeb808 · 2026-05-09T17:38:00Z

PR: Intelligent CC Suggestion Tool — Module 1 Complete

Summary

This PR delivers a fully working Module 1 (Sound Event Detection & confidence score with timestamps → SRT/SLS output) and lays the architectural groundwork for Module 2 (Visual Reaction Detection). The pipeline accepts any video file and produces closed-caption suggestions for meaningful non-speech audio events ,without over-captioning ambient sounds.
Unlike raw sound-event pipelines, this system is optimized specifically for accessibility-oriented CC generation by suppressing ambient,speech-leakage and low-context detections.

What this PR includes

Full YAMNet-based sound event detection pipeline with a transient-aware 3-path filter to catch short sounds while reducing ambient false positives
English + Hindi SRT/SLS/JSON/CSV export, fully offline with no translation API
Silero VAD speech suppression so speech frames never become false CC events
librosa onset pass to recover very short transients (<0.2s) missed by YAMNet windows
Accessibility-oriented CC filtering to avoid over-captioning low-context background sounds
Architectural groundwork for Module 2 visual reaction scoring

Pipeline Architecture

INPUT VIDEO
    │
    ├──▶ AUDIO EXTRACTION (imageio-ffmpeg, no system install needed)
    │           │
    │    ┌──────┴──────┐
    │    │ Silero VAD  │──▶ speech intervals (suppressed from detection)
    │    └─────────────┘
    │           │
    │    ┌──────▼──────────────────────────────────┐
    │    │  YAMNet  ·  RMS gate  ·  Blocklist     │
    │    │                                         │
    │    │  Transient? ──YES──▶ accept immediately │
    │    │      │               (dog bark, gunshot,│
    │    │      NO              door slam, glass…) │
    │    │      ▼                                  │
    │    │  Consensus voting (2/3 frames)          │
    │    │  + onset check (engine, rain, crowd)    │
    │    └──────────────────┬──────────────────────┘
    │                       │
    │           librosa onset pass (catches <0.2s events)
    │                       │
    │           Merge · Deduplicate · Sort
    │
    └──▶ SRT (EN + HI)  ·  SLS  ·  JSON  ·  CSV

Run

python detect.py --input video.mp4 --srt outputs/cc_en.srt --srt-hi outputs/cc_hi.srt

📎 Colab links: https://colab.research.google.com/drive/1aAbBrZBw1xg8ASqS98lyCewVWRSZb_Bj?usp=sharing,
https://colab.research.google.com/drive/15kpMJkWYWQO0sBoJZhYFMqcRBbLLzVMy?usp=sharing

Research: Benchmark Across 5 Model Families

Before settling on YAMNet as the production solution, I benchmarked five model families. Here's what I found:

WAV2CLIP + CLAP — Not viable

Both models embed audio into CLIP/text space and score against text prompts via cosine similarity. In theory, free-form labels; in practice:

CLAP (HTSAT-base) - repeated failures due to architecture and config mismatch,its not reliable
WAV2CLIP - WAV2CLIP was more suitable as a multimodal embedding component than as a standalone environmental sound detector.

PANNs CNN14 — Better mAP, but wrong fit for CC

Despite having the stronger mAP(mAP 0.385 vs YAMNet's 0.306) it was outperformed by yamnet in by overpredicitng

broad labels (like animal sound instead of dog barking) as its not made for specifc event detection.

Qwen2-Audio-7B — Most promising, fine-tune path forward

Qwen2-Audio is a 7B audio-language model (Whisper-large-v2 encoder + LLM). Instead of cosine similarity or fixed class

indices, it reasons about audio in natural language and returns structured JSON.

It came closest to accuracy of yamnet and was the most promising competitor to it but at the cost of huge size difference

which can affect it in production. we used 4-bit quantization to reduce its size and make it runnable on a colab and with

further tweaking and improving it can easily match yamnet and even outperform it easily

I will be researching more and keep it as a 2nd option for now

Full Comparison Table

	YAMNet	PANNs CNN14	CLAP	WAV2CLIP	Qwen2-Audio
AudioSet mAP	0.306	0.385	~0.47	~0.40	LLM-based
Parameters	3.7M	81M	87M	~60M	7B
Label type	Fixed 521	Fixed 527	Free text	Free text	Free text + reasoning
Hindi support	manual map	manual map	via prompt	via prompt	native
False positive control	gates + blocklist	blocklist too aggressive	mAP ceiling	inconsistent	LLM reasoning
Speed	fastest	fast	fast	moderate	slow
Offline	✅	✅	✅	✅	✅ (quantized)
Verdict	✅ production	❌ noisy for CC	❌ load errors	❌ low quality	🔬 fine-tune target

cc @abinash-sketch @keerthiseelan-planetread

Uneeb808 · 2026-05-09T17:46:02Z

here is the implementaion:

c4gt-video.mp4

Uneeb808 · 2026-05-10T06:11:12Z

please review some of the important changes i made

Add intelligent CC suggestion pipeline

b1c1a3c

Remove cache files

e11cf7a

Uneeb808 added 4 commits May 12, 2026 21:19

Update __init__.py

c0435f0

Update README.md

f9c5feb

Update yamnet.py

c073d3c

Update spectral.py

40d9c9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add intelligent CC suggestion pipeline#17

Add intelligent CC suggestion pipeline#17
Uneeb808 wants to merge 6 commits into
PlanetRead:mainfrom
Uneeb808:yamnet-cc-detector-clean

Uneeb808 commented May 9, 2026 •

edited

Loading

Uh oh!

Uneeb808 commented May 9, 2026

Uh oh!

Uneeb808 commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Uneeb808 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR: Intelligent CC Suggestion Tool — Module 1 Complete

Summary

What this PR includes

Pipeline Architecture

Run

Research: Benchmark Across 5 Model Families

WAV2CLIP + CLAP — Not viable

PANNs CNN14 — Better mAP, but wrong fit for CC

broad labels (like animal sound instead of dog barking) as its not made for specifc event detection.

Qwen2-Audio-7B — Most promising, fine-tune path forward

I will be researching more and keep it as a 2nd option for now

Full Comparison Table

Uh oh!

Uneeb808 commented May 9, 2026

Uh oh!

Uneeb808 commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uneeb808 commented May 9, 2026 •

edited

Loading