[DMP 2026] Intelligent CC Suggestion Tool Complete Pipeline (Goals 1, 2 & 3) by Ashutoshx7 · Pull Request #8 · PlanetRead/Intelligent-cc-generation

Ashutoshx7 · 2026-05-07T23:52:50Z

Intelligent CC Suggestion Tool

fixes #2

DMP 2026 · PlanetRead · C4GT

Proposal DOCS LINK - https://docs.google.com/document/d/18DmqvkRuaiw3bRKe-c_-T5-KrA5auziadDFyorAlWG8/edit?usp=sharing

AI-powered tool that identifies moments in a video where a Closed Caption (CC) annotation is genuinely necessary — such as when a non-speech audio event meaningfully affects the speakers or the scene — and suggests contextually relevant CC text, without over-captioning routine or low-impact sounds.

PLanet.Read.mp4

Youtube link of the video https://youtu.be/zOPK43g-OwQ?si=iPUpVk_uKhRCEQDV

Architecture

Video → Audio Extraction → YAMNet Detection → Speech Filtering
     → Scene Cut Detection → Reaction Window Frame Extraction
     → Pose Analysis (flinch, head turn) + Face Analysis (surprise)
     → Category-Aware Fusion Engine → SRT Output

Key Innovations

Temporal Reaction Windows — Extracts frames 300ms–1500ms after the sound (when reactions actually happen), not at the midpoint
Category-Aware Fusion — Different sound types use different weights (explosions don't need visual confirmation; doorbells do)
Scene Cut Detection — Skips visual analysis at edit points to prevent false positive reactions
Top-3 High-Impact Priority — When a dangerous sound (gunshot, explosion) appears in YAMNet's top 3 predictions, it's selected even if not the #1 class
Multi-Person Detection — Analyzes up to 4 people per frame, takes peak reaction score
Overcaption Prevention — Primary design goal is to filter ambient/insignificant sounds, not just detect everything (90% filter rate on real content)

Setup

# One-command setup (installs deps + downloads models)
chmod +x setup.sh && ./setup.sh

# Or manually:
pip install -r requirements.txt
sudo apt install ffmpeg            # optional but recommended

The setup script downloads MediaPipe model files to models/.

Usage

CLI (Command Line)

# Basic — produces <video>_cc.srt
python main.py video.mp4

# With options
python main.py video.mp4 -o captions.srt --verbose

# Override fusion threshold
python main.py video.mp4 --threshold 0.35

# Evaluation mode — compares output against ground truth
python main.py video.mp4 --evaluate --ground-truth eval/ground_truth/clip.json

Web UI

python web/app.py
# Open http://localhost:8000

The web interface provides:

Upload — Drag-and-drop video files
Processing — Real-time progress with pipeline stage updates
Review — Video player, interactive timeline, event cards with accept/reject toggles
Live CC Overlay — Captions appear on the video player in real-time during playback, styled by category
Caption Style Customizer — Change font, size, color, position, and background opacity of captions in real-time
Keyboard Shortcuts — Space play/pause, ←→ seek ±5s, J/K jump between events
Export — Download SRT or SLS with only accepted captions

Output

The CLI produces:

<video>_cc.srt — Standard SRT subtitle file with CC annotations
<video>_cc.sls — SLS (Same Language Subtitling) format with score metadata
<video>_cc_summary.txt — Human-readable report showing accepted/rejected events with scores

Example SRT Output

1
00:00:12,480 --> 00:00:13,440
[gunshot]

2
00:00:28,320 --> 00:00:28,800
[glass breaking]

Configuration

All thresholds are tunable via YAML config — zero hardcoded magic numbers.

config/default.yaml — Pipeline settings (confidence thresholds, reaction window timing, fusion weights)
config/sound_categories.yaml — Category-aware weights per sound type

Sound Categories

Category	Examples	Behavior
high_impact	Gunshot, Explosion, Scream	Caption even without visual reaction (α=0.85)
interactive	Doorbell, Knock, Dog bark	Only caption if someone visibly reacts (β=0.60)
social	Laughter, Applause, Crying	Context-dependent (balanced weights)
ambient	Music, Rain, Traffic	Almost never caption (threshold=0.70)

Project Structure

├── config/
│   ├── default.yaml             # Pipeline settings
│   └── sound_categories.yaml    # Category-aware weights
├── src/
│   ├── pipeline.py              # Full orchestrator
│   ├── config_loader.py         # YAML config loading
│   ├── audio/
│   │   ├── extractor.py         # ffmpeg audio extraction (+ OpenCV fallback)
│   │   ├── yamnet_detector.py   # YAMNet sound event detection (521 classes)
│   │   └── speech_filter.py     # WebRTC VAD + energy-based fallback
│   ├── visual/
│   │   ├── scene_cut.py         # Histogram-based cut detection
│   │   ├── frame_extractor.py   # Temporal reaction window (300-1500ms)
│   │   ├── pose_analyzer.py     # MediaPipe Pose (flinch, head turn, multi-person)
│   │   └── face_analyzer.py     # MediaPipe Face (surprise/gasp, multi-face)
│   ├── fusion/
│   │   ├── category_mapper.py   # YAMNet class → behavioral category
│   │   └── decision_engine.py   # Category-aware score fusion + CC decision
│   └── output/
│       ├── srt_writer.py        # SRT file generation
│       └── label_mapper.py      # YAMNet class → CC label (India-specific)
├── eval/
│   ├── evaluator.py             # IoU-based P/R/F1 + overcaption rate
│   └── ground_truth/            # Manual annotations (JSON)
├── web/
│   ├── app.py                   # FastAPI backend
│   └── static/                  # Monochrome web UI
├── tests/
│   ├── test_all.py              # 30-test suite
│   └── generate_test_data.py    # Synthetic video/audio generator
├── main.py                      # CLI entry point
├── setup.sh                     # One-command setup
└── requirements.txt

Testing

# Run all tests (30 tests)
python -m pytest tests/test_all.py -v

# Generate synthetic test data
python tests/generate_test_data.py

# Full end-to-end pipeline test
python main.py samples/test_clip.avi --verbose

# Evaluation test
python main.py samples/test_clip.avi --evaluate --ground-truth eval/ground_truth/test_clip.json

Tech Stack

Component	Tool
Audio extraction	ffmpeg (with moviepy fallback)
Sound detection	YAMNet (TensorFlow Hub, 521 classes)
Speech filtering	WebRTC VAD (with energy-based fallback)
Pose detection	MediaPipe PoseLandmarker (Tasks API)
Face analysis	MediaPipe FaceLandmarker (Tasks API)
Scene cuts	OpenCV histogram comparison
Config	YAML (all thresholds tunable)
Output	Standard SRT + SLS (PlanetRead)
Web UI	FastAPI + Vanilla JS

Evaluation Metrics

Metric	Target	Description
Precision	≥ 0.75	Fraction of suggestions that are correct
Recall	≥ 0.65	Fraction of important events caught
Overcaption Rate	≤ 0.15	Fraction of suggestions that are unnecessary

Hindi/Regional Content Support

Dense dialogue handling — WebRTC VAD at aggressiveness=3 for Hindi speech
India-specific sounds — Fireworks→[firecrackers], Drum→[drums], Bell→[bell]
SLS workflow compatible — Standard SRT format overlays with karaoke subtitles

Known Limitations

YAMNet is AudioSet-trained (English/Western-centric) — Indian-specific sounds (dhol, pressure cooker whistle, temple bells) may classify under generic labels. Mitigation: substring-based label mapper handles this, and PANNs can be swapped in via the fixed data contract.
Single-frame vs. multi-frame tradeoff — We extract 5 frames in the reaction window (300–1500ms). For very fast reactions (<300ms) or slow dramatic reactions (>1500ms), the window may miss. The window is configurable in default.yaml.
No GPU required but slower on CPU — YAMNet + MediaPipe run on CPU. A 10s video processes in ~4s. Longer videos scale linearly.
ffmpeg preferred for audio — Without system ffmpeg, moviepy (bundled ffmpeg) handles extraction. Both produce full-fidelity audio.
WebRTC VAD may not install on all platforms — Falls back to energy-based VAD automatically, which is less accurate for dense Hindi dialogue.
Confidence calibration — YAMNet softmax scores are not true probabilities. Per-class calibration on representative Hindi content would improve threshold accuracy.

What I'd Improve Next

Benchmark on real PlanetRead content — Tune thresholds and category weights on actual Hindi/regional videos with editor feedback
PANNs backend — Swap in PANNs for finer-grained classification (the data contract makes this a drop-in)
Confidence calibration — Per-class percentile normalization on a representative sample
Persistent job storage — Move from in-memory to SQLite/Redis for multi-user web deployment
VTT output format — Trivially derivable from SRT, not yet implemented
Threshold tuning UI — Expose category weights in the web interface for real-time editor adjustment

Goals 1, 2 & 3 — full end-to-end pipeline: Goal 1 — Sound Event Detection: - YAMNet-based audio classification (521 AudioSet classes) - WebRTC VAD / energy-based speech filtering - Consecutive event merging with peak confidence Goal 2 — Speaker Reaction Detection: - Temporal reaction windows (300ms-1500ms after sound onset) - Scene cut detection (histogram comparison) to prevent false positives - MediaPipe PoseLandmarker (flinch/head turn via shoulder/ear/nose landmarks) - MediaPipe FaceLandmarker (surprise via mouth openness) - Multi-person scoring (max reaction across all detected people) Goal 3 — CC Decision Engine + SRT Output: - Category-aware fusion weights (high_impact, interactive, social, ambient) - Speech-pause bonus for interrupted dialogue - Scene-cut fallback to audio-only scoring - Standard SRT output with human-readable summary - IoU-based evaluation framework (P/R/F1/overcaption rate) 19/19 tests passing. Full pipeline tested end-to-end.

- FastAPI backend: upload, async pipeline processing, event toggle, SRT export - Minimal black & white design: Inter + JetBrains Mono, 1px borders, subtle surfaces - Results page: stats bar, video player, interactive timeline, event cards with toggles - SRT live preview, accept/reject per event, download export - Audio extractor: OpenCV fallback for environments without ffmpeg

…on support - Created eval/ground_truth/test_clip.json with 3 annotated events - Added eval/__init__.py for clean package imports - Verified --evaluate mode runs end-to-end with P/R/F1 output

- requirements.txt: added fastapi, uvicorn, python-multipart, pytest - .gitignore: added web/uploads/, stopped ignoring .avi/.mp4 source files - README.md: documented web UI, setup, testing, Hindi support - setup.sh: auto-generates test data, prints web UI usage - pipeline.py: fixed WAV cleanup to preserve pre-existing test audio

- generate_demo_data.py: 15s video with siren, alarm, bell, knock - speech_filter.py: raised energy thresholds to prevent non-speech sounds from being filtered as speech - demo pipeline now shows 15 detected → 6 accepted with category filtering

New features: - report_generator.py: professional HTML report with dark monochrome design (stats grid, category distribution, color-coded event table, SRT preview) - report_generator.py: JSON report with full event data for integration - label_mapper.py: expanded from 30 to 120+ AudioSet class mappings (high impact, interactive, social, transport, physical, India-specific, nature) - Pipeline now auto-generates _report.html and _report.json alongside SRT Tests: 30 passing (was 19) - TestReportGenerator: JSON structure, HTML elements, filter rate - TestExtendedLabels: India-specific, high impact, social, transport, nature - TestEnergyVAD: threshold behavior, silent/loud detection

…cal detail

…, timeline, contact

… system ffmpeg

…during playback

…o active event

…ke animation for high-impact

…= 90% filter rate

…+ smarter detection logic

… reduce false reactions + auto cache-bust

…overlay, top-3 detection)

…eb API + CLI

…erlay, 150+ labels

…eyboard shortcuts (Space/arrows/J/K) + SLS export button

… export

…all apply live

…from applying

…s, and correct repo links

Ashutoshx7 · 2026-05-08T00:03:34Z

Hi @abinash-sketch,

I’m available for the interview.
Please find my detailed proposal/demo attached. I would appreciate your thoughts and feedback on it.

I’m happy to connect at any time that works best for you.

Thank you.

Ashutoshx7 added 28 commits May 8, 2026 00:57

fix: add ground truth sample + eval/__init__.py for complete evaluati…

5e08061

…on support - Created eval/ground_truth/test_clip.json with 3 annotated events - Added eval/__init__.py for clean package imports - Verified --evaluate mode runs end-to-end with P/R/F1 output

feat: add demo script + limitations + complete README

2bb5def

docs: add PROPOSAL.md with complete PR description

adbbfb4

docs: comprehensive DMP 2026 proposal with full architecture + techni…

de7541f

…cal detail

chore: add test_clip report outputs from evaluation run

3d28b13

docs: rewrite PROPOSAL.md in full GSoC proposal style with motivation…

02960ca

…, timeline, contact

fix: add moviepy audio extraction — real audio from any video without…

f495038

… system ffmpeg

feat: live CC overlay on video player — captions appear as subtitles …

6e38deb

…during playback

feat: high-impact captions pulse red + 2s linger time + auto-scroll t…

6226bae

…o active event

fix: add no-cache headers + cache-busting to prevent stale HTML serving

181b364

feat: cinematic CC overlay — glassmorphism badge, category icons, sha…

b018fc7

…ke animation for high-impact

feat: top-3 high-impact priority detection + expanded ambient filter …

acddf3a

…= 90% filter rate

feat: tasteful CC overlay (pill shape, subtle glow, auto cache-bust) …

79f68ff

…+ smarter detection logic

fix: Fire no longer mapped to [gunshot] + raised visual thresholds to…

ea3798c

… reduce false reactions + auto cache-bust

docs: update README with latest features (30 tests, moviepy, live CC …

3c8e3e9

…overlay, top-3 detection)

feat: add SLS (Same Language Subtitling) output format — pipeline + w…

65af104

…eb API + CLI

docs: update PROPOSAL.md with Mermaid diagram, SLS output, live CC ov…

45eb7ab

…erlay, 150+ labels

feat: caption style customizer (font/size/color/position/opacity) + k…

4773b1c

…eyboard shortcuts (Space/arrows/J/K) + SLS export button

docs: update README with caption customizer, keyboard shortcuts, dual…

2ac17fa

… export

fix: caption customizer now works — font/size/color/position/opacity …

c979061

…all apply live

fix: resolve CSS specificity issues preventing caption customization …

f531221

…from applying

fix: bump cache versions for CSS/JS to ensure customization fixes apply

f262b31

docs: finalize PROPOSAL.md with caption customizer, keyboard shortcut…

6ed7c51

…s, and correct repo links

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DMP 2026] Intelligent CC Suggestion Tool Complete Pipeline (Goals 1, 2 & 3)#8

[DMP 2026] Intelligent CC Suggestion Tool Complete Pipeline (Goals 1, 2 & 3)#8
Ashutoshx7 wants to merge 28 commits into
PlanetRead:mainfrom
Ashutoshx7:main

Ashutoshx7 commented May 7, 2026 •

edited

Loading

Uh oh!

Ashutoshx7 commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ashutoshx7 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Intelligent CC Suggestion Tool

Architecture

Key Innovations

Setup

Usage

CLI (Command Line)

Web UI

Output

Example SRT Output

Configuration

Sound Categories

Project Structure

Testing

Tech Stack

Evaluation Metrics

Hindi/Regional Content Support

Known Limitations

What I'd Improve Next

Uh oh!

Ashutoshx7 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Ashutoshx7 commented May 7, 2026 •

edited

Loading

Ashutoshx7 commented May 8, 2026 •

edited

Loading