Skip to content

[DMP 2026] Intelligent CC Suggestion Tool Complete Pipeline (Goals 1, 2 & 3)#8

Open
Ashutoshx7 wants to merge 28 commits into
PlanetRead:mainfrom
Ashutoshx7:main
Open

[DMP 2026] Intelligent CC Suggestion Tool Complete Pipeline (Goals 1, 2 & 3)#8
Ashutoshx7 wants to merge 28 commits into
PlanetRead:mainfrom
Ashutoshx7:main

Conversation

@Ashutoshx7
Copy link
Copy Markdown

@Ashutoshx7 Ashutoshx7 commented May 7, 2026

Intelligent CC Suggestion Tool

fixes #2

DMP 2026 · PlanetRead · C4GT

Proposal DOCS LINK - https://docs.google.com/document/d/18DmqvkRuaiw3bRKe-c_-T5-KrA5auziadDFyorAlWG8/edit?usp=sharing

AI-powered tool that identifies moments in a video where a Closed Caption (CC) annotation is genuinely necessary — such as when a non-speech audio event meaningfully affects the speakers or the scene — and suggests contextually relevant CC text, without over-captioning routine or low-impact sounds.

PLanet.Read.mp4

Youtube link of the video https://youtu.be/zOPK43g-OwQ?si=iPUpVk_uKhRCEQDV

Architecture

Video → Audio Extraction → YAMNet Detection → Speech Filtering
     → Scene Cut Detection → Reaction Window Frame Extraction
     → Pose Analysis (flinch, head turn) + Face Analysis (surprise)
     → Category-Aware Fusion Engine → SRT Output

image

Key Innovations

  1. Temporal Reaction Windows — Extracts frames 300ms–1500ms after the sound (when reactions actually happen), not at the midpoint
  2. Category-Aware Fusion — Different sound types use different weights (explosions don't need visual confirmation; doorbells do)
  3. Scene Cut Detection — Skips visual analysis at edit points to prevent false positive reactions
  4. Top-3 High-Impact Priority — When a dangerous sound (gunshot, explosion) appears in YAMNet's top 3 predictions, it's selected even if not the #1 class
  5. Multi-Person Detection — Analyzes up to 4 people per frame, takes peak reaction score
  6. Overcaption Prevention — Primary design goal is to filter ambient/insignificant sounds, not just detect everything (90% filter rate on real content)

Setup

# One-command setup (installs deps + downloads models)
chmod +x setup.sh && ./setup.sh

# Or manually:
pip install -r requirements.txt
sudo apt install ffmpeg            # optional but recommended

The setup script downloads MediaPipe model files to models/.

Usage

CLI (Command Line)

# Basic — produces <video>_cc.srt
python main.py video.mp4

# With options
python main.py video.mp4 -o captions.srt --verbose

# Override fusion threshold
python main.py video.mp4 --threshold 0.35

# Evaluation mode — compares output against ground truth
python main.py video.mp4 --evaluate --ground-truth eval/ground_truth/clip.json

Web UI

python web/app.py
# Open http://localhost:8000

The web interface provides:

  • Upload — Drag-and-drop video files
  • Processing — Real-time progress with pipeline stage updates
  • Review — Video player, interactive timeline, event cards with accept/reject toggles
  • Live CC Overlay — Captions appear on the video player in real-time during playback, styled by category
  • Caption Style Customizer — Change font, size, color, position, and background opacity of captions in real-time
  • Keyboard ShortcutsSpace play/pause, ←→ seek ±5s, J/K jump between events
  • Export — Download SRT or SLS with only accepted captions

Output

The CLI produces:

  • <video>_cc.srt — Standard SRT subtitle file with CC annotations
  • <video>_cc.sls — SLS (Same Language Subtitling) format with score metadata
  • <video>_cc_summary.txt — Human-readable report showing accepted/rejected events with scores

Example SRT Output

1
00:00:12,480 --> 00:00:13,440
[gunshot]

2
00:00:28,320 --> 00:00:28,800
[glass breaking]

Configuration

All thresholds are tunable via YAML config — zero hardcoded magic numbers.

  • config/default.yaml — Pipeline settings (confidence thresholds, reaction window timing, fusion weights)
  • config/sound_categories.yaml — Category-aware weights per sound type

Sound Categories

Category Examples Behavior
high_impact Gunshot, Explosion, Scream Caption even without visual reaction (α=0.85)
interactive Doorbell, Knock, Dog bark Only caption if someone visibly reacts (β=0.60)
social Laughter, Applause, Crying Context-dependent (balanced weights)
ambient Music, Rain, Traffic Almost never caption (threshold=0.70)

Project Structure

├── config/
│   ├── default.yaml             # Pipeline settings
│   └── sound_categories.yaml    # Category-aware weights
├── src/
│   ├── pipeline.py              # Full orchestrator
│   ├── config_loader.py         # YAML config loading
│   ├── audio/
│   │   ├── extractor.py         # ffmpeg audio extraction (+ OpenCV fallback)
│   │   ├── yamnet_detector.py   # YAMNet sound event detection (521 classes)
│   │   └── speech_filter.py     # WebRTC VAD + energy-based fallback
│   ├── visual/
│   │   ├── scene_cut.py         # Histogram-based cut detection
│   │   ├── frame_extractor.py   # Temporal reaction window (300-1500ms)
│   │   ├── pose_analyzer.py     # MediaPipe Pose (flinch, head turn, multi-person)
│   │   └── face_analyzer.py     # MediaPipe Face (surprise/gasp, multi-face)
│   ├── fusion/
│   │   ├── category_mapper.py   # YAMNet class → behavioral category
│   │   └── decision_engine.py   # Category-aware score fusion + CC decision
│   └── output/
│       ├── srt_writer.py        # SRT file generation
│       └── label_mapper.py      # YAMNet class → CC label (India-specific)
├── eval/
│   ├── evaluator.py             # IoU-based P/R/F1 + overcaption rate
│   └── ground_truth/            # Manual annotations (JSON)
├── web/
│   ├── app.py                   # FastAPI backend
│   └── static/                  # Monochrome web UI
├── tests/
│   ├── test_all.py              # 30-test suite
│   └── generate_test_data.py    # Synthetic video/audio generator
├── main.py                      # CLI entry point
├── setup.sh                     # One-command setup
└── requirements.txt

Testing

# Run all tests (30 tests)
python -m pytest tests/test_all.py -v

# Generate synthetic test data
python tests/generate_test_data.py

# Full end-to-end pipeline test
python main.py samples/test_clip.avi --verbose

# Evaluation test
python main.py samples/test_clip.avi --evaluate --ground-truth eval/ground_truth/test_clip.json

Tech Stack

Component Tool
Audio extraction ffmpeg (with moviepy fallback)
Sound detection YAMNet (TensorFlow Hub, 521 classes)
Speech filtering WebRTC VAD (with energy-based fallback)
Pose detection MediaPipe PoseLandmarker (Tasks API)
Face analysis MediaPipe FaceLandmarker (Tasks API)
Scene cuts OpenCV histogram comparison
Config YAML (all thresholds tunable)
Output Standard SRT + SLS (PlanetRead)
Web UI FastAPI + Vanilla JS

Evaluation Metrics

Metric Target Description
Precision ≥ 0.75 Fraction of suggestions that are correct
Recall ≥ 0.65 Fraction of important events caught
Overcaption Rate ≤ 0.15 Fraction of suggestions that are unnecessary

Hindi/Regional Content Support

  • Dense dialogue handling — WebRTC VAD at aggressiveness=3 for Hindi speech
  • India-specific sounds — Fireworks→[firecrackers], Drum→[drums], Bell→[bell]
  • SLS workflow compatible — Standard SRT format overlays with karaoke subtitles

Known Limitations

  1. YAMNet is AudioSet-trained (English/Western-centric) — Indian-specific sounds (dhol, pressure cooker whistle, temple bells) may classify under generic labels. Mitigation: substring-based label mapper handles this, and PANNs can be swapped in via the fixed data contract.
  2. Single-frame vs. multi-frame tradeoff — We extract 5 frames in the reaction window (300–1500ms). For very fast reactions (<300ms) or slow dramatic reactions (>1500ms), the window may miss. The window is configurable in default.yaml.
  3. No GPU required but slower on CPU — YAMNet + MediaPipe run on CPU. A 10s video processes in ~4s. Longer videos scale linearly.
  4. ffmpeg preferred for audio — Without system ffmpeg, moviepy (bundled ffmpeg) handles extraction. Both produce full-fidelity audio.
  5. WebRTC VAD may not install on all platforms — Falls back to energy-based VAD automatically, which is less accurate for dense Hindi dialogue.
  6. Confidence calibration — YAMNet softmax scores are not true probabilities. Per-class calibration on representative Hindi content would improve threshold accuracy.

What I'd Improve Next

  1. Benchmark on real PlanetRead content — Tune thresholds and category weights on actual Hindi/regional videos with editor feedback
  2. PANNs backend — Swap in PANNs for finer-grained classification (the data contract makes this a drop-in)
  3. Confidence calibration — Per-class percentile normalization on a representative sample
  4. Persistent job storage — Move from in-memory to SQLite/Redis for multi-user web deployment
  5. VTT output format — Trivially derivable from SRT, not yet implemented
  6. Threshold tuning UI — Expose category weights in the web interface for real-time editor adjustment

Ashutoshx7 added 28 commits May 8, 2026 00:57
Goals 1, 2 & 3 — full end-to-end pipeline:

Goal 1 — Sound Event Detection:
- YAMNet-based audio classification (521 AudioSet classes)
- WebRTC VAD / energy-based speech filtering
- Consecutive event merging with peak confidence

Goal 2 — Speaker Reaction Detection:
- Temporal reaction windows (300ms-1500ms after sound onset)
- Scene cut detection (histogram comparison) to prevent false positives
- MediaPipe PoseLandmarker (flinch/head turn via shoulder/ear/nose landmarks)
- MediaPipe FaceLandmarker (surprise via mouth openness)
- Multi-person scoring (max reaction across all detected people)

Goal 3 — CC Decision Engine + SRT Output:
- Category-aware fusion weights (high_impact, interactive, social, ambient)
- Speech-pause bonus for interrupted dialogue
- Scene-cut fallback to audio-only scoring
- Standard SRT output with human-readable summary
- IoU-based evaluation framework (P/R/F1/overcaption rate)

19/19 tests passing. Full pipeline tested end-to-end.
- FastAPI backend: upload, async pipeline processing, event toggle, SRT export
- Minimal black & white design: Inter + JetBrains Mono, 1px borders, subtle surfaces
- Results page: stats bar, video player, interactive timeline, event cards with toggles
- SRT live preview, accept/reject per event, download export
- Audio extractor: OpenCV fallback for environments without ffmpeg
…on support

- Created eval/ground_truth/test_clip.json with 3 annotated events
- Added eval/__init__.py for clean package imports
- Verified --evaluate mode runs end-to-end with P/R/F1 output
- requirements.txt: added fastapi, uvicorn, python-multipart, pytest
- .gitignore: added web/uploads/, stopped ignoring .avi/.mp4 source files
- README.md: documented web UI, setup, testing, Hindi support
- setup.sh: auto-generates test data, prints web UI usage
- pipeline.py: fixed WAV cleanup to preserve pre-existing test audio
- generate_demo_data.py: 15s video with siren, alarm, bell, knock
- speech_filter.py: raised energy thresholds to prevent non-speech sounds from being filtered as speech
- demo pipeline now shows 15 detected → 6 accepted with category filtering
New features:
- report_generator.py: professional HTML report with dark monochrome design
  (stats grid, category distribution, color-coded event table, SRT preview)
- report_generator.py: JSON report with full event data for integration
- label_mapper.py: expanded from 30 to 120+ AudioSet class mappings
  (high impact, interactive, social, transport, physical, India-specific, nature)
- Pipeline now auto-generates _report.html and _report.json alongside SRT

Tests: 30 passing (was 19)
- TestReportGenerator: JSON structure, HTML elements, filter rate
- TestExtendedLabels: India-specific, high impact, social, transport, nature
- TestEnergyVAD: threshold behavior, silent/loud detection
…eyboard shortcuts (Space/arrows/J/K) + SLS export button
@Ashutoshx7
Copy link
Copy Markdown
Author

Ashutoshx7 commented May 8, 2026

Hi @abinash-sketch,

I’m available for the interview.
Please find my detailed proposal/demo attached. I would appreciate your thoughts and feedback on it.

I’m happy to connect at any time that works best for you.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DMP 2026]: Create Intelligent Closed Caption (CC) Suggestion Tool

1 participant