Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
282734b
feat: implement intelligent CC suggestion pipeline (DMP 2026 PlanetRead)
Ashutoshx7 May 4, 2026
30e918d
feat: add tasteful monochrome web UI for CC review
Ashutoshx7 May 6, 2026
5e08061
fix: add ground truth sample + eval/__init__.py for complete evaluati…
Ashutoshx7 May 6, 2026
f228b81
fix: complete all remaining gaps
Ashutoshx7 May 6, 2026
2bb5def
feat: add demo script + limitations + complete README
Ashutoshx7 May 6, 2026
7ea5d57
feat: add rich demo clip generator + fix energy VAD thresholds
Ashutoshx7 May 6, 2026
fc8de87
feat: add HTML/JSON report generators, 120+ label mappings, 30 tests
Ashutoshx7 May 6, 2026
adbbfb4
docs: add PROPOSAL.md with complete PR description
Ashutoshx7 May 6, 2026
de7541f
docs: comprehensive DMP 2026 proposal with full architecture + techni…
Ashutoshx7 May 6, 2026
3d28b13
chore: add test_clip report outputs from evaluation run
Ashutoshx7 May 6, 2026
02960ca
docs: rewrite PROPOSAL.md in full GSoC proposal style with motivation…
Ashutoshx7 May 7, 2026
f495038
fix: add moviepy audio extraction — real audio from any video without…
Ashutoshx7 May 7, 2026
6e38deb
feat: live CC overlay on video player — captions appear as subtitles …
Ashutoshx7 May 7, 2026
6226bae
feat: high-impact captions pulse red + 2s linger time + auto-scroll t…
Ashutoshx7 May 7, 2026
181b364
fix: add no-cache headers + cache-busting to prevent stale HTML serving
Ashutoshx7 May 7, 2026
b018fc7
feat: cinematic CC overlay — glassmorphism badge, category icons, sha…
Ashutoshx7 May 7, 2026
acddf3a
feat: top-3 high-impact priority detection + expanded ambient filter …
Ashutoshx7 May 7, 2026
79f68ff
feat: tasteful CC overlay (pill shape, subtle glow, auto cache-bust) …
Ashutoshx7 May 7, 2026
ea3798c
fix: Fire no longer mapped to [gunshot] + raised visual thresholds to…
Ashutoshx7 May 7, 2026
3c8e3e9
docs: update README with latest features (30 tests, moviepy, live CC …
Ashutoshx7 May 7, 2026
65af104
feat: add SLS (Same Language Subtitling) output format — pipeline + w…
Ashutoshx7 May 7, 2026
45eb7ab
docs: update PROPOSAL.md with Mermaid diagram, SLS output, live CC ov…
Ashutoshx7 May 7, 2026
4773b1c
feat: caption style customizer (font/size/color/position/opacity) + k…
Ashutoshx7 May 7, 2026
2ac17fa
docs: update README with caption customizer, keyboard shortcuts, dual…
Ashutoshx7 May 7, 2026
c979061
fix: caption customizer now works — font/size/color/position/opacity …
Ashutoshx7 May 7, 2026
f531221
fix: resolve CSS specificity issues preventing caption customization …
Ashutoshx7 May 7, 2026
f262b31
fix: bump cache versions for CSS/JS to ensure customization fixes apply
Ashutoshx7 May 7, 2026
6ed7c51
docs: finalize PROPOSAL.md with caption customizer, keyboard shortcut…
Ashutoshx7 May 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
*.egg-info/
dist/
build/
*.egg

# Virtual environments
.venv/
venv/
env/

# IDE
.vscode/
.idea/
*.swp
*.swo

# OS
.DS_Store
Thumbs.db

# Test/build artifacts
.pytest_cache/

# Generated audio/video (not source test clips)
*.wav
*.mkv

# Models (large files — download via setup script)
models/*.task

# Web UI uploads (user data)
web/uploads/

# Temp
get-pip.py
508 changes: 508 additions & 0 deletions PROPOSAL.md

Large diffs are not rendered by default.

209 changes: 209 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
# Intelligent CC Suggestion Tool

> **DMP 2026 · PlanetRead · C4GT**

AI-powered tool that identifies moments in a video where a Closed Caption (CC) annotation is genuinely necessary — such as when a non-speech audio event meaningfully affects the speakers or the scene — and suggests contextually relevant CC text, without over-captioning routine or low-impact sounds.

## Architecture

```
Video → Audio Extraction → YAMNet Detection → Speech Filtering
→ Scene Cut Detection → Reaction Window Frame Extraction
→ Pose Analysis (flinch, head turn) + Face Analysis (surprise)
→ Category-Aware Fusion Engine → SRT Output
```

### Key Innovations

1. **Temporal Reaction Windows** — Extracts frames 300ms–1500ms *after* the sound (when reactions actually happen), not at the midpoint
2. **Category-Aware Fusion** — Different sound types use different weights (explosions don't need visual confirmation; doorbells do)
3. **Scene Cut Detection** — Skips visual analysis at edit points to prevent false positive reactions
4. **Top-3 High-Impact Priority** — When a dangerous sound (gunshot, explosion) appears in YAMNet's top 3 predictions, it's selected even if not the #1 class
5. **Multi-Person Detection** — Analyzes up to 4 people per frame, takes peak reaction score
6. **Overcaption Prevention** — Primary design goal is to filter ambient/insignificant sounds, not just detect everything (90% filter rate on real content)

## Setup

```bash
# One-command setup (installs deps + downloads models)
chmod +x setup.sh && ./setup.sh

# Or manually:
pip install -r requirements.txt
sudo apt install ffmpeg # optional but recommended
```

The setup script downloads MediaPipe model files to `models/`.

## Usage

### CLI (Command Line)

```bash
# Basic — produces <video>_cc.srt
python main.py video.mp4

# With options
python main.py video.mp4 -o captions.srt --verbose

# Override fusion threshold
python main.py video.mp4 --threshold 0.35

# Evaluation mode — compares output against ground truth
python main.py video.mp4 --evaluate --ground-truth eval/ground_truth/clip.json
```

### Web UI

```bash
python web/app.py
# Open http://localhost:8000
```

The web interface provides:
- **Upload** — Drag-and-drop video files
- **Processing** — Real-time progress with pipeline stage updates
- **Review** — Video player, interactive timeline, event cards with accept/reject toggles
- **Live CC Overlay** — Captions appear on the video player in real-time during playback, styled by category
- **Caption Style Customizer** — Change font, size, color, position, and background opacity of captions in real-time
- **Keyboard Shortcuts** — `Space` play/pause, `←→` seek ±5s, `J/K` jump between events
- **Export** — Download SRT or SLS with only accepted captions

## Output

The CLI produces:
- `<video>_cc.srt` — Standard SRT subtitle file with CC annotations
- `<video>_cc.sls` — SLS (Same Language Subtitling) format with score metadata
- `<video>_cc_summary.txt` — Human-readable report showing accepted/rejected events with scores

### Example SRT Output

```
1
00:00:12,480 --> 00:00:13,440
[gunshot]

2
00:00:28,320 --> 00:00:28,800
[glass breaking]
```

## Configuration

All thresholds are tunable via YAML config — zero hardcoded magic numbers.

- `config/default.yaml` — Pipeline settings (confidence thresholds, reaction window timing, fusion weights)
- `config/sound_categories.yaml` — Category-aware weights per sound type

### Sound Categories

| Category | Examples | Behavior |
|---|---|---|
| **high_impact** | Gunshot, Explosion, Scream | Caption even without visual reaction (α=0.85) |
| **interactive** | Doorbell, Knock, Dog bark | Only caption if someone visibly reacts (β=0.60) |
| **social** | Laughter, Applause, Crying | Context-dependent (balanced weights) |
| **ambient** | Music, Rain, Traffic | Almost never caption (threshold=0.70) |

## Project Structure

```
├── config/
│ ├── default.yaml # Pipeline settings
│ └── sound_categories.yaml # Category-aware weights
├── src/
│ ├── pipeline.py # Full orchestrator
│ ├── config_loader.py # YAML config loading
│ ├── audio/
│ │ ├── extractor.py # ffmpeg audio extraction (+ OpenCV fallback)
│ │ ├── yamnet_detector.py # YAMNet sound event detection (521 classes)
│ │ └── speech_filter.py # WebRTC VAD + energy-based fallback
│ ├── visual/
│ │ ├── scene_cut.py # Histogram-based cut detection
│ │ ├── frame_extractor.py # Temporal reaction window (300-1500ms)
│ │ ├── pose_analyzer.py # MediaPipe Pose (flinch, head turn, multi-person)
│ │ └── face_analyzer.py # MediaPipe Face (surprise/gasp, multi-face)
│ ├── fusion/
│ │ ├── category_mapper.py # YAMNet class → behavioral category
│ │ └── decision_engine.py # Category-aware score fusion + CC decision
│ └── output/
│ ├── srt_writer.py # SRT file generation
│ └── label_mapper.py # YAMNet class → CC label (India-specific)
├── eval/
│ ├── evaluator.py # IoU-based P/R/F1 + overcaption rate
│ └── ground_truth/ # Manual annotations (JSON)
├── web/
│ ├── app.py # FastAPI backend
│ └── static/ # Monochrome web UI
├── tests/
│ ├── test_all.py # 30-test suite
│ └── generate_test_data.py # Synthetic video/audio generator
├── main.py # CLI entry point
├── setup.sh # One-command setup
└── requirements.txt
```

## Testing

```bash
# Run all tests (30 tests)
python -m pytest tests/test_all.py -v

# Generate synthetic test data
python tests/generate_test_data.py

# Full end-to-end pipeline test
python main.py samples/test_clip.avi --verbose

# Evaluation test
python main.py samples/test_clip.avi --evaluate --ground-truth eval/ground_truth/test_clip.json
```

## Tech Stack

| Component | Tool |
|---|---|
| Audio extraction | ffmpeg (with moviepy fallback) |
| Sound detection | YAMNet (TensorFlow Hub, 521 classes) |
| Speech filtering | WebRTC VAD (with energy-based fallback) |
| Pose detection | MediaPipe PoseLandmarker (Tasks API) |
| Face analysis | MediaPipe FaceLandmarker (Tasks API) |
| Scene cuts | OpenCV histogram comparison |
| Config | YAML (all thresholds tunable) |
| Output | Standard SRT + SLS (PlanetRead) |
| Web UI | FastAPI + Vanilla JS |

## Evaluation Metrics

| Metric | Target | Description |
|---|---|---|
| Precision | ≥ 0.75 | Fraction of suggestions that are correct |
| Recall | ≥ 0.65 | Fraction of important events caught |
| Overcaption Rate | ≤ 0.15 | Fraction of suggestions that are unnecessary |

## Hindi/Regional Content Support

- **Dense dialogue handling** — WebRTC VAD at aggressiveness=3 for Hindi speech
- **India-specific sounds** — Fireworks→[firecrackers], Drum→[drums], Bell→[bell]
- **SLS workflow compatible** — Standard SRT format overlays with karaoke subtitles

## Known Limitations

1. **YAMNet is AudioSet-trained (English/Western-centric)** — Indian-specific sounds (dhol, pressure cooker whistle, temple bells) may classify under generic labels. Mitigation: substring-based label mapper handles this, and PANNs can be swapped in via the fixed data contract.
2. **Single-frame vs. multi-frame tradeoff** — We extract 5 frames in the reaction window (300–1500ms). For very fast reactions (<300ms) or slow dramatic reactions (>1500ms), the window may miss. The window is configurable in `default.yaml`.
3. **No GPU required but slower on CPU** — YAMNet + MediaPipe run on CPU. A 10s video processes in ~4s. Longer videos scale linearly.
4. **ffmpeg preferred for audio** — Without system ffmpeg, moviepy (bundled ffmpeg) handles extraction. Both produce full-fidelity audio.
5. **WebRTC VAD may not install on all platforms** — Falls back to energy-based VAD automatically, which is less accurate for dense Hindi dialogue.
6. **Confidence calibration** — YAMNet softmax scores are not true probabilities. Per-class calibration on representative Hindi content would improve threshold accuracy.

## What I'd Improve Next

1. **Benchmark on real PlanetRead content** — Tune thresholds and category weights on actual Hindi/regional videos with editor feedback
2. **PANNs backend** — Swap in PANNs for finer-grained classification (the data contract makes this a drop-in)
3. **Confidence calibration** — Per-class percentile normalization on a representative sample
5. **Persistent job storage** — Move from in-memory to SQLite/Redis for multi-user web deployment
6. **VTT output format** — Trivially derivable from SRT, not yet implemented
7. **Threshold tuning UI** — Expose category weights in the web interface for real-time editor adjustment

## License

MIT
35 changes: 35 additions & 0 deletions config/default.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
audio:
backend: "yamnet"
sample_rate: 16000
confidence_threshold: 0.3
speech_class_indices: [0,1,2,3,4,5,6]
vad_aggressiveness: 3 # 0-3, higher = more aggressive. Use 3 for Hindi.
merge_gap_seconds: 0.1 # merge events within this gap

visual:
reaction_window_start: 0.3 # seconds after event onset
reaction_window_end: 1.5 # seconds after event onset
num_reaction_frames: 5 # frames to sample in reaction window
pose_model_complexity: 1 # 0=lite, 1=full, 2=heavy
max_num_poses: 4 # multi-person detection
max_num_faces: 4
min_detection_confidence: 0.5
flinch_threshold: 0.08 # shoulder Y-diff to count as flinch (raised from 0.05)
flinch_ceiling: 0.18 # normalize to 1.0 at this value
head_turn_threshold: 0.20 # nose-ratio deviation to count (raised from 0.15)
head_turn_ceiling: 0.40
mouth_open_threshold: 0.045 # normalized lip gap (raised from 0.02 — normal speech ~0.03)
mouth_open_ceiling: 0.10 # genuine gasp/shock
scene_cut_threshold: 0.4 # Bhattacharyya distance
scene_cut_tolerance: 0.5 # seconds around cut to flag

fusion:
audio_weight: 0.6 # default alpha (overridden per-category)
visual_weight: 0.4 # default beta
threshold: 0.4 # default combined score cutoff
speech_pause_bonus: 0.15

output:
format: "srt"
max_cc_duration: 3.0 # seconds — subtitle standard
encoding: "utf-8"
Loading