Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Intelligent CC Suggestion Tool — Proof of Concept

**PlanetRead | DMP 2026**

A Python pipeline that identifies moments in a video where a non-speech sound warrants a closed-caption annotation and generates SRT/SLS output without over-captioning routine ambient sounds.

## Quick start

Open `poc_demo.ipynb` in Jupyter. The notebook is self-contained — it only needs `numpy` and walks through the full pipeline end-to-end using realistic sample data.

```bash
pip install numpy jupyter
jupyter notebook poc_demo.ipynb
```

## What the notebook covers

1. **Goal 1 — Audio event detection**: YAMNet patch scoring, speech filtering, confidence thresholding, adjacent-event merging
2. **Goal 2 — Visual reaction analysis**: Optical flow motion score + MediaPipe face-shift score from reaction-window frames
3. **Goal 3 — Decision engine + output**: Category-aware score fusion, SRT and SLS file generation
4. **Evaluation**: IoU-based precision / recall / F1 + overcaption rate

## Full pipeline stack

| Stage | Tool |
|---|---|
| Audio extraction | ffmpeg (subprocess) |
| Sound detection | YAMNet (TensorFlow Hub, 521 classes) |
| Speech filtering | label-based + energy VAD fallback |
| Visual reactions | OpenCV Farneback optical flow + MediaPipe FaceMesh |
| Decision fusion | category-aware weighted sum |
| Output | SRT (standard) + SLS (PlanetRead JSON) |
| Evaluation | IoU-based P/R/F1 |

## CC decision logic

```
score = audio_weight × audio_confidence
+ visual_weight × reaction_confidence
+ 0.12 (if high-impact label)
```

| Category | Audio w | Visual w | Examples |
|---|---|---|---|
| high_impact | 0.85 | 0.15 | Gunshot, explosion, alarm, siren, firecrackers |
| social | 0.55 | 0.45 | Laughter, applause, cheering, crying |
| interactive | 0.45 | 0.55 | Doorbell, dog bark, phone |
| ambient | 0.30 | 0.70 | Music, rain, traffic |

Events scoring below 0.50 are suppressed. India-specific labels (Tabla, Dhol, Fireworks) are mapped to their regional CC text equivalents.
Loading