From ac33970fd55ea1ecf2ee433a57e3e6c8f4abb864 Mon Sep 17 00:00:00 2001
From: bhuvan-somisetty <somisettybhuvan5@gmail.com>
Date: Fri, 8 May 2026 16:57:36 +0530
Subject: [PATCH] feat: add self-contained PoC notebook for CC suggestion
 pipeline

Signed-off-by: bhuvan-somisetty <somisettybhuvan5@gmail.com>
---
 README.md      |  50 ++++
 poc_demo.ipynb | 616 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 666 insertions(+)
 create mode 100644 README.md
 create mode 100644 poc_demo.ipynb

diff --git a/README.md b/README.md
new file mode 100644
index 0000000..a6cebe2
--- /dev/null
+++ b/README.md
@@ -0,0 +1,50 @@
+# Intelligent CC Suggestion Tool — Proof of Concept
+
+**PlanetRead | DMP 2026**
+
+A Python pipeline that identifies moments in a video where a non-speech sound warrants a closed-caption annotation and generates SRT/SLS output without over-captioning routine ambient sounds.
+
+## Quick start
+
+Open `poc_demo.ipynb` in Jupyter. The notebook is self-contained — it only needs `numpy` and walks through the full pipeline end-to-end using realistic sample data.
+
+```bash
+pip install numpy jupyter
+jupyter notebook poc_demo.ipynb
+```
+
+## What the notebook covers
+
+1. **Goal 1 — Audio event detection**: YAMNet patch scoring, speech filtering, confidence thresholding, adjacent-event merging
+2. **Goal 2 — Visual reaction analysis**: Optical flow motion score + MediaPipe face-shift score from reaction-window frames
+3. **Goal 3 — Decision engine + output**: Category-aware score fusion, SRT and SLS file generation
+4. **Evaluation**: IoU-based precision / recall / F1 + overcaption rate
+
+## Full pipeline stack
+
+| Stage | Tool |
+|---|---|
+| Audio extraction | ffmpeg (subprocess) |
+| Sound detection | YAMNet (TensorFlow Hub, 521 classes) |
+| Speech filtering | label-based + energy VAD fallback |
+| Visual reactions | OpenCV Farneback optical flow + MediaPipe FaceMesh |
+| Decision fusion | category-aware weighted sum |
+| Output | SRT (standard) + SLS (PlanetRead JSON) |
+| Evaluation | IoU-based P/R/F1 |
+
+## CC decision logic
+
+```
+score = audio_weight × audio_confidence
+      + visual_weight × reaction_confidence
+      + 0.12  (if high-impact label)
+```
+
+| Category | Audio w | Visual w | Examples |
+|---|---|---|---|
+| high_impact | 0.85 | 0.15 | Gunshot, explosion, alarm, siren, firecrackers |
+| social | 0.55 | 0.45 | Laughter, applause, cheering, crying |
+| interactive | 0.45 | 0.55 | Doorbell, dog bark, phone |
+| ambient | 0.30 | 0.70 | Music, rain, traffic |
+
+Events scoring below 0.50 are suppressed. India-specific labels (Tabla, Dhol, Fireworks) are mapped to their regional CC text equivalents.
diff --git a/poc_demo.ipynb b/poc_demo.ipynb
new file mode 100644
index 0000000..a2d171a
--- /dev/null
+++ b/poc_demo.ipynb
@@ -0,0 +1,616 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Intelligent Closed Caption (CC) Suggestion Tool — PoC Demo\n",
+    "\n",
+    "**PlanetRead | DMP 2026**\n",
+    "\n",
+    "This notebook walks through the full pipeline end-to-end: from a video file to a ready-to-use SRT caption file containing only contextually meaningful non-speech sound annotations.\n",
+    "\n",
+    "The three goals from the project spec are covered in sequence:\n",
+    "\n",
+    "- **Goal 1** — Sound event detection: classify non-speech audio events with timestamps\n",
+    "- **Goal 2** — Visual reaction analysis: detect whether speakers react to those events\n",
+    "- **Goal 3** — CC decision engine: fuse both signals and write SRT/SLS output\n",
+    "\n",
+    "The audio and visual ML calls (YAMNet, MediaPipe) are stubbed with realistic sample data so the notebook runs without any GPU or model downloads. Every other piece — the decision logic, label mapping, SRT formatting, evaluation — is real working code."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Pipeline overview\n",
+    "\n",
+    "```\n",
+    "Video file\n",
+    "  │\n",
+    "  ▼\n",
+    "Audio extraction  (ffmpeg -> 16 kHz mono WAV)\n",
+    "  │\n",
+    "  ▼\n",
+    "Sound event detection  (YAMNet, 521 AudioSet classes)\n",
+    "  │  • speech labels suppressed\n",
+    "  │  • adjacent same-label detections merged\n",
+    "  ▼\n",
+    "Timestamped audio events\n",
+    "  │\n",
+    "  ▼\n",
+    "Visual reaction analysis  (OpenCV optical flow + MediaPipe FaceMesh)\n",
+    "  │  • frames sampled 300–1500 ms after each event\n",
+    "  │  • motion score + face-shift score computed\n",
+    "  ▼\n",
+    "CC decision engine  (category-aware fusion)\n",
+    "  │  • high_impact / social / interactive / ambient weights\n",
+    "  │  • events below threshold suppressed (no over-captioning)\n",
+    "  ▼\n",
+    "SRT + SLS output\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Standard library + numpy only — no ML framework needed to run this notebook\n",
+    "from dataclasses import dataclass, field\n",
+    "from typing import List, Optional, Dict, Tuple\n",
+    "import json\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Shared data models\n",
+    "\n",
+    "These dataclasses are the shared language between every stage of the pipeline."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@dataclass\n",
+    "class AudioEvent:\n",
+    "    label: str\n",
+    "    confidence: float\n",
+    "    start: float       # seconds\n",
+    "    end: float         # seconds\n",
+    "    category: str = \"ambient\"\n",
+    "\n",
+    "@dataclass\n",
+    "class ReactionSignal:\n",
+    "    event: AudioEvent\n",
+    "    motion_score: float       # 0–1, from optical flow\n",
+    "    face_shift_score: float   # 0–1, from MediaPipe nose landmark displacement\n",
+    "    frame_count: int\n",
+    "\n",
+    "    @property\n",
+    "    def reaction_confidence(self) -> float:\n",
+    "        \"\"\"Weighted combination of motion and face-shift signals.\"\"\"\n",
+    "        return 0.6 * self.motion_score + 0.4 * self.face_shift_score\n",
+    "\n",
+    "@dataclass\n",
+    "class CaptionSuggestion:\n",
+    "    label: str\n",
+    "    text: str\n",
+    "    start: float\n",
+    "    end: float\n",
+    "    audio_confidence: float\n",
+    "    reaction_confidence: float\n",
+    "    decision_score: float\n",
+    "    reason: str          # \"audio+visual\" | \"high-impact-audio\"\n",
+    "    index: int = 0\n",
+    "\n",
+    "print(\"Data models defined.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Goal 1 — Sound event detection\n",
+    "\n",
+    "In production, audio is extracted from the video via ffmpeg and passed to **YAMNet** (a TensorFlow model trained on 521 AudioSet classes). YAMNet scores each 0.96-second patch of the waveform.\n",
+    "\n",
+    "Here we simulate that with realistic sample scores to show the full processing logic:\n",
+    "- speech labels are filtered out\n",
+    "- events below the confidence threshold are dropped\n",
+    "- adjacent same-label detections are merged into a single event"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# India-specific label -> category and CC text mapping\n",
+    "CATEGORY_MAP: Dict[str, List[str]] = {\n",
+    "    \"high_impact\":  [\"Gunshot\", \"Gun\", \"Explosion\", \"Blast\", \"Glass\", \"Alarm\",\n",
+    "                     \"Siren\", \"Scream\", \"Fireworks\", \"Cracker\"],\n",
+    "    \"social\":       [\"Laughter\", \"Applause\", \"Cheering\", \"Crying\", \"Crowd\", \"Clapping\"],\n",
+    "    \"interactive\":  [\"Doorbell\", \"Knock\", \"Dog\", \"Cat\", \"Bell\", \"Whistle\", \"Horn\"],\n",
+    "    \"ambient\":      [\"Music\", \"Rain\", \"Thunder\", \"Wind\", \"Traffic\", \"Drum\",\n",
+    "                     \"Tabla\", \"Dhol\"],\n",
+    "}\n",
+    "\n",
+    "CC_TEXT_MAP: Dict[str, str] = {\n",
+    "    \"Gunshot\": \"[gunshot]\",    \"Explosion\": \"[explosion]\",\n",
+    "    \"Glass\": \"[glass breaking]\", \"Alarm\": \"[alarm]\",      \"Siren\": \"[siren]\",\n",
+    "    \"Scream\": \"[scream]\",      \"Fireworks\": \"[firecrackers]\",\n",
+    "    \"Laughter\": \"[laughter]\",  \"Applause\": \"[applause]\",  \"Cheering\": \"[cheering]\",\n",
+    "    \"Clapping\": \"[applause]\",  \"Crying\": \"[crying]\",      \"Crowd\": \"[crowd noise]\",\n",
+    "    \"Doorbell\": \"[doorbell]\",  \"Dog\": \"[dog barking]\",\n",
+    "    \"Music\": \"[music]\",        \"Drum\": \"[drums]\",\n",
+    "    \"Tabla\": \"[tabla]\",        \"Dhol\": \"[dhol]\",          \"Bell\": \"[bell]\",\n",
+    "}\n",
+    "\n",
+    "SPEECH_CLASSES = {\n",
+    "    \"Speech\", \"Male speech, man speaking\", \"Female speech, woman speaking\",\n",
+    "    \"Child speech, kid speaking\", \"Conversation\", \"Narration, monologue\",\n",
+    "}\n",
+    "\n",
+    "def get_category(label: str) -> str:\n",
+    "    lower = label.lower()\n",
+    "    for cat, keywords in CATEGORY_MAP.items():\n",
+    "        if any(kw.lower() in lower for kw in keywords):\n",
+    "            return cat\n",
+    "    return \"ambient\"\n",
+    "\n",
+    "def get_cc_text(label: str) -> str:\n",
+    "    for key, text in CC_TEXT_MAP.items():\n",
+    "        if key.lower() in label.lower():\n",
+    "            return text\n",
+    "    clean = label.lower().split(\",\")[0].strip()\n",
+    "    return f\"[{clean}]\"\n",
+    "\n",
+    "print(\"Label utilities defined.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Simulated YAMNet output — each tuple is (top_label, confidence, patch_start_sec)\n",
+    "# In production this comes from running the TF Hub model on extracted audio patches.\n",
+    "SIMULATED_YAMNET_PATCHES = [\n",
+    "    (\"Speech\",                   0.94, 0.00),   # speech — will be filtered\n",
+    "    (\"Speech\",                   0.91, 0.96),\n",
+    "    (\"Music\",                    0.62, 1.92),   # low-confidence ambient — below threshold\n",
+    "    (\"Gunshot, gunfire\",         0.89, 5.76),\n",
+    "    (\"Gunshot, gunfire\",         0.93, 6.72),   # adjacent — will be merged\n",
+    "    (\"Glass\",                    0.21, 8.64),   # confidence 0.21 < threshold 0.35 — dropped\n",
+    "    (\"Laughter\",                 0.77, 12.48),\n",
+    "    (\"Laughter\",                 0.81, 13.44),  # adjacent — will be merged\n",
+    "    (\"Speech\",                   0.88, 15.36),  # speech — filtered\n",
+    "    (\"Applause\",                 0.74, 22.08),\n",
+    "    (\"Fireworks\",                0.86, 31.68),\n",
+    "    (\"Music\",                    0.58, 38.40),  # ambient, no strong reaction expected\n",
+    "    (\"Dog\",                      0.69, 44.16),\n",
+    "    (\"Tabla\",                    0.55, 51.84),  # India-specific\n",
+    "]\n",
+    "\n",
+    "CONFIDENCE_THRESHOLD = 0.35\n",
+    "MERGE_GAP = 1.0  # seconds\n",
+    "PATCH_DURATION = 0.96\n",
+    "\n",
+    "def detect_events(patches, confidence_threshold=CONFIDENCE_THRESHOLD, merge_gap=MERGE_GAP):\n",
+    "    events = []\n",
+    "    for label, conf, start in patches:\n",
+    "        if label in SPEECH_CLASSES:\n",
+    "            continue\n",
+    "        if conf < confidence_threshold:\n",
+    "            continue\n",
+    "        events.append(AudioEvent(\n",
+    "            label=label, confidence=conf,\n",
+    "            start=start, end=start + PATCH_DURATION,\n",
+    "            category=get_category(label),\n",
+    "        ))\n",
+    "    return _merge_adjacent(events, merge_gap)\n",
+    "\n",
+    "def _merge_adjacent(events, merge_gap):\n",
+    "    if not events:\n",
+    "        return []\n",
+    "    merged = [events[0]]\n",
+    "    for ev in events[1:]:\n",
+    "        prev = merged[-1]\n",
+    "        if ev.label == prev.label and ev.start - prev.end <= merge_gap:\n",
+    "            merged[-1] = AudioEvent(\n",
+    "                label=prev.label,\n",
+    "                confidence=max(prev.confidence, ev.confidence),\n",
+    "                start=prev.start,\n",
+    "                end=ev.end,\n",
+    "                category=prev.category,\n",
+    "            )\n",
+    "        else:\n",
+    "            merged.append(ev)\n",
+    "    return merged\n",
+    "\n",
+    "audio_events = detect_events(SIMULATED_YAMNET_PATCHES)\n",
+    "\n",
+    "print(f\"Detected {len(audio_events)} non-speech events after filtering and merging:\\n\")\n",
+    "print(f\"{'Label':<30} {'Category':<12} {'Conf':>6}  {'Start':>6} -> {'End':>6}\")\n",
+    "print(\"-\" * 72)\n",
+    "for ev in audio_events:\n",
+    "    print(f\"{ev.label:<30} {ev.category:<12} {ev.confidence:>6.2f}  {ev.start:>6.2f}s -> {ev.end:>6.2f}s\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Goal 2 — Visual reaction analysis\n",
+    "\n",
+    "For each detected audio event, we extract video frames from the **reaction window** (300–1500 ms after the event start). Two signals are computed:\n",
+    "\n",
+    "1. **Motion score** — Farneback Optical Flow measures pixel-level movement magnitude between consecutive frames. A sudden head turn or flinch shows up as high motion.\n",
+    "2. **Face-shift score** — MediaPipe FaceMesh tracks the nose landmark position. Displacement between frames indicates a head turn or startle.\n",
+    "\n",
+    "Below we simulate realistic scores for each detected event."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Simulated visual reaction scores — keyed by event label + start time.\n",
+    "# In production these come from running OpenCV optical flow and MediaPipe\n",
+    "# on frames extracted from the reaction window (300–1500 ms after the event).\n",
+    "SIMULATED_REACTIONS: Dict[Tuple[str, float], Tuple[float, float]] = {\n",
+    "    (\"Gunshot, gunfire\",  5.76): (0.82, 0.71),  # strong motion, clear head turn\n",
+    "    (\"Laughter\",         12.48): (0.55, 0.48),  # moderate reaction\n",
+    "    (\"Applause\",         22.08): (0.61, 0.53),  # moderate reaction\n",
+    "    (\"Fireworks\",        31.68): (0.77, 0.65),  # visible flinch\n",
+    "    (\"Music\",            38.40): (0.08, 0.04),  # no visible reaction\n",
+    "    (\"Dog\",              44.16): (0.31, 0.22),  # slight reaction\n",
+    "    (\"Tabla\",            51.84): (0.19, 0.14),  # minimal reaction\n",
+    "}\n",
+    "\n",
+    "def analyze_reactions(events: List[AudioEvent]) -> List[ReactionSignal]:\n",
+    "    signals = []\n",
+    "    for ev in events:\n",
+    "        key = (ev.label, ev.start)\n",
+    "        motion, face = SIMULATED_REACTIONS.get(key, (0.0, 0.0))\n",
+    "        signals.append(ReactionSignal(\n",
+    "            event=ev,\n",
+    "            motion_score=min(motion, 1.0),\n",
+    "            face_shift_score=min(face, 1.0),\n",
+    "            frame_count=5,\n",
+    "        ))\n",
+    "    return signals\n",
+    "\n",
+    "reaction_signals = analyze_reactions(audio_events)\n",
+    "\n",
+    "print(f\"{'Label':<30} {'Motion':>7}  {'Face':>7}  {'Reaction conf':>14}\")\n",
+    "print(\"-\" * 65)\n",
+    "for sig in reaction_signals:\n",
+    "    print(f\"{sig.event.label:<30} {sig.motion_score:>7.2f}  \"\n",
+    "          f\"{sig.face_shift_score:>7.2f}  {sig.reaction_confidence:>14.2f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Goal 3 — CC decision engine\n",
+    "\n",
+    "The fusion formula is:\n",
+    "\n",
+    "```\n",
+    "score = audio_weight × audio_confidence\n",
+    "      + visual_weight × reaction_confidence\n",
+    "      + 0.12  (if high-impact label)\n",
+    "```\n",
+    "\n",
+    "Weights are **category-aware** — a gunshot does not need a visible reaction to get captioned, but ambient music almost always does:\n",
+    "\n",
+    "| Category | Audio weight | Visual weight | Rationale |\n",
+    "|---|---|---|---|\n",
+    "| high_impact | 0.85 | 0.15 | Alarm/explosion — trust the audio |\n",
+    "| social | 0.55 | 0.45 | Laughter/applause — balanced |\n",
+    "| interactive | 0.45 | 0.55 | Doorbell/knock — needs confirmation |\n",
+    "| ambient | 0.30 | 0.70 | Background music — suppress unless reacted to |\n",
+    "\n",
+    "Events with `score < 0.50` are suppressed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "CATEGORY_WEIGHTS: Dict[str, Tuple[float, float]] = {\n",
+    "    \"high_impact\":  (0.85, 0.15),\n",
+    "    \"social\":       (0.55, 0.45),\n",
+    "    \"interactive\":  (0.45, 0.55),\n",
+    "    \"ambient\":      (0.30, 0.70),\n",
+    "}\n",
+    "\n",
+    "HIGH_IMPACT_LABELS = (\"alarm\", \"explosion\", \"glass\", \"gunshot\", \"scream\", \"siren\", \"fireworks\")\n",
+    "HIGH_IMPACT_BOOST  = 0.12\n",
+    "DECISION_THRESHOLD = 0.50\n",
+    "\n",
+    "def decide(signal: ReactionSignal, index: int) -> Optional[CaptionSuggestion]:\n",
+    "    ev = signal.event\n",
+    "    audio_w, visual_w = CATEGORY_WEIGHTS.get(ev.category, (0.55, 0.45))\n",
+    "\n",
+    "    score = audio_w * ev.confidence + visual_w * signal.reaction_confidence\n",
+    "\n",
+    "    is_high_impact = any(kw in ev.label.lower() for kw in HIGH_IMPACT_LABELS)\n",
+    "    if is_high_impact:\n",
+    "        score += HIGH_IMPACT_BOOST\n",
+    "\n",
+    "    if score < DECISION_THRESHOLD:\n",
+    "        return None\n",
+    "\n",
+    "    reason = (\n",
+    "        \"high-impact-audio\"\n",
+    "        if is_high_impact and signal.reaction_confidence < 0.3\n",
+    "        else \"audio+visual\"\n",
+    "    )\n",
+    "\n",
+    "    return CaptionSuggestion(\n",
+    "        label=ev.label,\n",
+    "        text=get_cc_text(ev.label),\n",
+    "        start=ev.start,\n",
+    "        end=ev.end,\n",
+    "        audio_confidence=ev.confidence,\n",
+    "        reaction_confidence=signal.reaction_confidence,\n",
+    "        decision_score=min(score, 1.0),\n",
+    "        reason=reason,\n",
+    "        index=index,\n",
+    "    )\n",
+    "\n",
+    "suggestions: List[CaptionSuggestion] = []\n",
+    "idx = 1\n",
+    "for sig in reaction_signals:\n",
+    "    result = decide(sig, idx)\n",
+    "    if result:\n",
+    "        suggestions.append(result)\n",
+    "        idx += 1\n",
+    "\n",
+    "print(f\"{len(suggestions)}/{len(reaction_signals)} events accepted as CC suggestions\\n\")\n",
+    "print(f\"{'#':<4} {'Label':<30} {'Score':>6}  {'Reason':<20} {'CC text'}\")\n",
+    "print(\"-\" * 80)\n",
+    "for s in suggestions:\n",
+    "    print(f\"{s.index:<4} {s.label:<30} {s.decision_score:>6.3f}  {s.reason:<20} {s.text}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Suppressed events\n",
+    "\n",
+    "Let's see what was filtered out and why."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "accepted_labels = {s.label for s in suggestions}\n",
+    "suppressed = [sig for sig in reaction_signals if sig.event.label not in accepted_labels]\n",
+    "\n",
+    "print(\"Suppressed events:\")\n",
+    "print(f\"{'Label':<30} {'Audio conf':>10}  {'Reaction conf':>14}  {'Score':>6}\")\n",
+    "print(\"-\" * 68)\n",
+    "for sig in suppressed:\n",
+    "    ev = sig.event\n",
+    "    audio_w, visual_w = CATEGORY_WEIGHTS.get(ev.category, (0.55, 0.45))\n",
+    "    score = audio_w * ev.confidence + visual_w * sig.reaction_confidence\n",
+    "    is_hi = any(kw in ev.label.lower() for kw in HIGH_IMPACT_LABELS)\n",
+    "    if is_hi:\n",
+    "        score += HIGH_IMPACT_BOOST\n",
+    "    print(f\"{ev.label:<30} {ev.confidence:>10.2f}  {sig.reaction_confidence:>14.2f}  {score:>6.3f}  ← below {DECISION_THRESHOLD}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Output generation — SRT and SLS\n",
+    "\n",
+    "**SRT** is the standard subtitle format supported by every video player and editing tool. **SLS** is a structured JSON variant used in PlanetRead's karaoke/same-language-subtitling workflow, containing full metadata per suggestion."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def _ts(seconds: float) -> str:\n",
+    "    \"\"\"Format seconds as SRT timestamp HH:MM:SS,mmm.\"\"\"\n",
+    "    h   = int(seconds // 3600)\n",
+    "    m   = int((seconds % 3600) // 60)\n",
+    "    s   = int(seconds % 60)\n",
+    "    ms  = int(round((seconds % 1) * 1000))\n",
+    "    return f\"{h:02d}:{m:02d}:{s:02d},{ms:03d}\"\n",
+    "\n",
+    "def to_srt(suggestions: List[CaptionSuggestion]) -> str:\n",
+    "    blocks = []\n",
+    "    for i, s in enumerate(suggestions, 1):\n",
+    "        blocks.append(f\"{i}\\n{_ts(s.start)} --> {_ts(s.end)}\\n{s.text}\\n\")\n",
+    "    return \"\\n\".join(blocks)\n",
+    "\n",
+    "def to_sls(suggestions: List[CaptionSuggestion], video_path=\"sample_video.mp4\") -> str:\n",
+    "    data = {\n",
+    "        \"video\": video_path,\n",
+    "        \"total_accepted\": len(suggestions),\n",
+    "        \"captions\": [\n",
+    "            {\n",
+    "                \"index\": s.index,\n",
+    "                \"label\": s.label,\n",
+    "                \"text\": s.text,\n",
+    "                \"start\": round(s.start, 3),\n",
+    "                \"end\": round(s.end, 3),\n",
+    "                \"audio_confidence\": round(s.audio_confidence, 4),\n",
+    "                \"reaction_confidence\": round(s.reaction_confidence, 4),\n",
+    "                \"decision_score\": round(s.decision_score, 4),\n",
+    "                \"reason\": s.reason,\n",
+    "            }\n",
+    "            for s in suggestions\n",
+    "        ],\n",
+    "    }\n",
+    "    return json.dumps(data, indent=2, ensure_ascii=False)\n",
+    "\n",
+    "srt_output = to_srt(suggestions)\n",
+    "print(\"=== SRT output ===\")\n",
+    "print(srt_output)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sls_output = to_sls(suggestions)\n",
+    "print(\"=== SLS output (first two captions) ===\")\n",
+    "sls_data = json.loads(sls_output)\n",
+    "preview = {**sls_data, \"captions\": sls_data[\"captions\"][:2]}\n",
+    "print(json.dumps(preview, indent=2))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Evaluation framework\n",
+    "\n",
+    "To measure quality we use **IoU-based matching** against a ground-truth annotation file. An accepted suggestion is a True Positive if it overlaps the correct event by IoU ≥ 0.3.\n",
+    "\n",
+    "Metrics reported:\n",
+    "- **Precision** — fraction of accepted suggestions that are correct\n",
+    "- **Recall** — fraction of actual events that were caught\n",
+    "- **F1** — harmonic mean\n",
+    "- **Overcaption rate** — fraction of accepted suggestions that are unnecessary (FP / total accepted)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Ground truth for the same synthetic video\n",
+    "GROUND_TRUTH = [\n",
+    "    {\"label\": \"gunshot\",      \"start\": 5.76,  \"end\": 7.68},\n",
+    "    {\"label\": \"laughter\",     \"start\": 12.48, \"end\": 14.40},\n",
+    "    {\"label\": \"applause\",     \"start\": 22.08, \"end\": 23.04},\n",
+    "    {\"label\": \"firecrackers\", \"start\": 31.68, \"end\": 32.64},\n",
+    "    # dog bark at 44s is debatable — not in ground truth (editor chose to skip it)\n",
+    "]\n",
+    "\n",
+    "def iou(a_start, a_end, b_start, b_end):\n",
+    "    inter = max(0.0, min(a_end, b_end) - max(a_start, b_start))\n",
+    "    union = (a_end - a_start) + (b_end - b_start) - inter\n",
+    "    return inter / union if union > 0 else 0.0\n",
+    "\n",
+    "def evaluate(suggestions, ground_truth, iou_threshold=0.3):\n",
+    "    matched_gt = set()\n",
+    "    tp = fp = 0\n",
+    "    for s in suggestions:\n",
+    "        hit = False\n",
+    "        for i, g in enumerate(ground_truth):\n",
+    "            if i in matched_gt:\n",
+    "                continue\n",
+    "            if iou(s.start, s.end, g[\"start\"], g[\"end\"]) >= iou_threshold:\n",
+    "                matched_gt.add(i)\n",
+    "                hit = True\n",
+    "                break\n",
+    "        if hit:\n",
+    "            tp += 1\n",
+    "        else:\n",
+    "            fp += 1\n",
+    "    fn = len(ground_truth) - len(matched_gt)\n",
+    "    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0\n",
+    "    recall    = tp / (tp + fn) if (tp + fn) > 0 else 0.0\n",
+    "    f1        = (2 * precision * recall / (precision + recall)) if (precision + recall) > 0 else 0.0\n",
+    "    overcap   = fp / len(suggestions) if suggestions else 0.0\n",
+    "    return dict(precision=precision, recall=recall, f1=f1,\n",
+    "                overcaption_rate=overcap, TP=tp, FP=fp, FN=fn)\n",
+    "\n",
+    "metrics = evaluate(suggestions, GROUND_TRUTH)\n",
+    "print(\"Evaluation results\")\n",
+    "print(\"-\" * 40)\n",
+    "for k, v in metrics.items():\n",
+    "    if isinstance(v, float):\n",
+    "        print(f\"  {k:<20} {v:.3f}\")\n",
+    "    else:\n",
+    "        print(f\"  {k:<20} {v}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## Summary\n",
+    "\n",
+    "This notebook demonstrates a working proof of concept for all three project goals:\n",
+    "\n",
+    "| Goal | Module | Status |\n",
+    "|---|---|---|\n",
+    "| Sound event detection | YAMNet + speech filter + event merge | ✓ shown |\n",
+    "| Visual reaction analysis | Optical flow + MediaPipe face-shift | ✓ shown |\n",
+    "| CC decision + output | Category-aware fusion + SRT/SLS writer | ✓ shown |\n",
+    "| Evaluation | IoU-based P/R/F1 + overcaption rate | ✓ shown |\n",
+    "\n",
+    "Key design choices:\n",
+    "- **Category-aware weights** prevent over-captioning ambient sounds while ensuring high-impact events (gunshot, explosion, alarm) are never missed\n",
+    "- **India-specific label mapping** handles AudioSet classes that correspond to regional sounds (fireworks -> firecrackers, Tabla, Dhol)\n",
+    "- **Reaction window timing** (300–1500 ms after event) captures the moment speakers react, not the moment the sound occurs\n",
+    "- **SLS output** preserves full metadata per suggestion, compatible with PlanetRead's karaoke subtitle workflow\n",
+    "\n",
+    "### What the full pipeline adds\n",
+    "\n",
+    "The production version replaces the simulated scores above with:\n",
+    "- `ffmpeg` subprocess call to extract 16 kHz mono WAV\n",
+    "- TensorFlow Hub `yamnet/1` model for real patch-level scores\n",
+    "- OpenCV `calcOpticalFlowFarneback` on extracted frames\n",
+    "- MediaPipe `FaceMesh` for nose-landmark displacement tracking\n",
+    "- OpenCV Haar cascade fallback when MediaPipe is unavailable\n",
+    "\n",
+    "All of these are drop-in replacements for the stub functions above — the data contracts (AudioEvent, ReactionSignal, CaptionSuggestion) and the decision logic remain identical."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}