From ac33970fd55ea1ecf2ee433a57e3e6c8f4abb864 Mon Sep 17 00:00:00 2001 From: bhuvan-somisetty Date: Fri, 8 May 2026 16:57:36 +0530 Subject: [PATCH] feat: add self-contained PoC notebook for CC suggestion pipeline Signed-off-by: bhuvan-somisetty --- README.md | 50 ++++ poc_demo.ipynb | 616 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 666 insertions(+) create mode 100644 README.md create mode 100644 poc_demo.ipynb diff --git a/README.md b/README.md new file mode 100644 index 0000000..a6cebe2 --- /dev/null +++ b/README.md @@ -0,0 +1,50 @@ +# Intelligent CC Suggestion Tool — Proof of Concept + +**PlanetRead | DMP 2026** + +A Python pipeline that identifies moments in a video where a non-speech sound warrants a closed-caption annotation and generates SRT/SLS output without over-captioning routine ambient sounds. + +## Quick start + +Open `poc_demo.ipynb` in Jupyter. The notebook is self-contained — it only needs `numpy` and walks through the full pipeline end-to-end using realistic sample data. + +```bash +pip install numpy jupyter +jupyter notebook poc_demo.ipynb +``` + +## What the notebook covers + +1. **Goal 1 — Audio event detection**: YAMNet patch scoring, speech filtering, confidence thresholding, adjacent-event merging +2. **Goal 2 — Visual reaction analysis**: Optical flow motion score + MediaPipe face-shift score from reaction-window frames +3. **Goal 3 — Decision engine + output**: Category-aware score fusion, SRT and SLS file generation +4. **Evaluation**: IoU-based precision / recall / F1 + overcaption rate + +## Full pipeline stack + +| Stage | Tool | +|---|---| +| Audio extraction | ffmpeg (subprocess) | +| Sound detection | YAMNet (TensorFlow Hub, 521 classes) | +| Speech filtering | label-based + energy VAD fallback | +| Visual reactions | OpenCV Farneback optical flow + MediaPipe FaceMesh | +| Decision fusion | category-aware weighted sum | +| Output | SRT (standard) + SLS (PlanetRead JSON) | +| Evaluation | IoU-based P/R/F1 | + +## CC decision logic + +``` +score = audio_weight × audio_confidence + + visual_weight × reaction_confidence + + 0.12 (if high-impact label) +``` + +| Category | Audio w | Visual w | Examples | +|---|---|---|---| +| high_impact | 0.85 | 0.15 | Gunshot, explosion, alarm, siren, firecrackers | +| social | 0.55 | 0.45 | Laughter, applause, cheering, crying | +| interactive | 0.45 | 0.55 | Doorbell, dog bark, phone | +| ambient | 0.30 | 0.70 | Music, rain, traffic | + +Events scoring below 0.50 are suppressed. India-specific labels (Tabla, Dhol, Fireworks) are mapped to their regional CC text equivalents. diff --git a/poc_demo.ipynb b/poc_demo.ipynb new file mode 100644 index 0000000..a2d171a --- /dev/null +++ b/poc_demo.ipynb @@ -0,0 +1,616 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Intelligent Closed Caption (CC) Suggestion Tool — PoC Demo\n", + "\n", + "**PlanetRead | DMP 2026**\n", + "\n", + "This notebook walks through the full pipeline end-to-end: from a video file to a ready-to-use SRT caption file containing only contextually meaningful non-speech sound annotations.\n", + "\n", + "The three goals from the project spec are covered in sequence:\n", + "\n", + "- **Goal 1** — Sound event detection: classify non-speech audio events with timestamps\n", + "- **Goal 2** — Visual reaction analysis: detect whether speakers react to those events\n", + "- **Goal 3** — CC decision engine: fuse both signals and write SRT/SLS output\n", + "\n", + "The audio and visual ML calls (YAMNet, MediaPipe) are stubbed with realistic sample data so the notebook runs without any GPU or model downloads. Every other piece — the decision logic, label mapping, SRT formatting, evaluation — is real working code." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pipeline overview\n", + "\n", + "```\n", + "Video file\n", + " │\n", + " ▼\n", + "Audio extraction (ffmpeg -> 16 kHz mono WAV)\n", + " │\n", + " ▼\n", + "Sound event detection (YAMNet, 521 AudioSet classes)\n", + " │ • speech labels suppressed\n", + " │ • adjacent same-label detections merged\n", + " ▼\n", + "Timestamped audio events\n", + " │\n", + " ▼\n", + "Visual reaction analysis (OpenCV optical flow + MediaPipe FaceMesh)\n", + " │ • frames sampled 300–1500 ms after each event\n", + " │ • motion score + face-shift score computed\n", + " ▼\n", + "CC decision engine (category-aware fusion)\n", + " │ • high_impact / social / interactive / ambient weights\n", + " │ • events below threshold suppressed (no over-captioning)\n", + " ▼\n", + "SRT + SLS output\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Standard library + numpy only — no ML framework needed to run this notebook\n", + "from dataclasses import dataclass, field\n", + "from typing import List, Optional, Dict, Tuple\n", + "import json\n", + "import numpy as np" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## Shared data models\n", + "\n", + "These dataclasses are the shared language between every stage of the pipeline." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@dataclass\n", + "class AudioEvent:\n", + " label: str\n", + " confidence: float\n", + " start: float # seconds\n", + " end: float # seconds\n", + " category: str = \"ambient\"\n", + "\n", + "@dataclass\n", + "class ReactionSignal:\n", + " event: AudioEvent\n", + " motion_score: float # 0–1, from optical flow\n", + " face_shift_score: float # 0–1, from MediaPipe nose landmark displacement\n", + " frame_count: int\n", + "\n", + " @property\n", + " def reaction_confidence(self) -> float:\n", + " \"\"\"Weighted combination of motion and face-shift signals.\"\"\"\n", + " return 0.6 * self.motion_score + 0.4 * self.face_shift_score\n", + "\n", + "@dataclass\n", + "class CaptionSuggestion:\n", + " label: str\n", + " text: str\n", + " start: float\n", + " end: float\n", + " audio_confidence: float\n", + " reaction_confidence: float\n", + " decision_score: float\n", + " reason: str # \"audio+visual\" | \"high-impact-audio\"\n", + " index: int = 0\n", + "\n", + "print(\"Data models defined.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## Goal 1 — Sound event detection\n", + "\n", + "In production, audio is extracted from the video via ffmpeg and passed to **YAMNet** (a TensorFlow model trained on 521 AudioSet classes). YAMNet scores each 0.96-second patch of the waveform.\n", + "\n", + "Here we simulate that with realistic sample scores to show the full processing logic:\n", + "- speech labels are filtered out\n", + "- events below the confidence threshold are dropped\n", + "- adjacent same-label detections are merged into a single event" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# India-specific label -> category and CC text mapping\n", + "CATEGORY_MAP: Dict[str, List[str]] = {\n", + " \"high_impact\": [\"Gunshot\", \"Gun\", \"Explosion\", \"Blast\", \"Glass\", \"Alarm\",\n", + " \"Siren\", \"Scream\", \"Fireworks\", \"Cracker\"],\n", + " \"social\": [\"Laughter\", \"Applause\", \"Cheering\", \"Crying\", \"Crowd\", \"Clapping\"],\n", + " \"interactive\": [\"Doorbell\", \"Knock\", \"Dog\", \"Cat\", \"Bell\", \"Whistle\", \"Horn\"],\n", + " \"ambient\": [\"Music\", \"Rain\", \"Thunder\", \"Wind\", \"Traffic\", \"Drum\",\n", + " \"Tabla\", \"Dhol\"],\n", + "}\n", + "\n", + "CC_TEXT_MAP: Dict[str, str] = {\n", + " \"Gunshot\": \"[gunshot]\", \"Explosion\": \"[explosion]\",\n", + " \"Glass\": \"[glass breaking]\", \"Alarm\": \"[alarm]\", \"Siren\": \"[siren]\",\n", + " \"Scream\": \"[scream]\", \"Fireworks\": \"[firecrackers]\",\n", + " \"Laughter\": \"[laughter]\", \"Applause\": \"[applause]\", \"Cheering\": \"[cheering]\",\n", + " \"Clapping\": \"[applause]\", \"Crying\": \"[crying]\", \"Crowd\": \"[crowd noise]\",\n", + " \"Doorbell\": \"[doorbell]\", \"Dog\": \"[dog barking]\",\n", + " \"Music\": \"[music]\", \"Drum\": \"[drums]\",\n", + " \"Tabla\": \"[tabla]\", \"Dhol\": \"[dhol]\", \"Bell\": \"[bell]\",\n", + "}\n", + "\n", + "SPEECH_CLASSES = {\n", + " \"Speech\", \"Male speech, man speaking\", \"Female speech, woman speaking\",\n", + " \"Child speech, kid speaking\", \"Conversation\", \"Narration, monologue\",\n", + "}\n", + "\n", + "def get_category(label: str) -> str:\n", + " lower = label.lower()\n", + " for cat, keywords in CATEGORY_MAP.items():\n", + " if any(kw.lower() in lower for kw in keywords):\n", + " return cat\n", + " return \"ambient\"\n", + "\n", + "def get_cc_text(label: str) -> str:\n", + " for key, text in CC_TEXT_MAP.items():\n", + " if key.lower() in label.lower():\n", + " return text\n", + " clean = label.lower().split(\",\")[0].strip()\n", + " return f\"[{clean}]\"\n", + "\n", + "print(\"Label utilities defined.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Simulated YAMNet output — each tuple is (top_label, confidence, patch_start_sec)\n", + "# In production this comes from running the TF Hub model on extracted audio patches.\n", + "SIMULATED_YAMNET_PATCHES = [\n", + " (\"Speech\", 0.94, 0.00), # speech — will be filtered\n", + " (\"Speech\", 0.91, 0.96),\n", + " (\"Music\", 0.62, 1.92), # low-confidence ambient — below threshold\n", + " (\"Gunshot, gunfire\", 0.89, 5.76),\n", + " (\"Gunshot, gunfire\", 0.93, 6.72), # adjacent — will be merged\n", + " (\"Glass\", 0.21, 8.64), # confidence 0.21 < threshold 0.35 — dropped\n", + " (\"Laughter\", 0.77, 12.48),\n", + " (\"Laughter\", 0.81, 13.44), # adjacent — will be merged\n", + " (\"Speech\", 0.88, 15.36), # speech — filtered\n", + " (\"Applause\", 0.74, 22.08),\n", + " (\"Fireworks\", 0.86, 31.68),\n", + " (\"Music\", 0.58, 38.40), # ambient, no strong reaction expected\n", + " (\"Dog\", 0.69, 44.16),\n", + " (\"Tabla\", 0.55, 51.84), # India-specific\n", + "]\n", + "\n", + "CONFIDENCE_THRESHOLD = 0.35\n", + "MERGE_GAP = 1.0 # seconds\n", + "PATCH_DURATION = 0.96\n", + "\n", + "def detect_events(patches, confidence_threshold=CONFIDENCE_THRESHOLD, merge_gap=MERGE_GAP):\n", + " events = []\n", + " for label, conf, start in patches:\n", + " if label in SPEECH_CLASSES:\n", + " continue\n", + " if conf < confidence_threshold:\n", + " continue\n", + " events.append(AudioEvent(\n", + " label=label, confidence=conf,\n", + " start=start, end=start + PATCH_DURATION,\n", + " category=get_category(label),\n", + " ))\n", + " return _merge_adjacent(events, merge_gap)\n", + "\n", + "def _merge_adjacent(events, merge_gap):\n", + " if not events:\n", + " return []\n", + " merged = [events[0]]\n", + " for ev in events[1:]:\n", + " prev = merged[-1]\n", + " if ev.label == prev.label and ev.start - prev.end <= merge_gap:\n", + " merged[-1] = AudioEvent(\n", + " label=prev.label,\n", + " confidence=max(prev.confidence, ev.confidence),\n", + " start=prev.start,\n", + " end=ev.end,\n", + " category=prev.category,\n", + " )\n", + " else:\n", + " merged.append(ev)\n", + " return merged\n", + "\n", + "audio_events = detect_events(SIMULATED_YAMNET_PATCHES)\n", + "\n", + "print(f\"Detected {len(audio_events)} non-speech events after filtering and merging:\\n\")\n", + "print(f\"{'Label':<30} {'Category':<12} {'Conf':>6} {'Start':>6} -> {'End':>6}\")\n", + "print(\"-\" * 72)\n", + "for ev in audio_events:\n", + " print(f\"{ev.label:<30} {ev.category:<12} {ev.confidence:>6.2f} {ev.start:>6.2f}s -> {ev.end:>6.2f}s\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## Goal 2 — Visual reaction analysis\n", + "\n", + "For each detected audio event, we extract video frames from the **reaction window** (300–1500 ms after the event start). Two signals are computed:\n", + "\n", + "1. **Motion score** — Farneback Optical Flow measures pixel-level movement magnitude between consecutive frames. A sudden head turn or flinch shows up as high motion.\n", + "2. **Face-shift score** — MediaPipe FaceMesh tracks the nose landmark position. Displacement between frames indicates a head turn or startle.\n", + "\n", + "Below we simulate realistic scores for each detected event." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Simulated visual reaction scores — keyed by event label + start time.\n", + "# In production these come from running OpenCV optical flow and MediaPipe\n", + "# on frames extracted from the reaction window (300–1500 ms after the event).\n", + "SIMULATED_REACTIONS: Dict[Tuple[str, float], Tuple[float, float]] = {\n", + " (\"Gunshot, gunfire\", 5.76): (0.82, 0.71), # strong motion, clear head turn\n", + " (\"Laughter\", 12.48): (0.55, 0.48), # moderate reaction\n", + " (\"Applause\", 22.08): (0.61, 0.53), # moderate reaction\n", + " (\"Fireworks\", 31.68): (0.77, 0.65), # visible flinch\n", + " (\"Music\", 38.40): (0.08, 0.04), # no visible reaction\n", + " (\"Dog\", 44.16): (0.31, 0.22), # slight reaction\n", + " (\"Tabla\", 51.84): (0.19, 0.14), # minimal reaction\n", + "}\n", + "\n", + "def analyze_reactions(events: List[AudioEvent]) -> List[ReactionSignal]:\n", + " signals = []\n", + " for ev in events:\n", + " key = (ev.label, ev.start)\n", + " motion, face = SIMULATED_REACTIONS.get(key, (0.0, 0.0))\n", + " signals.append(ReactionSignal(\n", + " event=ev,\n", + " motion_score=min(motion, 1.0),\n", + " face_shift_score=min(face, 1.0),\n", + " frame_count=5,\n", + " ))\n", + " return signals\n", + "\n", + "reaction_signals = analyze_reactions(audio_events)\n", + "\n", + "print(f\"{'Label':<30} {'Motion':>7} {'Face':>7} {'Reaction conf':>14}\")\n", + "print(\"-\" * 65)\n", + "for sig in reaction_signals:\n", + " print(f\"{sig.event.label:<30} {sig.motion_score:>7.2f} \"\n", + " f\"{sig.face_shift_score:>7.2f} {sig.reaction_confidence:>14.2f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## Goal 3 — CC decision engine\n", + "\n", + "The fusion formula is:\n", + "\n", + "```\n", + "score = audio_weight × audio_confidence\n", + " + visual_weight × reaction_confidence\n", + " + 0.12 (if high-impact label)\n", + "```\n", + "\n", + "Weights are **category-aware** — a gunshot does not need a visible reaction to get captioned, but ambient music almost always does:\n", + "\n", + "| Category | Audio weight | Visual weight | Rationale |\n", + "|---|---|---|---|\n", + "| high_impact | 0.85 | 0.15 | Alarm/explosion — trust the audio |\n", + "| social | 0.55 | 0.45 | Laughter/applause — balanced |\n", + "| interactive | 0.45 | 0.55 | Doorbell/knock — needs confirmation |\n", + "| ambient | 0.30 | 0.70 | Background music — suppress unless reacted to |\n", + "\n", + "Events with `score < 0.50` are suppressed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "CATEGORY_WEIGHTS: Dict[str, Tuple[float, float]] = {\n", + " \"high_impact\": (0.85, 0.15),\n", + " \"social\": (0.55, 0.45),\n", + " \"interactive\": (0.45, 0.55),\n", + " \"ambient\": (0.30, 0.70),\n", + "}\n", + "\n", + "HIGH_IMPACT_LABELS = (\"alarm\", \"explosion\", \"glass\", \"gunshot\", \"scream\", \"siren\", \"fireworks\")\n", + "HIGH_IMPACT_BOOST = 0.12\n", + "DECISION_THRESHOLD = 0.50\n", + "\n", + "def decide(signal: ReactionSignal, index: int) -> Optional[CaptionSuggestion]:\n", + " ev = signal.event\n", + " audio_w, visual_w = CATEGORY_WEIGHTS.get(ev.category, (0.55, 0.45))\n", + "\n", + " score = audio_w * ev.confidence + visual_w * signal.reaction_confidence\n", + "\n", + " is_high_impact = any(kw in ev.label.lower() for kw in HIGH_IMPACT_LABELS)\n", + " if is_high_impact:\n", + " score += HIGH_IMPACT_BOOST\n", + "\n", + " if score < DECISION_THRESHOLD:\n", + " return None\n", + "\n", + " reason = (\n", + " \"high-impact-audio\"\n", + " if is_high_impact and signal.reaction_confidence < 0.3\n", + " else \"audio+visual\"\n", + " )\n", + "\n", + " return CaptionSuggestion(\n", + " label=ev.label,\n", + " text=get_cc_text(ev.label),\n", + " start=ev.start,\n", + " end=ev.end,\n", + " audio_confidence=ev.confidence,\n", + " reaction_confidence=signal.reaction_confidence,\n", + " decision_score=min(score, 1.0),\n", + " reason=reason,\n", + " index=index,\n", + " )\n", + "\n", + "suggestions: List[CaptionSuggestion] = []\n", + "idx = 1\n", + "for sig in reaction_signals:\n", + " result = decide(sig, idx)\n", + " if result:\n", + " suggestions.append(result)\n", + " idx += 1\n", + "\n", + "print(f\"{len(suggestions)}/{len(reaction_signals)} events accepted as CC suggestions\\n\")\n", + "print(f\"{'#':<4} {'Label':<30} {'Score':>6} {'Reason':<20} {'CC text'}\")\n", + "print(\"-\" * 80)\n", + "for s in suggestions:\n", + " print(f\"{s.index:<4} {s.label:<30} {s.decision_score:>6.3f} {s.reason:<20} {s.text}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Suppressed events\n", + "\n", + "Let's see what was filtered out and why." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "accepted_labels = {s.label for s in suggestions}\n", + "suppressed = [sig for sig in reaction_signals if sig.event.label not in accepted_labels]\n", + "\n", + "print(\"Suppressed events:\")\n", + "print(f\"{'Label':<30} {'Audio conf':>10} {'Reaction conf':>14} {'Score':>6}\")\n", + "print(\"-\" * 68)\n", + "for sig in suppressed:\n", + " ev = sig.event\n", + " audio_w, visual_w = CATEGORY_WEIGHTS.get(ev.category, (0.55, 0.45))\n", + " score = audio_w * ev.confidence + visual_w * sig.reaction_confidence\n", + " is_hi = any(kw in ev.label.lower() for kw in HIGH_IMPACT_LABELS)\n", + " if is_hi:\n", + " score += HIGH_IMPACT_BOOST\n", + " print(f\"{ev.label:<30} {ev.confidence:>10.2f} {sig.reaction_confidence:>14.2f} {score:>6.3f} ← below {DECISION_THRESHOLD}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## Output generation — SRT and SLS\n", + "\n", + "**SRT** is the standard subtitle format supported by every video player and editing tool. **SLS** is a structured JSON variant used in PlanetRead's karaoke/same-language-subtitling workflow, containing full metadata per suggestion." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def _ts(seconds: float) -> str:\n", + " \"\"\"Format seconds as SRT timestamp HH:MM:SS,mmm.\"\"\"\n", + " h = int(seconds // 3600)\n", + " m = int((seconds % 3600) // 60)\n", + " s = int(seconds % 60)\n", + " ms = int(round((seconds % 1) * 1000))\n", + " return f\"{h:02d}:{m:02d}:{s:02d},{ms:03d}\"\n", + "\n", + "def to_srt(suggestions: List[CaptionSuggestion]) -> str:\n", + " blocks = []\n", + " for i, s in enumerate(suggestions, 1):\n", + " blocks.append(f\"{i}\\n{_ts(s.start)} --> {_ts(s.end)}\\n{s.text}\\n\")\n", + " return \"\\n\".join(blocks)\n", + "\n", + "def to_sls(suggestions: List[CaptionSuggestion], video_path=\"sample_video.mp4\") -> str:\n", + " data = {\n", + " \"video\": video_path,\n", + " \"total_accepted\": len(suggestions),\n", + " \"captions\": [\n", + " {\n", + " \"index\": s.index,\n", + " \"label\": s.label,\n", + " \"text\": s.text,\n", + " \"start\": round(s.start, 3),\n", + " \"end\": round(s.end, 3),\n", + " \"audio_confidence\": round(s.audio_confidence, 4),\n", + " \"reaction_confidence\": round(s.reaction_confidence, 4),\n", + " \"decision_score\": round(s.decision_score, 4),\n", + " \"reason\": s.reason,\n", + " }\n", + " for s in suggestions\n", + " ],\n", + " }\n", + " return json.dumps(data, indent=2, ensure_ascii=False)\n", + "\n", + "srt_output = to_srt(suggestions)\n", + "print(\"=== SRT output ===\")\n", + "print(srt_output)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sls_output = to_sls(suggestions)\n", + "print(\"=== SLS output (first two captions) ===\")\n", + "sls_data = json.loads(sls_output)\n", + "preview = {**sls_data, \"captions\": sls_data[\"captions\"][:2]}\n", + "print(json.dumps(preview, indent=2))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## Evaluation framework\n", + "\n", + "To measure quality we use **IoU-based matching** against a ground-truth annotation file. An accepted suggestion is a True Positive if it overlaps the correct event by IoU ≥ 0.3.\n", + "\n", + "Metrics reported:\n", + "- **Precision** — fraction of accepted suggestions that are correct\n", + "- **Recall** — fraction of actual events that were caught\n", + "- **F1** — harmonic mean\n", + "- **Overcaption rate** — fraction of accepted suggestions that are unnecessary (FP / total accepted)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Ground truth for the same synthetic video\n", + "GROUND_TRUTH = [\n", + " {\"label\": \"gunshot\", \"start\": 5.76, \"end\": 7.68},\n", + " {\"label\": \"laughter\", \"start\": 12.48, \"end\": 14.40},\n", + " {\"label\": \"applause\", \"start\": 22.08, \"end\": 23.04},\n", + " {\"label\": \"firecrackers\", \"start\": 31.68, \"end\": 32.64},\n", + " # dog bark at 44s is debatable — not in ground truth (editor chose to skip it)\n", + "]\n", + "\n", + "def iou(a_start, a_end, b_start, b_end):\n", + " inter = max(0.0, min(a_end, b_end) - max(a_start, b_start))\n", + " union = (a_end - a_start) + (b_end - b_start) - inter\n", + " return inter / union if union > 0 else 0.0\n", + "\n", + "def evaluate(suggestions, ground_truth, iou_threshold=0.3):\n", + " matched_gt = set()\n", + " tp = fp = 0\n", + " for s in suggestions:\n", + " hit = False\n", + " for i, g in enumerate(ground_truth):\n", + " if i in matched_gt:\n", + " continue\n", + " if iou(s.start, s.end, g[\"start\"], g[\"end\"]) >= iou_threshold:\n", + " matched_gt.add(i)\n", + " hit = True\n", + " break\n", + " if hit:\n", + " tp += 1\n", + " else:\n", + " fp += 1\n", + " fn = len(ground_truth) - len(matched_gt)\n", + " precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0\n", + " recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0\n", + " f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) > 0 else 0.0\n", + " overcap = fp / len(suggestions) if suggestions else 0.0\n", + " return dict(precision=precision, recall=recall, f1=f1,\n", + " overcaption_rate=overcap, TP=tp, FP=fp, FN=fn)\n", + "\n", + "metrics = evaluate(suggestions, GROUND_TRUTH)\n", + "print(\"Evaluation results\")\n", + "print(\"-\" * 40)\n", + "for k, v in metrics.items():\n", + " if isinstance(v, float):\n", + " print(f\" {k:<20} {v:.3f}\")\n", + " else:\n", + " print(f\" {k:<20} {v}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## Summary\n", + "\n", + "This notebook demonstrates a working proof of concept for all three project goals:\n", + "\n", + "| Goal | Module | Status |\n", + "|---|---|---|\n", + "| Sound event detection | YAMNet + speech filter + event merge | ✓ shown |\n", + "| Visual reaction analysis | Optical flow + MediaPipe face-shift | ✓ shown |\n", + "| CC decision + output | Category-aware fusion + SRT/SLS writer | ✓ shown |\n", + "| Evaluation | IoU-based P/R/F1 + overcaption rate | ✓ shown |\n", + "\n", + "Key design choices:\n", + "- **Category-aware weights** prevent over-captioning ambient sounds while ensuring high-impact events (gunshot, explosion, alarm) are never missed\n", + "- **India-specific label mapping** handles AudioSet classes that correspond to regional sounds (fireworks -> firecrackers, Tabla, Dhol)\n", + "- **Reaction window timing** (300–1500 ms after event) captures the moment speakers react, not the moment the sound occurs\n", + "- **SLS output** preserves full metadata per suggestion, compatible with PlanetRead's karaoke subtitle workflow\n", + "\n", + "### What the full pipeline adds\n", + "\n", + "The production version replaces the simulated scores above with:\n", + "- `ffmpeg` subprocess call to extract 16 kHz mono WAV\n", + "- TensorFlow Hub `yamnet/1` model for real patch-level scores\n", + "- OpenCV `calcOpticalFlowFarneback` on extracted frames\n", + "- MediaPipe `FaceMesh` for nose-landmark displacement tracking\n", + "- OpenCV Haar cascade fallback when MediaPipe is unavailable\n", + "\n", + "All of these are drop-in replacements for the stub functions above — the data contracts (AudioEvent, ReactionSignal, CaptionSuggestion) and the decision logic remain identical." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}