diff --git a/README.md b/README.md
new file mode 100644
index 0000000..21af4fa
--- /dev/null
+++ b/README.md
@@ -0,0 +1,204 @@
+# Intelligent CC Suggestion Tool — DMP 2026 Demo
+
+**Contributor:** Naitik Gupta
+**Organisation:** PlanetRead  
+**Issue:** [#2 — Intelligent CC Generation](https://github.com/PlanetRead/Intelligent-cc-generation/issues/2)  
+**Mentors:** @abinash-sketch, @keerthiseelan-planetread
+
+---
+
+## What This Demo Covers
+
+This is a **complete end-to-end working demo** of all three pipeline goals described in the ticket:
+
+| Module | Goal | Status |
+|--------|------|--------|
+| Module 1 | Sound Event Detection (YAMNet + confidence scores + timestamps) | ✅ Complete |
+| Module 2 | Speaker Reaction Detection (MediaPipe Face Mesh + OpenCV) | ✅ Complete |
+| Module 3 | CC Decision Engine + SRT/SLS Output | ✅ Complete |
+
+---
+
+## How It Works
+
+```
+Video File
+    │
+    ├─► Module 1: SoundEventDetector
+    │       YAMNet classifies non-speech audio events
+    │       Output: [{sound, confidence, start_time, end_time}]
+    │
+    ├─► Module 2: SpeakerReactionDetector
+    │       MediaPipe Face Mesh tracks head velocity + mouth openness
+    │       around each audio event timestamp
+    │       Output: [reaction_confidence_score per event]
+    │
+    └─► Module 3: CCDecisionEngine
+            Combined score = 0.45 × audio_conf + 0.55 × visual_conf
+            If combined ≥ threshold → CC approved → written to SRT
+            Output: .srt file + .json report
+```
+
+### Decision Formula
+
+```
+combined = 0.45 × audio_confidence + 0.55 × visual_confidence
+```
+
+Visual reaction is weighted slightly higher because a visible speaker reaction
+is a stronger signal of narrative significance than audio confidence alone.
+This prevents over-captioning ambient sounds the speaker ignores.
+
+### Visual Reaction Signals (Module 2)
+
+Four signals are scored independently and summed:
+
+| Signal | Score | Condition |
+|--------|-------|-----------|
+| Velocity spike | +0.40 | Head moves >2σ above baseline for ≥2 frames |
+| Sustained movement | +0.25 | Mean post-event velocity > 1.5× baseline |
+| Freeze response | +0.15 | Sudden stillness after event (startle) |
+| Mouth opens | +0.20 | Mouth openness increases >1.6× (gasp) |
+| Scene diff fallback | +0.25–0.50 | Used when no face is detected |
+
+---
+
+## Installation
+
+```bash
+pip install tensorflow tensorflow-hub librosa moviepy mediapipe opencv-python srt numpy
+```
+
+**Tested on:** Python 3.10+, TensorFlow 2.19, CPU-only machine  
+**Note:** YAMNet is downloaded automatically on first run from TensorFlow Hub (~25MB).
+
+---
+
+## Usage
+
+### Basic (English CC labels)
+```bash
+python intelligent_cc_pipeline.py --video sample.mp4
+```
+
+### Hindi CC labels
+```bash
+python intelligent_cc_pipeline.py --video sample.mp4 --lang hi
+```
+
+### Custom thresholds
+```bash
+python intelligent_cc_pipeline.py --video sample.mp4 \
+    --audio-thresh 0.4 \
+    --fusion-thresh 0.55
+```
+
+### All options
+```
+--video         Path to input video (required)
+--output        Output .srt path (auto-named if omitted)
+--audio-thresh  YAMNet confidence threshold (default: 0.35)
+--fusion-thresh Combined score threshold to approve CC (default: 0.50)
+--lang          'en' or 'hi' (default: 'en')
+--no-json       Skip saving the JSON report
+```
+
+---
+
+## Output Files
+
+**`<video_name>_cc_suggestions.srt`** — Standard SRT subtitle file
+```
+1
+00:00:03,200 --> 00:00:04,680
+[GLASS BREAKING]
+
+2
+00:00:11,040 --> 00:00:12,520
+[APPLAUSE]
+```
+
+**`<video_name>_cc_suggestions_report.json`** — Full pipeline report
+```json
+{
+  "total_events": 8,
+  "approved_cc": 3,
+  "audio_events": [...],
+  "visual_scores": [...],
+  "accepted_cc": [...]
+}
+```
+
+---
+
+## Design Decisions
+
+### Why YAMNet?
+YAMNet is pretrained on Google's AudioSet (2M+ clips, 521 classes) and runs
+efficiently on CPU. It requires no fine-tuning for common sound events and
+handles the wide range of events relevant to PlanetRead content (applause,
+laughter, alarms, music, impacts). PANNs was evaluated as an alternative — 
+YAMNet was chosen for its lightweight inference and TensorFlow Hub availability.
+
+### Why MediaPipe Face Mesh?
+MediaPipe runs in real-time on CPU, provides 468 landmark points per face,
+and is well-suited for the edge/server environments PlanetRead works with.
+The 4-signal scoring approach (velocity spike, sustained movement, freeze,
+mouth opening) captures different types of startle/reaction responses without
+requiring a trained classifier.
+
+### Why weight visual higher (0.55 vs 0.45)?
+A speaker visibly reacting to a sound is unambiguous evidence that the sound
+affects the narrative. High audio confidence alone (e.g., distant music) does
+not necessarily warrant a CC. This weighting was determined empirically and is
+easily tunable via `--fusion-thresh`.
+
+### Consolidation logic
+Consecutive YAMNet frames detecting the same sound class within 1.0 seconds
+are merged into a single event. This prevents the same sound from generating
+dozens of overlapping CC annotations.
+
+---
+
+## Known Limitations and Future Work
+
+1. **YAMNet class coverage:** Some culturally specific Indian sounds (dhol,
+   shehnai, specific street sounds) may not be in YAMNet's 521-class vocabulary.
+   A fine-tuned model on Indian audio content would improve recall for regional
+   content.
+
+2. **Single-face tracking:** Module 2 currently tracks only the primary face.
+   Multi-speaker scenes (talk shows, debates) would benefit from tracking all
+   visible speakers and triggering CC if any one of them reacts.
+
+3. **No GPU acceleration:** The pipeline runs on CPU. GPU inference would
+   reduce processing time significantly for long-form content.
+
+4. **SLS format:** The current output is standard SRT. PlanetRead's SLS format
+   has specific timing and encoding requirements that should be confirmed with
+   mentors and implemented as a post-processing step.
+
+5. **Threshold tuning:** The default thresholds (audio=0.35, fusion=0.50) were
+   set conservatively. Optimal values should be determined through systematic
+   evaluation with PlanetRead editors on a labeled Hindi/regional video dataset.
+
+---
+
+## Demo Video
+
+> 📹 https://youtu.be/zn3huIukfiY
+
+The demo video shows the pipeline running on a sample video, with terminal
+output for each module and the final SRT file being generated.
+
+---
+
+## Repository Structure
+
+```
+.
+├── intelligent_cc_pipeline.py   # Main pipeline (all 3 modules)
+├── README.md                    # This file
+├── sample_output.srt            # Example SRT output
+└── sample_report.json           # Example JSON report
+```
\ No newline at end of file
diff --git a/canva.mp4 b/canva.mp4
new file mode 100644
index 0000000..27e7704
Binary files /dev/null and b/canva.mp4 differ
diff --git a/canva_cc_suggestions.srt b/canva_cc_suggestions.srt
new file mode 100644
index 0000000..9fda099
--- /dev/null
+++ b/canva_cc_suggestions.srt
@@ -0,0 +1,36 @@
+1
+00:00:00,000 --> 00:00:03,840
+[कांच टूटना]
+
+2
+00:00:03,840 --> 00:00:04,840
+[कांच टूटना]
+
+3
+00:00:04,800 --> 00:00:05,800
+[कांच टूटना]
+
+4
+00:00:06,240 --> 00:00:12,000
+[तालियाँ]
+
+5
+00:00:12,000 --> 00:00:13,000
+[अलार्म]
+
+6
+00:00:13,440 --> 00:00:14,440
+[गोलीबारी]
+
+7
+00:00:13,920 --> 00:00:14,920
+[विस्फोट]
+
+8
+00:00:14,400 --> 00:00:15,400
+[सायरन]
+
+9
+00:00:14,880 --> 00:00:15,880
+[सायरन]
+
diff --git a/canva_cc_suggestions_report.json b/canva_cc_suggestions_report.json
new file mode 100644
index 0000000..5f6090c
--- /dev/null
+++ b/canva_cc_suggestions_report.json
@@ -0,0 +1,259 @@
+{
+  "video": "canva.mp4",
+  "srt_output": "canva_cc_suggestions.srt",
+  "lang": "hi",
+  "audio_threshold": 0.35,
+  "fusion_threshold": 0.42,
+  "audio_only_threshold": null,
+  "total_events": 12,
+  "approved_cc": 9,
+  "audio_events": [
+    {
+      "sound": "Glass",
+      "label_en": "[GLASS BREAKING]",
+      "confidence": 0.967,
+      "start_time": 0.0,
+      "end_time": 3.84,
+      "label_out": "[कांच टूटना]"
+    },
+    {
+      "sound": "Shatter",
+      "label_en": "[GLASS BREAKING]",
+      "confidence": 0.668,
+      "start_time": 3.84,
+      "end_time": 4.32,
+      "label_out": "[कांच टूटना]"
+    },
+    {
+      "sound": "Liquid",
+      "label_en": "[LIQUID]",
+      "confidence": 0.634,
+      "start_time": 4.32,
+      "end_time": 4.8,
+      "label_out": "[पानी]"
+    },
+    {
+      "sound": "Shatter",
+      "label_en": "[GLASS BREAKING]",
+      "confidence": 0.921,
+      "start_time": 4.8,
+      "end_time": 5.76,
+      "label_out": "[कांच टूटना]"
+    },
+    {
+      "sound": "Liquid",
+      "label_en": "[LIQUID]",
+      "confidence": 0.489,
+      "start_time": 5.76,
+      "end_time": 6.24,
+      "label_out": "[पानी]"
+    },
+    {
+      "sound": "Applause",
+      "label_en": "[APPLAUSE]",
+      "confidence": 0.996,
+      "start_time": 6.24,
+      "end_time": 12.0,
+      "label_out": "[तालियाँ]"
+    },
+    {
+      "sound": "Alarm",
+      "label_en": "[ALARM]",
+      "confidence": 0.363,
+      "start_time": 12.0,
+      "end_time": 12.48,
+      "label_out": "[अलार्म]"
+    },
+    {
+      "sound": "Gunshot, gunfire",
+      "label_en": "[GUNSHOT]",
+      "confidence": 0.954,
+      "start_time": 13.44,
+      "end_time": 13.92,
+      "label_out": "[गोलीबारी]"
+    },
+    {
+      "sound": "Explosion",
+      "label_en": "[EXPLOSION]",
+      "confidence": 0.97,
+      "start_time": 13.92,
+      "end_time": 14.4,
+      "label_out": "[विस्फोट]"
+    },
+    {
+      "sound": "Police car (siren)",
+      "label_en": "[SIREN]",
+      "confidence": 0.572,
+      "start_time": 14.4,
+      "end_time": 14.88,
+      "label_out": "[सायरन]"
+    },
+    {
+      "sound": "Emergency vehicle",
+      "label_en": "[SIREN]",
+      "confidence": 0.799,
+      "start_time": 14.88,
+      "end_time": 15.84,
+      "label_out": "[सायरन]"
+    },
+    {
+      "sound": "Vehicle",
+      "label_en": "[VEHICLE]",
+      "confidence": 0.495,
+      "start_time": 16.8,
+      "end_time": 17.28,
+      "label_out": "[वाहन]"
+    }
+  ],
+  "visual_scores": [
+    0.0,
+    0.0,
+    0.0,
+    0.0,
+    0.0,
+    0.0,
+    0.5,
+    0.0,
+    0.5,
+    0.0,
+    0.0,
+    0.0
+  ],
+  "accepted_cc": [
+    {
+      "sound": "Glass",
+      "label_en": "[GLASS BREAKING]",
+      "start_time": 0.0,
+      "label_out": "[कांच टूटना]",
+      "end_time": 3.84,
+      "audio_conf": 0.967,
+      "visual_conf": 0.0,
+      "combined": 0.822,
+      "combined_pre_boost": 0.435,
+      "high_impact": true,
+      "high_impact_boost_applied": true,
+      "decision": "APPROVED",
+      "decision_basis": "HIGH_IMPACT"
+    },
+    {
+      "sound": "Shatter",
+      "label_en": "[GLASS BREAKING]",
+      "start_time": 3.84,
+      "label_out": "[कांच टूटना]",
+      "end_time": 4.32,
+      "audio_conf": 0.668,
+      "visual_conf": 0.0,
+      "combined": 0.568,
+      "combined_pre_boost": 0.301,
+      "high_impact": true,
+      "high_impact_boost_applied": true,
+      "decision": "APPROVED",
+      "decision_basis": "HIGH_IMPACT"
+    },
+    {
+      "sound": "Shatter",
+      "label_en": "[GLASS BREAKING]",
+      "start_time": 4.8,
+      "label_out": "[कांच टूटना]",
+      "end_time": 5.76,
+      "audio_conf": 0.921,
+      "visual_conf": 0.0,
+      "combined": 0.783,
+      "combined_pre_boost": 0.414,
+      "high_impact": true,
+      "high_impact_boost_applied": true,
+      "decision": "APPROVED",
+      "decision_basis": "HIGH_IMPACT"
+    },
+    {
+      "sound": "Applause",
+      "label_en": "[APPLAUSE]",
+      "start_time": 6.24,
+      "label_out": "[तालियाँ]",
+      "end_time": 12.0,
+      "audio_conf": 0.996,
+      "visual_conf": 0.0,
+      "combined": 0.448,
+      "combined_pre_boost": 0.448,
+      "high_impact": false,
+      "high_impact_boost_applied": false,
+      "decision": "APPROVED",
+      "decision_basis": "FUSION"
+    },
+    {
+      "sound": "Alarm",
+      "label_en": "[ALARM]",
+      "start_time": 12.0,
+      "label_out": "[अलार्म]",
+      "end_time": 12.48,
+      "audio_conf": 0.363,
+      "visual_conf": 0.5,
+      "combined": 0.438,
+      "combined_pre_boost": 0.438,
+      "high_impact": true,
+      "high_impact_boost_applied": false,
+      "decision": "APPROVED",
+      "decision_basis": "FUSION"
+    },
+    {
+      "sound": "Gunshot, gunfire",
+      "label_en": "[GUNSHOT]",
+      "start_time": 13.44,
+      "label_out": "[गोलीबारी]",
+      "end_time": 13.92,
+      "audio_conf": 0.954,
+      "visual_conf": 0.0,
+      "combined": 0.811,
+      "combined_pre_boost": 0.429,
+      "high_impact": true,
+      "high_impact_boost_applied": true,
+      "decision": "APPROVED",
+      "decision_basis": "HIGH_IMPACT"
+    },
+    {
+      "sound": "Explosion",
+      "label_en": "[EXPLOSION]",
+      "start_time": 13.92,
+      "label_out": "[विस्फोट]",
+      "end_time": 14.4,
+      "audio_conf": 0.97,
+      "visual_conf": 0.5,
+      "combined": 0.712,
+      "combined_pre_boost": 0.712,
+      "high_impact": true,
+      "high_impact_boost_applied": false,
+      "decision": "APPROVED",
+      "decision_basis": "FUSION"
+    },
+    {
+      "sound": "Police car (siren)",
+      "label_en": "[SIREN]",
+      "start_time": 14.4,
+      "label_out": "[सायरन]",
+      "end_time": 14.88,
+      "audio_conf": 0.572,
+      "visual_conf": 0.0,
+      "combined": 0.486,
+      "combined_pre_boost": 0.257,
+      "high_impact": true,
+      "high_impact_boost_applied": true,
+      "decision": "APPROVED",
+      "decision_basis": "HIGH_IMPACT"
+    },
+    {
+      "sound": "Emergency vehicle",
+      "label_en": "[SIREN]",
+      "start_time": 14.88,
+      "label_out": "[सायरन]",
+      "end_time": 15.84,
+      "audio_conf": 0.799,
+      "visual_conf": 0.0,
+      "combined": 0.679,
+      "combined_pre_boost": 0.36,
+      "high_impact": true,
+      "high_impact_boost_applied": true,
+      "decision": "APPROVED",
+      "decision_basis": "HIGH_IMPACT"
+    }
+  ]
+}
\ No newline at end of file
diff --git a/intelligent_cc_pipeline.py b/intelligent_cc_pipeline.py
new file mode 100644
index 0000000..eb9a215
--- /dev/null
+++ b/intelligent_cc_pipeline.py
@@ -0,0 +1,940 @@
+"""
+=============================================================================
+Intelligent Closed Caption (CC) Suggestion Tool
+PlanetRead — DMP 2026 Demo Submission
+
+Author  : Naitik
+GitHub  : https://github.com/naitik120gupta
+Ticket  : https://github.com/PlanetRead/Intelligent-cc-generation/issues/2
+
+Description
+-----------
+End-to-end pipeline that accepts a video file and produces a ready-to-use
+SRT file containing only contextually meaningful non-speech CC annotations.
+
+Pipeline stages:
+  Module 1 — Sound Event Detection  (YAMNet via TensorFlow Hub)
+  Module 2 — Speaker Reaction Detection (MediaPipe Face Mesh + OpenCV)
+  Module 3 — CC Decision Engine + SRT Output
+
+Usage
+-----
+    python intelligent_cc_pipeline.py --video sample.mp4
+    python intelligent_cc_pipeline.py --video sample.mp4 --output my_cc.srt
+    python intelligent_cc_pipeline.py --video sample.mp4 --audio-thresh 0.4 --fusion-thresh 0.5
+    python intelligent_cc_pipeline.py --video sample.mp4 --lang hi   # Hindi CC labels
+
+Requirements
+------------
+    pip install "setuptools<82" tensorflow tensorflow-hub librosa moviepy mediapipe opencv-python srt numpy
+=============================================================================
+"""
+
+import os
+import sys
+import csv
+import math
+import json
+import argparse
+import datetime
+import warnings
+import tempfile
+import subprocess
+from pathlib import Path
+import urllib.request
+
+warnings.filterwarnings("ignore")
+os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"          # Suppress TF CUDA warnings
+os.environ["TF_ENABLE_ONEDNN_OPTS"] = "0"
+
+import numpy as np
+import cv2
+import srt
+import librosa
+import tensorflow as tf
+import tensorflow_hub as hub
+import mediapipe as mp
+
+
+# =============================================================================
+# CC LABEL DICTIONARIES
+# Mapping YAMNet class names → human-readable CC bracket labels
+# Extend these to support more events or languages.
+# =============================================================================
+
+CC_LABELS_EN = {
+    # Vehicles
+    "Vehicle horn, car horn, honking": "[HONKING]",
+    "Beep, bleep":                     "[BEEPING]",
+    "Car alarm":                       "[CAR ALARM]",
+    "Tire squeal":                     "[TIRES SCREECHING]",
+    "Helicopter":                       "[HELICOPTER]",
+    "Emergency vehicle":               "[SIREN]",
+    "Siren":                           "[SIREN]",
+    # Alarms / alerts
+    "Fire alarm":                      "[FIRE ALARM]",
+    "Alarm":                           "[ALARM]",
+    "Bell":                            "[BELL]",
+    "Telephone bell ringing":          "[PHONE RINGING]",
+    "Doorbell":                        "[DOORBELL]",
+    # Human non-speech sounds
+    "Laughter":                        "[LAUGHTER]",
+    "Crying, sobbing":                 "[CRYING]",
+    "Screaming":                       "[SCREAMING]",
+    "Applause":                        "[APPLAUSE]",
+    "Clapping":                        "[CLAPPING]",
+    "Cheering":                        "[CHEERING]",
+    "Crowd":                           "[CROWD NOISE]",
+    "Whispering":                      "[WHISPERING]",
+    "Cough":                           "[COUGHING]",
+    "Sneeze":                          "[SNEEZE]",
+    # Impacts / sudden events
+    "Gunshot, gunfire":                "[GUNSHOT]",
+    "Explosion":                       "[EXPLOSION]",
+    "Glass breaking":                  "[GLASS BREAKING]",
+    "Glass":                           "[GLASS BREAKING]",
+    "Shatter":                         "[GLASS BREAKING]",
+    "Slam":                            "[DOOR SLAM]",
+    "Knock":                           "[KNOCKING]",
+    "Crash":                           "[CRASH]",
+    "Thud":                            "[THUD]",
+    # Other common AudioSet/YAMNet labels
+    "Police car (siren)":              "[SIREN]",
+    "Vehicle":                         "[VEHICLE]",
+    "Liquid":                          "[LIQUID]",
+    # Nature
+    "Thunder":                         "[THUNDER]",
+    "Rain":                            "[RAIN]",
+    "Wind":                            "[WIND]",
+    # Music
+    "Musical instrument":              "[MUSIC]",
+    "Drum":                            "[DRUMBEAT]",
+    "Guitar":                          "[GUITAR]",
+}
+
+CC_LABELS_HI = {
+    # Hindi translations of the same labels
+    "[HONKING]":          "[हॉर्न]",
+    "[BEEPING]":          "[बीप]",
+    "[CAR ALARM]":        "[कार अलार्म]",
+    "[TIRES SCREECHING]": "[टायर चीखना]",
+    "[HELICOPTER]":       "[हेलीकॉप्टर]",
+    "[SIREN]":            "[सायरन]",
+    "[FIRE ALARM]":       "[आग अलार्म]",
+    "[ALARM]":            "[अलार्म]",
+    "[BELL]":             "[घंटी]",
+    "[PHONE RINGING]":    "[फ़ोन बज रहा है]",
+    "[DOORBELL]":         "[दरवाज़े की घंटी]",
+    "[LAUGHTER]":         "[हँसी]",
+    "[CRYING]":           "[रोना]",
+    "[SCREAMING]":        "[चीखना]",
+    "[APPLAUSE]":         "[तालियाँ]",
+    "[CLAPPING]":         "[ताली बजाना]",
+    "[CHEERING]":         "[जयकार]",
+    "[CROWD NOISE]":      "[भीड़ का शोर]",
+    "[WHISPERING]":       "[फुसफुसाना]",
+    "[COUGHING]":         "[खाँसी]",
+    "[SNEEZE]":           "[छींक]",
+    "[GUNSHOT]":          "[गोलीबारी]",
+    "[EXPLOSION]":        "[विस्फोट]",
+    "[GLASS BREAKING]":   "[कांच टूटना]",
+    "[GLASS]":            "[कांच टूटना]",
+    "[SHATTER]":          "[टूटने की आवाज़]",
+    "[DOOR SLAM]":        "[दरवाज़ा बंद]",
+    "[KNOCKING]":         "[दस्तक]",
+    "[CRASH]":            "[टक्कर]",
+    "[THUD]":             "[धड़ाम]",
+    "[THUNDER]":          "[गर्जना]",
+    "[RAIN]":             "[बारिश]",
+    "[WIND]":             "[हवा]",
+    "[MUSIC]":            "[संगीत]",
+    "[DRUMBEAT]":         "[ड्रम]",
+    "[GUITAR]":           "[गिटार]",
+    "[LIQUID]":           "[पानी]",
+    "[VEHICLE]":          "[वाहन]",
+    "[POLICE CAR (SIREN)]":"[सायरन]",
+}
+
+# YAMNet class names that we always exclude (speech, ambient, silence)
+EXCLUDED_CLASSES = {
+    "Speech", "Male speech, man speaking", "Female speech, woman speaking",
+    "Child speech, kid speaking", "Silence", "Inside, small room",
+    "Inside, large room or hall", "Outside, urban or manmade",
+    "Outside, rural or natural", "Noise", "Environmental noise",
+    "White noise", "Pink noise", "Background noise",
+}
+
+# High-impact CC labels that should not depend on visible reaction.
+# These are bracket labels (post-mapping), not raw YAMNet class names.
+HIGH_IMPACT_LABELS = {
+    "[GUNSHOT]",
+    "[EXPLOSION]",
+    "[SIREN]",
+    "[FIRE ALARM]",
+    "[ALARM]",
+    "[GLASS BREAKING]",
+    "[SCREAMING]",
+    "[CRASH]",
+}
+
+
+# =============================================================================
+# MODULE 1 — SOUND EVENT DETECTION
+# Uses YAMNet (Google AudioSet classifier) to detect and classify non-speech
+# audio events with confidence scores and timestamps.
+# =============================================================================
+
+class SoundEventDetector:
+    """
+    Detects and classifies non-speech audio events in a video file.
+
+    YAMNet processes audio in 0.96s windows with 0.48s hop, producing one
+    prediction vector (521 AudioSet classes) per frame. We filter by
+    confidence threshold and exclude speech/silence/ambient classes.
+    """
+
+    YAMNET_URL = "https://tfhub.dev/google/yamnet/1"
+    FRAME_HOP  = 0.48   # YAMNet hop duration in seconds
+
+    def __init__(self):
+        print("[Module 1] Loading YAMNet model from TensorFlow Hub...")
+        self.model = hub.load(self.YAMNET_URL)
+        self.class_names = self._load_class_names()
+        print(f"[Module 1] YAMNet loaded — {len(self.class_names)} AudioSet classes available.\n")
+
+    def _load_class_names(self):
+        class_map_path = self.model.class_map_path().numpy().decode("utf-8")
+        names = []
+        with tf.io.gfile.GFile(class_map_path) as f:
+            reader = csv.DictReader(f)
+            for row in reader:
+                names.append(row["display_name"])
+        return names
+
+    def _extract_audio(self, video_path: str) -> np.ndarray:
+        """
+        Extracts mono 16 kHz audio from a video file using ffmpeg subprocess.
+        Falls back to moviepy if ffmpeg is not on PATH.
+
+        Returns a float32 numpy array of waveform samples.
+        """
+        tmp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
+        tmp_wav = tmp_file.name
+        tmp_file.close()
+        try:
+            # Prefer ffmpeg — much faster and no Python overhead
+            cmd = [
+                "ffmpeg", "-y", "-i", video_path,
+                "-ac", "1", "-ar", "16000",
+                "-vn", tmp_wav, "-loglevel", "error"
+            ]
+            subprocess.run(cmd, check=True, capture_output=True, text=True)
+            wav, _ = librosa.load(tmp_wav, sr=16000, mono=True)
+        except FileNotFoundError as e:
+            # ffmpeg not available — use moviepy
+            print("[Module 1] ffmpeg not available, using moviepy fallback...")
+            try:
+                from moviepy.editor import VideoFileClip
+            except ModuleNotFoundError as ie:
+                raise RuntimeError(
+                    "Audio extraction requires either 'ffmpeg' on PATH or the 'moviepy' Python package. "
+                    "Install ffmpeg (recommended) or run: pip install moviepy"
+                ) from ie
+
+            clip = VideoFileClip(video_path)
+            if clip.audio is None:
+                clip.close()
+                raise RuntimeError(f"No audio track found in video: {video_path}")
+
+            clip.audio.write_audiofile(
+                tmp_wav,
+                fps=16000,
+                nbytes=2,
+                codec="pcm_s16le",
+                logger=None,
+            )
+            wav, _ = librosa.load(tmp_wav, sr=16000, mono=True)
+            clip.close()
+        except subprocess.CalledProcessError as e:
+            stderr = (e.stderr or "").strip()
+            details = f"ffmpeg failed extracting audio from '{video_path}' (exit code {e.returncode})."
+            if stderr:
+                details += f"\nffmpeg stderr:\n{stderr}"
+            details += (
+                "\n\nThis usually means the input video is invalid/corrupt or uses an unsupported codec. "
+                "For example, 'moov atom not found' typically indicates an incomplete MP4 file."
+            )
+            raise RuntimeError(details) from e
+        finally:
+            if os.path.exists(tmp_wav):
+                os.remove(tmp_wav)
+        return wav.astype(np.float32)
+
+    def detect_events(self, video_path: str,
+                      confidence_threshold: float = 0.35) -> list[dict]:
+        """
+        Runs the full sound event detection pipeline.
+
+        Parameters
+        ----------
+        video_path           : path to the input video
+        confidence_threshold : minimum YAMNet score to keep an event
+
+        Returns
+        -------
+        List of dicts: [{sound, label_en, confidence, start_time, end_time}]
+        """
+        print(f"[Module 1] Analysing audio from: {video_path}")
+        wav = self._extract_audio(video_path)
+        scores, _, _ = self.model(wav)      # shape: (n_frames, 521)
+        scores_np = scores.numpy()
+
+        raw_events = []
+        for frame_idx, frame_scores in enumerate(scores_np):
+            top_idx   = int(np.argmax(frame_scores))
+            top_score = float(frame_scores[top_idx])
+            class_name = self.class_names[top_idx]
+
+            if top_score < confidence_threshold:
+                continue
+            if class_name in EXCLUDED_CLASSES:
+                continue
+
+            timestamp = frame_idx * self.FRAME_HOP
+            # Map to a bracket label; use the raw class name if not in dict
+            label_en = CC_LABELS_EN.get(class_name, f"[{class_name.upper()}]")
+
+            raw_events.append({
+                "sound":      class_name,
+                "label_en":   label_en,
+                "confidence": round(top_score, 3),
+                "start_time": round(timestamp, 3),
+                "end_time":   round(timestamp + self.FRAME_HOP, 3),
+            })
+
+        consolidated = self._consolidate(raw_events)
+        print(f"[Module 1] Detected {len(consolidated)} non-speech audio events.\n")
+        return consolidated
+
+    def _consolidate(self, events: list[dict],
+                     gap_threshold: float = 1.0) -> list[dict]:
+        """
+        Merges consecutive detections of the same sound class that are
+        within `gap_threshold` seconds of each other into a single event.
+        This prevents the same sound producing dozens of separate CC entries.
+        """
+        if not events:
+            return []
+        merged = [events[0].copy()]
+        for ev in events[1:]:
+            last = merged[-1]
+            same_sound  = ev["sound"] == last["sound"]
+            close_enough = (ev["start_time"] - last["end_time"]) < gap_threshold
+            if same_sound and close_enough:
+                last["end_time"]   = ev["end_time"]
+                last["confidence"] = max(last["confidence"], ev["confidence"])
+            else:
+                merged.append(ev.copy())
+        return merged
+
+
+# =============================================================================
+# MODULE 2 — SPEAKER REACTION DETECTION
+# Uses MediaPipe Face Mesh to measure head/face movement and mouth-openness
+# changes around an audio event timestamp, producing a reaction confidence score.
+# =============================================================================
+
+class SpeakerReactionDetector:
+    """
+    Determines whether a visible speaker reacts to a detected audio event
+    by analysing changes in facial landmark dynamics before and after the event.
+
+    Reaction signals used:
+      1. Head velocity spike  — sudden rapid head movement after the event
+      2. Sustained movement   — elevated mean head velocity after the event
+      3. Stillness (freeze)   — speaker freezes momentarily (startle response)
+      4. Mouth open           — sudden mouth opening (gasp, exclamation)
+
+    If no face is detected, falls back to a pixel-level frame-difference
+    heuristic to capture scene-level visual disruption.
+    """
+
+    # MediaPipe facial landmark indices
+    NOSE_TIP    = 1
+    LEFT_EYE    = 33
+    RIGHT_EYE   = 263
+    UPPER_LIP   = 13
+    LOWER_LIP   = 14
+
+    # MediaPipe Tasks face landmarker model (used when mp.solutions is unavailable)
+    FACE_LANDMARKER_TASK_URL = (
+        "https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/latest/face_landmarker.task"
+    )
+
+    def __init__(self):
+        print("[Module 2] Initialising MediaPipe Face Mesh...")
+        self._backend = None
+        self.mp_face_mesh = None
+        self.face_mesh = None
+        self.face_landmarker = None
+
+        # Legacy API (older MediaPipe): mp.solutions.face_mesh.FaceMesh
+        if hasattr(mp, "solutions") and hasattr(mp.solutions, "face_mesh"):
+            self._backend = "solutions"
+            self.mp_face_mesh = mp.solutions.face_mesh
+            self.face_mesh = self.mp_face_mesh.FaceMesh(
+                static_image_mode=False,
+                max_num_faces=1,
+                refine_landmarks=True,
+                min_detection_confidence=0.5,
+                min_tracking_confidence=0.5,
+            )
+        else:
+            # Newer MediaPipe (0.10.35+ in some builds): Tasks API only
+            self._backend = "tasks"
+            # MediaPipe Tasks VIDEO mode requires monotonically increasing timestamps
+            # across *all* detect_for_video() calls for the lifetime of the landmarker.
+            # Our pipeline analyzes multiple overlapping windows per video, so we
+            # maintain an internal timestamp counter instead of using video-time.
+            self._tasks_timestamp_ms = 0
+            from mediapipe.tasks.python.core.base_options import BaseOptions
+            from mediapipe.tasks.python.vision import face_landmarker
+            from mediapipe.tasks.python.vision.core.vision_task_running_mode import VisionTaskRunningMode
+
+            model_path = self._ensure_face_landmarker_task_model()
+            options = face_landmarker.FaceLandmarkerOptions(
+                base_options=BaseOptions(model_asset_path=model_path),
+                running_mode=VisionTaskRunningMode.VIDEO,
+                num_faces=1,
+                min_face_detection_confidence=0.5,
+                min_face_presence_confidence=0.5,
+                min_tracking_confidence=0.5,
+            )
+            self.face_landmarker = face_landmarker.FaceLandmarker.create_from_options(options)
+        print("[Module 2] MediaPipe ready.\n")
+
+    def _ensure_face_landmarker_task_model(self) -> str:
+        cache_dir = Path.home() / ".cache" / "planetread" / "mediapipe"
+        cache_dir.mkdir(parents=True, exist_ok=True)
+        model_path = cache_dir / "face_landmarker.task"
+        if model_path.exists() and model_path.stat().st_size > 0:
+            return str(model_path)
+
+        print("[Module 2] Downloading MediaPipe face_landmarker.task model...")
+        try:
+            urllib.request.urlretrieve(self.FACE_LANDMARKER_TASK_URL, model_path)
+        except Exception as e:
+            raise RuntimeError(
+                "Failed to download the MediaPipe face landmarker model. "
+                "Check your internet connection or manually download the model and place it at: "
+                f"{model_path}"
+            ) from e
+
+        return str(model_path)
+
+    def __del__(self):
+        # Best-effort cleanup for MediaPipe resources
+        try:
+            if self.face_mesh is not None:
+                self.face_mesh.close()
+        except Exception:
+            pass
+        try:
+            if self.face_landmarker is not None:
+                self.face_landmarker.close()
+        except Exception:
+            pass
+
+    def _face_scale(self, lm) -> float:
+        """Inter-ocular distance — used to normalise head movement magnitude."""
+        dx = lm[self.LEFT_EYE].x  - lm[self.RIGHT_EYE].x
+        dy = lm[self.LEFT_EYE].y  - lm[self.RIGHT_EYE].y
+        return max(math.hypot(dx, dy), 1e-6)
+
+    def _mouth_openness(self, lm) -> float:
+        return abs(lm[self.UPPER_LIP].y - lm[self.LOWER_LIP].y)
+
+    def _frame_diff(self, f1: np.ndarray, f2: np.ndarray) -> float:
+        """Mean absolute pixel difference between two greyscale frames."""
+        g1 = cv2.cvtColor(f1, cv2.COLOR_BGR2GRAY).astype(np.float32)
+        g2 = cv2.cvtColor(f2, cv2.COLOR_BGR2GRAY).astype(np.float32)
+        return float(np.mean(np.abs(g1 - g2)))
+
+    def analyze_reaction(self, video_path: str,
+                         event_time: float,
+                         window_before: float = 1.5,
+                         window_after:  float = 2.0) -> float:
+        """
+        Analyses frames in [event_time - window_before, event_time + window_after]
+        and returns a reaction confidence score in [0.0, 1.0].
+
+        Parameters
+        ----------
+        video_path    : path to the video file
+        event_time    : audio event timestamp (seconds)
+        window_before : seconds of baseline frames to analyse before event
+        window_after  : seconds of reaction frames to analyse after event
+
+        Returns
+        -------
+        reaction_confidence : float in [0.0, 1.0]
+        """
+        print(f"[Module 2] Checking visual reaction at t={event_time:.2f}s ...")
+
+        cap = cv2.VideoCapture(video_path)
+        fps = cap.get(cv2.CAP_PROP_FPS) or 25.0
+        dt  = 1.0 / fps
+
+        start_frame = int(max(0, event_time - window_before) * fps)
+        end_frame   = int((event_time + window_after) * fps)
+        event_frame = int(event_time * fps)
+
+        cap.set(cv2.CAP_PROP_POS_FRAMES, start_frame)
+
+        before_vel, after_vel   = [], []
+        before_mouth, after_mouth = [], []
+        prev_nose  = None
+        prev_vel   = 0.0
+        prev_frame = None
+        scene_diffs_before, scene_diffs_after = [], []
+        face_detected_any = False
+
+        cur = start_frame
+        tasks_step_ms = max(1, int(round(dt * 1000)))
+        while cur <= end_frame:
+            ok, frame = cap.read()
+            if not ok:
+                break
+
+            rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+            landmarks = None
+            if self._backend == "solutions":
+                results = self.face_mesh.process(rgb)
+                if results.multi_face_landmarks:
+                    landmarks = results.multi_face_landmarks[0].landmark
+            else:
+                # Tasks API expects a MediaPipe Image + monotonically increasing timestamp (ms)
+                mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb)
+                timestamp_ms = self._tasks_timestamp_ms
+                self._tasks_timestamp_ms += tasks_step_ms
+
+                results = self.face_landmarker.detect_for_video(mp_image, timestamp_ms)
+                if results.face_landmarks:
+                    landmarks = results.face_landmarks[0]
+
+            if landmarks:
+                face_detected_any = True
+                lm = landmarks
+                scale = self._face_scale(lm)
+                nose  = (lm[self.NOSE_TIP].x, lm[self.NOSE_TIP].y)
+                mouth = self._mouth_openness(lm)
+
+                if prev_nose is not None:
+                    raw_vel = math.hypot(nose[0] - prev_nose[0],
+                                         nose[1] - prev_nose[1]) / (dt * scale)
+                    # Exponential smoothing — reduces noise from jitter
+                    vel      = 0.6 * prev_vel + 0.4 * raw_vel
+                    prev_vel = vel
+                    if cur < event_frame:
+                        before_vel.append(vel)
+                        before_mouth.append(mouth)
+                    else:
+                        after_vel.append(vel)
+                        after_mouth.append(mouth)
+                prev_nose = nose
+            else:
+                # No face — accumulate scene-level pixel diffs as fallback
+                if prev_frame is not None:
+                    diff = self._frame_diff(prev_frame, frame)
+                    if cur < event_frame:
+                        scene_diffs_before.append(diff)
+                    else:
+                        scene_diffs_after.append(diff)
+                prev_nose, prev_vel = None, 0.0
+
+            prev_frame = frame.copy()
+            cur += 1
+
+        cap.release()
+
+        # --- Score computation ---
+        score = 0.0
+
+        if face_detected_any and before_vel and after_vel:
+            mu_b  = np.mean(before_vel)
+            std_b = np.std(before_vel)
+
+            # Signal 1: velocity spike (>2σ above baseline for >2 frames)
+            spike_frames = np.sum(np.array(after_vel) > mu_b + 2 * std_b)
+            if spike_frames >= 2:
+                score += 0.40
+
+            # Signal 2: sustained elevated movement
+            if np.mean(after_vel) > 1.5 * max(mu_b, 1e-6):
+                score += 0.25
+
+            # Signal 3: freeze response (sudden stillness)
+            if np.mean(after_vel) < 0.5 * max(mu_b, 1e-6) and mu_b > 0.01:
+                score += 0.15
+
+            # Signal 4: mouth opens (gasp / exclamation)
+            if before_mouth and after_mouth:
+                if np.mean(after_mouth) > 1.6 * max(np.mean(before_mouth), 1e-6):
+                    score += 0.20
+
+        elif scene_diffs_before and scene_diffs_after:
+            # Fallback: scene-level visual disruption
+            mean_before_diff = np.mean(scene_diffs_before)
+            mean_after_diff  = np.mean(scene_diffs_after)
+            if mean_after_diff > 2.0 * max(mean_before_diff, 1e-6):
+                score += 0.50
+            elif mean_after_diff > 1.3 * max(mean_before_diff, 1e-6):
+                score += 0.25
+
+        score = round(min(1.0, score), 3)
+        print(f"[Module 2] Reaction confidence: {score:.2f}\n")
+        return score
+
+
+# =============================================================================
+# MODULE 3 — CC DECISION ENGINE + SRT OUTPUT
+# Combines audio event confidence and visual reaction score to decide
+# whether to generate a CC annotation, then writes the SRT/SLS file.
+# =============================================================================
+
+class CCDecisionEngine:
+    """
+    Fusion layer that combines Module 1 (audio) and Module 2 (visual) signals
+    and generates a standard SRT (or plain-text SLS) file.
+
+    Decision formula
+    ----------------
+    combined = audio_weight * audio_conf + visual_weight * visual_conf
+
+    We weight visual higher (0.55 vs 0.45) because a visible speaker reaction
+    is a stronger signal of narrative significance than audio confidence alone.
+    If combined >= fusion_threshold → CC is generated.
+
+    The decision logic is intentionally simple and interpretable so that
+    PlanetRead editors can easily understand and tune thresholds.
+    """
+
+    AUDIO_WEIGHT  = 0.45
+    VISUAL_WEIGHT = 0.55
+
+    def __init__(self,
+                 fusion_threshold: float = 0.42,
+                 audio_only_threshold: float | None = 0.75,
+                 lang: str = "en"):
+        """
+        Parameters
+        ----------
+        fusion_threshold : combined score threshold above which a CC is generated
+        lang             : 'en' for English labels, 'hi' for Hindi labels
+        """
+        self.fusion_threshold = fusion_threshold
+        self.audio_only_threshold = audio_only_threshold
+        self.lang = lang
+
+    def _get_cc_text(self, label_en: str) -> str:
+        if self.lang == "hi":
+            return CC_LABELS_HI.get(label_en, label_en)
+        return label_en
+
+    def _to_timedelta(self, seconds: float) -> datetime.timedelta:
+        return datetime.timedelta(seconds=seconds)
+
+    def decide_and_generate(self,
+                            audio_events:    list[dict],
+                            visual_scores:   list[float],
+                            video_path:      str,
+                            output_path:     str,
+                            min_duration:    float = 1.0) -> list[dict]:
+        """
+        Runs the decision engine and writes the SRT file.
+
+        Parameters
+        ----------
+        audio_events  : output of Module 1
+        visual_scores : output of Module 2 (one score per audio event)
+        video_path    : original video path (used for metadata only)
+        output_path   : path to write the .srt file
+        min_duration  : minimum CC subtitle duration in seconds
+
+        Returns
+        -------
+        List of accepted CC annotations (dicts).
+        """
+        print("[Module 3] Running CC Decision Engine...")
+
+        accepted  = []
+        rejected  = []
+        subtitles = []
+
+        for idx, (event, vis_score) in enumerate(zip(audio_events, visual_scores)):
+            audio_conf = event["confidence"]
+            combined   = (self.AUDIO_WEIGHT * audio_conf +
+                          self.VISUAL_WEIGHT * vis_score)
+            combined   = round(combined, 3)
+
+            # --- High-impact bypass/boost ---
+            # If a critical sound happens off-camera (no visual reaction), we still want
+            # to consider it for CC based on audio strength alone.
+            label_en = event.get("label_en", "")
+            is_high_impact = label_en in HIGH_IMPACT_LABELS
+            high_impact_boost_applied = False
+            combined_pre_boost = combined
+            if is_high_impact and vis_score <= 0.0 and audio_conf >= 0.55:
+                boosted_floor = round(audio_conf * 0.85, 3)
+                if boosted_floor > combined:
+                    combined = boosted_floor
+                    high_impact_boost_applied = True
+
+            approved_by_fusion = combined >= self.fusion_threshold
+            approved_by_audio_only = (
+                self.audio_only_threshold is not None and
+                audio_conf >= self.audio_only_threshold
+            )
+            approved = approved_by_fusion or approved_by_audio_only
+
+            decision_info = {
+                "sound":        event["sound"],
+                "label_en":     event["label_en"],
+                "start_time":   event["start_time"],
+                "label_out":    self._get_cc_text(event["label_en"]), 
+                "end_time":     event["end_time"],
+                "audio_conf":   audio_conf,
+                "visual_conf":  vis_score,
+                "combined":     combined,
+                "combined_pre_boost": combined_pre_boost,
+                "high_impact":  is_high_impact,
+                "high_impact_boost_applied": high_impact_boost_applied,
+                "decision":     "APPROVED" if approved else "REJECTED",
+                "decision_basis": (
+                    "AUDIO_ONLY" if approved_by_audio_only else
+                    ("HIGH_IMPACT" if (approved_by_fusion and high_impact_boost_applied) else
+                     ("FUSION" if approved_by_fusion else "NONE"))
+                ),
+            }
+
+            if approved:
+                cc_text = self._get_cc_text(event["label_en"])
+                # Ensure subtitle has at least min_duration on screen
+                end_t   = max(event["end_time"], event["start_time"] + min_duration)
+                sub = srt.Subtitle(
+                    index=len(subtitles) + 1,
+                    start=self._to_timedelta(event["start_time"]),
+                    end=self._to_timedelta(end_t),
+                    content=cc_text,
+                )
+                subtitles.append(sub)
+                accepted.append(decision_info)
+                print(f"  ✅ APPROVED  | {event['sound'][:35]:<35} "
+                      f"| t={event['start_time']:.2f}s "
+                      f"| audio={audio_conf:.2f} vis={vis_score:.2f} "
+                        f"→ combined={combined:.2f} "
+                        f"({decision_info['decision_basis']}) "
+                      f"→ {cc_text}")
+            else:
+                rejected.append(decision_info)
+                print(f"  ❌ REJECTED  | {event['sound'][:35]:<35} "
+                      f"| t={event['start_time']:.2f}s "
+                      f"| audio={audio_conf:.2f} vis={vis_score:.2f} "
+                      f"→ combined={combined:.2f} (below threshold {self.fusion_threshold})")
+
+        # Write SRT file
+        srt_content = srt.compose(subtitles)
+        Path(output_path).write_text(srt_content, encoding="utf-8")
+
+        print(f"\n[Module 3] Complete.")
+        print(f"  Approved : {len(accepted)}")
+        print(f"  Rejected : {len(rejected)}")
+        print(f"  SRT file : {output_path}\n")
+
+        return accepted
+
+
+# =============================================================================
+# PIPELINE ORCHESTRATOR
+# Ties all three modules together into a single callable function.
+# =============================================================================
+
+def run_pipeline(video_path:       str,
+                 output_path:      str  = None,
+                 audio_threshold:  float = 0.35,
+                 fusion_threshold: float = 0.42,
+                 audio_only_threshold: float | None = 0.75,
+                 lang:             str  = "en",
+                 save_json:        bool = True) -> dict:
+    """
+    Runs the full end-to-end Intelligent CC Suggestion pipeline.
+
+    Parameters
+    ----------
+    video_path        : path to input video file
+    output_path       : path to write .srt (auto-generated if None)
+    audio_threshold   : YAMNet confidence threshold for Module 1
+    fusion_threshold  : combined score threshold for Module 3
+    lang              : 'en' or 'hi'
+    save_json         : if True, also saves a JSON report alongside the SRT
+
+    Returns
+    -------
+    Dictionary with keys: audio_events, visual_scores, accepted_cc, srt_path
+    """
+    if not os.path.exists(video_path):
+        raise FileNotFoundError(f"Video file not found: {video_path}")
+
+    stem = Path(video_path).stem
+    if output_path is None:
+        output_path = f"{stem}_cc_suggestions.srt"
+
+    print("=" * 65)
+    print("  INTELLIGENT CC SUGGESTION TOOL — PlanetRead DMP 2026")
+    print("=" * 65)
+    print(f"  Input video : {video_path}")
+    print(f"  Output SRT  : {output_path}")
+    print(f"  Language    : {'Hindi' if lang == 'hi' else 'English'}")
+    print(f"  Thresholds  : audio={audio_threshold}, fusion={fusion_threshold}")
+    print("=" * 65 + "\n")
+
+    # --- Module 1 ---
+    sed    = SoundEventDetector()
+    events = sed.detect_events(video_path, confidence_threshold=audio_threshold)
+
+    if not events:
+        print("No significant non-speech audio events detected. No SRT generated.")
+        return {"audio_events": [], "visual_scores": [],
+                "accepted_cc": [], "srt_path": None}
+
+    # --- Module 2 ---
+    vrd    = SpeakerReactionDetector()
+    scores = []
+    for ev in events:
+        mid_time = (ev["start_time"] + ev["end_time"]) / 2.0
+        score    = vrd.analyze_reaction(video_path, event_time=mid_time)
+        scores.append(score)
+
+    # --- Module 3 ---
+    engine   = CCDecisionEngine(
+        fusion_threshold=fusion_threshold,
+        audio_only_threshold=audio_only_threshold,
+        lang=lang,
+    )
+    accepted = engine.decide_and_generate(
+        audio_events=events,
+        visual_scores=scores,
+        video_path=video_path,
+        output_path=output_path,
+    )
+
+    # Optional JSON report
+    json_path = None
+    if save_json:
+        json_path = output_path.replace(".srt", "_report.json")
+
+        # Add language-specific label alongside label_en for easier consumption
+        events_with_labels = [
+            {
+                **ev,
+                "label_out": engine._get_cc_text(ev.get("label_en", "")),
+            }
+            for ev in events
+        ]
+        report = {
+            "video":           video_path,
+            "srt_output":      output_path,
+            "lang":            lang,
+            "audio_threshold": audio_threshold,
+            "fusion_threshold":fusion_threshold,
+            "audio_only_threshold": audio_only_threshold,
+            "total_events":    len(events),
+            "approved_cc":     len(accepted),
+            "audio_events":    events_with_labels,
+            "visual_scores":   scores,
+            "accepted_cc":     accepted,
+        }
+        Path(json_path).write_text(
+            json.dumps(report, indent=2, ensure_ascii=False),
+            encoding="utf-8",
+        )
+        print(f"  JSON report: {json_path}")
+
+    # --- Final summary ---
+    print("\n" + "=" * 65)
+    print("  PIPELINE SUMMARY")
+    print("=" * 65)
+    print(f"  Non-speech events detected : {len(events)}")
+    print(f"  CCs approved               : {len(accepted)}")
+    print(f"  CCs rejected               : {len(events) - len(accepted)}")
+    print(f"  Output SRT                 : {output_path}")
+    if json_path:
+        print(f"  JSON report                : {json_path}")
+    print("=" * 65 + "\n")
+
+    if accepted:
+        print("  Generated CC annotations:")
+        print("  " + "-" * 55)
+        for cc in accepted:
+            print(f"  {cc['start_time']:>7.2f}s  {cc.get('label_out', cc['label_en'])}")
+        print()
+
+    return {
+        "audio_events":  events,
+        "visual_scores": scores,
+        "accepted_cc":   accepted,
+        "srt_path":      output_path,
+        "json_path":     json_path,
+    }
+
+
+# =============================================================================
+# CLI ENTRY POINT
+# =============================================================================
+
+def parse_args():
+    p = argparse.ArgumentParser(
+        description="Intelligent CC Suggestion Tool — PlanetRead DMP 2026",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  python intelligent_cc_pipeline.py --video input.mp4
+  python intelligent_cc_pipeline.py --video input.mp4 --lang hi
+  python intelligent_cc_pipeline.py --video input.mp4 --audio-thresh 0.4 --fusion-thresh 0.55
+  python intelligent_cc_pipeline.py --video input.mp4 --output my_captions.srt
+        """
+    )
+    p.add_argument("--video",         required=True,  help="Path to input video file")
+    p.add_argument("--output",        default=None,   help="Output SRT file path (auto-named if omitted)")
+    p.add_argument("--audio-thresh",  type=float, default=0.35,
+                   help="YAMNet confidence threshold (default: 0.35)")
+    p.add_argument("--fusion-thresh", type=float, default=0.42,
+                   help="Combined audio+visual threshold to approve CC (default: 0.42)")
+    p.add_argument(
+        "--audio-only-thresh",
+        type=float,
+        default=0.75,
+        help=(
+            "Approve events purely on audio confidence if >= this value (default: 0.75). "
+            "Set to a negative value to disable audio-only approvals."
+        ),
+    )
+    p.add_argument("--lang",          choices=["en", "hi"], default="en",
+                   help="CC label language: 'en' (English) or 'hi' (Hindi)")
+    p.add_argument("--no-json",       action="store_true",
+                   help="Skip saving the JSON report")
+    return p.parse_args()
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    audio_only_threshold = None if args.audio_only_thresh < 0 else args.audio_only_thresh
+    run_pipeline(
+        video_path=args.video,
+        output_path=args.output,
+        audio_threshold=args.audio_thresh,
+        fusion_threshold=args.fusion_thresh,
+        audio_only_threshold=audio_only_threshold,
+        lang=args.lang,
+        save_json=not args.no_json,
+    )
\ No newline at end of file
diff --git a/reaction_detector.py b/reaction_detector.py
new file mode 100644
index 0000000..6b6b8bb
--- /dev/null
+++ b/reaction_detector.py
@@ -0,0 +1,337 @@
+"""Module 2 demo — Speaker/Scene Reaction Detection.
+
+This script demonstrates the *visual reaction* module of the Intelligent CC pipeline.
+Given a video and one or more event timestamps (seconds), it extracts frames around
+each event and returns a reaction confidence score in [0, 1].
+
+What counts as a "reaction"?
+- Sudden head movement (landmark motion) after the event
+- Sustained movement elevation after the event
+- Freeze response (drop in movement)
+- Sudden mouth opening (gasp)
+
+If no face is detected, it falls back to a simple scene-level frame-difference
+heuristic as a proxy for visual disruption.
+
+Notes
+-----
+- Supports both MediaPipe backends:
+  - mp.solutions.face_mesh.FaceMesh (legacy)
+  - mediapipe.tasks FaceLandmarker VIDEO mode (newer builds)
+- MediaPipe Tasks VIDEO mode requires monotonically increasing timestamps across
+  all detect_for_video() calls; we maintain an internal counter to satisfy that.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import urllib.request
+from pathlib import Path
+
+import cv2
+import mediapipe as mp
+import numpy as np
+
+
+class SpeakerReactionDetector:
+    """Visual reaction detector (Module 2)."""
+
+    # MediaPipe facial landmark indices
+    NOSE_TIP = 1
+    LEFT_EYE = 33
+    RIGHT_EYE = 263
+    UPPER_LIP = 13
+    LOWER_LIP = 14
+
+    FACE_LANDMARKER_TASK_URL = (
+        "https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/latest/face_landmarker.task"
+    )
+
+    def __init__(self):
+        self._backend = None
+        self.face_mesh = None
+        self.face_landmarker = None
+        self._tasks_timestamp_ms = 0
+
+        if hasattr(mp, "solutions") and hasattr(mp.solutions, "face_mesh"):
+            self._backend = "solutions"
+            self.face_mesh = mp.solutions.face_mesh.FaceMesh(
+                static_image_mode=False,
+                max_num_faces=1,
+                refine_landmarks=True,
+                min_detection_confidence=0.5,
+                min_tracking_confidence=0.5,
+            )
+        else:
+            self._backend = "tasks"
+            from mediapipe.tasks.python.core.base_options import BaseOptions
+            from mediapipe.tasks.python.vision import face_landmarker
+            from mediapipe.tasks.python.vision.core.vision_task_running_mode import (
+                VisionTaskRunningMode,
+            )
+
+            model_path = self._ensure_face_landmarker_task_model()
+            options = face_landmarker.FaceLandmarkerOptions(
+                base_options=BaseOptions(model_asset_path=model_path),
+                running_mode=VisionTaskRunningMode.VIDEO,
+                num_faces=1,
+                min_face_detection_confidence=0.5,
+                min_face_presence_confidence=0.5,
+                min_tracking_confidence=0.5,
+            )
+            self.face_landmarker = face_landmarker.FaceLandmarker.create_from_options(options)
+
+    def __del__(self):
+        try:
+            if self.face_mesh is not None:
+                self.face_mesh.close()
+        except Exception:
+            pass
+        try:
+            if self.face_landmarker is not None:
+                self.face_landmarker.close()
+        except Exception:
+            pass
+
+    def _ensure_face_landmarker_task_model(self) -> str:
+        cache_dir = Path.home() / ".cache" / "planetread" / "mediapipe"
+        cache_dir.mkdir(parents=True, exist_ok=True)
+        model_path = cache_dir / "face_landmarker.task"
+        if model_path.exists() and model_path.stat().st_size > 0:
+            return str(model_path)
+
+        urllib.request.urlretrieve(self.FACE_LANDMARKER_TASK_URL, model_path)
+        return str(model_path)
+
+    def _face_scale(self, lm) -> float:
+        dx = lm[self.LEFT_EYE].x - lm[self.RIGHT_EYE].x
+        dy = lm[self.LEFT_EYE].y - lm[self.RIGHT_EYE].y
+        return max(math.hypot(dx, dy), 1e-6)
+
+    def _mouth_openness(self, lm) -> float:
+        return abs(lm[self.UPPER_LIP].y - lm[self.LOWER_LIP].y)
+
+    def _frame_diff(self, f1: np.ndarray, f2: np.ndarray) -> float:
+        g1 = cv2.cvtColor(f1, cv2.COLOR_BGR2GRAY).astype(np.float32)
+        g2 = cv2.cvtColor(f2, cv2.COLOR_BGR2GRAY).astype(np.float32)
+        return float(np.mean(np.abs(g1 - g2)))
+
+    def analyze_reaction(
+        self,
+        video_path: str,
+        event_time: float,
+        window_before: float = 1.5,
+        window_after: float = 2.0,
+    ) -> dict:
+        """Return reaction confidence + diagnostics for one event."""
+        cap = cv2.VideoCapture(video_path)
+        if not cap.isOpened():
+            raise RuntimeError(f"Failed to open video: {video_path}")
+
+        fps = cap.get(cv2.CAP_PROP_FPS) or 25.0
+        dt = 1.0 / fps
+
+        start_frame = int(max(0.0, event_time - window_before) * fps)
+        end_frame = int((event_time + window_after) * fps)
+        event_frame = int(event_time * fps)
+
+        cap.set(cv2.CAP_PROP_POS_FRAMES, start_frame)
+
+        before_vel: list[float] = []
+        after_vel: list[float] = []
+        before_mouth: list[float] = []
+        after_mouth: list[float] = []
+
+        prev_nose = None
+        prev_vel = 0.0
+        prev_frame = None
+        scene_diffs_before: list[float] = []
+        scene_diffs_after: list[float] = []
+        face_detected_frames = 0
+        total_frames = 0
+
+        tasks_step_ms = max(1, int(round(dt * 1000)))
+
+        cur = start_frame
+        while cur <= end_frame:
+            ok, frame = cap.read()
+            if not ok:
+                break
+            total_frames += 1
+
+            rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+            landmarks = None
+
+            if self._backend == "solutions":
+                results = self.face_mesh.process(rgb)
+                if results.multi_face_landmarks:
+                    landmarks = results.multi_face_landmarks[0].landmark
+            else:
+                mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb)
+                timestamp_ms = self._tasks_timestamp_ms
+                self._tasks_timestamp_ms += tasks_step_ms
+                results = self.face_landmarker.detect_for_video(mp_image, timestamp_ms)
+                if results.face_landmarks:
+                    landmarks = results.face_landmarks[0]
+
+            if landmarks:
+                face_detected_frames += 1
+                lm = landmarks
+                scale = self._face_scale(lm)
+                nose = (lm[self.NOSE_TIP].x, lm[self.NOSE_TIP].y)
+                mouth = self._mouth_openness(lm)
+
+                if prev_nose is not None:
+                    raw_vel = math.hypot(nose[0] - prev_nose[0], nose[1] - prev_nose[1]) / (
+                        dt * scale
+                    )
+                    vel = 0.6 * prev_vel + 0.4 * raw_vel
+                    prev_vel = vel
+                    if cur < event_frame:
+                        before_vel.append(vel)
+                        before_mouth.append(mouth)
+                    else:
+                        after_vel.append(vel)
+                        after_mouth.append(mouth)
+                prev_nose = nose
+            else:
+                if prev_frame is not None:
+                    diff = self._frame_diff(prev_frame, frame)
+                    if cur < event_frame:
+                        scene_diffs_before.append(diff)
+                    else:
+                        scene_diffs_after.append(diff)
+                prev_nose, prev_vel = None, 0.0
+
+            prev_frame = frame.copy()
+            cur += 1
+
+        cap.release()
+
+        score = 0.0
+        basis = "NONE"
+
+        if face_detected_frames > 0 and before_vel and after_vel:
+            mu_b = float(np.mean(before_vel))
+            std_b = float(np.std(before_vel))
+
+            spike_frames = int(np.sum(np.array(after_vel) > mu_b + 2 * std_b))
+            if spike_frames >= 2:
+                score += 0.40
+
+            if float(np.mean(after_vel)) > 1.5 * max(mu_b, 1e-6):
+                score += 0.25
+
+            if float(np.mean(after_vel)) < 0.5 * max(mu_b, 1e-6) and mu_b > 0.01:
+                score += 0.15
+
+            if before_mouth and after_mouth:
+                if float(np.mean(after_mouth)) > 1.6 * max(float(np.mean(before_mouth)), 1e-6):
+                    score += 0.20
+            basis = "FACE"
+
+        elif scene_diffs_before and scene_diffs_after:
+            mean_before_diff = float(np.mean(scene_diffs_before))
+            mean_after_diff = float(np.mean(scene_diffs_after))
+            if mean_after_diff > 2.0 * max(mean_before_diff, 1e-6):
+                score += 0.50
+            elif mean_after_diff > 1.3 * max(mean_before_diff, 1e-6):
+                score += 0.25
+            basis = "SCENE_DIFF"
+
+        score = round(min(1.0, score), 3)
+        return {
+            "event_time": round(float(event_time), 3),
+            "reaction_confidence": score,
+            "basis": basis,
+            "backend": self._backend,
+            "face_detected_frames": face_detected_frames,
+            "total_frames": total_frames,
+        }
+
+
+def _parse_event_times(times: str | None) -> list[float]:
+    if not times:
+        return []
+    out: list[float] = []
+    for part in times.split(","):
+        part = part.strip()
+        if not part:
+            continue
+        out.append(float(part))
+    return out
+
+
+def _event_times_from_report(report_path: str) -> list[float]:
+    data = json.loads(Path(report_path).read_text(encoding="utf-8"))
+    events = data.get("audio_events") or []
+    times: list[float] = []
+    for ev in events:
+        start = float(ev.get("start_time", 0.0))
+        end = float(ev.get("end_time", start))
+        times.append((start + end) / 2.0)
+    return times
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description="Module 2 demo — visual reaction detection")
+    p.add_argument("--video", required=True, help="Path to input video")
+    p.add_argument(
+        "--times",
+        default=None,
+        help="Comma-separated event times in seconds (e.g. '0.96,3.6,39.12')",
+    )
+    p.add_argument(
+        "--from-report",
+        default=None,
+        help="Optional JSON report with audio_events (uses midpoints as event times)",
+    )
+    p.add_argument("--out", default=None, help="Output JSON path")
+    p.add_argument("--window-before", type=float, default=1.5)
+    p.add_argument("--window-after", type=float, default=2.0)
+    args = p.parse_args()
+
+    times = _parse_event_times(args.times)
+    if args.from_report:
+        times = _event_times_from_report(args.from_report)
+    if not times:
+        raise SystemExit("No event times provided. Use --times or --from-report")
+
+    det = SpeakerReactionDetector()
+    results = [
+        det.analyze_reaction(
+            args.video,
+            event_time=t,
+            window_before=args.window_before,
+            window_after=args.window_after,
+        )
+        for t in times
+    ]
+
+    payload = {
+        "video": args.video,
+        "num_events": len(results),
+        "window_before": args.window_before,
+        "window_after": args.window_after,
+        "results": results,
+    }
+
+    out_path = args.out
+    if out_path is None:
+        out_path = str(Path(args.video).with_suffix("")) + "_module2_reaction_report.json"
+    Path(out_path).write_text(json.dumps(payload, indent=2), encoding="utf-8")
+
+    print("Event time | reaction | basis     | face_frames/total")
+    print("-" * 55)
+    for r in results:
+        print(
+            f"{r['event_time']:>8.2f}s | {r['reaction_confidence']:<8.2f} | {r['basis']:<9} | {r['face_detected_frames']}/{r['total_frames']}"
+        )
+    print(f"\nWrote: {out_path}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
\ No newline at end of file
diff --git a/spider.mp4 b/spider.mp4
new file mode 100644
index 0000000..799e246
Binary files /dev/null and b/spider.mp4 differ
diff --git a/spider_cc_suggestions.srt b/spider_cc_suggestions.srt
new file mode 100644
index 0000000..033185f
--- /dev/null
+++ b/spider_cc_suggestions.srt
@@ -0,0 +1,20 @@
+1
+00:00:00,000 --> 00:00:01,920
+[हेलीकॉप्टर]
+
+2
+00:00:38,880 --> 00:00:39,880
+[कांच टूटना]
+
+3
+00:01:30,720 --> 00:01:31,720
+[संगीत]
+
+4
+00:01:37,440 --> 00:01:38,440
+[संगीत]
+
+5
+00:01:39,360 --> 00:01:40,800
+[संगीत]
+
diff --git a/spider_cc_suggestions_report.json b/spider_cc_suggestions_report.json
new file mode 100644
index 0000000..e35754a
--- /dev/null
+++ b/spider_cc_suggestions_report.json
@@ -0,0 +1,163 @@
+{
+  "video": "spider.mp4",
+  "srt_output": "spider_cc_suggestions.srt",
+  "lang": "hi",
+  "audio_threshold": 0.35,
+  "fusion_threshold": 0.42,
+  "audio_only_threshold": 0.75,
+  "total_events": 8,
+  "approved_cc": 5,
+  "audio_events": [
+    {
+      "sound": "Helicopter",
+      "label_en": "[HELICOPTER]",
+      "confidence": 0.782,
+      "start_time": 0.0,
+      "end_time": 1.92,
+      "label_out": "[हेलीकॉप्टर]"
+    },
+    {
+      "sound": "Helicopter",
+      "label_en": "[HELICOPTER]",
+      "confidence": 0.535,
+      "start_time": 3.36,
+      "end_time": 3.84,
+      "label_out": "[हेलीकॉप्टर]"
+    },
+    {
+      "sound": "Glass",
+      "label_en": "[GLASS BREAKING]",
+      "confidence": 0.766,
+      "start_time": 38.88,
+      "end_time": 39.36,
+      "label_out": "[कांच टूटना]"
+    },
+    {
+      "sound": "Whispering",
+      "label_en": "[WHISPERING]",
+      "confidence": 0.408,
+      "start_time": 63.36,
+      "end_time": 63.84,
+      "label_out": "[फुसफुसाना]"
+    },
+    {
+      "sound": "Animal",
+      "label_en": "[ANIMAL]",
+      "confidence": 0.386,
+      "start_time": 70.56,
+      "end_time": 71.04,
+      "label_out": "[ANIMAL]"
+    },
+    {
+      "sound": "Music",
+      "label_en": "[MUSIC]",
+      "confidence": 0.602,
+      "start_time": 90.72,
+      "end_time": 91.2,
+      "label_out": "[संगीत]"
+    },
+    {
+      "sound": "Music",
+      "label_en": "[MUSIC]",
+      "confidence": 0.798,
+      "start_time": 97.44,
+      "end_time": 97.92,
+      "label_out": "[संगीत]"
+    },
+    {
+      "sound": "Music",
+      "label_en": "[MUSIC]",
+      "confidence": 0.978,
+      "start_time": 99.36,
+      "end_time": 100.8,
+      "label_out": "[संगीत]"
+    }
+  ],
+  "visual_scores": [
+    0.0,
+    0.0,
+    0.4,
+    0.4,
+    0.4,
+    0.65,
+    0.2,
+    0.15
+  ],
+  "accepted_cc": [
+    {
+      "sound": "Helicopter",
+      "label_en": "[HELICOPTER]",
+      "start_time": 0.0,
+      "label_out": "[हेलीकॉप्टर]",
+      "end_time": 1.92,
+      "audio_conf": 0.782,
+      "visual_conf": 0.0,
+      "combined": 0.352,
+      "combined_pre_boost": 0.352,
+      "high_impact": false,
+      "high_impact_boost_applied": false,
+      "decision": "APPROVED",
+      "decision_basis": "AUDIO_ONLY"
+    },
+    {
+      "sound": "Glass",
+      "label_en": "[GLASS BREAKING]",
+      "start_time": 38.88,
+      "label_out": "[कांच टूटना]",
+      "end_time": 39.36,
+      "audio_conf": 0.766,
+      "visual_conf": 0.4,
+      "combined": 0.565,
+      "combined_pre_boost": 0.565,
+      "high_impact": true,
+      "high_impact_boost_applied": false,
+      "decision": "APPROVED",
+      "decision_basis": "AUDIO_ONLY"
+    },
+    {
+      "sound": "Music",
+      "label_en": "[MUSIC]",
+      "start_time": 90.72,
+      "label_out": "[संगीत]",
+      "end_time": 91.2,
+      "audio_conf": 0.602,
+      "visual_conf": 0.65,
+      "combined": 0.628,
+      "combined_pre_boost": 0.628,
+      "high_impact": false,
+      "high_impact_boost_applied": false,
+      "decision": "APPROVED",
+      "decision_basis": "FUSION"
+    },
+    {
+      "sound": "Music",
+      "label_en": "[MUSIC]",
+      "start_time": 97.44,
+      "label_out": "[संगीत]",
+      "end_time": 97.92,
+      "audio_conf": 0.798,
+      "visual_conf": 0.2,
+      "combined": 0.469,
+      "combined_pre_boost": 0.469,
+      "high_impact": false,
+      "high_impact_boost_applied": false,
+      "decision": "APPROVED",
+      "decision_basis": "AUDIO_ONLY"
+    },
+    {
+      "sound": "Music",
+      "label_en": "[MUSIC]",
+      "start_time": 99.36,
+      "label_out": "[संगीत]",
+      "end_time": 100.8,
+      "audio_conf": 0.978,
+      "visual_conf": 0.15,
+      "combined": 0.523,
+      "combined_pre_boost": 0.523,
+      "high_impact": false,
+      "high_impact_boost_applied": false,
+      "decision": "APPROVED",
+      "decision_basis": "AUDIO_ONLY"
+    }
+  ]
+}
\ No newline at end of file
diff --git a/spider_module2_reaction_report.json b/spider_module2_reaction_report.json
new file mode 100644
index 0000000..94c3426
--- /dev/null
+++ b/spider_module2_reaction_report.json
@@ -0,0 +1,72 @@
+{
+  "video": "spider.mp4",
+  "num_events": 8,
+  "window_before": 1.5,
+  "window_after": 2.0,
+  "results": [
+    {
+      "event_time": 0.96,
+      "reaction_confidence": 0.0,
+      "basis": "SCENE_DIFF",
+      "backend": "tasks",
+      "face_detected_frames": 1,
+      "total_frames": 71
+    },
+    {
+      "event_time": 3.6,
+      "reaction_confidence": 0.0,
+      "basis": "SCENE_DIFF",
+      "backend": "tasks",
+      "face_detected_frames": 0,
+      "total_frames": 85
+    },
+    {
+      "event_time": 39.12,
+      "reaction_confidence": 0.4,
+      "basis": "FACE",
+      "backend": "tasks",
+      "face_detected_frames": 39,
+      "total_frames": 85
+    },
+    {
+      "event_time": 63.6,
+      "reaction_confidence": 0.4,
+      "basis": "FACE",
+      "backend": "tasks",
+      "face_detected_frames": 84,
+      "total_frames": 85
+    },
+    {
+      "event_time": 70.8,
+      "reaction_confidence": 0.4,
+      "basis": "FACE",
+      "backend": "tasks",
+      "face_detected_frames": 84,
+      "total_frames": 85
+    },
+    {
+      "event_time": 90.96,
+      "reaction_confidence": 0.65,
+      "basis": "FACE",
+      "backend": "tasks",
+      "face_detected_frames": 81,
+      "total_frames": 85
+    },
+    {
+      "event_time": 97.68,
+      "reaction_confidence": 0.2,
+      "basis": "FACE",
+      "backend": "tasks",
+      "face_detected_frames": 82,
+      "total_frames": 84
+    },
+    {
+      "event_time": 100.08,
+      "reaction_confidence": 0.15,
+      "basis": "FACE",
+      "backend": "tasks",
+      "face_detected_frames": 55,
+      "total_frames": 57
+    }
+  ]
+}
\ No newline at end of file