diff --git a/README.md b/README.md new file mode 100644 index 0000000..21af4fa --- /dev/null +++ b/README.md @@ -0,0 +1,204 @@ +# Intelligent CC Suggestion Tool — DMP 2026 Demo + +**Contributor:** Naitik Gupta +**Organisation:** PlanetRead +**Issue:** [#2 — Intelligent CC Generation](https://github.com/PlanetRead/Intelligent-cc-generation/issues/2) +**Mentors:** @abinash-sketch, @keerthiseelan-planetread + +--- + +## What This Demo Covers + +This is a **complete end-to-end working demo** of all three pipeline goals described in the ticket: + +| Module | Goal | Status | +|--------|------|--------| +| Module 1 | Sound Event Detection (YAMNet + confidence scores + timestamps) | ✅ Complete | +| Module 2 | Speaker Reaction Detection (MediaPipe Face Mesh + OpenCV) | ✅ Complete | +| Module 3 | CC Decision Engine + SRT/SLS Output | ✅ Complete | + +--- + +## How It Works + +``` +Video File + │ + ├─► Module 1: SoundEventDetector + │ YAMNet classifies non-speech audio events + │ Output: [{sound, confidence, start_time, end_time}] + │ + ├─► Module 2: SpeakerReactionDetector + │ MediaPipe Face Mesh tracks head velocity + mouth openness + │ around each audio event timestamp + │ Output: [reaction_confidence_score per event] + │ + └─► Module 3: CCDecisionEngine + Combined score = 0.45 × audio_conf + 0.55 × visual_conf + If combined ≥ threshold → CC approved → written to SRT + Output: .srt file + .json report +``` + +### Decision Formula + +``` +combined = 0.45 × audio_confidence + 0.55 × visual_confidence +``` + +Visual reaction is weighted slightly higher because a visible speaker reaction +is a stronger signal of narrative significance than audio confidence alone. +This prevents over-captioning ambient sounds the speaker ignores. + +### Visual Reaction Signals (Module 2) + +Four signals are scored independently and summed: + +| Signal | Score | Condition | +|--------|-------|-----------| +| Velocity spike | +0.40 | Head moves >2σ above baseline for ≥2 frames | +| Sustained movement | +0.25 | Mean post-event velocity > 1.5× baseline | +| Freeze response | +0.15 | Sudden stillness after event (startle) | +| Mouth opens | +0.20 | Mouth openness increases >1.6× (gasp) | +| Scene diff fallback | +0.25–0.50 | Used when no face is detected | + +--- + +## Installation + +```bash +pip install tensorflow tensorflow-hub librosa moviepy mediapipe opencv-python srt numpy +``` + +**Tested on:** Python 3.10+, TensorFlow 2.19, CPU-only machine +**Note:** YAMNet is downloaded automatically on first run from TensorFlow Hub (~25MB). + +--- + +## Usage + +### Basic (English CC labels) +```bash +python intelligent_cc_pipeline.py --video sample.mp4 +``` + +### Hindi CC labels +```bash +python intelligent_cc_pipeline.py --video sample.mp4 --lang hi +``` + +### Custom thresholds +```bash +python intelligent_cc_pipeline.py --video sample.mp4 \ + --audio-thresh 0.4 \ + --fusion-thresh 0.55 +``` + +### All options +``` +--video Path to input video (required) +--output Output .srt path (auto-named if omitted) +--audio-thresh YAMNet confidence threshold (default: 0.35) +--fusion-thresh Combined score threshold to approve CC (default: 0.50) +--lang 'en' or 'hi' (default: 'en') +--no-json Skip saving the JSON report +``` + +--- + +## Output Files + +**`_cc_suggestions.srt`** — Standard SRT subtitle file +``` +1 +00:00:03,200 --> 00:00:04,680 +[GLASS BREAKING] + +2 +00:00:11,040 --> 00:00:12,520 +[APPLAUSE] +``` + +**`_cc_suggestions_report.json`** — Full pipeline report +```json +{ + "total_events": 8, + "approved_cc": 3, + "audio_events": [...], + "visual_scores": [...], + "accepted_cc": [...] +} +``` + +--- + +## Design Decisions + +### Why YAMNet? +YAMNet is pretrained on Google's AudioSet (2M+ clips, 521 classes) and runs +efficiently on CPU. It requires no fine-tuning for common sound events and +handles the wide range of events relevant to PlanetRead content (applause, +laughter, alarms, music, impacts). PANNs was evaluated as an alternative — +YAMNet was chosen for its lightweight inference and TensorFlow Hub availability. + +### Why MediaPipe Face Mesh? +MediaPipe runs in real-time on CPU, provides 468 landmark points per face, +and is well-suited for the edge/server environments PlanetRead works with. +The 4-signal scoring approach (velocity spike, sustained movement, freeze, +mouth opening) captures different types of startle/reaction responses without +requiring a trained classifier. + +### Why weight visual higher (0.55 vs 0.45)? +A speaker visibly reacting to a sound is unambiguous evidence that the sound +affects the narrative. High audio confidence alone (e.g., distant music) does +not necessarily warrant a CC. This weighting was determined empirically and is +easily tunable via `--fusion-thresh`. + +### Consolidation logic +Consecutive YAMNet frames detecting the same sound class within 1.0 seconds +are merged into a single event. This prevents the same sound from generating +dozens of overlapping CC annotations. + +--- + +## Known Limitations and Future Work + +1. **YAMNet class coverage:** Some culturally specific Indian sounds (dhol, + shehnai, specific street sounds) may not be in YAMNet's 521-class vocabulary. + A fine-tuned model on Indian audio content would improve recall for regional + content. + +2. **Single-face tracking:** Module 2 currently tracks only the primary face. + Multi-speaker scenes (talk shows, debates) would benefit from tracking all + visible speakers and triggering CC if any one of them reacts. + +3. **No GPU acceleration:** The pipeline runs on CPU. GPU inference would + reduce processing time significantly for long-form content. + +4. **SLS format:** The current output is standard SRT. PlanetRead's SLS format + has specific timing and encoding requirements that should be confirmed with + mentors and implemented as a post-processing step. + +5. **Threshold tuning:** The default thresholds (audio=0.35, fusion=0.50) were + set conservatively. Optimal values should be determined through systematic + evaluation with PlanetRead editors on a labeled Hindi/regional video dataset. + +--- + +## Demo Video + +> 📹 https://youtu.be/zn3huIukfiY + +The demo video shows the pipeline running on a sample video, with terminal +output for each module and the final SRT file being generated. + +--- + +## Repository Structure + +``` +. +├── intelligent_cc_pipeline.py # Main pipeline (all 3 modules) +├── README.md # This file +├── sample_output.srt # Example SRT output +└── sample_report.json # Example JSON report +``` \ No newline at end of file diff --git a/canva.mp4 b/canva.mp4 new file mode 100644 index 0000000..27e7704 Binary files /dev/null and b/canva.mp4 differ diff --git a/canva_cc_suggestions.srt b/canva_cc_suggestions.srt new file mode 100644 index 0000000..9fda099 --- /dev/null +++ b/canva_cc_suggestions.srt @@ -0,0 +1,36 @@ +1 +00:00:00,000 --> 00:00:03,840 +[कांच टूटना] + +2 +00:00:03,840 --> 00:00:04,840 +[कांच टूटना] + +3 +00:00:04,800 --> 00:00:05,800 +[कांच टूटना] + +4 +00:00:06,240 --> 00:00:12,000 +[तालियाँ] + +5 +00:00:12,000 --> 00:00:13,000 +[अलार्म] + +6 +00:00:13,440 --> 00:00:14,440 +[गोलीबारी] + +7 +00:00:13,920 --> 00:00:14,920 +[विस्फोट] + +8 +00:00:14,400 --> 00:00:15,400 +[सायरन] + +9 +00:00:14,880 --> 00:00:15,880 +[सायरन] + diff --git a/canva_cc_suggestions_report.json b/canva_cc_suggestions_report.json new file mode 100644 index 0000000..5f6090c --- /dev/null +++ b/canva_cc_suggestions_report.json @@ -0,0 +1,259 @@ +{ + "video": "canva.mp4", + "srt_output": "canva_cc_suggestions.srt", + "lang": "hi", + "audio_threshold": 0.35, + "fusion_threshold": 0.42, + "audio_only_threshold": null, + "total_events": 12, + "approved_cc": 9, + "audio_events": [ + { + "sound": "Glass", + "label_en": "[GLASS BREAKING]", + "confidence": 0.967, + "start_time": 0.0, + "end_time": 3.84, + "label_out": "[कांच टूटना]" + }, + { + "sound": "Shatter", + "label_en": "[GLASS BREAKING]", + "confidence": 0.668, + "start_time": 3.84, + "end_time": 4.32, + "label_out": "[कांच टूटना]" + }, + { + "sound": "Liquid", + "label_en": "[LIQUID]", + "confidence": 0.634, + "start_time": 4.32, + "end_time": 4.8, + "label_out": "[पानी]" + }, + { + "sound": "Shatter", + "label_en": "[GLASS BREAKING]", + "confidence": 0.921, + "start_time": 4.8, + "end_time": 5.76, + "label_out": "[कांच टूटना]" + }, + { + "sound": "Liquid", + "label_en": "[LIQUID]", + "confidence": 0.489, + "start_time": 5.76, + "end_time": 6.24, + "label_out": "[पानी]" + }, + { + "sound": "Applause", + "label_en": "[APPLAUSE]", + "confidence": 0.996, + "start_time": 6.24, + "end_time": 12.0, + "label_out": "[तालियाँ]" + }, + { + "sound": "Alarm", + "label_en": "[ALARM]", + "confidence": 0.363, + "start_time": 12.0, + "end_time": 12.48, + "label_out": "[अलार्म]" + }, + { + "sound": "Gunshot, gunfire", + "label_en": "[GUNSHOT]", + "confidence": 0.954, + "start_time": 13.44, + "end_time": 13.92, + "label_out": "[गोलीबारी]" + }, + { + "sound": "Explosion", + "label_en": "[EXPLOSION]", + "confidence": 0.97, + "start_time": 13.92, + "end_time": 14.4, + "label_out": "[विस्फोट]" + }, + { + "sound": "Police car (siren)", + "label_en": "[SIREN]", + "confidence": 0.572, + "start_time": 14.4, + "end_time": 14.88, + "label_out": "[सायरन]" + }, + { + "sound": "Emergency vehicle", + "label_en": "[SIREN]", + "confidence": 0.799, + "start_time": 14.88, + "end_time": 15.84, + "label_out": "[सायरन]" + }, + { + "sound": "Vehicle", + "label_en": "[VEHICLE]", + "confidence": 0.495, + "start_time": 16.8, + "end_time": 17.28, + "label_out": "[वाहन]" + } + ], + "visual_scores": [ + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.5, + 0.0, + 0.5, + 0.0, + 0.0, + 0.0 + ], + "accepted_cc": [ + { + "sound": "Glass", + "label_en": "[GLASS BREAKING]", + "start_time": 0.0, + "label_out": "[कांच टूटना]", + "end_time": 3.84, + "audio_conf": 0.967, + "visual_conf": 0.0, + "combined": 0.822, + "combined_pre_boost": 0.435, + "high_impact": true, + "high_impact_boost_applied": true, + "decision": "APPROVED", + "decision_basis": "HIGH_IMPACT" + }, + { + "sound": "Shatter", + "label_en": "[GLASS BREAKING]", + "start_time": 3.84, + "label_out": "[कांच टूटना]", + "end_time": 4.32, + "audio_conf": 0.668, + "visual_conf": 0.0, + "combined": 0.568, + "combined_pre_boost": 0.301, + "high_impact": true, + "high_impact_boost_applied": true, + "decision": "APPROVED", + "decision_basis": "HIGH_IMPACT" + }, + { + "sound": "Shatter", + "label_en": "[GLASS BREAKING]", + "start_time": 4.8, + "label_out": "[कांच टूटना]", + "end_time": 5.76, + "audio_conf": 0.921, + "visual_conf": 0.0, + "combined": 0.783, + "combined_pre_boost": 0.414, + "high_impact": true, + "high_impact_boost_applied": true, + "decision": "APPROVED", + "decision_basis": "HIGH_IMPACT" + }, + { + "sound": "Applause", + "label_en": "[APPLAUSE]", + "start_time": 6.24, + "label_out": "[तालियाँ]", + "end_time": 12.0, + "audio_conf": 0.996, + "visual_conf": 0.0, + "combined": 0.448, + "combined_pre_boost": 0.448, + "high_impact": false, + "high_impact_boost_applied": false, + "decision": "APPROVED", + "decision_basis": "FUSION" + }, + { + "sound": "Alarm", + "label_en": "[ALARM]", + "start_time": 12.0, + "label_out": "[अलार्म]", + "end_time": 12.48, + "audio_conf": 0.363, + "visual_conf": 0.5, + "combined": 0.438, + "combined_pre_boost": 0.438, + "high_impact": true, + "high_impact_boost_applied": false, + "decision": "APPROVED", + "decision_basis": "FUSION" + }, + { + "sound": "Gunshot, gunfire", + "label_en": "[GUNSHOT]", + "start_time": 13.44, + "label_out": "[गोलीबारी]", + "end_time": 13.92, + "audio_conf": 0.954, + "visual_conf": 0.0, + "combined": 0.811, + "combined_pre_boost": 0.429, + "high_impact": true, + "high_impact_boost_applied": true, + "decision": "APPROVED", + "decision_basis": "HIGH_IMPACT" + }, + { + "sound": "Explosion", + "label_en": "[EXPLOSION]", + "start_time": 13.92, + "label_out": "[विस्फोट]", + "end_time": 14.4, + "audio_conf": 0.97, + "visual_conf": 0.5, + "combined": 0.712, + "combined_pre_boost": 0.712, + "high_impact": true, + "high_impact_boost_applied": false, + "decision": "APPROVED", + "decision_basis": "FUSION" + }, + { + "sound": "Police car (siren)", + "label_en": "[SIREN]", + "start_time": 14.4, + "label_out": "[सायरन]", + "end_time": 14.88, + "audio_conf": 0.572, + "visual_conf": 0.0, + "combined": 0.486, + "combined_pre_boost": 0.257, + "high_impact": true, + "high_impact_boost_applied": true, + "decision": "APPROVED", + "decision_basis": "HIGH_IMPACT" + }, + { + "sound": "Emergency vehicle", + "label_en": "[SIREN]", + "start_time": 14.88, + "label_out": "[सायरन]", + "end_time": 15.84, + "audio_conf": 0.799, + "visual_conf": 0.0, + "combined": 0.679, + "combined_pre_boost": 0.36, + "high_impact": true, + "high_impact_boost_applied": true, + "decision": "APPROVED", + "decision_basis": "HIGH_IMPACT" + } + ] +} \ No newline at end of file diff --git a/intelligent_cc_pipeline.py b/intelligent_cc_pipeline.py new file mode 100644 index 0000000..eb9a215 --- /dev/null +++ b/intelligent_cc_pipeline.py @@ -0,0 +1,940 @@ +""" +============================================================================= +Intelligent Closed Caption (CC) Suggestion Tool +PlanetRead — DMP 2026 Demo Submission + +Author : Naitik +GitHub : https://github.com/naitik120gupta +Ticket : https://github.com/PlanetRead/Intelligent-cc-generation/issues/2 + +Description +----------- +End-to-end pipeline that accepts a video file and produces a ready-to-use +SRT file containing only contextually meaningful non-speech CC annotations. + +Pipeline stages: + Module 1 — Sound Event Detection (YAMNet via TensorFlow Hub) + Module 2 — Speaker Reaction Detection (MediaPipe Face Mesh + OpenCV) + Module 3 — CC Decision Engine + SRT Output + +Usage +----- + python intelligent_cc_pipeline.py --video sample.mp4 + python intelligent_cc_pipeline.py --video sample.mp4 --output my_cc.srt + python intelligent_cc_pipeline.py --video sample.mp4 --audio-thresh 0.4 --fusion-thresh 0.5 + python intelligent_cc_pipeline.py --video sample.mp4 --lang hi # Hindi CC labels + +Requirements +------------ + pip install "setuptools<82" tensorflow tensorflow-hub librosa moviepy mediapipe opencv-python srt numpy +============================================================================= +""" + +import os +import sys +import csv +import math +import json +import argparse +import datetime +import warnings +import tempfile +import subprocess +from pathlib import Path +import urllib.request + +warnings.filterwarnings("ignore") +os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" # Suppress TF CUDA warnings +os.environ["TF_ENABLE_ONEDNN_OPTS"] = "0" + +import numpy as np +import cv2 +import srt +import librosa +import tensorflow as tf +import tensorflow_hub as hub +import mediapipe as mp + + +# ============================================================================= +# CC LABEL DICTIONARIES +# Mapping YAMNet class names → human-readable CC bracket labels +# Extend these to support more events or languages. +# ============================================================================= + +CC_LABELS_EN = { + # Vehicles + "Vehicle horn, car horn, honking": "[HONKING]", + "Beep, bleep": "[BEEPING]", + "Car alarm": "[CAR ALARM]", + "Tire squeal": "[TIRES SCREECHING]", + "Helicopter": "[HELICOPTER]", + "Emergency vehicle": "[SIREN]", + "Siren": "[SIREN]", + # Alarms / alerts + "Fire alarm": "[FIRE ALARM]", + "Alarm": "[ALARM]", + "Bell": "[BELL]", + "Telephone bell ringing": "[PHONE RINGING]", + "Doorbell": "[DOORBELL]", + # Human non-speech sounds + "Laughter": "[LAUGHTER]", + "Crying, sobbing": "[CRYING]", + "Screaming": "[SCREAMING]", + "Applause": "[APPLAUSE]", + "Clapping": "[CLAPPING]", + "Cheering": "[CHEERING]", + "Crowd": "[CROWD NOISE]", + "Whispering": "[WHISPERING]", + "Cough": "[COUGHING]", + "Sneeze": "[SNEEZE]", + # Impacts / sudden events + "Gunshot, gunfire": "[GUNSHOT]", + "Explosion": "[EXPLOSION]", + "Glass breaking": "[GLASS BREAKING]", + "Glass": "[GLASS BREAKING]", + "Shatter": "[GLASS BREAKING]", + "Slam": "[DOOR SLAM]", + "Knock": "[KNOCKING]", + "Crash": "[CRASH]", + "Thud": "[THUD]", + # Other common AudioSet/YAMNet labels + "Police car (siren)": "[SIREN]", + "Vehicle": "[VEHICLE]", + "Liquid": "[LIQUID]", + # Nature + "Thunder": "[THUNDER]", + "Rain": "[RAIN]", + "Wind": "[WIND]", + # Music + "Musical instrument": "[MUSIC]", + "Drum": "[DRUMBEAT]", + "Guitar": "[GUITAR]", +} + +CC_LABELS_HI = { + # Hindi translations of the same labels + "[HONKING]": "[हॉर्न]", + "[BEEPING]": "[बीप]", + "[CAR ALARM]": "[कार अलार्म]", + "[TIRES SCREECHING]": "[टायर चीखना]", + "[HELICOPTER]": "[हेलीकॉप्टर]", + "[SIREN]": "[सायरन]", + "[FIRE ALARM]": "[आग अलार्म]", + "[ALARM]": "[अलार्म]", + "[BELL]": "[घंटी]", + "[PHONE RINGING]": "[फ़ोन बज रहा है]", + "[DOORBELL]": "[दरवाज़े की घंटी]", + "[LAUGHTER]": "[हँसी]", + "[CRYING]": "[रोना]", + "[SCREAMING]": "[चीखना]", + "[APPLAUSE]": "[तालियाँ]", + "[CLAPPING]": "[ताली बजाना]", + "[CHEERING]": "[जयकार]", + "[CROWD NOISE]": "[भीड़ का शोर]", + "[WHISPERING]": "[फुसफुसाना]", + "[COUGHING]": "[खाँसी]", + "[SNEEZE]": "[छींक]", + "[GUNSHOT]": "[गोलीबारी]", + "[EXPLOSION]": "[विस्फोट]", + "[GLASS BREAKING]": "[कांच टूटना]", + "[GLASS]": "[कांच टूटना]", + "[SHATTER]": "[टूटने की आवाज़]", + "[DOOR SLAM]": "[दरवाज़ा बंद]", + "[KNOCKING]": "[दस्तक]", + "[CRASH]": "[टक्कर]", + "[THUD]": "[धड़ाम]", + "[THUNDER]": "[गर्जना]", + "[RAIN]": "[बारिश]", + "[WIND]": "[हवा]", + "[MUSIC]": "[संगीत]", + "[DRUMBEAT]": "[ड्रम]", + "[GUITAR]": "[गिटार]", + "[LIQUID]": "[पानी]", + "[VEHICLE]": "[वाहन]", + "[POLICE CAR (SIREN)]":"[सायरन]", +} + +# YAMNet class names that we always exclude (speech, ambient, silence) +EXCLUDED_CLASSES = { + "Speech", "Male speech, man speaking", "Female speech, woman speaking", + "Child speech, kid speaking", "Silence", "Inside, small room", + "Inside, large room or hall", "Outside, urban or manmade", + "Outside, rural or natural", "Noise", "Environmental noise", + "White noise", "Pink noise", "Background noise", +} + +# High-impact CC labels that should not depend on visible reaction. +# These are bracket labels (post-mapping), not raw YAMNet class names. +HIGH_IMPACT_LABELS = { + "[GUNSHOT]", + "[EXPLOSION]", + "[SIREN]", + "[FIRE ALARM]", + "[ALARM]", + "[GLASS BREAKING]", + "[SCREAMING]", + "[CRASH]", +} + + +# ============================================================================= +# MODULE 1 — SOUND EVENT DETECTION +# Uses YAMNet (Google AudioSet classifier) to detect and classify non-speech +# audio events with confidence scores and timestamps. +# ============================================================================= + +class SoundEventDetector: + """ + Detects and classifies non-speech audio events in a video file. + + YAMNet processes audio in 0.96s windows with 0.48s hop, producing one + prediction vector (521 AudioSet classes) per frame. We filter by + confidence threshold and exclude speech/silence/ambient classes. + """ + + YAMNET_URL = "https://tfhub.dev/google/yamnet/1" + FRAME_HOP = 0.48 # YAMNet hop duration in seconds + + def __init__(self): + print("[Module 1] Loading YAMNet model from TensorFlow Hub...") + self.model = hub.load(self.YAMNET_URL) + self.class_names = self._load_class_names() + print(f"[Module 1] YAMNet loaded — {len(self.class_names)} AudioSet classes available.\n") + + def _load_class_names(self): + class_map_path = self.model.class_map_path().numpy().decode("utf-8") + names = [] + with tf.io.gfile.GFile(class_map_path) as f: + reader = csv.DictReader(f) + for row in reader: + names.append(row["display_name"]) + return names + + def _extract_audio(self, video_path: str) -> np.ndarray: + """ + Extracts mono 16 kHz audio from a video file using ffmpeg subprocess. + Falls back to moviepy if ffmpeg is not on PATH. + + Returns a float32 numpy array of waveform samples. + """ + tmp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False) + tmp_wav = tmp_file.name + tmp_file.close() + try: + # Prefer ffmpeg — much faster and no Python overhead + cmd = [ + "ffmpeg", "-y", "-i", video_path, + "-ac", "1", "-ar", "16000", + "-vn", tmp_wav, "-loglevel", "error" + ] + subprocess.run(cmd, check=True, capture_output=True, text=True) + wav, _ = librosa.load(tmp_wav, sr=16000, mono=True) + except FileNotFoundError as e: + # ffmpeg not available — use moviepy + print("[Module 1] ffmpeg not available, using moviepy fallback...") + try: + from moviepy.editor import VideoFileClip + except ModuleNotFoundError as ie: + raise RuntimeError( + "Audio extraction requires either 'ffmpeg' on PATH or the 'moviepy' Python package. " + "Install ffmpeg (recommended) or run: pip install moviepy" + ) from ie + + clip = VideoFileClip(video_path) + if clip.audio is None: + clip.close() + raise RuntimeError(f"No audio track found in video: {video_path}") + + clip.audio.write_audiofile( + tmp_wav, + fps=16000, + nbytes=2, + codec="pcm_s16le", + logger=None, + ) + wav, _ = librosa.load(tmp_wav, sr=16000, mono=True) + clip.close() + except subprocess.CalledProcessError as e: + stderr = (e.stderr or "").strip() + details = f"ffmpeg failed extracting audio from '{video_path}' (exit code {e.returncode})." + if stderr: + details += f"\nffmpeg stderr:\n{stderr}" + details += ( + "\n\nThis usually means the input video is invalid/corrupt or uses an unsupported codec. " + "For example, 'moov atom not found' typically indicates an incomplete MP4 file." + ) + raise RuntimeError(details) from e + finally: + if os.path.exists(tmp_wav): + os.remove(tmp_wav) + return wav.astype(np.float32) + + def detect_events(self, video_path: str, + confidence_threshold: float = 0.35) -> list[dict]: + """ + Runs the full sound event detection pipeline. + + Parameters + ---------- + video_path : path to the input video + confidence_threshold : minimum YAMNet score to keep an event + + Returns + ------- + List of dicts: [{sound, label_en, confidence, start_time, end_time}] + """ + print(f"[Module 1] Analysing audio from: {video_path}") + wav = self._extract_audio(video_path) + scores, _, _ = self.model(wav) # shape: (n_frames, 521) + scores_np = scores.numpy() + + raw_events = [] + for frame_idx, frame_scores in enumerate(scores_np): + top_idx = int(np.argmax(frame_scores)) + top_score = float(frame_scores[top_idx]) + class_name = self.class_names[top_idx] + + if top_score < confidence_threshold: + continue + if class_name in EXCLUDED_CLASSES: + continue + + timestamp = frame_idx * self.FRAME_HOP + # Map to a bracket label; use the raw class name if not in dict + label_en = CC_LABELS_EN.get(class_name, f"[{class_name.upper()}]") + + raw_events.append({ + "sound": class_name, + "label_en": label_en, + "confidence": round(top_score, 3), + "start_time": round(timestamp, 3), + "end_time": round(timestamp + self.FRAME_HOP, 3), + }) + + consolidated = self._consolidate(raw_events) + print(f"[Module 1] Detected {len(consolidated)} non-speech audio events.\n") + return consolidated + + def _consolidate(self, events: list[dict], + gap_threshold: float = 1.0) -> list[dict]: + """ + Merges consecutive detections of the same sound class that are + within `gap_threshold` seconds of each other into a single event. + This prevents the same sound producing dozens of separate CC entries. + """ + if not events: + return [] + merged = [events[0].copy()] + for ev in events[1:]: + last = merged[-1] + same_sound = ev["sound"] == last["sound"] + close_enough = (ev["start_time"] - last["end_time"]) < gap_threshold + if same_sound and close_enough: + last["end_time"] = ev["end_time"] + last["confidence"] = max(last["confidence"], ev["confidence"]) + else: + merged.append(ev.copy()) + return merged + + +# ============================================================================= +# MODULE 2 — SPEAKER REACTION DETECTION +# Uses MediaPipe Face Mesh to measure head/face movement and mouth-openness +# changes around an audio event timestamp, producing a reaction confidence score. +# ============================================================================= + +class SpeakerReactionDetector: + """ + Determines whether a visible speaker reacts to a detected audio event + by analysing changes in facial landmark dynamics before and after the event. + + Reaction signals used: + 1. Head velocity spike — sudden rapid head movement after the event + 2. Sustained movement — elevated mean head velocity after the event + 3. Stillness (freeze) — speaker freezes momentarily (startle response) + 4. Mouth open — sudden mouth opening (gasp, exclamation) + + If no face is detected, falls back to a pixel-level frame-difference + heuristic to capture scene-level visual disruption. + """ + + # MediaPipe facial landmark indices + NOSE_TIP = 1 + LEFT_EYE = 33 + RIGHT_EYE = 263 + UPPER_LIP = 13 + LOWER_LIP = 14 + + # MediaPipe Tasks face landmarker model (used when mp.solutions is unavailable) + FACE_LANDMARKER_TASK_URL = ( + "https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/latest/face_landmarker.task" + ) + + def __init__(self): + print("[Module 2] Initialising MediaPipe Face Mesh...") + self._backend = None + self.mp_face_mesh = None + self.face_mesh = None + self.face_landmarker = None + + # Legacy API (older MediaPipe): mp.solutions.face_mesh.FaceMesh + if hasattr(mp, "solutions") and hasattr(mp.solutions, "face_mesh"): + self._backend = "solutions" + self.mp_face_mesh = mp.solutions.face_mesh + self.face_mesh = self.mp_face_mesh.FaceMesh( + static_image_mode=False, + max_num_faces=1, + refine_landmarks=True, + min_detection_confidence=0.5, + min_tracking_confidence=0.5, + ) + else: + # Newer MediaPipe (0.10.35+ in some builds): Tasks API only + self._backend = "tasks" + # MediaPipe Tasks VIDEO mode requires monotonically increasing timestamps + # across *all* detect_for_video() calls for the lifetime of the landmarker. + # Our pipeline analyzes multiple overlapping windows per video, so we + # maintain an internal timestamp counter instead of using video-time. + self._tasks_timestamp_ms = 0 + from mediapipe.tasks.python.core.base_options import BaseOptions + from mediapipe.tasks.python.vision import face_landmarker + from mediapipe.tasks.python.vision.core.vision_task_running_mode import VisionTaskRunningMode + + model_path = self._ensure_face_landmarker_task_model() + options = face_landmarker.FaceLandmarkerOptions( + base_options=BaseOptions(model_asset_path=model_path), + running_mode=VisionTaskRunningMode.VIDEO, + num_faces=1, + min_face_detection_confidence=0.5, + min_face_presence_confidence=0.5, + min_tracking_confidence=0.5, + ) + self.face_landmarker = face_landmarker.FaceLandmarker.create_from_options(options) + print("[Module 2] MediaPipe ready.\n") + + def _ensure_face_landmarker_task_model(self) -> str: + cache_dir = Path.home() / ".cache" / "planetread" / "mediapipe" + cache_dir.mkdir(parents=True, exist_ok=True) + model_path = cache_dir / "face_landmarker.task" + if model_path.exists() and model_path.stat().st_size > 0: + return str(model_path) + + print("[Module 2] Downloading MediaPipe face_landmarker.task model...") + try: + urllib.request.urlretrieve(self.FACE_LANDMARKER_TASK_URL, model_path) + except Exception as e: + raise RuntimeError( + "Failed to download the MediaPipe face landmarker model. " + "Check your internet connection or manually download the model and place it at: " + f"{model_path}" + ) from e + + return str(model_path) + + def __del__(self): + # Best-effort cleanup for MediaPipe resources + try: + if self.face_mesh is not None: + self.face_mesh.close() + except Exception: + pass + try: + if self.face_landmarker is not None: + self.face_landmarker.close() + except Exception: + pass + + def _face_scale(self, lm) -> float: + """Inter-ocular distance — used to normalise head movement magnitude.""" + dx = lm[self.LEFT_EYE].x - lm[self.RIGHT_EYE].x + dy = lm[self.LEFT_EYE].y - lm[self.RIGHT_EYE].y + return max(math.hypot(dx, dy), 1e-6) + + def _mouth_openness(self, lm) -> float: + return abs(lm[self.UPPER_LIP].y - lm[self.LOWER_LIP].y) + + def _frame_diff(self, f1: np.ndarray, f2: np.ndarray) -> float: + """Mean absolute pixel difference between two greyscale frames.""" + g1 = cv2.cvtColor(f1, cv2.COLOR_BGR2GRAY).astype(np.float32) + g2 = cv2.cvtColor(f2, cv2.COLOR_BGR2GRAY).astype(np.float32) + return float(np.mean(np.abs(g1 - g2))) + + def analyze_reaction(self, video_path: str, + event_time: float, + window_before: float = 1.5, + window_after: float = 2.0) -> float: + """ + Analyses frames in [event_time - window_before, event_time + window_after] + and returns a reaction confidence score in [0.0, 1.0]. + + Parameters + ---------- + video_path : path to the video file + event_time : audio event timestamp (seconds) + window_before : seconds of baseline frames to analyse before event + window_after : seconds of reaction frames to analyse after event + + Returns + ------- + reaction_confidence : float in [0.0, 1.0] + """ + print(f"[Module 2] Checking visual reaction at t={event_time:.2f}s ...") + + cap = cv2.VideoCapture(video_path) + fps = cap.get(cv2.CAP_PROP_FPS) or 25.0 + dt = 1.0 / fps + + start_frame = int(max(0, event_time - window_before) * fps) + end_frame = int((event_time + window_after) * fps) + event_frame = int(event_time * fps) + + cap.set(cv2.CAP_PROP_POS_FRAMES, start_frame) + + before_vel, after_vel = [], [] + before_mouth, after_mouth = [], [] + prev_nose = None + prev_vel = 0.0 + prev_frame = None + scene_diffs_before, scene_diffs_after = [], [] + face_detected_any = False + + cur = start_frame + tasks_step_ms = max(1, int(round(dt * 1000))) + while cur <= end_frame: + ok, frame = cap.read() + if not ok: + break + + rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) + landmarks = None + if self._backend == "solutions": + results = self.face_mesh.process(rgb) + if results.multi_face_landmarks: + landmarks = results.multi_face_landmarks[0].landmark + else: + # Tasks API expects a MediaPipe Image + monotonically increasing timestamp (ms) + mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb) + timestamp_ms = self._tasks_timestamp_ms + self._tasks_timestamp_ms += tasks_step_ms + + results = self.face_landmarker.detect_for_video(mp_image, timestamp_ms) + if results.face_landmarks: + landmarks = results.face_landmarks[0] + + if landmarks: + face_detected_any = True + lm = landmarks + scale = self._face_scale(lm) + nose = (lm[self.NOSE_TIP].x, lm[self.NOSE_TIP].y) + mouth = self._mouth_openness(lm) + + if prev_nose is not None: + raw_vel = math.hypot(nose[0] - prev_nose[0], + nose[1] - prev_nose[1]) / (dt * scale) + # Exponential smoothing — reduces noise from jitter + vel = 0.6 * prev_vel + 0.4 * raw_vel + prev_vel = vel + if cur < event_frame: + before_vel.append(vel) + before_mouth.append(mouth) + else: + after_vel.append(vel) + after_mouth.append(mouth) + prev_nose = nose + else: + # No face — accumulate scene-level pixel diffs as fallback + if prev_frame is not None: + diff = self._frame_diff(prev_frame, frame) + if cur < event_frame: + scene_diffs_before.append(diff) + else: + scene_diffs_after.append(diff) + prev_nose, prev_vel = None, 0.0 + + prev_frame = frame.copy() + cur += 1 + + cap.release() + + # --- Score computation --- + score = 0.0 + + if face_detected_any and before_vel and after_vel: + mu_b = np.mean(before_vel) + std_b = np.std(before_vel) + + # Signal 1: velocity spike (>2σ above baseline for >2 frames) + spike_frames = np.sum(np.array(after_vel) > mu_b + 2 * std_b) + if spike_frames >= 2: + score += 0.40 + + # Signal 2: sustained elevated movement + if np.mean(after_vel) > 1.5 * max(mu_b, 1e-6): + score += 0.25 + + # Signal 3: freeze response (sudden stillness) + if np.mean(after_vel) < 0.5 * max(mu_b, 1e-6) and mu_b > 0.01: + score += 0.15 + + # Signal 4: mouth opens (gasp / exclamation) + if before_mouth and after_mouth: + if np.mean(after_mouth) > 1.6 * max(np.mean(before_mouth), 1e-6): + score += 0.20 + + elif scene_diffs_before and scene_diffs_after: + # Fallback: scene-level visual disruption + mean_before_diff = np.mean(scene_diffs_before) + mean_after_diff = np.mean(scene_diffs_after) + if mean_after_diff > 2.0 * max(mean_before_diff, 1e-6): + score += 0.50 + elif mean_after_diff > 1.3 * max(mean_before_diff, 1e-6): + score += 0.25 + + score = round(min(1.0, score), 3) + print(f"[Module 2] Reaction confidence: {score:.2f}\n") + return score + + +# ============================================================================= +# MODULE 3 — CC DECISION ENGINE + SRT OUTPUT +# Combines audio event confidence and visual reaction score to decide +# whether to generate a CC annotation, then writes the SRT/SLS file. +# ============================================================================= + +class CCDecisionEngine: + """ + Fusion layer that combines Module 1 (audio) and Module 2 (visual) signals + and generates a standard SRT (or plain-text SLS) file. + + Decision formula + ---------------- + combined = audio_weight * audio_conf + visual_weight * visual_conf + + We weight visual higher (0.55 vs 0.45) because a visible speaker reaction + is a stronger signal of narrative significance than audio confidence alone. + If combined >= fusion_threshold → CC is generated. + + The decision logic is intentionally simple and interpretable so that + PlanetRead editors can easily understand and tune thresholds. + """ + + AUDIO_WEIGHT = 0.45 + VISUAL_WEIGHT = 0.55 + + def __init__(self, + fusion_threshold: float = 0.42, + audio_only_threshold: float | None = 0.75, + lang: str = "en"): + """ + Parameters + ---------- + fusion_threshold : combined score threshold above which a CC is generated + lang : 'en' for English labels, 'hi' for Hindi labels + """ + self.fusion_threshold = fusion_threshold + self.audio_only_threshold = audio_only_threshold + self.lang = lang + + def _get_cc_text(self, label_en: str) -> str: + if self.lang == "hi": + return CC_LABELS_HI.get(label_en, label_en) + return label_en + + def _to_timedelta(self, seconds: float) -> datetime.timedelta: + return datetime.timedelta(seconds=seconds) + + def decide_and_generate(self, + audio_events: list[dict], + visual_scores: list[float], + video_path: str, + output_path: str, + min_duration: float = 1.0) -> list[dict]: + """ + Runs the decision engine and writes the SRT file. + + Parameters + ---------- + audio_events : output of Module 1 + visual_scores : output of Module 2 (one score per audio event) + video_path : original video path (used for metadata only) + output_path : path to write the .srt file + min_duration : minimum CC subtitle duration in seconds + + Returns + ------- + List of accepted CC annotations (dicts). + """ + print("[Module 3] Running CC Decision Engine...") + + accepted = [] + rejected = [] + subtitles = [] + + for idx, (event, vis_score) in enumerate(zip(audio_events, visual_scores)): + audio_conf = event["confidence"] + combined = (self.AUDIO_WEIGHT * audio_conf + + self.VISUAL_WEIGHT * vis_score) + combined = round(combined, 3) + + # --- High-impact bypass/boost --- + # If a critical sound happens off-camera (no visual reaction), we still want + # to consider it for CC based on audio strength alone. + label_en = event.get("label_en", "") + is_high_impact = label_en in HIGH_IMPACT_LABELS + high_impact_boost_applied = False + combined_pre_boost = combined + if is_high_impact and vis_score <= 0.0 and audio_conf >= 0.55: + boosted_floor = round(audio_conf * 0.85, 3) + if boosted_floor > combined: + combined = boosted_floor + high_impact_boost_applied = True + + approved_by_fusion = combined >= self.fusion_threshold + approved_by_audio_only = ( + self.audio_only_threshold is not None and + audio_conf >= self.audio_only_threshold + ) + approved = approved_by_fusion or approved_by_audio_only + + decision_info = { + "sound": event["sound"], + "label_en": event["label_en"], + "start_time": event["start_time"], + "label_out": self._get_cc_text(event["label_en"]), + "end_time": event["end_time"], + "audio_conf": audio_conf, + "visual_conf": vis_score, + "combined": combined, + "combined_pre_boost": combined_pre_boost, + "high_impact": is_high_impact, + "high_impact_boost_applied": high_impact_boost_applied, + "decision": "APPROVED" if approved else "REJECTED", + "decision_basis": ( + "AUDIO_ONLY" if approved_by_audio_only else + ("HIGH_IMPACT" if (approved_by_fusion and high_impact_boost_applied) else + ("FUSION" if approved_by_fusion else "NONE")) + ), + } + + if approved: + cc_text = self._get_cc_text(event["label_en"]) + # Ensure subtitle has at least min_duration on screen + end_t = max(event["end_time"], event["start_time"] + min_duration) + sub = srt.Subtitle( + index=len(subtitles) + 1, + start=self._to_timedelta(event["start_time"]), + end=self._to_timedelta(end_t), + content=cc_text, + ) + subtitles.append(sub) + accepted.append(decision_info) + print(f" ✅ APPROVED | {event['sound'][:35]:<35} " + f"| t={event['start_time']:.2f}s " + f"| audio={audio_conf:.2f} vis={vis_score:.2f} " + f"→ combined={combined:.2f} " + f"({decision_info['decision_basis']}) " + f"→ {cc_text}") + else: + rejected.append(decision_info) + print(f" ❌ REJECTED | {event['sound'][:35]:<35} " + f"| t={event['start_time']:.2f}s " + f"| audio={audio_conf:.2f} vis={vis_score:.2f} " + f"→ combined={combined:.2f} (below threshold {self.fusion_threshold})") + + # Write SRT file + srt_content = srt.compose(subtitles) + Path(output_path).write_text(srt_content, encoding="utf-8") + + print(f"\n[Module 3] Complete.") + print(f" Approved : {len(accepted)}") + print(f" Rejected : {len(rejected)}") + print(f" SRT file : {output_path}\n") + + return accepted + + +# ============================================================================= +# PIPELINE ORCHESTRATOR +# Ties all three modules together into a single callable function. +# ============================================================================= + +def run_pipeline(video_path: str, + output_path: str = None, + audio_threshold: float = 0.35, + fusion_threshold: float = 0.42, + audio_only_threshold: float | None = 0.75, + lang: str = "en", + save_json: bool = True) -> dict: + """ + Runs the full end-to-end Intelligent CC Suggestion pipeline. + + Parameters + ---------- + video_path : path to input video file + output_path : path to write .srt (auto-generated if None) + audio_threshold : YAMNet confidence threshold for Module 1 + fusion_threshold : combined score threshold for Module 3 + lang : 'en' or 'hi' + save_json : if True, also saves a JSON report alongside the SRT + + Returns + ------- + Dictionary with keys: audio_events, visual_scores, accepted_cc, srt_path + """ + if not os.path.exists(video_path): + raise FileNotFoundError(f"Video file not found: {video_path}") + + stem = Path(video_path).stem + if output_path is None: + output_path = f"{stem}_cc_suggestions.srt" + + print("=" * 65) + print(" INTELLIGENT CC SUGGESTION TOOL — PlanetRead DMP 2026") + print("=" * 65) + print(f" Input video : {video_path}") + print(f" Output SRT : {output_path}") + print(f" Language : {'Hindi' if lang == 'hi' else 'English'}") + print(f" Thresholds : audio={audio_threshold}, fusion={fusion_threshold}") + print("=" * 65 + "\n") + + # --- Module 1 --- + sed = SoundEventDetector() + events = sed.detect_events(video_path, confidence_threshold=audio_threshold) + + if not events: + print("No significant non-speech audio events detected. No SRT generated.") + return {"audio_events": [], "visual_scores": [], + "accepted_cc": [], "srt_path": None} + + # --- Module 2 --- + vrd = SpeakerReactionDetector() + scores = [] + for ev in events: + mid_time = (ev["start_time"] + ev["end_time"]) / 2.0 + score = vrd.analyze_reaction(video_path, event_time=mid_time) + scores.append(score) + + # --- Module 3 --- + engine = CCDecisionEngine( + fusion_threshold=fusion_threshold, + audio_only_threshold=audio_only_threshold, + lang=lang, + ) + accepted = engine.decide_and_generate( + audio_events=events, + visual_scores=scores, + video_path=video_path, + output_path=output_path, + ) + + # Optional JSON report + json_path = None + if save_json: + json_path = output_path.replace(".srt", "_report.json") + + # Add language-specific label alongside label_en for easier consumption + events_with_labels = [ + { + **ev, + "label_out": engine._get_cc_text(ev.get("label_en", "")), + } + for ev in events + ] + report = { + "video": video_path, + "srt_output": output_path, + "lang": lang, + "audio_threshold": audio_threshold, + "fusion_threshold":fusion_threshold, + "audio_only_threshold": audio_only_threshold, + "total_events": len(events), + "approved_cc": len(accepted), + "audio_events": events_with_labels, + "visual_scores": scores, + "accepted_cc": accepted, + } + Path(json_path).write_text( + json.dumps(report, indent=2, ensure_ascii=False), + encoding="utf-8", + ) + print(f" JSON report: {json_path}") + + # --- Final summary --- + print("\n" + "=" * 65) + print(" PIPELINE SUMMARY") + print("=" * 65) + print(f" Non-speech events detected : {len(events)}") + print(f" CCs approved : {len(accepted)}") + print(f" CCs rejected : {len(events) - len(accepted)}") + print(f" Output SRT : {output_path}") + if json_path: + print(f" JSON report : {json_path}") + print("=" * 65 + "\n") + + if accepted: + print(" Generated CC annotations:") + print(" " + "-" * 55) + for cc in accepted: + print(f" {cc['start_time']:>7.2f}s {cc.get('label_out', cc['label_en'])}") + print() + + return { + "audio_events": events, + "visual_scores": scores, + "accepted_cc": accepted, + "srt_path": output_path, + "json_path": json_path, + } + + +# ============================================================================= +# CLI ENTRY POINT +# ============================================================================= + +def parse_args(): + p = argparse.ArgumentParser( + description="Intelligent CC Suggestion Tool — PlanetRead DMP 2026", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + python intelligent_cc_pipeline.py --video input.mp4 + python intelligent_cc_pipeline.py --video input.mp4 --lang hi + python intelligent_cc_pipeline.py --video input.mp4 --audio-thresh 0.4 --fusion-thresh 0.55 + python intelligent_cc_pipeline.py --video input.mp4 --output my_captions.srt + """ + ) + p.add_argument("--video", required=True, help="Path to input video file") + p.add_argument("--output", default=None, help="Output SRT file path (auto-named if omitted)") + p.add_argument("--audio-thresh", type=float, default=0.35, + help="YAMNet confidence threshold (default: 0.35)") + p.add_argument("--fusion-thresh", type=float, default=0.42, + help="Combined audio+visual threshold to approve CC (default: 0.42)") + p.add_argument( + "--audio-only-thresh", + type=float, + default=0.75, + help=( + "Approve events purely on audio confidence if >= this value (default: 0.75). " + "Set to a negative value to disable audio-only approvals." + ), + ) + p.add_argument("--lang", choices=["en", "hi"], default="en", + help="CC label language: 'en' (English) or 'hi' (Hindi)") + p.add_argument("--no-json", action="store_true", + help="Skip saving the JSON report") + return p.parse_args() + + +if __name__ == "__main__": + args = parse_args() + audio_only_threshold = None if args.audio_only_thresh < 0 else args.audio_only_thresh + run_pipeline( + video_path=args.video, + output_path=args.output, + audio_threshold=args.audio_thresh, + fusion_threshold=args.fusion_thresh, + audio_only_threshold=audio_only_threshold, + lang=args.lang, + save_json=not args.no_json, + ) \ No newline at end of file diff --git a/reaction_detector.py b/reaction_detector.py new file mode 100644 index 0000000..6b6b8bb --- /dev/null +++ b/reaction_detector.py @@ -0,0 +1,337 @@ +"""Module 2 demo — Speaker/Scene Reaction Detection. + +This script demonstrates the *visual reaction* module of the Intelligent CC pipeline. +Given a video and one or more event timestamps (seconds), it extracts frames around +each event and returns a reaction confidence score in [0, 1]. + +What counts as a "reaction"? +- Sudden head movement (landmark motion) after the event +- Sustained movement elevation after the event +- Freeze response (drop in movement) +- Sudden mouth opening (gasp) + +If no face is detected, it falls back to a simple scene-level frame-difference +heuristic as a proxy for visual disruption. + +Notes +----- +- Supports both MediaPipe backends: + - mp.solutions.face_mesh.FaceMesh (legacy) + - mediapipe.tasks FaceLandmarker VIDEO mode (newer builds) +- MediaPipe Tasks VIDEO mode requires monotonically increasing timestamps across + all detect_for_video() calls; we maintain an internal counter to satisfy that. +""" + +from __future__ import annotations + +import argparse +import json +import math +import urllib.request +from pathlib import Path + +import cv2 +import mediapipe as mp +import numpy as np + + +class SpeakerReactionDetector: + """Visual reaction detector (Module 2).""" + + # MediaPipe facial landmark indices + NOSE_TIP = 1 + LEFT_EYE = 33 + RIGHT_EYE = 263 + UPPER_LIP = 13 + LOWER_LIP = 14 + + FACE_LANDMARKER_TASK_URL = ( + "https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/latest/face_landmarker.task" + ) + + def __init__(self): + self._backend = None + self.face_mesh = None + self.face_landmarker = None + self._tasks_timestamp_ms = 0 + + if hasattr(mp, "solutions") and hasattr(mp.solutions, "face_mesh"): + self._backend = "solutions" + self.face_mesh = mp.solutions.face_mesh.FaceMesh( + static_image_mode=False, + max_num_faces=1, + refine_landmarks=True, + min_detection_confidence=0.5, + min_tracking_confidence=0.5, + ) + else: + self._backend = "tasks" + from mediapipe.tasks.python.core.base_options import BaseOptions + from mediapipe.tasks.python.vision import face_landmarker + from mediapipe.tasks.python.vision.core.vision_task_running_mode import ( + VisionTaskRunningMode, + ) + + model_path = self._ensure_face_landmarker_task_model() + options = face_landmarker.FaceLandmarkerOptions( + base_options=BaseOptions(model_asset_path=model_path), + running_mode=VisionTaskRunningMode.VIDEO, + num_faces=1, + min_face_detection_confidence=0.5, + min_face_presence_confidence=0.5, + min_tracking_confidence=0.5, + ) + self.face_landmarker = face_landmarker.FaceLandmarker.create_from_options(options) + + def __del__(self): + try: + if self.face_mesh is not None: + self.face_mesh.close() + except Exception: + pass + try: + if self.face_landmarker is not None: + self.face_landmarker.close() + except Exception: + pass + + def _ensure_face_landmarker_task_model(self) -> str: + cache_dir = Path.home() / ".cache" / "planetread" / "mediapipe" + cache_dir.mkdir(parents=True, exist_ok=True) + model_path = cache_dir / "face_landmarker.task" + if model_path.exists() and model_path.stat().st_size > 0: + return str(model_path) + + urllib.request.urlretrieve(self.FACE_LANDMARKER_TASK_URL, model_path) + return str(model_path) + + def _face_scale(self, lm) -> float: + dx = lm[self.LEFT_EYE].x - lm[self.RIGHT_EYE].x + dy = lm[self.LEFT_EYE].y - lm[self.RIGHT_EYE].y + return max(math.hypot(dx, dy), 1e-6) + + def _mouth_openness(self, lm) -> float: + return abs(lm[self.UPPER_LIP].y - lm[self.LOWER_LIP].y) + + def _frame_diff(self, f1: np.ndarray, f2: np.ndarray) -> float: + g1 = cv2.cvtColor(f1, cv2.COLOR_BGR2GRAY).astype(np.float32) + g2 = cv2.cvtColor(f2, cv2.COLOR_BGR2GRAY).astype(np.float32) + return float(np.mean(np.abs(g1 - g2))) + + def analyze_reaction( + self, + video_path: str, + event_time: float, + window_before: float = 1.5, + window_after: float = 2.0, + ) -> dict: + """Return reaction confidence + diagnostics for one event.""" + cap = cv2.VideoCapture(video_path) + if not cap.isOpened(): + raise RuntimeError(f"Failed to open video: {video_path}") + + fps = cap.get(cv2.CAP_PROP_FPS) or 25.0 + dt = 1.0 / fps + + start_frame = int(max(0.0, event_time - window_before) * fps) + end_frame = int((event_time + window_after) * fps) + event_frame = int(event_time * fps) + + cap.set(cv2.CAP_PROP_POS_FRAMES, start_frame) + + before_vel: list[float] = [] + after_vel: list[float] = [] + before_mouth: list[float] = [] + after_mouth: list[float] = [] + + prev_nose = None + prev_vel = 0.0 + prev_frame = None + scene_diffs_before: list[float] = [] + scene_diffs_after: list[float] = [] + face_detected_frames = 0 + total_frames = 0 + + tasks_step_ms = max(1, int(round(dt * 1000))) + + cur = start_frame + while cur <= end_frame: + ok, frame = cap.read() + if not ok: + break + total_frames += 1 + + rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) + landmarks = None + + if self._backend == "solutions": + results = self.face_mesh.process(rgb) + if results.multi_face_landmarks: + landmarks = results.multi_face_landmarks[0].landmark + else: + mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb) + timestamp_ms = self._tasks_timestamp_ms + self._tasks_timestamp_ms += tasks_step_ms + results = self.face_landmarker.detect_for_video(mp_image, timestamp_ms) + if results.face_landmarks: + landmarks = results.face_landmarks[0] + + if landmarks: + face_detected_frames += 1 + lm = landmarks + scale = self._face_scale(lm) + nose = (lm[self.NOSE_TIP].x, lm[self.NOSE_TIP].y) + mouth = self._mouth_openness(lm) + + if prev_nose is not None: + raw_vel = math.hypot(nose[0] - prev_nose[0], nose[1] - prev_nose[1]) / ( + dt * scale + ) + vel = 0.6 * prev_vel + 0.4 * raw_vel + prev_vel = vel + if cur < event_frame: + before_vel.append(vel) + before_mouth.append(mouth) + else: + after_vel.append(vel) + after_mouth.append(mouth) + prev_nose = nose + else: + if prev_frame is not None: + diff = self._frame_diff(prev_frame, frame) + if cur < event_frame: + scene_diffs_before.append(diff) + else: + scene_diffs_after.append(diff) + prev_nose, prev_vel = None, 0.0 + + prev_frame = frame.copy() + cur += 1 + + cap.release() + + score = 0.0 + basis = "NONE" + + if face_detected_frames > 0 and before_vel and after_vel: + mu_b = float(np.mean(before_vel)) + std_b = float(np.std(before_vel)) + + spike_frames = int(np.sum(np.array(after_vel) > mu_b + 2 * std_b)) + if spike_frames >= 2: + score += 0.40 + + if float(np.mean(after_vel)) > 1.5 * max(mu_b, 1e-6): + score += 0.25 + + if float(np.mean(after_vel)) < 0.5 * max(mu_b, 1e-6) and mu_b > 0.01: + score += 0.15 + + if before_mouth and after_mouth: + if float(np.mean(after_mouth)) > 1.6 * max(float(np.mean(before_mouth)), 1e-6): + score += 0.20 + basis = "FACE" + + elif scene_diffs_before and scene_diffs_after: + mean_before_diff = float(np.mean(scene_diffs_before)) + mean_after_diff = float(np.mean(scene_diffs_after)) + if mean_after_diff > 2.0 * max(mean_before_diff, 1e-6): + score += 0.50 + elif mean_after_diff > 1.3 * max(mean_before_diff, 1e-6): + score += 0.25 + basis = "SCENE_DIFF" + + score = round(min(1.0, score), 3) + return { + "event_time": round(float(event_time), 3), + "reaction_confidence": score, + "basis": basis, + "backend": self._backend, + "face_detected_frames": face_detected_frames, + "total_frames": total_frames, + } + + +def _parse_event_times(times: str | None) -> list[float]: + if not times: + return [] + out: list[float] = [] + for part in times.split(","): + part = part.strip() + if not part: + continue + out.append(float(part)) + return out + + +def _event_times_from_report(report_path: str) -> list[float]: + data = json.loads(Path(report_path).read_text(encoding="utf-8")) + events = data.get("audio_events") or [] + times: list[float] = [] + for ev in events: + start = float(ev.get("start_time", 0.0)) + end = float(ev.get("end_time", start)) + times.append((start + end) / 2.0) + return times + + +def main() -> int: + p = argparse.ArgumentParser(description="Module 2 demo — visual reaction detection") + p.add_argument("--video", required=True, help="Path to input video") + p.add_argument( + "--times", + default=None, + help="Comma-separated event times in seconds (e.g. '0.96,3.6,39.12')", + ) + p.add_argument( + "--from-report", + default=None, + help="Optional JSON report with audio_events (uses midpoints as event times)", + ) + p.add_argument("--out", default=None, help="Output JSON path") + p.add_argument("--window-before", type=float, default=1.5) + p.add_argument("--window-after", type=float, default=2.0) + args = p.parse_args() + + times = _parse_event_times(args.times) + if args.from_report: + times = _event_times_from_report(args.from_report) + if not times: + raise SystemExit("No event times provided. Use --times or --from-report") + + det = SpeakerReactionDetector() + results = [ + det.analyze_reaction( + args.video, + event_time=t, + window_before=args.window_before, + window_after=args.window_after, + ) + for t in times + ] + + payload = { + "video": args.video, + "num_events": len(results), + "window_before": args.window_before, + "window_after": args.window_after, + "results": results, + } + + out_path = args.out + if out_path is None: + out_path = str(Path(args.video).with_suffix("")) + "_module2_reaction_report.json" + Path(out_path).write_text(json.dumps(payload, indent=2), encoding="utf-8") + + print("Event time | reaction | basis | face_frames/total") + print("-" * 55) + for r in results: + print( + f"{r['event_time']:>8.2f}s | {r['reaction_confidence']:<8.2f} | {r['basis']:<9} | {r['face_detected_frames']}/{r['total_frames']}" + ) + print(f"\nWrote: {out_path}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) \ No newline at end of file diff --git a/spider.mp4 b/spider.mp4 new file mode 100644 index 0000000..799e246 Binary files /dev/null and b/spider.mp4 differ diff --git a/spider_cc_suggestions.srt b/spider_cc_suggestions.srt new file mode 100644 index 0000000..033185f --- /dev/null +++ b/spider_cc_suggestions.srt @@ -0,0 +1,20 @@ +1 +00:00:00,000 --> 00:00:01,920 +[हेलीकॉप्टर] + +2 +00:00:38,880 --> 00:00:39,880 +[कांच टूटना] + +3 +00:01:30,720 --> 00:01:31,720 +[संगीत] + +4 +00:01:37,440 --> 00:01:38,440 +[संगीत] + +5 +00:01:39,360 --> 00:01:40,800 +[संगीत] + diff --git a/spider_cc_suggestions_report.json b/spider_cc_suggestions_report.json new file mode 100644 index 0000000..e35754a --- /dev/null +++ b/spider_cc_suggestions_report.json @@ -0,0 +1,163 @@ +{ + "video": "spider.mp4", + "srt_output": "spider_cc_suggestions.srt", + "lang": "hi", + "audio_threshold": 0.35, + "fusion_threshold": 0.42, + "audio_only_threshold": 0.75, + "total_events": 8, + "approved_cc": 5, + "audio_events": [ + { + "sound": "Helicopter", + "label_en": "[HELICOPTER]", + "confidence": 0.782, + "start_time": 0.0, + "end_time": 1.92, + "label_out": "[हेलीकॉप्टर]" + }, + { + "sound": "Helicopter", + "label_en": "[HELICOPTER]", + "confidence": 0.535, + "start_time": 3.36, + "end_time": 3.84, + "label_out": "[हेलीकॉप्टर]" + }, + { + "sound": "Glass", + "label_en": "[GLASS BREAKING]", + "confidence": 0.766, + "start_time": 38.88, + "end_time": 39.36, + "label_out": "[कांच टूटना]" + }, + { + "sound": "Whispering", + "label_en": "[WHISPERING]", + "confidence": 0.408, + "start_time": 63.36, + "end_time": 63.84, + "label_out": "[फुसफुसाना]" + }, + { + "sound": "Animal", + "label_en": "[ANIMAL]", + "confidence": 0.386, + "start_time": 70.56, + "end_time": 71.04, + "label_out": "[ANIMAL]" + }, + { + "sound": "Music", + "label_en": "[MUSIC]", + "confidence": 0.602, + "start_time": 90.72, + "end_time": 91.2, + "label_out": "[संगीत]" + }, + { + "sound": "Music", + "label_en": "[MUSIC]", + "confidence": 0.798, + "start_time": 97.44, + "end_time": 97.92, + "label_out": "[संगीत]" + }, + { + "sound": "Music", + "label_en": "[MUSIC]", + "confidence": 0.978, + "start_time": 99.36, + "end_time": 100.8, + "label_out": "[संगीत]" + } + ], + "visual_scores": [ + 0.0, + 0.0, + 0.4, + 0.4, + 0.4, + 0.65, + 0.2, + 0.15 + ], + "accepted_cc": [ + { + "sound": "Helicopter", + "label_en": "[HELICOPTER]", + "start_time": 0.0, + "label_out": "[हेलीकॉप्टर]", + "end_time": 1.92, + "audio_conf": 0.782, + "visual_conf": 0.0, + "combined": 0.352, + "combined_pre_boost": 0.352, + "high_impact": false, + "high_impact_boost_applied": false, + "decision": "APPROVED", + "decision_basis": "AUDIO_ONLY" + }, + { + "sound": "Glass", + "label_en": "[GLASS BREAKING]", + "start_time": 38.88, + "label_out": "[कांच टूटना]", + "end_time": 39.36, + "audio_conf": 0.766, + "visual_conf": 0.4, + "combined": 0.565, + "combined_pre_boost": 0.565, + "high_impact": true, + "high_impact_boost_applied": false, + "decision": "APPROVED", + "decision_basis": "AUDIO_ONLY" + }, + { + "sound": "Music", + "label_en": "[MUSIC]", + "start_time": 90.72, + "label_out": "[संगीत]", + "end_time": 91.2, + "audio_conf": 0.602, + "visual_conf": 0.65, + "combined": 0.628, + "combined_pre_boost": 0.628, + "high_impact": false, + "high_impact_boost_applied": false, + "decision": "APPROVED", + "decision_basis": "FUSION" + }, + { + "sound": "Music", + "label_en": "[MUSIC]", + "start_time": 97.44, + "label_out": "[संगीत]", + "end_time": 97.92, + "audio_conf": 0.798, + "visual_conf": 0.2, + "combined": 0.469, + "combined_pre_boost": 0.469, + "high_impact": false, + "high_impact_boost_applied": false, + "decision": "APPROVED", + "decision_basis": "AUDIO_ONLY" + }, + { + "sound": "Music", + "label_en": "[MUSIC]", + "start_time": 99.36, + "label_out": "[संगीत]", + "end_time": 100.8, + "audio_conf": 0.978, + "visual_conf": 0.15, + "combined": 0.523, + "combined_pre_boost": 0.523, + "high_impact": false, + "high_impact_boost_applied": false, + "decision": "APPROVED", + "decision_basis": "AUDIO_ONLY" + } + ] +} \ No newline at end of file diff --git a/spider_module2_reaction_report.json b/spider_module2_reaction_report.json new file mode 100644 index 0000000..94c3426 --- /dev/null +++ b/spider_module2_reaction_report.json @@ -0,0 +1,72 @@ +{ + "video": "spider.mp4", + "num_events": 8, + "window_before": 1.5, + "window_after": 2.0, + "results": [ + { + "event_time": 0.96, + "reaction_confidence": 0.0, + "basis": "SCENE_DIFF", + "backend": "tasks", + "face_detected_frames": 1, + "total_frames": 71 + }, + { + "event_time": 3.6, + "reaction_confidence": 0.0, + "basis": "SCENE_DIFF", + "backend": "tasks", + "face_detected_frames": 0, + "total_frames": 85 + }, + { + "event_time": 39.12, + "reaction_confidence": 0.4, + "basis": "FACE", + "backend": "tasks", + "face_detected_frames": 39, + "total_frames": 85 + }, + { + "event_time": 63.6, + "reaction_confidence": 0.4, + "basis": "FACE", + "backend": "tasks", + "face_detected_frames": 84, + "total_frames": 85 + }, + { + "event_time": 70.8, + "reaction_confidence": 0.4, + "basis": "FACE", + "backend": "tasks", + "face_detected_frames": 84, + "total_frames": 85 + }, + { + "event_time": 90.96, + "reaction_confidence": 0.65, + "basis": "FACE", + "backend": "tasks", + "face_detected_frames": 81, + "total_frames": 85 + }, + { + "event_time": 97.68, + "reaction_confidence": 0.2, + "basis": "FACE", + "backend": "tasks", + "face_detected_frames": 82, + "total_frames": 84 + }, + { + "event_time": 100.08, + "reaction_confidence": 0.15, + "basis": "FACE", + "backend": "tasks", + "face_detected_frames": 55, + "total_frames": 57 + } + ] +} \ No newline at end of file