[DMP 2026] Implement Intelligent CC Suggestion Pipeline — Goals 1, 2 & 3 by Jeevanjot19 · Pull Request #3 · PlanetRead/Intelligent-cc-generation

Jeevanjot19 · 2026-05-02T13:15:56Z

[DMP 2026] Intelligent CC Suggestion Pipeline — Complete Implementation (Goals 1, 2 & 3)

Closes #2

Why I Built This Before Submitting

When I read this issue, the core problem immediately stood out to me: detecting every sound is easy — deciding which sounds are narratively significant enough to warrant a CC is genuinely hard. A dog barking in the background is irrelevant; the same bark causing a speaker to visibly flinch on screen demands a CC. That distinction requires multi-modal reasoning, and I wanted to prove I could build it before writing a single word of proposal.

So instead of commenting "I'm interested," I built the full pipeline. This PR is that result.

Architecture

The pipeline is three independently testable modules connected by a shared Event dataclass. Each module is swappable without touching the others — the heuristic audio detector and YAMNet detector are interchangeable, as are the OpenCV and MediaPipe visual backends. This modularity was intentional: it means the pipeline can run zero-dependency on any machine today, and swap in better models as they become available.

Input Video / WAV
       │
       ▼
┌─────────────────────────────────────────┐
│  audio.py — Sound Event Detection       │
│  ┌─────────────────┐ ┌───────────────┐  │
│  │ Heuristic (RMS) │ │ YAMNet via    │  │
│  │ adaptive noise  │ │ MediaPipe     │  │
│  │ floor + merging │ │ 521 classes   │  │
│  └────────┬────────┘ └──────┬────────┘  │
│           └────────┬────────┘           │
│        Event candidates                 │
│   {t_start, t_end, class, confidence}  │
└────────────────────┬────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────┐
│  visual.py — Reaction Scoring           │
│  ┌──────────────────┐ ┌──────────────┐  │
│  │ OpenCV frame     │ │ MediaPipe    │  │
│  │ diff + scene-cut │ │ Pose + Face  │  │
│  │ detection        │ │ landmark     │  │
│  └────────┬─────────┘ └──────┬───────┘  │
│           └────────┬─────────┘          │
│          reaction_score ∈ [0, 1]        │
└────────────────────┬────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────┐
│  pipeline.py — Fusion & Decision        │
│                                         │
│  score = α·audio_conf + β·react_score  │
│  CC if score ≥ θ  ∨  audio ≥ 0.92      │
│               ∨  reaction ≥ 0.88       │
│                                         │
│  label ← taxonomy[audio_class]         │
└────────────────────┬────────────────────┘
                     │
                     ▼
          SRT / SLS / JSON / HTML

Goal 1 — Sound Event Detection (`cc_suggester/audio.py`)

Heuristic backend (`model: heuristic`)

The heuristic detector does not use a fixed energy threshold. Instead it computes the median RMS energy across all frames as an adaptive noise floor, then sets the detection threshold as max(config_threshold, noise_floor × noise_ratio). This means the same config works on a quiet interview and a loud action scene without manual tuning.

Key implementation details:

Per-frame RMS computed over configurable frame_seconds windows with hop_seconds stride
Adjacent high-energy spans merged within gap_tolerance to avoid fragmented events
Events shorter than min_event_duration discarded (eliminates transient noise spikes)
Classification by duration + peak energy: sharp_impact (short, high energy), sustained_sound (long duration), loud_sound (everything else)
Confidence scored as base + energy_normalized_delta, bounded to [0.45, 0.95]
Zero external dependencies — runs on standard library WAV only

YAMNet backend (`model: yamnet`)

MediaPipe AudioClassifier with yamnet.tflite — 521 AudioSet class labels
Speech/silence/music blocklist prevents these from ever becoming CC candidates (the most important filter for the over-captioning problem)
Critical fix: timestamps use chunk_idx × hop_seconds instead of result.timestamp_ms — in AUDIO_CLIPS mode, MediaPipe's timestamp field reflects the classify() call time, not position within the audio, causing all events to cluster at near-zero timestamps without this fix
WebRTC VAD pre-filter (aggressiveness 0–3, configurable) zeros out speech frames before YAMNet inference — hard guarantee that loud speech never becomes a CC candidate even if it bypasses the blocklist
Full label taxonomy: 30+ YAMNet class names map directly to CC labels (Gunshot, gunfire → [gunshot], Applause → [applause], Laughter → [laughter], etc.) with [Sound effect] as fallback

Goal 2 — Speaker Reaction Detection (`cc_suggester/visual.py`)

Visual analysis runs only on frames within ±context_window seconds of each detected audio event — not the full video. This keeps processing time linear in the number of audio candidates rather than video duration.

OpenCV motion backend (`backend: opencv_motion`)

Frame-diff scoring via mean absolute pixel difference, normalised through sigmoid: score = 2/(1 + exp(-raw)) - 1
Sigmoid chosen deliberately over hard ceiling — avoids the saturation problem where both scene cuts and genuine reactions both scored 1.0 under linear normalisation
Scene-cut detection: is_cut = peak_diff > avg_diff × 3.0 — a hard scene cut produces a single extreme frame diff with low average, while genuine motion produces sustained elevated diffs. Cuts are detected and reaction_score discounted 80%, with reaction_type = "scene_cut" and a diagnostic note on the Event
Configurable frame stride, resolution downscale (default 64×36 for speed), and context window

MediaPipe backend (`backend: mediapipe`)

PoseLandmarker (nose, left shoulder, right shoulder) + FaceLandmarker (chin, lips, eye corners) at IMAGE mode
Pose and face landmark sets normalised independently — centroid subtracted, divided by mean spread — before concatenation. Previous version normalised them together, which caused skew when only one detector fired (e.g. face out of frame)
Reaction scored as 0.65 × peak_landmark_displacement + 0.35 × max_inter_frame_velocity from baseline frame
Three reaction type tiers: landmark_reaction (≥0.65), subtle_landmark_motion (≥0.35), None

Goal 3 — Fusion Engine & Output (`cc_suggester/pipeline.py`)

Decision formula

fusion_score = α · audio_confidence + β · reaction_score

CC accepted iff fusion_score ≥ θ OR audio_confidence ≥ 0.92 (audio override) OR reaction_score ≥ 0.88 (visual override)

Default: α=0.60, β=0.40, θ=0.55. All three values are first-class config parameters — no magic numbers in source code.

The override thresholds exist for unambiguous single-signal cases: a 95% confidence gunshot detection warrants a CC regardless of whether a face is visible in frame. Conversely, a speaker visibly flinching at 90% reaction score warrants investigation even if the audio was ambiguous.

Caption quality

Duration splitting: events longer than max_caption_duration (default 3s) are split into equal parts — professional subtitle standard, avoids a single CC spanning 6+ seconds
Label taxonomy lookup: audio_class → cc_label with fallback to [Sound effect]
All split parts inherit the parent event's scores and decision for JSON/HTML auditability

Output formats

SRT — standards-compliant, direct import into any subtitle editor
SLS — PlanetRead's native format with pipe-separated fields including score metadata
JSON — full per-event dump: timestamps, audio class, audio confidence, reaction score, reaction type, fusion score, CC decision, CC label, diagnostic notes
HTML — professional report with metrics panel, per-event score table, accept/reject badges

Evaluation Framework (`cc_suggester/eval.py`)

Built an IoU-based evaluation framework that directly measures the acceptance criteria from the issue:

Overcaption rate = FP / total predictions — target ≤ 10%
Undercaption rate = FN / total ground truth
Precision, Recall, F1 via IoU matching at configurable threshold (default 0.30)
Compliance assessment: PASS/FAIL per criterion with percentage readout
Ground truth loaded from CSV (start, end, label columns) — compatible with manual VLC annotation

Usage during coding period: annotate 3–5 Hindi content samples → run grid search over (θ, α, β) → replace defaults with empirically validated values.

Configuration System (`cc_suggester/config.py`)

Every parameter is a first-class config field. Four ready-to-use profiles ship with the PR:

Profile	Audio	Visual	Use case
config/default.json	Heuristic	OpenCV	Zero dependencies, instant demo
config/yamnet.json	YAMNet	OpenCV	Better audio classification
config/mediapipe.json	Heuristic	MediaPipe	Better reaction detection
config/full_ml.json	YAMNet	MediaPipe	Full pipeline, best quality

Heuristic accepted events at 0:23, 1:10, 1:46, and 2:54 — spanning the full video, aligned with actual dramatic moments. The fusion engine correctly rejected 23 ambient sound candidates that had audio signal but low visual reaction.

Next step during coding period: repeat on Hindi/regional content samples with ground truth annotation to get real precision/recall numbers and tune thresholds accordingly.

How to Run

# Install pip install -r requirements.txt Zero-dependency demo (no video, no models) python -m cc_suggester.demo_data --output samples/demo.wav python -m cc_suggester --input samples/demo.wav --output out/demo.srt --events-json out/events.json --report-html out/report.html Real video — heuristic (no model files needed) python -m cc_suggester --input video.mp4 --output captions.srt YAMNet backend python scripts/download_models.py --select yamnet.tflite python -m cc_suggester --input video.mp4 --output captions.srt --config config/yamnet.json Full ML pipeline python scripts/download_models.py python -m cc_suggester --input video.mp4 --output captions.srt --config config/full_ml.json Evaluate against annotated ground truth python -m cc_suggester.eval --predictions out/events.json --ground-truth ground_truth/video.csv --output out/metrics.json Editor review dashboard

streamlit run streamlit_app.py

What I Plan to Do During the Coding Period

This PR is a foundation, not a finished product. The work I genuinely want to do during DMP:

Benchmark on real PlanetRead Hindi content — test YAMNet's AudioSet classes against Hindi-specific sounds (dhol, firecrackers, devotional music) and identify gaps where PANNs or a custom fine-tune would help
Ground truth annotation and threshold validation — the fusion thresholds are currently documented as unvalidated defaults; I want to annotate enough real videos to replace them with empirically justified values
Collect editor feedback — the HTML report and Streamlit dashboard exist precisely for this; I want to put them in front of actual PlanetRead accessibility editors and iterate on what makes a CC suggestion useful vs distracting
Evaluate PANNs as a YAMNet alternative — PANNs gives finer-grained classifications that may handle Indian content better; worth a rigorous benchmark
Improve the label taxonomy — the current taxonomy covers ~30 YAMNet classes; a full mapping of all 521 classes to meaningful CC labels (or "ignore") is genuinely useful work

I am not applying to C4GT to pad a resume. PlanetRead's mission — making content accessible to regional-language audiences across India — is something I care about, and the CC annotation problem is a real bottleneck for accessibility editors. I want to build something they can actually use.

C4GT DMP 2026 — PlanetRead Intelligent CC Suggestion Tool Submitted by Jeevanjot Singh | GitHub: https://github.com/Jeevanjot19

# [DMP 2026] Intelligent CC Suggestion Pipeline — Complete Implementation (Goals 1, 2 & 3)

Closes #2

Why I Built This Before Submitting

When I read this issue, the core problem immediately stood out to me: detecting every sound is easy — deciding which sounds are narratively significant enough to warrant a CC is genuinely hard. A dog barking in the background is irrelevant; the same bark causing a speaker to visibly flinch on screen demands a CC. That distinction requires multi-modal reasoning, and I wanted to prove I could build it before writing a single word of proposal.

So instead of commenting "I'm interested," I built the full pipeline. This PR is that result.

Architecture

The pipeline is three independently testable modules connected by a shared Event dataclass. Each module is swappable without touching the others — the heuristic audio detector and YAMNet detector are interchangeable, as are the OpenCV and MediaPipe visual backends. This modularity was intentional: it means the pipeline can run zero-dependency on any machine today, and swap in better models as they become available.

Input Video / WAV
       │
       ▼
┌─────────────────────────────────────────┐
│  audio.py — Sound Event Detection       │
│  ┌─────────────────┐ ┌───────────────┐  │
│  │ Heuristic (RMS) │ │ YAMNet via    │  │
│  │ adaptive noise  │ │ MediaPipe     │  │
│  │ floor + merging │ │ 521 classes   │  │
│  └────────┬────────┘ └──────┬────────┘  │
│           └────────┬────────┘           │
│        Event candidates                 │
│   {t_start, t_end, class, confidence}  │
└────────────────────┬────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────┐
│  visual.py — Reaction Scoring           │
│  ┌──────────────────┐ ┌──────────────┐  │
│  │ OpenCV frame     │ │ MediaPipe    │  │
│  │ diff + scene-cut │ │ Pose + Face  │  │
│  │ detection        │ │ landmark     │  │
│  └────────┬─────────┘ └──────┬───────┘  │
│           └────────┬─────────┘          │
│          reaction_score ∈ [0, 1]        │
└────────────────────┬────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────┐
│  pipeline.py — Fusion & Decision        │
│                                         │
│  score = α·audio_conf + β·react_score  │
│  CC if score ≥ θ  ∨  audio ≥ 0.92      │
│               ∨  reaction ≥ 0.88       │
│                                         │
│  label ← taxonomy[audio_class]         │
└────────────────────┬────────────────────┘
                     │
                     ▼
          SRT / SLS / JSON / HTML

Goal 1 — Sound Event Detection (`cc_suggester/audio.py`)

Heuristic backend (`model: heuristic`)

The heuristic detector does not use a fixed energy threshold. Instead it computes the median RMS energy across all frames as an adaptive noise floor, then sets the detection threshold as max(config_threshold, noise_floor × noise_ratio). This means the same config works on a quiet interview and a loud action scene without manual tuning.

Key implementation details:

Per-frame RMS computed over configurable frame_seconds windows with hop_seconds stride
Adjacent high-energy spans merged within gap_tolerance to avoid fragmented events
Events shorter than min_event_duration discarded (eliminates transient noise spikes)
Classification by duration + peak energy: sharp_impact (short, high energy), sustained_sound (long duration), loud_sound (everything else)
Confidence scored as base + energy_normalized_delta, bounded to [0.45, 0.95]
Zero external dependencies — runs on standard library WAV only

YAMNet backend (`model: yamnet`)

MediaPipe AudioClassifier with yamnet.tflite — 521 AudioSet class labels
Speech/silence/music blocklist prevents these from ever becoming CC candidates (the most important filter for the over-captioning problem)
Critical fix: timestamps use chunk_idx × hop_seconds instead of result.timestamp_ms — in AUDIO_CLIPS mode, MediaPipe's timestamp field reflects the classify() call time, not position within the audio, causing all events to cluster at near-zero timestamps without this fix
WebRTC VAD pre-filter (aggressiveness 0–3, configurable) zeros out speech frames before YAMNet inference — hard guarantee that loud speech never becomes a CC candidate even if it bypasses the blocklist
Full label taxonomy: 30+ YAMNet class names map directly to CC labels (Gunshot, gunfire → [gunshot], Applause → [applause], Laughter → [laughter], etc.) with [Sound effect] as fallback

Goal 2 — Speaker Reaction Detection (`cc_suggester/visual.py`)

Visual analysis runs only on frames within ±context_window seconds of each detected audio event — not the full video. This keeps processing time linear in the number of audio candidates rather than video duration.

OpenCV motion backend (`backend: opencv_motion`)

Frame-diff scoring via mean absolute pixel difference, normalised through sigmoid: score = 2/(1 + exp(-raw)) - 1
Sigmoid chosen deliberately over hard ceiling — avoids the saturation problem where both scene cuts and genuine reactions both scored 1.0 under linear normalisation
Scene-cut detection: is_cut = peak_diff > avg_diff × 3.0 — a hard scene cut produces a single extreme frame diff with low average, while genuine motion produces sustained elevated diffs. Cuts are detected and reaction_score discounted 80%, with reaction_type = "scene_cut" and a diagnostic note on the Event
Configurable frame stride, resolution downscale (default 64×36 for speed), and context window

MediaPipe backend (`backend: mediapipe`)

PoseLandmarker (nose, left shoulder, right shoulder) + FaceLandmarker (chin, lips, eye corners) at IMAGE mode
Pose and face landmark sets normalised independently — centroid subtracted, divided by mean spread — before concatenation. Previous version normalised them together, which caused skew when only one detector fired (e.g. face out of frame)
Reaction scored as 0.65 × peak_landmark_displacement + 0.35 × max_inter_frame_velocity from baseline frame
Three reaction type tiers: landmark_reaction (≥0.65), subtle_landmark_motion (≥0.35), None

Goal 3 — Fusion Engine & Output (`cc_suggester/pipeline.py`)

Decision formula

fusion_score  =  α · audio_confidence  +  β · reaction_score

CC accepted  iff  fusion_score  ≥  θ
             OR   audio_confidence  ≥  0.92   (audio override)
             OR   reaction_score    ≥  0.88   (visual override)

Default: α=0.60, β=0.40, θ=0.55. All three values are first-class config parameters — no magic numbers in source code.

The override thresholds exist for unambiguous single-signal cases: a 95% confidence gunshot detection warrants a CC regardless of whether a face is visible in frame. Conversely, a speaker visibly flinching at 90% reaction score warrants investigation even if the audio was ambiguous.

Caption quality

Duration splitting: events longer than max_caption_duration (default 3s) are split into equal parts — professional subtitle standard, avoids a single CC spanning 6+ seconds
Label taxonomy lookup: audio_class → cc_label with fallback to [Sound effect]
All split parts inherit the parent event's scores and decision for JSON/HTML auditability

Output formats

SRT — standards-compliant, direct import into any subtitle editor
SLS — PlanetRead's native format with pipe-separated fields including score metadata
JSON — full per-event dump: timestamps, audio class, audio confidence, reaction score, reaction type, fusion score, CC decision, CC label, diagnostic notes
HTML — professional report with metrics panel, per-event score table, accept/reject badges

Evaluation Framework (`cc_suggester/eval.py`)

Built an IoU-based evaluation framework that directly measures the acceptance criteria from the issue:

Overcaption rate = FP / total predictions — target ≤ 10%
Undercaption rate = FN / total ground truth
Precision, Recall, F1 via IoU matching at configurable threshold (default 0.30)
Compliance assessment: PASS/FAIL per criterion with percentage readout
Ground truth loaded from CSV (start, end, label columns) — compatible with manual VLC annotation

Usage during coding period: annotate 3–5 Hindi content samples → run grid search over (θ, α, β) → replace defaults with empirically validated values.

Configuration System (`cc_suggester/config.py`)

Every parameter is a first-class config field. Four ready-to-use profiles ship with the PR:

Profile	Audio	Visual	Use case
`config/default.json`	Heuristic	OpenCV	Zero dependencies, instant demo
`config/yamnet.json`	YAMNet	OpenCV	Better audio classification
`config/mediapipe.json`	Heuristic	MediaPipe	Better reaction detection
`config/full_ml.json`	YAMNet	MediaPipe	Full pipeline, best quality

The FusionConfig docstring explicitly flags the default thresholds as unvalidated and links to the evaluation workflow — I think it is important to be honest about what has and has not been empirically tested.

Real Video Test Results

Tested on [JUMPER — Suspense Thriller Short Film](https://www.youtube.com/watch?v=VOJsld2_oeI) (3 minutes, English, action-heavy sound design):

Backend	Candidates	Accepted	Time	Labels
Heuristic + OpenCV	27	4	5.6s	[Loud sound], [Sustained sound]
YAMNet + OpenCV	20	2	20.9s	Rich class names

Heuristic accepted events at 0:23, 1:10, 1:46, and 2:54 — spanning the full video, aligned with actual dramatic moments. The fusion engine correctly rejected 23 ambient sound candidates that had audio signal but low visual reaction.

Next step during coding period: repeat on Hindi/regional content samples with ground truth annotation to get real precision/recall numbers and tune thresholds accordingly.

How to Run

# Install
pip install -r requirements.txt

# Zero-dependency demo (no video, no models)
python -m cc_suggester.demo_data --output samples/demo.wav
python -m cc_suggester --input samples/demo.wav \
  --output out/demo.srt \
  --events-json out/events.json \
  --report-html out/report.html

# Real video — heuristic (no model files needed)
python -m cc_suggester --input video.mp4 --output captions.srt

# YAMNet backend
python scripts/download_models.py --select yamnet.tflite
python -m cc_suggester --input video.mp4 --output captions.srt \
  --config config/yamnet.json

# Full ML pipeline
python scripts/download_models.py
python -m cc_suggester --input video.mp4 --output captions.srt \
  --config config/full_ml.json

# Evaluate against annotated ground truth
python -m cc_suggester.eval \
  --predictions out/events.json \
  --ground-truth ground_truth/video.csv \
  --output out/metrics.json

# Editor review dashboard
streamlit run streamlit_app.py

What I Plan to Do During the Coding Period

This PR is a foundation, not a finished product. The work I genuinely want to do during DMP:

Benchmark on real PlanetRead Hindi content — test YAMNet's AudioSet classes against Hindi-specific sounds (dhol, firecrackers, devotional music) and identify gaps where PANNs or a custom fine-tune would help
Ground truth annotation and threshold validation — the fusion thresholds are currently documented as unvalidated defaults; I want to annotate enough real videos to replace them with empirically justified values
Collect editor feedback — the HTML report and Streamlit dashboard exist precisely for this; I want to put them in front of actual PlanetRead accessibility editors and iterate on what makes a CC suggestion useful vs distracting
Evaluate PANNs as a YAMNet alternative — PANNs gives finer-grained classifications that may handle Indian content better; worth a rigorous benchmark
Improve the label taxonomy — the current taxonomy covers ~30 YAMNet classes; a full mapping of all 521 classes to meaningful CC labels (or "ignore") is genuinely useful work

I am not applying to C4GT to pad a resume. PlanetRead's mission — making content accessible to regional-language audiences across India — is something I care about, and the CC annotation problem is a real bottleneck for accessibility editors. I want to build something they can actually use.

C4GT DMP 2026 — PlanetRead Intelligent CC Suggestion Tool
Submitted by Jeevanjot | GitHub: https://github.com/Jeevanjot19

…ad#2)

…ad#2) - Goal 1: audio_detector.py with heuristic (RMS + adaptive noise floor) and YAMNet/MediaPipe AudioClassifier backends - Goal 2: visual.py with OpenCV motion diff and MediaPipe Pose+FaceMesh landmark-delta scoring backends - Goal 3: pipeline.py fusion engine (alpha*audio + beta*visual), SRT/SLS/JSON output, HTML report - eval.py: IoU-based precision/recall/F1/overcaption-rate framework - config/: YAML/JSON config system, all thresholds tuneable - Streamlit reviewer dashboard - 14 pytest tests passing - Tested on real video (JUMPER short film): heuristic 27->4, YAMNet 20->2 Closes PlanetRead#2

Copilot

Pull request overview

This PR adds an end-to-end “Intelligent CC Suggestion” pipeline that detects non-speech audio events, scores visual reactions, fuses both signals into a CC/no-CC decision, and exports results (SRT/SLS/JSON/HTML) with evaluation + review tooling.

Changes:

Introduces core pipeline modules (audio, visual, pipeline, output, report) built around a shared Event dataclass and JSON/YAML configuration profiles.
Adds evaluation tooling (IoU-based metrics), a Streamlit reviewer dashboard, and multiple scripts for real-video workflows + model/video utilities.
Adds pytest coverage for core pipeline behaviors and basic dependency/error-path handling.

Reviewed changes

Copilot reviewed 31 out of 35 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`tests/test_pipeline.py`	End-to-end and unit tests for timestamps, pipeline outputs, config override, dependency error paths, and evaluation helpers.
`scripts/video_utils.py`	FFmpeg-based video probing, validation, extraction, and conversion helpers.
`scripts/test_yamnet_integration.py`	Benchmark script comparing heuristic vs YAMNet detection and generating an HTML report.
`scripts/test_real_videos.py`	Real-video workflow runner (validate → extract → pipeline → templates).
`scripts/run_full_test.py`	One-command workflow runner that also generates example ground truth and runs evaluation.
`scripts/full_test_workflow.ps1`	PowerShell automation for the full workflow on Windows.
`scripts/download_youtube_videos.py`	Utility to download videos/audio via `yt-dlp` for annotation/eval.
`scripts/download_models.py`	Utility to download optional model assets (YAMNet + MediaPipe tasks).
`scripts/annotation_tool.py`	CLI annotation helper (template + interactive + conversion/merge helpers).
`requirements.txt`	Declares Python dependencies for tests/UI/ML backends.
`config/yamnet.json`	Config profile for YAMNet audio + OpenCV visual scoring.
`config/mediapipe.json`	Config profile for heuristic audio + MediaPipe visual scoring.
`config/full_ml.json`	Config profile for YAMNet audio + MediaPipe visual scoring.
`config/default.yaml`	Default YAML configuration profile.
`config/default.json`	Default JSON configuration profile.
`cc_suggester/visual.py`	Visual scoring backends (OpenCV motion + MediaPipe landmarks) and backend dispatch.
`cc_suggester/report.py`	HTML reporting for events, decisions, and optional metrics panel.
`cc_suggester/pipeline.py`	Orchestrates audio detection, visual scoring, fusion decisions, caption splitting, and output writing + metrics.
`cc_suggester/output.py`	SRT/SLS writers + timestamp formatting + events JSON writer.
`cc_suggester/media.py`	FFmpeg dependency checks and audio extraction to WAV.
`cc_suggester/event.py`	Shared `Event` dataclass + serialization helpers.
`cc_suggester/eval.py`	IoU-based evaluation CLI + metrics computation helpers.
`cc_suggester/demo_data.py`	Synthetic WAV generator used by tests/demo.
`cc_suggester/dashboard.py`	Streamlit reviewer dashboard data loading + UI.
`cc_suggester/config.py`	Typed config dataclasses + JSON/YAML loader/merging + default taxonomy.
`cc_suggester/cli.py`	Main CLI entry for running the pipeline.
`cc_suggester/audio.py`	Heuristic RMS detector + YAMNet(MediaPipe) detector + VAD filtering + backend dispatch.
`cc_suggester/__init__.py`	Package metadata/version.
`REAL_VIDEO_TEST_RESULTS.md`	Documented real-video validation notes and outputs.
`REAL_VIDEO_TESTING.md`	Step-by-step guide for real-video workflows, annotation, and evaluation.
`README.md`	Project overview, usage, workflows, and documentation links.
`FFMPEG_SETUP.md`	FFmpeg install/setup instructions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    diffs: list[float] = []
+    for previous, current in zip(frames, frames[1:]):
+        import cv2
+        import numpy as np
+


+        "false_negative": false_negative,
+        "precision": round(precision, 3),
+        "recall": round(recall, 3),
+        "f1": round(f1, 3),


+def _assess_compliance(metrics: dict[str, Any]) -> dict[str, str]:
+    """Check if metrics meet proposal acceptance criteria.
+
+    Acceptance Criteria from GitHub issue #2:
+    1. Avoid over-captioning -> overcaption_rate should be <= 10%
+    2. Detect non-speech audio events -> recall should be >= 80%
+    """
+    results = {}
+
+    # Criterion 1: Avoid over-captioning (FP rate)
+    overcaption = metrics.get("overcaption_rate", 1.0)
+    if overcaption <= 0.10:
+        results["avoid_overcaption"] = f"PASS ({overcaption:.1%} false positives <= 10% target)"
+    else:
+        results["avoid_overcaption"] = f"FAIL ({overcaption:.1%} false positives > 10% target)"
+
+    # Criterion 2: Detect events (recall)
+    recall = metrics.get("recall", 0.0)
+    if recall >= 0.80:
+        results["detect_events"] = f"PASS ({recall:.1%} detection rate >= 80% target)"
+    else:
+        results["detect_events"] = f"WARN ({recall:.1%} detection rate < 80% target)"
+


+def parse_timestamp(ts_str: str) -> float:
+    """Parse HH:MM:SS.mmm format to seconds."""
+    try:
+        parts = ts_str.split(':')
+        hours = int(parts[0])
+        minutes = int(parts[1])
+        seconds_parts = parts[2].split('.')
+        seconds = int(seconds_parts[0])
+        milliseconds = int(seconds_parts[1]) if len(seconds_parts) > 1 else 0
+


+                    # Parse FPS: "24 fps", "30000/1001 fps"
+                    fps_match = re.search(r"(\d+\.?\d*)\s*fps", line)
+                    if fps_match:
+                        fps = float(fps_match.group(1))
+                    else:
+                        # Try fractional format
+                        fps_frac = re.search(r"(\d+)/(\d+)\s*fps", line)
+                        if fps_frac:
+                            fps = float(fps_frac.group(1)) / float(fps_frac.group(2))
+                    break


+- YAMNet inference window: `config.yamnet_inference_window` (was hardcoded 0.975)
+- Motion reaction threshold: `config.reaction_threshold` (was hardcoded 0.4)
+- VAD aggressiveness: `config.vad_aggressiveness` (configurable 0-3)


+**Fix:** Moved to `config.yamnet_inference_window`  
+**Result:** ✅ Configurable via `config/yamnet.json`
+
+### 3. Magic Number (0.4) Threshold Extracted ✓
+**Issue:** Hardcoded reaction threshold  
+**Fix:** Moved to `config.reaction_threshold`  
+**Result:** ✅ OpenCV motion detection using configurable threshold


+
+    if not logger.handlers:
+        formatter = logging.Formatter(
+            '%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+            datefmt='%Y-%m-%d %H:%M:%S'
+        )
+
+        console = logging.StreamHandler()
+        console.setFormatter(formatter)
+        logger.addHandler(console)
+
+        if log_file:
+            file_handler = logging.FileHandler(log_file)
+            file_handler.setFormatter(formatter)
+            logger.addHandler(file_handler)
+


+Environment:
+    - Requires internet connection
+    - Creates models/ directory if not exists
+    - Validates checksums after download
+"""


+  # Download only specific model
+  python scripts/download_models.py --select yamnet
+        """,


Jeevanjot19 · 2026-05-02T16:11:21Z

@copilot apply changes based on the comments in this thread

abinash-sketch · 2026-05-07T04:43:22Z

let me know when can we connect.

Jeevanjot19 · 2026-05-07T06:37:47Z

We can connect anytime after 3 pm today

…

On Thu, May 7, 2026, 11:49 AM abinash-sketch ***@***.***> wrote: *abinash-sketch* left a comment (PlanetRead/Intelligent-cc-generation#3) <#3 (comment)> let me know when can we connect. — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BGJLMGVJILSYQ46OS2UAZK34ZQTHZAVCNFSM6AAAAACYOIV5HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGOJUGIYTEMJWGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Jeevanjot19 added 5 commits May 2, 2026 14:29

feat: implement intelligent CC suggestion pipeline (DMP 2026 PlanetRe…

de6a5ca

…ad#2)

feat: implement intelligent CC suggestion pipeline (DMP 2026 PlanetRe…

c43f210

…ad#2)

feat: implement intelligent CC suggestion pipeline (DMP 2026 PlanetRe…

531568f

…ad#2)

fixed minor issues

d6cc9e8

Copilot AI review requested due to automatic review settings May 2, 2026 13:15

Copilot started reviewing on behalf of Jeevanjot19 May 2, 2026 13:16 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DMP 2026] Implement Intelligent CC Suggestion Pipeline — Goals 1, 2 & 3#3

[DMP 2026] Implement Intelligent CC Suggestion Pipeline — Goals 1, 2 & 3#3
Jeevanjot19 wants to merge 5 commits into
PlanetRead:mainfrom
Jeevanjot19:main

Jeevanjot19 commented May 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Jeevanjot19 commented May 2, 2026

Uh oh!

abinash-sketch commented May 7, 2026

Uh oh!

Jeevanjot19 commented May 7, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Jeevanjot19 commented May 2, 2026

[DMP 2026] Intelligent CC Suggestion Pipeline — Complete Implementation (Goals 1, 2 & 3)

Why I Built This Before Submitting

Architecture

Goal 1 — Sound Event Detection (cc_suggester/audio.py)

Heuristic backend (model: heuristic)

YAMNet backend (model: yamnet)

Goal 2 — Speaker Reaction Detection (cc_suggester/visual.py)

OpenCV motion backend (backend: opencv_motion)

MediaPipe backend (backend: mediapipe)

Goal 3 — Fusion Engine & Output (cc_suggester/pipeline.py)

Decision formula

Caption quality

Output formats

Evaluation Framework (cc_suggester/eval.py)

Configuration System (cc_suggester/config.py)

How to Run

Zero-dependency demo (no video, no models)

Real video — heuristic (no model files needed)

YAMNet backend

Full ML pipeline

Evaluate against annotated ground truth

Editor review dashboard

What I Plan to Do During the Coding Period

Why I Built This Before Submitting

Architecture

Goal 1 — Sound Event Detection (cc_suggester/audio.py)

Heuristic backend (model: heuristic)

YAMNet backend (model: yamnet)

Goal 2 — Speaker Reaction Detection (cc_suggester/visual.py)

OpenCV motion backend (backend: opencv_motion)

MediaPipe backend (backend: mediapipe)

Goal 3 — Fusion Engine & Output (cc_suggester/pipeline.py)

Decision formula

Caption quality

Output formats

Evaluation Framework (cc_suggester/eval.py)

Configuration System (cc_suggester/config.py)

Real Video Test Results

How to Run

What I Plan to Do During the Coding Period

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Jeevanjot19 commented May 2, 2026

Uh oh!

abinash-sketch commented May 7, 2026

Uh oh!

Jeevanjot19 commented May 7, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Goal 1 — Sound Event Detection (`cc_suggester/audio.py`)

Heuristic backend (`model: heuristic`)

YAMNet backend (`model: yamnet`)

Goal 2 — Speaker Reaction Detection (`cc_suggester/visual.py`)

OpenCV motion backend (`backend: opencv_motion`)

MediaPipe backend (`backend: mediapipe`)

Goal 3 — Fusion Engine & Output (`cc_suggester/pipeline.py`)

Evaluation Framework (`cc_suggester/eval.py`)

Configuration System (`cc_suggester/config.py`)

Goal 1 — Sound Event Detection (`cc_suggester/audio.py`)

Heuristic backend (`model: heuristic`)

YAMNet backend (`model: yamnet`)

Goal 2 — Speaker Reaction Detection (`cc_suggester/visual.py`)

OpenCV motion backend (`backend: opencv_motion`)

MediaPipe backend (`backend: mediapipe`)

Goal 3 — Fusion Engine & Output (`cc_suggester/pipeline.py`)

Evaluation Framework (`cc_suggester/eval.py`)

Configuration System (`cc_suggester/config.py`)