Skip to content

[DMP 2026] Implement Intelligent CC Suggestion Pipeline — Goals 1, 2 & 3#3

Open
Jeevanjot19 wants to merge 5 commits into
PlanetRead:mainfrom
Jeevanjot19:main
Open

[DMP 2026] Implement Intelligent CC Suggestion Pipeline — Goals 1, 2 & 3#3
Jeevanjot19 wants to merge 5 commits into
PlanetRead:mainfrom
Jeevanjot19:main

Conversation

@Jeevanjot19
Copy link
Copy Markdown

[DMP 2026] Intelligent CC Suggestion Pipeline — Complete Implementation (Goals 1, 2 & 3)

Closes #2

Why I Built This Before Submitting

When I read this issue, the core problem immediately stood out to me: detecting every sound is easy — deciding which sounds are narratively significant enough to warrant a CC is genuinely hard. A dog barking in the background is irrelevant; the same bark causing a speaker to visibly flinch on screen demands a CC. That distinction requires multi-modal reasoning, and I wanted to prove I could build it before writing a single word of proposal.

So instead of commenting "I'm interested," I built the full pipeline. This PR is that result.


Architecture

The pipeline is three independently testable modules connected by a shared Event dataclass. Each module is swappable without touching the others — the heuristic audio detector and YAMNet detector are interchangeable, as are the OpenCV and MediaPipe visual backends. This modularity was intentional: it means the pipeline can run zero-dependency on any machine today, and swap in better models as they become available.

Input Video / WAV
       │
       ▼
┌─────────────────────────────────────────┐
│  audio.py — Sound Event Detection       │
│  ┌─────────────────┐ ┌───────────────┐  │
│  │ Heuristic (RMS) │ │ YAMNet via    │  │
│  │ adaptive noise  │ │ MediaPipe     │  │
│  │ floor + merging │ │ 521 classes   │  │
│  └────────┬────────┘ └──────┬────────┘  │
│           └────────┬────────┘           │
│        Event candidates                 │
│   {t_start, t_end, class, confidence}  │
└────────────────────┬────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────┐
│  visual.py — Reaction Scoring           │
│  ┌──────────────────┐ ┌──────────────┐  │
│  │ OpenCV frame     │ │ MediaPipe    │  │
│  │ diff + scene-cut │ │ Pose + Face  │  │
│  │ detection        │ │ landmark     │  │
│  └────────┬─────────┘ └──────┬───────┘  │
│           └────────┬─────────┘          │
│          reaction_score ∈ [0, 1]        │
└────────────────────┬────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────┐
│  pipeline.py — Fusion & Decision        │
│                                         │
│  score = α·audio_conf + β·react_score  │
│  CC if score ≥ θ  ∨  audio ≥ 0.92      │
│               ∨  reaction ≥ 0.88       │
│                                         │
│  label ← taxonomy[audio_class]         │
└────────────────────┬────────────────────┘
                     │
                     ▼
          SRT / SLS / JSON / HTML

Goal 1 — Sound Event Detection (cc_suggester/audio.py)

Heuristic backend (model: heuristic)

The heuristic detector does not use a fixed energy threshold. Instead it computes the median RMS energy across all frames as an adaptive noise floor, then sets the detection threshold as max(config_threshold, noise_floor × noise_ratio). This means the same config works on a quiet interview and a loud action scene without manual tuning.

Key implementation details:

  • Per-frame RMS computed over configurable frame_seconds windows with hop_seconds stride
  • Adjacent high-energy spans merged within gap_tolerance to avoid fragmented events
  • Events shorter than min_event_duration discarded (eliminates transient noise spikes)
  • Classification by duration + peak energy: sharp_impact (short, high energy), sustained_sound (long duration), loud_sound (everything else)
  • Confidence scored as base + energy_normalized_delta, bounded to [0.45, 0.95]
  • Zero external dependencies — runs on standard library WAV only

YAMNet backend (model: yamnet)

  • MediaPipe AudioClassifier with yamnet.tflite — 521 AudioSet class labels
  • Speech/silence/music blocklist prevents these from ever becoming CC candidates (the most important filter for the over-captioning problem)
  • Critical fix: timestamps use chunk_idx × hop_seconds instead of result.timestamp_ms — in AUDIO_CLIPS mode, MediaPipe's timestamp field reflects the classify() call time, not position within the audio, causing all events to cluster at near-zero timestamps without this fix
  • WebRTC VAD pre-filter (aggressiveness 0–3, configurable) zeros out speech frames before YAMNet inference — hard guarantee that loud speech never becomes a CC candidate even if it bypasses the blocklist
  • Full label taxonomy: 30+ YAMNet class names map directly to CC labels (Gunshot, gunfire[gunshot], Applause[applause], Laughter[laughter], etc.) with [Sound effect] as fallback

Goal 2 — Speaker Reaction Detection (cc_suggester/visual.py)

Visual analysis runs only on frames within ±context_window seconds of each detected audio event — not the full video. This keeps processing time linear in the number of audio candidates rather than video duration.

OpenCV motion backend (backend: opencv_motion)

  • Frame-diff scoring via mean absolute pixel difference, normalised through sigmoid: score = 2/(1 + exp(-raw)) - 1
  • Sigmoid chosen deliberately over hard ceiling — avoids the saturation problem where both scene cuts and genuine reactions both scored 1.0 under linear normalisation
  • Scene-cut detection: is_cut = peak_diff > avg_diff × 3.0 — a hard scene cut produces a single extreme frame diff with low average, while genuine motion produces sustained elevated diffs. Cuts are detected and reaction_score discounted 80%, with reaction_type = "scene_cut" and a diagnostic note on the Event
  • Configurable frame stride, resolution downscale (default 64×36 for speed), and context window

MediaPipe backend (backend: mediapipe)

  • PoseLandmarker (nose, left shoulder, right shoulder) + FaceLandmarker (chin, lips, eye corners) at IMAGE mode
  • Pose and face landmark sets normalised independently — centroid subtracted, divided by mean spread — before concatenation. Previous version normalised them together, which caused skew when only one detector fired (e.g. face out of frame)
  • Reaction scored as 0.65 × peak_landmark_displacement + 0.35 × max_inter_frame_velocity from baseline frame
  • Three reaction type tiers: landmark_reaction (≥0.65), subtle_landmark_motion (≥0.35), None

Goal 3 — Fusion Engine & Output (cc_suggester/pipeline.py)

Decision formula

fusion_score  =  α · audio_confidence  +  β · reaction_score

CC accepted iff fusion_score ≥ θ
OR audio_confidence ≥ 0.92 (audio override)
OR reaction_score ≥ 0.88 (visual override)

Default: α=0.60, β=0.40, θ=0.55. All three values are first-class config parameters — no magic numbers in source code.

The override thresholds exist for unambiguous single-signal cases: a 95% confidence gunshot detection warrants a CC regardless of whether a face is visible in frame. Conversely, a speaker visibly flinching at 90% reaction score warrants investigation even if the audio was ambiguous.

Caption quality

  • Duration splitting: events longer than max_caption_duration (default 3s) are split into equal parts — professional subtitle standard, avoids a single CC spanning 6+ seconds
  • Label taxonomy lookup: audio_class → cc_label with fallback to [Sound effect]
  • All split parts inherit the parent event's scores and decision for JSON/HTML auditability

Output formats

  • SRT — standards-compliant, direct import into any subtitle editor
  • SLS — PlanetRead's native format with pipe-separated fields including score metadata
  • JSON — full per-event dump: timestamps, audio class, audio confidence, reaction score, reaction type, fusion score, CC decision, CC label, diagnostic notes
  • HTML — professional report with metrics panel, per-event score table, accept/reject badges

Evaluation Framework (cc_suggester/eval.py)

Built an IoU-based evaluation framework that directly measures the acceptance criteria from the issue:

  • Overcaption rate = FP / total predictions — target ≤ 10%
  • Undercaption rate = FN / total ground truth
  • Precision, Recall, F1 via IoU matching at configurable threshold (default 0.30)
  • Compliance assessment: PASS/FAIL per criterion with percentage readout
  • Ground truth loaded from CSV (start, end, label columns) — compatible with manual VLC annotation

Usage during coding period: annotate 3–5 Hindi content samples → run grid search over (θ, α, β) → replace defaults with empirically validated values.


Configuration System (cc_suggester/config.py)

Every parameter is a first-class config field. Four ready-to-use profiles ship with the PR:

Profile Audio Visual Use case
config/default.json Heuristic OpenCV Zero dependencies, instant demo
config/yamnet.json YAMNet OpenCV Better audio classification
config/mediapipe.json Heuristic MediaPipe Better reaction detection
config/full_ml.json YAMNet MediaPipe Full pipeline, best quality

Heuristic accepted events at 0:23, 1:10, 1:46, and 2:54 — spanning the full video, aligned with actual dramatic moments. The fusion engine correctly rejected 23 ambient sound candidates that had audio signal but low visual reaction.

Next step during coding period: repeat on Hindi/regional content samples with ground truth annotation to get real precision/recall numbers and tune thresholds accordingly.


How to Run

# Install
pip install -r requirements.txt

Zero-dependency demo (no video, no models)

python -m cc_suggester.demo_data --output samples/demo.wav
python -m cc_suggester --input samples/demo.wav
--output out/demo.srt
--events-json out/events.json
--report-html out/report.html

Real video — heuristic (no model files needed)

python -m cc_suggester --input video.mp4 --output captions.srt

YAMNet backend

python scripts/download_models.py --select yamnet.tflite
python -m cc_suggester --input video.mp4 --output captions.srt
--config config/yamnet.json

Full ML pipeline

python scripts/download_models.py
python -m cc_suggester --input video.mp4 --output captions.srt
--config config/full_ml.json

Evaluate against annotated ground truth

python -m cc_suggester.eval
--predictions out/events.json
--ground-truth ground_truth/video.csv
--output out/metrics.json

Editor review dashboard

streamlit run streamlit_app.py


What I Plan to Do During the Coding Period

This PR is a foundation, not a finished product. The work I genuinely want to do during DMP:

  1. Benchmark on real PlanetRead Hindi content — test YAMNet's AudioSet classes against Hindi-specific sounds (dhol, firecrackers, devotional music) and identify gaps where PANNs or a custom fine-tune would help
  2. Ground truth annotation and threshold validation — the fusion thresholds are currently documented as unvalidated defaults; I want to annotate enough real videos to replace them with empirically justified values
  3. Collect editor feedback — the HTML report and Streamlit dashboard exist precisely for this; I want to put them in front of actual PlanetRead accessibility editors and iterate on what makes a CC suggestion useful vs distracting
  4. Evaluate PANNs as a YAMNet alternative — PANNs gives finer-grained classifications that may handle Indian content better; worth a rigorous benchmark
  5. Improve the label taxonomy — the current taxonomy covers ~30 YAMNet classes; a full mapping of all 521 classes to meaningful CC labels (or "ignore") is genuinely useful work

I am not applying to C4GT to pad a resume. PlanetRead's mission — making content accessible to regional-language audiences across India — is something I care about, and the CC annotation problem is a real bottleneck for accessibility editors. I want to build something they can actually use.


C4GT DMP 2026 — PlanetRead Intelligent CC Suggestion Tool Submitted by Jeevanjot Singh | GitHub: https://github.com/Jeevanjot19

# [DMP 2026] Intelligent CC Suggestion Pipeline — Complete Implementation (Goals 1, 2 & 3)

Closes #2

Why I Built This Before Submitting

When I read this issue, the core problem immediately stood out to me: detecting every sound is easy — deciding which sounds are narratively significant enough to warrant a CC is genuinely hard. A dog barking in the background is irrelevant; the same bark causing a speaker to visibly flinch on screen demands a CC. That distinction requires multi-modal reasoning, and I wanted to prove I could build it before writing a single word of proposal.

So instead of commenting "I'm interested," I built the full pipeline. This PR is that result.


Architecture

The pipeline is three independently testable modules connected by a shared Event dataclass. Each module is swappable without touching the others — the heuristic audio detector and YAMNet detector are interchangeable, as are the OpenCV and MediaPipe visual backends. This modularity was intentional: it means the pipeline can run zero-dependency on any machine today, and swap in better models as they become available.

Input Video / WAV
       │
       ▼
┌─────────────────────────────────────────┐
│  audio.py — Sound Event Detection       │
│  ┌─────────────────┐ ┌───────────────┐  │
│  │ Heuristic (RMS) │ │ YAMNet via    │  │
│  │ adaptive noise  │ │ MediaPipe     │  │
│  │ floor + merging │ │ 521 classes   │  │
│  └────────┬────────┘ └──────┬────────┘  │
│           └────────┬────────┘           │
│        Event candidates                 │
│   {t_start, t_end, class, confidence}  │
└────────────────────┬────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────┐
│  visual.py — Reaction Scoring           │
│  ┌──────────────────┐ ┌──────────────┐  │
│  │ OpenCV frame     │ │ MediaPipe    │  │
│  │ diff + scene-cut │ │ Pose + Face  │  │
│  │ detection        │ │ landmark     │  │
│  └────────┬─────────┘ └──────┬───────┘  │
│           └────────┬─────────┘          │
│          reaction_score ∈ [0, 1]        │
└────────────────────┬────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────┐
│  pipeline.py — Fusion & Decision        │
│                                         │
│  score = α·audio_conf + β·react_score  │
│  CC if score ≥ θ  ∨  audio ≥ 0.92      │
│               ∨  reaction ≥ 0.88       │
│                                         │
│  label ← taxonomy[audio_class]         │
└────────────────────┬────────────────────┘
                     │
                     ▼
          SRT / SLS / JSON / HTML

Goal 1 — Sound Event Detection (cc_suggester/audio.py)

Heuristic backend (model: heuristic)

The heuristic detector does not use a fixed energy threshold. Instead it computes the median RMS energy across all frames as an adaptive noise floor, then sets the detection threshold as max(config_threshold, noise_floor × noise_ratio). This means the same config works on a quiet interview and a loud action scene without manual tuning.

Key implementation details:

  • Per-frame RMS computed over configurable frame_seconds windows with hop_seconds stride
  • Adjacent high-energy spans merged within gap_tolerance to avoid fragmented events
  • Events shorter than min_event_duration discarded (eliminates transient noise spikes)
  • Classification by duration + peak energy: sharp_impact (short, high energy), sustained_sound (long duration), loud_sound (everything else)
  • Confidence scored as base + energy_normalized_delta, bounded to [0.45, 0.95]
  • Zero external dependencies — runs on standard library WAV only

YAMNet backend (model: yamnet)

  • MediaPipe AudioClassifier with yamnet.tflite — 521 AudioSet class labels
  • Speech/silence/music blocklist prevents these from ever becoming CC candidates (the most important filter for the over-captioning problem)
  • Critical fix: timestamps use chunk_idx × hop_seconds instead of result.timestamp_ms — in AUDIO_CLIPS mode, MediaPipe's timestamp field reflects the classify() call time, not position within the audio, causing all events to cluster at near-zero timestamps without this fix
  • WebRTC VAD pre-filter (aggressiveness 0–3, configurable) zeros out speech frames before YAMNet inference — hard guarantee that loud speech never becomes a CC candidate even if it bypasses the blocklist
  • Full label taxonomy: 30+ YAMNet class names map directly to CC labels (Gunshot, gunfire[gunshot], Applause[applause], Laughter[laughter], etc.) with [Sound effect] as fallback

Goal 2 — Speaker Reaction Detection (cc_suggester/visual.py)

Visual analysis runs only on frames within ±context_window seconds of each detected audio event — not the full video. This keeps processing time linear in the number of audio candidates rather than video duration.

OpenCV motion backend (backend: opencv_motion)

  • Frame-diff scoring via mean absolute pixel difference, normalised through sigmoid: score = 2/(1 + exp(-raw)) - 1
  • Sigmoid chosen deliberately over hard ceiling — avoids the saturation problem where both scene cuts and genuine reactions both scored 1.0 under linear normalisation
  • Scene-cut detection: is_cut = peak_diff > avg_diff × 3.0 — a hard scene cut produces a single extreme frame diff with low average, while genuine motion produces sustained elevated diffs. Cuts are detected and reaction_score discounted 80%, with reaction_type = "scene_cut" and a diagnostic note on the Event
  • Configurable frame stride, resolution downscale (default 64×36 for speed), and context window

MediaPipe backend (backend: mediapipe)

  • PoseLandmarker (nose, left shoulder, right shoulder) + FaceLandmarker (chin, lips, eye corners) at IMAGE mode
  • Pose and face landmark sets normalised independently — centroid subtracted, divided by mean spread — before concatenation. Previous version normalised them together, which caused skew when only one detector fired (e.g. face out of frame)
  • Reaction scored as 0.65 × peak_landmark_displacement + 0.35 × max_inter_frame_velocity from baseline frame
  • Three reaction type tiers: landmark_reaction (≥0.65), subtle_landmark_motion (≥0.35), None

Goal 3 — Fusion Engine & Output (cc_suggester/pipeline.py)

Decision formula

fusion_score  =  α · audio_confidence  +  β · reaction_score

CC accepted  iff  fusion_score  ≥  θ
             OR   audio_confidence  ≥  0.92   (audio override)
             OR   reaction_score    ≥  0.88   (visual override)

Default: α=0.60, β=0.40, θ=0.55. All three values are first-class config parameters — no magic numbers in source code.

The override thresholds exist for unambiguous single-signal cases: a 95% confidence gunshot detection warrants a CC regardless of whether a face is visible in frame. Conversely, a speaker visibly flinching at 90% reaction score warrants investigation even if the audio was ambiguous.

Caption quality

  • Duration splitting: events longer than max_caption_duration (default 3s) are split into equal parts — professional subtitle standard, avoids a single CC spanning 6+ seconds
  • Label taxonomy lookup: audio_class → cc_label with fallback to [Sound effect]
  • All split parts inherit the parent event's scores and decision for JSON/HTML auditability

Output formats

  • SRT — standards-compliant, direct import into any subtitle editor
  • SLS — PlanetRead's native format with pipe-separated fields including score metadata
  • JSON — full per-event dump: timestamps, audio class, audio confidence, reaction score, reaction type, fusion score, CC decision, CC label, diagnostic notes
  • HTML — professional report with metrics panel, per-event score table, accept/reject badges

Evaluation Framework (cc_suggester/eval.py)

Built an IoU-based evaluation framework that directly measures the acceptance criteria from the issue:

  • Overcaption rate = FP / total predictions — target ≤ 10%
  • Undercaption rate = FN / total ground truth
  • Precision, Recall, F1 via IoU matching at configurable threshold (default 0.30)
  • Compliance assessment: PASS/FAIL per criterion with percentage readout
  • Ground truth loaded from CSV (start, end, label columns) — compatible with manual VLC annotation

Usage during coding period: annotate 3–5 Hindi content samples → run grid search over (θ, α, β) → replace defaults with empirically validated values.


Configuration System (cc_suggester/config.py)

Every parameter is a first-class config field. Four ready-to-use profiles ship with the PR:

Profile Audio Visual Use case
config/default.json Heuristic OpenCV Zero dependencies, instant demo
config/yamnet.json YAMNet OpenCV Better audio classification
config/mediapipe.json Heuristic MediaPipe Better reaction detection
config/full_ml.json YAMNet MediaPipe Full pipeline, best quality

The FusionConfig docstring explicitly flags the default thresholds as unvalidated and links to the evaluation workflow — I think it is important to be honest about what has and has not been empirically tested.


Real Video Test Results

Tested on [JUMPER — Suspense Thriller Short Film](https://www.youtube.com/watch?v=VOJsld2_oeI) (3 minutes, English, action-heavy sound design):

Backend Candidates Accepted Time Labels
Heuristic + OpenCV 27 4 5.6s [Loud sound], [Sustained sound]
YAMNet + OpenCV 20 2 20.9s Rich class names

Heuristic accepted events at 0:23, 1:10, 1:46, and 2:54 — spanning the full video, aligned with actual dramatic moments. The fusion engine correctly rejected 23 ambient sound candidates that had audio signal but low visual reaction.

Next step during coding period: repeat on Hindi/regional content samples with ground truth annotation to get real precision/recall numbers and tune thresholds accordingly.


How to Run

# Install
pip install -r requirements.txt

# Zero-dependency demo (no video, no models)
python -m cc_suggester.demo_data --output samples/demo.wav
python -m cc_suggester --input samples/demo.wav \
  --output out/demo.srt \
  --events-json out/events.json \
  --report-html out/report.html

# Real video — heuristic (no model files needed)
python -m cc_suggester --input video.mp4 --output captions.srt

# YAMNet backend
python scripts/download_models.py --select yamnet.tflite
python -m cc_suggester --input video.mp4 --output captions.srt \
  --config config/yamnet.json

# Full ML pipeline
python scripts/download_models.py
python -m cc_suggester --input video.mp4 --output captions.srt \
  --config config/full_ml.json

# Evaluate against annotated ground truth
python -m cc_suggester.eval \
  --predictions out/events.json \
  --ground-truth ground_truth/video.csv \
  --output out/metrics.json

# Editor review dashboard
streamlit run streamlit_app.py

What I Plan to Do During the Coding Period

This PR is a foundation, not a finished product. The work I genuinely want to do during DMP:

  1. Benchmark on real PlanetRead Hindi content — test YAMNet's AudioSet classes against Hindi-specific sounds (dhol, firecrackers, devotional music) and identify gaps where PANNs or a custom fine-tune would help
  2. Ground truth annotation and threshold validation — the fusion thresholds are currently documented as unvalidated defaults; I want to annotate enough real videos to replace them with empirically justified values
  3. Collect editor feedback — the HTML report and Streamlit dashboard exist precisely for this; I want to put them in front of actual PlanetRead accessibility editors and iterate on what makes a CC suggestion useful vs distracting
  4. Evaluate PANNs as a YAMNet alternative — PANNs gives finer-grained classifications that may handle Indian content better; worth a rigorous benchmark
  5. Improve the label taxonomy — the current taxonomy covers ~30 YAMNet classes; a full mapping of all 521 classes to meaningful CC labels (or "ignore") is genuinely useful work

I am not applying to C4GT to pad a resume. PlanetRead's mission — making content accessible to regional-language audiences across India — is something I care about, and the CC annotation problem is a real bottleneck for accessibility editors. I want to build something they can actually use.


C4GT DMP 2026 — PlanetRead Intelligent CC Suggestion Tool
Submitted by Jeevanjot | GitHub: https://github.com/Jeevanjot19

…ad#2)

- Goal 1: audio_detector.py with heuristic (RMS + adaptive noise floor)
  and YAMNet/MediaPipe AudioClassifier backends
- Goal 2: visual.py with OpenCV motion diff and MediaPipe Pose+FaceMesh
  landmark-delta scoring backends
- Goal 3: pipeline.py fusion engine (alpha*audio + beta*visual),
  SRT/SLS/JSON output, HTML report
- eval.py: IoU-based precision/recall/F1/overcaption-rate framework
- config/: YAML/JSON config system, all thresholds tuneable
- Streamlit reviewer dashboard
- 14 pytest tests passing
- Tested on real video (JUMPER short film): heuristic 27->4, YAMNet 20->2

Closes PlanetRead#2
Copilot AI review requested due to automatic review settings May 2, 2026 13:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an end-to-end “Intelligent CC Suggestion” pipeline that detects non-speech audio events, scores visual reactions, fuses both signals into a CC/no-CC decision, and exports results (SRT/SLS/JSON/HTML) with evaluation + review tooling.

Changes:

  • Introduces core pipeline modules (audio, visual, pipeline, output, report) built around a shared Event dataclass and JSON/YAML configuration profiles.
  • Adds evaluation tooling (IoU-based metrics), a Streamlit reviewer dashboard, and multiple scripts for real-video workflows + model/video utilities.
  • Adds pytest coverage for core pipeline behaviors and basic dependency/error-path handling.

Reviewed changes

Copilot reviewed 31 out of 35 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/test_pipeline.py End-to-end and unit tests for timestamps, pipeline outputs, config override, dependency error paths, and evaluation helpers.
scripts/video_utils.py FFmpeg-based video probing, validation, extraction, and conversion helpers.
scripts/test_yamnet_integration.py Benchmark script comparing heuristic vs YAMNet detection and generating an HTML report.
scripts/test_real_videos.py Real-video workflow runner (validate → extract → pipeline → templates).
scripts/run_full_test.py One-command workflow runner that also generates example ground truth and runs evaluation.
scripts/full_test_workflow.ps1 PowerShell automation for the full workflow on Windows.
scripts/download_youtube_videos.py Utility to download videos/audio via yt-dlp for annotation/eval.
scripts/download_models.py Utility to download optional model assets (YAMNet + MediaPipe tasks).
scripts/annotation_tool.py CLI annotation helper (template + interactive + conversion/merge helpers).
requirements.txt Declares Python dependencies for tests/UI/ML backends.
config/yamnet.json Config profile for YAMNet audio + OpenCV visual scoring.
config/mediapipe.json Config profile for heuristic audio + MediaPipe visual scoring.
config/full_ml.json Config profile for YAMNet audio + MediaPipe visual scoring.
config/default.yaml Default YAML configuration profile.
config/default.json Default JSON configuration profile.
cc_suggester/visual.py Visual scoring backends (OpenCV motion + MediaPipe landmarks) and backend dispatch.
cc_suggester/report.py HTML reporting for events, decisions, and optional metrics panel.
cc_suggester/pipeline.py Orchestrates audio detection, visual scoring, fusion decisions, caption splitting, and output writing + metrics.
cc_suggester/output.py SRT/SLS writers + timestamp formatting + events JSON writer.
cc_suggester/media.py FFmpeg dependency checks and audio extraction to WAV.
cc_suggester/event.py Shared Event dataclass + serialization helpers.
cc_suggester/eval.py IoU-based evaluation CLI + metrics computation helpers.
cc_suggester/demo_data.py Synthetic WAV generator used by tests/demo.
cc_suggester/dashboard.py Streamlit reviewer dashboard data loading + UI.
cc_suggester/config.py Typed config dataclasses + JSON/YAML loader/merging + default taxonomy.
cc_suggester/cli.py Main CLI entry for running the pipeline.
cc_suggester/audio.py Heuristic RMS detector + YAMNet(MediaPipe) detector + VAD filtering + backend dispatch.
cc_suggester/__init__.py Package metadata/version.
REAL_VIDEO_TEST_RESULTS.md Documented real-video validation notes and outputs.
REAL_VIDEO_TESTING.md Step-by-step guide for real-video workflows, annotation, and evaluation.
README.md Project overview, usage, workflows, and documentation links.
FFMPEG_SETUP.md FFmpeg install/setup instructions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cc_suggester/visual.py
Comment on lines +23 to +27
diffs: list[float] = []
for previous, current in zip(frames, frames[1:]):
import cv2
import numpy as np

Comment thread cc_suggester/eval.py
"false_negative": false_negative,
"precision": round(precision, 3),
"recall": round(recall, 3),
"f1": round(f1, 3),
Comment thread cc_suggester/eval.py
Comment on lines +118 to +140
def _assess_compliance(metrics: dict[str, Any]) -> dict[str, str]:
"""Check if metrics meet proposal acceptance criteria.

Acceptance Criteria from GitHub issue #2:
1. Avoid over-captioning -> overcaption_rate should be <= 10%
2. Detect non-speech audio events -> recall should be >= 80%
"""
results = {}

# Criterion 1: Avoid over-captioning (FP rate)
overcaption = metrics.get("overcaption_rate", 1.0)
if overcaption <= 0.10:
results["avoid_overcaption"] = f"PASS ({overcaption:.1%} false positives <= 10% target)"
else:
results["avoid_overcaption"] = f"FAIL ({overcaption:.1%} false positives > 10% target)"

# Criterion 2: Detect events (recall)
recall = metrics.get("recall", 0.0)
if recall >= 0.80:
results["detect_events"] = f"PASS ({recall:.1%} detection rate >= 80% target)"
else:
results["detect_events"] = f"WARN ({recall:.1%} detection rate < 80% target)"

Comment on lines +22 to +31
def parse_timestamp(ts_str: str) -> float:
"""Parse HH:MM:SS.mmm format to seconds."""
try:
parts = ts_str.split(':')
hours = int(parts[0])
minutes = int(parts[1])
seconds_parts = parts[2].split('.')
seconds = int(seconds_parts[0])
milliseconds = int(seconds_parts[1]) if len(seconds_parts) > 1 else 0

Comment thread scripts/video_utils.py
Comment on lines +106 to +115
# Parse FPS: "24 fps", "30000/1001 fps"
fps_match = re.search(r"(\d+\.?\d*)\s*fps", line)
if fps_match:
fps = float(fps_match.group(1))
else:
# Try fractional format
fps_frac = re.search(r"(\d+)/(\d+)\s*fps", line)
if fps_frac:
fps = float(fps_frac.group(1)) / float(fps_frac.group(2))
break
Comment thread README.md
Comment on lines +71 to +73
- YAMNet inference window: `config.yamnet_inference_window` (was hardcoded 0.975)
- Motion reaction threshold: `config.reaction_threshold` (was hardcoded 0.4)
- VAD aggressiveness: `config.vad_aggressiveness` (configurable 0-3)
Comment on lines +24 to +30
**Fix:** Moved to `config.yamnet_inference_window`
**Result:** ✅ Configurable via `config/yamnet.json`

### 3. Magic Number (0.4) Threshold Extracted ✓
**Issue:** Hardcoded reaction threshold
**Fix:** Moved to `config.reaction_threshold`
**Result:** ✅ OpenCV motion detection using configurable threshold
Comment thread cc_suggester/pipeline.py
Comment on lines +26 to +41

if not logger.handlers:
formatter = logging.Formatter(
'%(asctime)s - %(name)s - %(levelname)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)

console = logging.StreamHandler()
console.setFormatter(formatter)
logger.addHandler(console)

if log_file:
file_handler = logging.FileHandler(log_file)
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)

Comment on lines +16 to +20
Environment:
- Requires internet connection
- Creates models/ directory if not exists
- Validates checksums after download
"""
Comment on lines +128 to +130
# Download only specific model
python scripts/download_models.py --select yamnet
""",
@Jeevanjot19
Copy link
Copy Markdown
Author

@copilot apply changes based on the comments in this thread

@abinash-sketch
Copy link
Copy Markdown

let me know when can we connect.

@Jeevanjot19
Copy link
Copy Markdown
Author

Jeevanjot19 commented May 7, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DMP 2026]: Create Intelligent Closed Caption (CC) Suggestion Tool

3 participants