[DMP 2026] Implement Intelligent CC Suggestion Pipeline — Goals 1, 2 & 3#3
Open
Jeevanjot19 wants to merge 5 commits into
Open
[DMP 2026] Implement Intelligent CC Suggestion Pipeline — Goals 1, 2 & 3#3Jeevanjot19 wants to merge 5 commits into
Jeevanjot19 wants to merge 5 commits into
Conversation
…ad#2) - Goal 1: audio_detector.py with heuristic (RMS + adaptive noise floor) and YAMNet/MediaPipe AudioClassifier backends - Goal 2: visual.py with OpenCV motion diff and MediaPipe Pose+FaceMesh landmark-delta scoring backends - Goal 3: pipeline.py fusion engine (alpha*audio + beta*visual), SRT/SLS/JSON output, HTML report - eval.py: IoU-based precision/recall/F1/overcaption-rate framework - config/: YAML/JSON config system, all thresholds tuneable - Streamlit reviewer dashboard - 14 pytest tests passing - Tested on real video (JUMPER short film): heuristic 27->4, YAMNet 20->2 Closes PlanetRead#2
There was a problem hiding this comment.
Pull request overview
This PR adds an end-to-end “Intelligent CC Suggestion” pipeline that detects non-speech audio events, scores visual reactions, fuses both signals into a CC/no-CC decision, and exports results (SRT/SLS/JSON/HTML) with evaluation + review tooling.
Changes:
- Introduces core pipeline modules (
audio,visual,pipeline,output,report) built around a sharedEventdataclass and JSON/YAML configuration profiles. - Adds evaluation tooling (IoU-based metrics), a Streamlit reviewer dashboard, and multiple scripts for real-video workflows + model/video utilities.
- Adds pytest coverage for core pipeline behaviors and basic dependency/error-path handling.
Reviewed changes
Copilot reviewed 31 out of 35 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_pipeline.py |
End-to-end and unit tests for timestamps, pipeline outputs, config override, dependency error paths, and evaluation helpers. |
scripts/video_utils.py |
FFmpeg-based video probing, validation, extraction, and conversion helpers. |
scripts/test_yamnet_integration.py |
Benchmark script comparing heuristic vs YAMNet detection and generating an HTML report. |
scripts/test_real_videos.py |
Real-video workflow runner (validate → extract → pipeline → templates). |
scripts/run_full_test.py |
One-command workflow runner that also generates example ground truth and runs evaluation. |
scripts/full_test_workflow.ps1 |
PowerShell automation for the full workflow on Windows. |
scripts/download_youtube_videos.py |
Utility to download videos/audio via yt-dlp for annotation/eval. |
scripts/download_models.py |
Utility to download optional model assets (YAMNet + MediaPipe tasks). |
scripts/annotation_tool.py |
CLI annotation helper (template + interactive + conversion/merge helpers). |
requirements.txt |
Declares Python dependencies for tests/UI/ML backends. |
config/yamnet.json |
Config profile for YAMNet audio + OpenCV visual scoring. |
config/mediapipe.json |
Config profile for heuristic audio + MediaPipe visual scoring. |
config/full_ml.json |
Config profile for YAMNet audio + MediaPipe visual scoring. |
config/default.yaml |
Default YAML configuration profile. |
config/default.json |
Default JSON configuration profile. |
cc_suggester/visual.py |
Visual scoring backends (OpenCV motion + MediaPipe landmarks) and backend dispatch. |
cc_suggester/report.py |
HTML reporting for events, decisions, and optional metrics panel. |
cc_suggester/pipeline.py |
Orchestrates audio detection, visual scoring, fusion decisions, caption splitting, and output writing + metrics. |
cc_suggester/output.py |
SRT/SLS writers + timestamp formatting + events JSON writer. |
cc_suggester/media.py |
FFmpeg dependency checks and audio extraction to WAV. |
cc_suggester/event.py |
Shared Event dataclass + serialization helpers. |
cc_suggester/eval.py |
IoU-based evaluation CLI + metrics computation helpers. |
cc_suggester/demo_data.py |
Synthetic WAV generator used by tests/demo. |
cc_suggester/dashboard.py |
Streamlit reviewer dashboard data loading + UI. |
cc_suggester/config.py |
Typed config dataclasses + JSON/YAML loader/merging + default taxonomy. |
cc_suggester/cli.py |
Main CLI entry for running the pipeline. |
cc_suggester/audio.py |
Heuristic RMS detector + YAMNet(MediaPipe) detector + VAD filtering + backend dispatch. |
cc_suggester/__init__.py |
Package metadata/version. |
REAL_VIDEO_TEST_RESULTS.md |
Documented real-video validation notes and outputs. |
REAL_VIDEO_TESTING.md |
Step-by-step guide for real-video workflows, annotation, and evaluation. |
README.md |
Project overview, usage, workflows, and documentation links. |
FFMPEG_SETUP.md |
FFmpeg install/setup instructions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+23
to
+27
| diffs: list[float] = [] | ||
| for previous, current in zip(frames, frames[1:]): | ||
| import cv2 | ||
| import numpy as np | ||
|
|
| "false_negative": false_negative, | ||
| "precision": round(precision, 3), | ||
| "recall": round(recall, 3), | ||
| "f1": round(f1, 3), |
Comment on lines
+118
to
+140
| def _assess_compliance(metrics: dict[str, Any]) -> dict[str, str]: | ||
| """Check if metrics meet proposal acceptance criteria. | ||
|
|
||
| Acceptance Criteria from GitHub issue #2: | ||
| 1. Avoid over-captioning -> overcaption_rate should be <= 10% | ||
| 2. Detect non-speech audio events -> recall should be >= 80% | ||
| """ | ||
| results = {} | ||
|
|
||
| # Criterion 1: Avoid over-captioning (FP rate) | ||
| overcaption = metrics.get("overcaption_rate", 1.0) | ||
| if overcaption <= 0.10: | ||
| results["avoid_overcaption"] = f"PASS ({overcaption:.1%} false positives <= 10% target)" | ||
| else: | ||
| results["avoid_overcaption"] = f"FAIL ({overcaption:.1%} false positives > 10% target)" | ||
|
|
||
| # Criterion 2: Detect events (recall) | ||
| recall = metrics.get("recall", 0.0) | ||
| if recall >= 0.80: | ||
| results["detect_events"] = f"PASS ({recall:.1%} detection rate >= 80% target)" | ||
| else: | ||
| results["detect_events"] = f"WARN ({recall:.1%} detection rate < 80% target)" | ||
|
|
Comment on lines
+22
to
+31
| def parse_timestamp(ts_str: str) -> float: | ||
| """Parse HH:MM:SS.mmm format to seconds.""" | ||
| try: | ||
| parts = ts_str.split(':') | ||
| hours = int(parts[0]) | ||
| minutes = int(parts[1]) | ||
| seconds_parts = parts[2].split('.') | ||
| seconds = int(seconds_parts[0]) | ||
| milliseconds = int(seconds_parts[1]) if len(seconds_parts) > 1 else 0 | ||
|
|
Comment on lines
+106
to
+115
| # Parse FPS: "24 fps", "30000/1001 fps" | ||
| fps_match = re.search(r"(\d+\.?\d*)\s*fps", line) | ||
| if fps_match: | ||
| fps = float(fps_match.group(1)) | ||
| else: | ||
| # Try fractional format | ||
| fps_frac = re.search(r"(\d+)/(\d+)\s*fps", line) | ||
| if fps_frac: | ||
| fps = float(fps_frac.group(1)) / float(fps_frac.group(2)) | ||
| break |
Comment on lines
+71
to
+73
| - YAMNet inference window: `config.yamnet_inference_window` (was hardcoded 0.975) | ||
| - Motion reaction threshold: `config.reaction_threshold` (was hardcoded 0.4) | ||
| - VAD aggressiveness: `config.vad_aggressiveness` (configurable 0-3) |
Comment on lines
+24
to
+30
| **Fix:** Moved to `config.yamnet_inference_window` | ||
| **Result:** ✅ Configurable via `config/yamnet.json` | ||
|
|
||
| ### 3. Magic Number (0.4) Threshold Extracted ✓ | ||
| **Issue:** Hardcoded reaction threshold | ||
| **Fix:** Moved to `config.reaction_threshold` | ||
| **Result:** ✅ OpenCV motion detection using configurable threshold |
Comment on lines
+26
to
+41
|
|
||
| if not logger.handlers: | ||
| formatter = logging.Formatter( | ||
| '%(asctime)s - %(name)s - %(levelname)s - %(message)s', | ||
| datefmt='%Y-%m-%d %H:%M:%S' | ||
| ) | ||
|
|
||
| console = logging.StreamHandler() | ||
| console.setFormatter(formatter) | ||
| logger.addHandler(console) | ||
|
|
||
| if log_file: | ||
| file_handler = logging.FileHandler(log_file) | ||
| file_handler.setFormatter(formatter) | ||
| logger.addHandler(file_handler) | ||
|
|
Comment on lines
+16
to
+20
| Environment: | ||
| - Requires internet connection | ||
| - Creates models/ directory if not exists | ||
| - Validates checksums after download | ||
| """ |
Comment on lines
+128
to
+130
| # Download only specific model | ||
| python scripts/download_models.py --select yamnet | ||
| """, |
Author
|
@copilot apply changes based on the comments in this thread |
|
let me know when can we connect. |
Author
|
We can connect anytime after 3 pm today
…On Thu, May 7, 2026, 11:49 AM abinash-sketch ***@***.***> wrote:
*abinash-sketch* left a comment (PlanetRead/Intelligent-cc-generation#3)
<#3 (comment)>
let me know when can we connect.
—
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BGJLMGVJILSYQ46OS2UAZK34ZQTHZAVCNFSM6AAAAACYOIV5HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGOJUGIYTEMJWGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[DMP 2026] Intelligent CC Suggestion Pipeline — Complete Implementation (Goals 1, 2 & 3)
Closes #2
Why I Built This Before Submitting
When I read this issue, the core problem immediately stood out to me: detecting every sound is easy — deciding which sounds are narratively significant enough to warrant a CC is genuinely hard. A dog barking in the background is irrelevant; the same bark causing a speaker to visibly flinch on screen demands a CC. That distinction requires multi-modal reasoning, and I wanted to prove I could build it before writing a single word of proposal.
So instead of commenting "I'm interested," I built the full pipeline. This PR is that result.
Architecture
The pipeline is three independently testable modules connected by a shared
Eventdataclass. Each module is swappable without touching the others — the heuristic audio detector and YAMNet detector are interchangeable, as are the OpenCV and MediaPipe visual backends. This modularity was intentional: it means the pipeline can run zero-dependency on any machine today, and swap in better models as they become available.Goal 1 — Sound Event Detection (
cc_suggester/audio.py)Heuristic backend (
model: heuristic)The heuristic detector does not use a fixed energy threshold. Instead it computes the median RMS energy across all frames as an adaptive noise floor, then sets the detection threshold as
max(config_threshold, noise_floor × noise_ratio). This means the same config works on a quiet interview and a loud action scene without manual tuning.Key implementation details:
frame_secondswindows withhop_secondsstridegap_toleranceto avoid fragmented eventsmin_event_durationdiscarded (eliminates transient noise spikes)sharp_impact(short, high energy),sustained_sound(long duration),loud_sound(everything else)base + energy_normalized_delta, bounded to [0.45, 0.95]YAMNet backend (
model: yamnet)yamnet.tflite— 521 AudioSet class labelschunk_idx × hop_secondsinstead ofresult.timestamp_ms— in AUDIO_CLIPS mode, MediaPipe's timestamp field reflects the classify() call time, not position within the audio, causing all events to cluster at near-zero timestamps without this fixGunshot, gunfire→[gunshot],Applause→[applause],Laughter→[laughter], etc.) with[Sound effect]as fallbackGoal 2 — Speaker Reaction Detection (
cc_suggester/visual.py)Visual analysis runs only on frames within ±context_window seconds of each detected audio event — not the full video. This keeps processing time linear in the number of audio candidates rather than video duration.
OpenCV motion backend (
backend: opencv_motion)score = 2/(1 + exp(-raw)) - 1is_cut = peak_diff > avg_diff × 3.0— a hard scene cut produces a single extreme frame diff with low average, while genuine motion produces sustained elevated diffs. Cuts are detected andreaction_scorediscounted 80%, withreaction_type = "scene_cut"and a diagnostic note on the EventMediaPipe backend (
backend: mediapipe)0.65 × peak_landmark_displacement + 0.35 × max_inter_frame_velocityfrom baseline framelandmark_reaction(≥0.65),subtle_landmark_motion(≥0.35),NoneGoal 3 — Fusion Engine & Output (
cc_suggester/pipeline.py)Decision formula
Default: α=0.60, β=0.40, θ=0.55. All three values are first-class config parameters — no magic numbers in source code.
The override thresholds exist for unambiguous single-signal cases: a 95% confidence gunshot detection warrants a CC regardless of whether a face is visible in frame. Conversely, a speaker visibly flinching at 90% reaction score warrants investigation even if the audio was ambiguous.
Caption quality
max_caption_duration(default 3s) are split into equal parts — professional subtitle standard, avoids a single CC spanning 6+ secondsaudio_class → cc_labelwith fallback to[Sound effect]Output formats
Evaluation Framework (
cc_suggester/eval.py)Built an IoU-based evaluation framework that directly measures the acceptance criteria from the issue:
Usage during coding period: annotate 3–5 Hindi content samples → run grid search over (θ, α, β) → replace defaults with empirically validated values.
Configuration System (
cc_suggester/config.py)Every parameter is a first-class config field. Four ready-to-use profiles ship with the PR:
Heuristic accepted events at 0:23, 1:10, 1:46, and 2:54 — spanning the full video, aligned with actual dramatic moments. The fusion engine correctly rejected 23 ambient sound candidates that had audio signal but low visual reaction.
Next step during coding period: repeat on Hindi/regional content samples with ground truth annotation to get real precision/recall numbers and tune thresholds accordingly.
How to Run
What I Plan to Do During the Coding Period
This PR is a foundation, not a finished product. The work I genuinely want to do during DMP:
I am not applying to C4GT to pad a resume. PlanetRead's mission — making content accessible to regional-language audiences across India — is something I care about, and the CC annotation problem is a real bottleneck for accessibility editors. I want to build something they can actually use.
C4GT DMP 2026 — PlanetRead Intelligent CC Suggestion Tool Submitted by Jeevanjot Singh | GitHub: https://github.com/Jeevanjot19
# [DMP 2026] Intelligent CC Suggestion Pipeline — Complete Implementation (Goals 1, 2 & 3)Closes #2
Why I Built This Before Submitting
When I read this issue, the core problem immediately stood out to me: detecting every sound is easy — deciding which sounds are narratively significant enough to warrant a CC is genuinely hard. A dog barking in the background is irrelevant; the same bark causing a speaker to visibly flinch on screen demands a CC. That distinction requires multi-modal reasoning, and I wanted to prove I could build it before writing a single word of proposal.
So instead of commenting "I'm interested," I built the full pipeline. This PR is that result.
Architecture
The pipeline is three independently testable modules connected by a shared
Eventdataclass. Each module is swappable without touching the others — the heuristic audio detector and YAMNet detector are interchangeable, as are the OpenCV and MediaPipe visual backends. This modularity was intentional: it means the pipeline can run zero-dependency on any machine today, and swap in better models as they become available.Goal 1 — Sound Event Detection (
cc_suggester/audio.py)Heuristic backend (
model: heuristic)The heuristic detector does not use a fixed energy threshold. Instead it computes the median RMS energy across all frames as an adaptive noise floor, then sets the detection threshold as
max(config_threshold, noise_floor × noise_ratio). This means the same config works on a quiet interview and a loud action scene without manual tuning.Key implementation details:
frame_secondswindows withhop_secondsstridegap_toleranceto avoid fragmented eventsmin_event_durationdiscarded (eliminates transient noise spikes)sharp_impact(short, high energy),sustained_sound(long duration),loud_sound(everything else)base + energy_normalized_delta, bounded to [0.45, 0.95]YAMNet backend (
model: yamnet)yamnet.tflite— 521 AudioSet class labelschunk_idx × hop_secondsinstead ofresult.timestamp_ms— in AUDIO_CLIPS mode, MediaPipe's timestamp field reflects the classify() call time, not position within the audio, causing all events to cluster at near-zero timestamps without this fixGunshot, gunfire→[gunshot],Applause→[applause],Laughter→[laughter], etc.) with[Sound effect]as fallbackGoal 2 — Speaker Reaction Detection (
cc_suggester/visual.py)Visual analysis runs only on frames within ±context_window seconds of each detected audio event — not the full video. This keeps processing time linear in the number of audio candidates rather than video duration.
OpenCV motion backend (
backend: opencv_motion)score = 2/(1 + exp(-raw)) - 1is_cut = peak_diff > avg_diff × 3.0— a hard scene cut produces a single extreme frame diff with low average, while genuine motion produces sustained elevated diffs. Cuts are detected andreaction_scorediscounted 80%, withreaction_type = "scene_cut"and a diagnostic note on the EventMediaPipe backend (
backend: mediapipe)0.65 × peak_landmark_displacement + 0.35 × max_inter_frame_velocityfrom baseline framelandmark_reaction(≥0.65),subtle_landmark_motion(≥0.35),NoneGoal 3 — Fusion Engine & Output (
cc_suggester/pipeline.py)Decision formula
Default: α=0.60, β=0.40, θ=0.55. All three values are first-class config parameters — no magic numbers in source code.
The override thresholds exist for unambiguous single-signal cases: a 95% confidence gunshot detection warrants a CC regardless of whether a face is visible in frame. Conversely, a speaker visibly flinching at 90% reaction score warrants investigation even if the audio was ambiguous.
Caption quality
max_caption_duration(default 3s) are split into equal parts — professional subtitle standard, avoids a single CC spanning 6+ secondsaudio_class → cc_labelwith fallback to[Sound effect]Output formats
Evaluation Framework (
cc_suggester/eval.py)Built an IoU-based evaluation framework that directly measures the acceptance criteria from the issue:
Usage during coding period: annotate 3–5 Hindi content samples → run grid search over (θ, α, β) → replace defaults with empirically validated values.
Configuration System (
cc_suggester/config.py)Every parameter is a first-class config field. Four ready-to-use profiles ship with the PR:
config/default.jsonconfig/yamnet.jsonconfig/mediapipe.jsonconfig/full_ml.jsonThe
FusionConfigdocstring explicitly flags the default thresholds as unvalidated and links to the evaluation workflow — I think it is important to be honest about what has and has not been empirically tested.Real Video Test Results
Tested on [JUMPER — Suspense Thriller Short Film](https://www.youtube.com/watch?v=VOJsld2_oeI) (3 minutes, English, action-heavy sound design):
Heuristic accepted events at 0:23, 1:10, 1:46, and 2:54 — spanning the full video, aligned with actual dramatic moments. The fusion engine correctly rejected 23 ambient sound candidates that had audio signal but low visual reaction.
Next step during coding period: repeat on Hindi/regional content samples with ground truth annotation to get real precision/recall numbers and tune thresholds accordingly.
How to Run
What I Plan to Do During the Coding Period
This PR is a foundation, not a finished product. The work I genuinely want to do during DMP:
I am not applying to C4GT to pad a resume. PlanetRead's mission — making content accessible to regional-language audiences across India — is something I care about, and the CC annotation problem is a real bottleneck for accessibility editors. I want to build something they can actually use.
C4GT DMP 2026 — PlanetRead Intelligent CC Suggestion Tool
Submitted by Jeevanjot | GitHub: https://github.com/Jeevanjot19