Skip to content

feat: intelligent CC suggestion pipeline — all 3 goals + Hindi support#13

Open
Naitik120gupta wants to merge 1 commit into
PlanetRead:mainfrom
Naitik120gupta:feat/intelligent-cc-pipeline-naitik
Open

feat: intelligent CC suggestion pipeline — all 3 goals + Hindi support#13
Naitik120gupta wants to merge 1 commit into
PlanetRead:mainfrom
Naitik120gupta:feat/intelligent-cc-pipeline-naitik

Conversation

@Naitik120gupta
Copy link
Copy Markdown

@Naitik120gupta Naitik120gupta commented May 9, 2026

Intelligent CC Suggestion Tool — DMP 2026 Demo

Contributor: Naitik | Issue: #2


What this PR contains

A single-file, end-to-end working pipeline covering all 3 goals from the ticket. No complex file structure is needed.

Architecture

Screenshot from 2026-05-09 15-54-43
Video → [Module 1: YAMNet SED] → audio events + timestamps → [Module 2: MediaPipe Reaction] → visual confidence scores → [Module 3: Decision Engine] → SRT / JSON output

What makes this submission different

1. Hindi CC label support (ticket requirement — others missed this)
The ticket explicitly targets "Hindi and regional-language content."
This pipeline is the only submission with native Hindi output:

python intelligent_cc_pipeline.py --video input.mp4 --lang hi
# Output: [तालियाँ], [विस्फोट], [गोलीबारी], [सायरन]

2. Audio-only bypass for high-impact events
Safety-critical sounds (gunshot, explosion, siren, alarm) get approved
on strong audio confidence alone (≥ 0.75), even without a visible face reaction.
Rationale: a gunshot off-camera still warrants a CC. This is a named
--audio-only-thresh flag the user can tune or disable.

3. Freeze response detection in Module 2
Most implementations only detect motion spikes. This pipeline also scores
sudden stillness after an event — the startle freeze response — as a
reaction signal. This catches a class of reactions other visual models miss.

4. Transparent decision basis in JSON output
Every accepted CC includes a decision_basis field:
"audio+visual", "audio_only_high_confidence", or "high_impact_bypass"
so editors know exactly why each CC was approved.

Sample output (Hindi mode, canva.mp4)

Time Label (EN) Label (HI) Audio Visual Combined
6.24s [APPLAUSE] [तालियाँ] 1.00 0.00 approved
13.92s [EXPLOSION] [विस्फोट] 0.97 0.50 0.71
13.44s [GUNSHOT] [गोलीबारी] 0.95 0.00 bypass

Files

  • intelligent_cc_pipeline.py — full pipeline, all 3 modules
  • README.md — installation, usage, design decisions, known limitations
  • sample_output_en.srt — English CC output
  • sample_output_hi.srt — Hindi CC output
  • sample_report.json — full JSON report with decision_basis per event

Demo video

intelligent_cc_pipeline.mp4

Youtube link of the video https://youtu.be/zn3huIukfiY

Known limitations

  1. YAMNet has no culturally-specific Indian sound classes (dhol, shehnai)
  2. Module 2 tracks one face — multi-speaker scenes need extension
  3. CPU-only; GPU would significantly reduce processing time
  4. SLS format needs PlanetRead's exact spec to finalize byte-level output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant