intelligent-cc-generation-mvp-with-demo#16
Open
Anamikarajesh wants to merge 3 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR: Intelligent Closed Caption Suggestion Tool
Summary
This PR delivers the end-to-end Intelligent Closed Caption (CC) Suggestion Tool — an AI-powered Python backend and Streamlit editor-review workspace that detects non-speech sound events in raw video, analyzes visual speaker reactions, and exports reviewer-accepted captions as SRT files. It targets accessibility editors working with Hindi and Indian regional-language content.
What this PR includes
analyze,doctor,export,labels, andwebcommandsDemo & Drive Link
Screenshots
Architecture & Diagrams
1. System Architecture Overview
flowchart TB subgraph Inputs["Video Input"] VIDEO["Raw Video\n(.mp4, .mov, .mkv)"] end subgraph Audio["Audio Analysis"] direction TB EXTRACT["Audio Extraction\n(ffmpeg)"] DSP["DSP Baseline\n(RMS, STFT, Onsets)"] A_MODELS["Audio ML Backends\n(YAMNet / PANNs / AST)"] SMOOTH["Event Smoothing\n(Merge, Filter, Normalize)"] EXTRACT --> DSP --> A_MODELS --> SMOOTH end subgraph Vision["Visual Reaction"] direction TB FRAMES["Frame Sampler\n(before / during / after)"] FLOW["Optical Flow\n(OpenCV)"] V_MODELS["Vision ML Backends\n(MediaPipe / MMPose)"] REACT["Reaction Scoring"] FRAMES --> FLOW --> REACT FRAMES --> V_MODELS --> REACT end subgraph Decision["Decision Engine"] direction TB SCORER["Scorer\n(audio + reaction + importance - ambient penalty)"] LABELS["Caption Labels\n(Glossary per language)"] SCORER --> LABELS end subgraph Outputs["Exports"] direction LR SRT["SRT\n(accepted captions)"] JSON["JSON\n(full debug report)"] CSV["CSV\n(reviewer spreadsheet)"] end subgraph Clients["User Interfaces"] direction LR CLI["CLI\n(ccs analyze / doctor / export)"] WEB["Web UI\n(Streamlit editor workspace)"] end VIDEO --> EXTRACT SMOOTH --> AudioEvents["Audio Event\nCandidates"] AudioEvents --> FRAMES AudioEvents --> SCORER REACT --> SCORER LABELS --> SRT SCORER --> JSON SCORER --> CSV CLI --> VIDEO CLI --> SCORER WEB --> VIDEO WEB --> SCORER2. Pipeline Flow
flowchart TD A["Raw video input"] --> B{"Valid video?"} B -- "No" --> B_ERR["Friendly error\nsuggest inspect/doctor command"] B -- "Yes" --> C["Extract metadata\nfps, duration, resolution"] C --> D["Extract audio with ffmpeg"] D --> E["Compute DSP features\nRMS, STFT, spectral flux"] E --> F["Run audio backend\nYAMNet first, PANNs/AST later"] F --> G["Smooth + merge detections"] G --> H["Audio event candidates"] H --> I["Sample frames around each event"] I --> J["Run visual backends\nMediaPipe face/pose + optical flow"] J --> K["Reaction confidence per event"] H --> L["Decision engine"] K --> L L --> M{"Caption warranted?"} M -- "No" --> N["Rejected candidate\nkept in JSON/CSV debug report"] M -- "Yes" --> O["Accepted caption suggestion"] O --> P["Language label mapping"] P --> Q["Export SRT"] L --> R["Export full JSON report"] L --> S["Export CSV review report"]3. Data Model (Class Diagram)
classDiagram class AudioEventCandidate { string event_id string label float start_time float end_time float audio_confidence string audio_backend string raw_class_name dict debug_info } class ReactionResult { string event_id float start_time float end_time float reaction_confidence dict reaction_signals int frames_sampled string vision_backend dict debug_info } class CaptionSuggestion { string event_id float start_time float end_time float audio_confidence float reaction_confidence float decision_score bool accepted string reason string caption_text string language bool requires_review dict debug_info } AudioEventCandidate --> ReactionResult : analyzed visually at event timestamp AudioEventCandidate --> CaptionSuggestion : contributes audio evidence ReactionResult --> CaptionSuggestion : contributes visual evidence4. Decision Engine Scoring
flowchart LR A["Audio confidence"] --> E["Decision scorer"] B["Reaction confidence"] --> E C["Event importance prior"] --> E D["Ambient sound penalty"] --> E P["Speech pause / scene impact bonus"] --> E E --> F{"Decision score >= threshold?"} F -- "Yes" --> G["Accept caption"] F -- "Borderline" --> H["Needs editor review"] F -- "No" --> I["Reject candidate"] G --> J["Generate caption text"] H --> J I --> K["Keep reason in debug output"] J --> L["SRT / JSON / CSV"]5. Editor Review Flow (Sequence Diagram)
sequenceDiagram actor Editor participant UI as Web UI participant Pipeline as Core Pipeline participant Review as Review State participant Export as Exporter Editor->>UI: Select video, language, device Editor->>UI: Click Start Caption UI->>Pipeline: Run analysis with config Pipeline-->>UI: Caption suggestions + diagnostics UI->>Review: Load suggestions into timeline and panel Editor->>Review: Jump to event marker Editor->>Review: Edit caption text Editor->>Review: Accept or reject suggestion Editor->>UI: Export final SRT UI->>Export: Export accepted captions Export-->>Editor: SRT / JSON / CSV files6. Web UI Layout
flowchart TB subgraph Top["Top Bar"] T1["Device Mode"] T2["Language"] T3["Audio Backend"] T4["Vision Backend"] T5["Run Doctor"] end subgraph Left["Left Panel"] L1["Video Dropdown / Upload"] L2["Video Metadata"] L3["Start Caption"] L4["Export SRT / JSON / CSV"] end subgraph Center["Center Panel"] C1["Video Player"] C2["Play / Pause"] C3["Draggable Timeline"] C4["Event Markers"] C5["Previous / Next Event"] end subgraph Right["Right Review Panel"] R1["SRT Suggestions"] R2["Editable Caption Text"] R3["Accept / Reject"] R4["Confidence Scores"] R5["Decision Reason"] R6["Error / Warning Badges"] end subgraph Bottom["Bottom Panel"] B1["Event Table"] B2["Timestamps"] B3["Audio + Reaction Scores"] B4["Status"] end L1 --> L3 T1 --> L3 T2 --> L3 T3 --> L3 T4 --> L3 L3 --> C1 L3 --> R1 L3 --> B1 C4 --> R1 B1 --> R1 R3 --> L4 R2 --> L47. High-Level Module Architecture
flowchart TB subgraph Clients["User-Facing Clients"] CLI["CLI\nccs analyze / doctor / export"] WEB["Web UI\neditor review workspace"] VLC["Future VLC Plugin"] API["Future Local API"] end subgraph Core["Reusable Core Pipeline"] PIPE["Pipeline Orchestrator"] CONFIG["Config + Thresholds"] DIAG["Diagnostics + Friendly Errors"] TYPES["Shared Data Models"] end subgraph Audio["Audio Analysis"] EXTRACT["Audio Extraction"] DSP2["DSP Features\nFFT / STFT / RMS / Onsets"] A_BACKENDS["Audio Backends\nYAMNet / PANNs / AST / BEATs"] EVENTS["Event Smoothing\nMerge / Filter / Normalize"] end subgraph Vision["Visual Reaction Analysis"] FRAMES2["Frame Sampler"] FLOW2["Optical Flow"] V_BACKENDS["Vision Backends\nMediaPipe / MMPose / MMAction2"] REACT2["Reaction Scoring"] end subgraph Decision["Caption Decision"] SCORE["Decision Scorer"] RULES["Importance Rules\nAmbient Penalties"] LABELS2["Caption Labels\nGlossary + Translation"] end subgraph Outputs["Exports"] SRT2["SRT"] JSON2["JSON Debug Report"] CSV2["CSV Review Report"] end CLI --> PIPE WEB --> PIPE VLC --> API API --> PIPE PIPE --> CONFIG PIPE --> DIAG PIPE --> TYPES PIPE --> EXTRACT EXTRACT --> DSP2 DSP2 --> A_BACKENDS A_BACKENDS --> EVENTS EVENTS --> FRAMES2 FRAMES2 --> FLOW2 FRAMES2 --> V_BACKENDS FLOW2 --> REACT2 V_BACKENDS --> REACT2 EVENTS --> SCORE REACT2 --> SCORE RULES --> SCORE SCORE --> LABELS2 LABELS2 --> SRT2 SCORE --> JSON2 SCORE --> CSV28. Device / GPU Handling
flowchart TD A["User selects device mode"] --> B{"Mode"} B -- "auto" --> C{"GPU available?"} C -- "Yes" --> D["Use GPU"] C -- "No" --> E["Fallback to CPU\nrecord fallback reason"] B -- "cpu" --> F["Force CPU"] B -- "cuda" --> G{"GPU available?"} G -- "Yes" --> D G -- "No" --> H["Stop with clear diagnostic"] H --> I["Suggest: retry with --device cpu"] H --> J["Suggest: run ccs doctor"] D --> K["Save device metadata"] E --> K F --> K9. Project Roadmap
flowchart LR P1["Phase 1\nProject Foundation"] --> P2["Phase 2\nAudio Detection"] P2 --> P3["Phase 3\nVisual Reaction Detection"] P3 --> P4["Phase 4\nDecision + Output"] P4 --> P5["Phase 5\nCLI Productization"] P5 --> P6["Phase 6\nWeb Editor UI"] P6 --> P7["Phase 7\nAdvanced Backends"] P7 --> P8["Phase 8\nEvaluation + Packaging"] P2 -.->|Midpoint Goal 1| M["Midpoint\nGoals 1 + 2 complete"] P3 -.->|Midpoint Goal 2| MFiles Changed
Core Pipeline
main/cc_suggester/__init__.pymain/cc_suggester/__main__.pymain/cc_suggester/core/__init__.pymain/cc_suggester/core/config.pymain/cc_suggester/core/diagnostics.pymain/cc_suggester/core/errors.pymain/cc_suggester/core/media.pymain/cc_suggester/core/pipeline.pymain/cc_suggester/core/types.pyAudio Module
main/cc_suggester/audio/__init__.pymain/cc_suggester/audio/extractor.pymain/cc_suggester/audio/wav.pymain/cc_suggester/audio/dsp.pymain/cc_suggester/audio/events.pymain/cc_suggester/audio/vad.pymain/cc_suggester/audio/label_mapping.pymain/cc_suggester/audio/backends/__init__.pymain/cc_suggester/audio/backends/base.pymain/cc_suggester/audio/backends/dsp.pymain/cc_suggester/audio/backends/mock.pymain/cc_suggester/audio/backends/unavailable.pymain/cc_suggester/audio/backends/yamnet.pyVision Module
main/cc_suggester/vision/__init__.pymain/cc_suggester/vision/frame_sampler.pymain/cc_suggester/vision/optical_flow.pymain/cc_suggester/vision/reactions.pymain/cc_suggester/vision/backends/__init__.pymain/cc_suggester/vision/backends/base.pymain/cc_suggester/vision/backends/mediapipe.pymain/cc_suggester/vision/backends/mock.pymain/cc_suggester/vision/backends/opencv.pyDecision Engine
main/cc_suggester/decision/__init__.pymain/cc_suggester/decision/labels.pymain/cc_suggester/decision/rules.pymain/cc_suggester/decision/scorer.pyOutput / Export
main/cc_suggester/output/__init__.pymain/cc_suggester/output/csv_report.pymain/cc_suggester/output/json_report.pymain/cc_suggester/output/review_export.pymain/cc_suggester/output/srt.pyCLI
main/cc_suggester/cli/__init__.pymain/cc_suggester/cli/app.pyWeb UI (Streamlit)
main/cc_suggester/ui/__init__.pymain/cc_suggester/ui/streamlit_app.pyTranslation
main/cc_suggester/translation/__init__.pymain/cc_suggester/translation/glossary.pyConfig
main/configs/default.jsonTests
main/tests/test_config_cli.pymain/tests/test_dsp_backend.pymain/tests/test_outputs.pymain/tests/test_real_video_integration.pymain/tests/test_review_export.pymain/tests/test_vision_pipeline.pymain/tests/test_yamnet_backend.pyScripts
main/scripts/generate_sample_video.pyBuild & Dependencies
main/pyproject.tomlmain/requirements.txtmain/requirements-audio.txtmain/requirements-dev.txtmain/requirements-translate.txtmain/requirements-ui.txtmain/requirements-vision.txtDocumentation & Assets
README.mdmain/README.mdmain/.gitignore.gitignoremockups/hindi.pngmockups/telugu.pngmockups/mallu.pngmockups/web-ui.htmldemo_videos/drivelinkdemo_videos/output/vid1.reviewed.en.srtdemo_videos/output/vid2.reviewed.en.srtdemo_videos/output/vid3.reviewed.en.srtdemo_videos/output/vid4.reviewed.en.srtdemo_videos/output/vid5.reviewed.en.srtdemo_videos/output/vid10.reviewed.en.srtdemo_videos/output/vid11.reviewed.en.srtTesting
cd main pip install -r requirements.txt -r requirements-dev.txt python -m pytest tests -vTest Coverage
Usage Quickstart
Checklist
--device cpu) without errors--device auto)doctorreports environment correctlyanalyzeruns end-to-endpython -m pytest tests -vFuture Work (Out of Scope for This PR)