Skip to content

intelligent-cc-generation-mvp-with-demo#16

Open
Anamikarajesh wants to merge 3 commits into
PlanetRead:mainfrom
Anamikarajesh:main
Open

intelligent-cc-generation-mvp-with-demo#16
Anamikarajesh wants to merge 3 commits into
PlanetRead:mainfrom
Anamikarajesh:main

Conversation

@Anamikarajesh
Copy link
Copy Markdown

PR: Intelligent Closed Caption Suggestion Tool

Author: @Anamikarajesh
Status: Open
Related Issue: (link)


Summary

This PR delivers the end-to-end Intelligent Closed Caption (CC) Suggestion Tool — an AI-powered Python backend and Streamlit editor-review workspace that detects non-speech sound events in raw video, analyzes visual speaker reactions, and exports reviewer-accepted captions as SRT files. It targets accessibility editors working with Hindi and Indian regional-language content.

What this PR includes

  • Modular Python backend with pluggable audio (DSP / YAMNet / mock) and vision (OpenCV / MediaPipe / mock) backends
  • Decision engine that combines audio confidence, reaction confidence, event importance, and ambient penalties
  • Multilingual caption label glossary (English, Hindi, Tamil, Telugu, Bengali, Marathi, Malayalam)
  • CLI with analyze, doctor, export, labels, and web commands
  • Streamlit Web UI with glassmorphism design, video player, draggable timeline, and review panel
  • SRT / JSON / CSV exports, review/re-export pipeline, and diagnostic tooling
  • CPU/GPU device selection with graceful fallback and error suggestions
  • Sample video generator for local testing without system ffmpeg
  • Full README with architecture diagrams, data models, pipeline flow, and roadmap

Demo & Drive Link

Google Drive Folder:
https://drive.google.com/drive/folders/1Ti5aqztP9VHas_5AbrH7utSn-G27HZXW?usp=sharing

Contains: demo videos, sample recordings, and SRT outputs.


Screenshots

Hindi Malayalam Telugu
hindi mallu telugu

Architecture & Diagrams

1. System Architecture Overview

flowchart TB
    subgraph Inputs["Video Input"]
        VIDEO["Raw Video\n(.mp4, .mov, .mkv)"]
    end

    subgraph Audio["Audio Analysis"]
        direction TB
        EXTRACT["Audio Extraction\n(ffmpeg)"]
        DSP["DSP Baseline\n(RMS, STFT, Onsets)"]
        A_MODELS["Audio ML Backends\n(YAMNet / PANNs / AST)"]
        SMOOTH["Event Smoothing\n(Merge, Filter, Normalize)"]
        EXTRACT --> DSP --> A_MODELS --> SMOOTH
    end

    subgraph Vision["Visual Reaction"]
        direction TB
        FRAMES["Frame Sampler\n(before / during / after)"]
        FLOW["Optical Flow\n(OpenCV)"]
        V_MODELS["Vision ML Backends\n(MediaPipe / MMPose)"]
        REACT["Reaction Scoring"]
        FRAMES --> FLOW --> REACT
        FRAMES --> V_MODELS --> REACT
    end

    subgraph Decision["Decision Engine"]
        direction TB
        SCORER["Scorer\n(audio + reaction + importance - ambient penalty)"]
        LABELS["Caption Labels\n(Glossary per language)"]
        SCORER --> LABELS
    end

    subgraph Outputs["Exports"]
        direction LR
        SRT["SRT\n(accepted captions)"]
        JSON["JSON\n(full debug report)"]
        CSV["CSV\n(reviewer spreadsheet)"]
    end

    subgraph Clients["User Interfaces"]
        direction LR
        CLI["CLI\n(ccs analyze / doctor / export)"]
        WEB["Web UI\n(Streamlit editor workspace)"]
    end

    VIDEO --> EXTRACT
    SMOOTH --> AudioEvents["Audio Event\nCandidates"]
    AudioEvents --> FRAMES
    AudioEvents --> SCORER
    REACT --> SCORER
    LABELS --> SRT
    SCORER --> JSON
    SCORER --> CSV

    CLI --> VIDEO
    CLI --> SCORER
    WEB --> VIDEO
    WEB --> SCORER
Loading

2. Pipeline Flow

flowchart TD
    A["Raw video input"] --> B{"Valid video?"}
    B -- "No" --> B_ERR["Friendly error\nsuggest inspect/doctor command"]
    B -- "Yes" --> C["Extract metadata\nfps, duration, resolution"]
    C --> D["Extract audio with ffmpeg"]
    D --> E["Compute DSP features\nRMS, STFT, spectral flux"]
    E --> F["Run audio backend\nYAMNet first, PANNs/AST later"]
    F --> G["Smooth + merge detections"]
    G --> H["Audio event candidates"]
    H --> I["Sample frames around each event"]
    I --> J["Run visual backends\nMediaPipe face/pose + optical flow"]
    J --> K["Reaction confidence per event"]
    H --> L["Decision engine"]
    K --> L
    L --> M{"Caption warranted?"}
    M -- "No" --> N["Rejected candidate\nkept in JSON/CSV debug report"]
    M -- "Yes" --> O["Accepted caption suggestion"]
    O --> P["Language label mapping"]
    P --> Q["Export SRT"]
    L --> R["Export full JSON report"]
    L --> S["Export CSV review report"]
Loading

3. Data Model (Class Diagram)

classDiagram
    class AudioEventCandidate {
        string event_id
        string label
        float start_time
        float end_time
        float audio_confidence
        string audio_backend
        string raw_class_name
        dict debug_info
    }

    class ReactionResult {
        string event_id
        float start_time
        float end_time
        float reaction_confidence
        dict reaction_signals
        int frames_sampled
        string vision_backend
        dict debug_info
    }

    class CaptionSuggestion {
        string event_id
        float start_time
        float end_time
        float audio_confidence
        float reaction_confidence
        float decision_score
        bool accepted
        string reason
        string caption_text
        string language
        bool requires_review
        dict debug_info
    }

    AudioEventCandidate --> ReactionResult : analyzed visually at event timestamp
    AudioEventCandidate --> CaptionSuggestion : contributes audio evidence
    ReactionResult --> CaptionSuggestion : contributes visual evidence
Loading

4. Decision Engine Scoring

flowchart LR
    A["Audio confidence"] --> E["Decision scorer"]
    B["Reaction confidence"] --> E
    C["Event importance prior"] --> E
    D["Ambient sound penalty"] --> E
    P["Speech pause / scene impact bonus"] --> E

    E --> F{"Decision score >= threshold?"}
    F -- "Yes" --> G["Accept caption"]
    F -- "Borderline" --> H["Needs editor review"]
    F -- "No" --> I["Reject candidate"]

    G --> J["Generate caption text"]
    H --> J
    I --> K["Keep reason in debug output"]
    J --> L["SRT / JSON / CSV"]
Loading

5. Editor Review Flow (Sequence Diagram)

sequenceDiagram
    actor Editor
    participant UI as Web UI
    participant Pipeline as Core Pipeline
    participant Review as Review State
    participant Export as Exporter

    Editor->>UI: Select video, language, device
    Editor->>UI: Click Start Caption
    UI->>Pipeline: Run analysis with config
    Pipeline-->>UI: Caption suggestions + diagnostics
    UI->>Review: Load suggestions into timeline and panel
    Editor->>Review: Jump to event marker
    Editor->>Review: Edit caption text
    Editor->>Review: Accept or reject suggestion
    Editor->>UI: Export final SRT
    UI->>Export: Export accepted captions
    Export-->>Editor: SRT / JSON / CSV files
Loading

6. Web UI Layout

flowchart TB
    subgraph Top["Top Bar"]
        T1["Device Mode"]
        T2["Language"]
        T3["Audio Backend"]
        T4["Vision Backend"]
        T5["Run Doctor"]
    end

    subgraph Left["Left Panel"]
        L1["Video Dropdown / Upload"]
        L2["Video Metadata"]
        L3["Start Caption"]
        L4["Export SRT / JSON / CSV"]
    end

    subgraph Center["Center Panel"]
        C1["Video Player"]
        C2["Play / Pause"]
        C3["Draggable Timeline"]
        C4["Event Markers"]
        C5["Previous / Next Event"]
    end

    subgraph Right["Right Review Panel"]
        R1["SRT Suggestions"]
        R2["Editable Caption Text"]
        R3["Accept / Reject"]
        R4["Confidence Scores"]
        R5["Decision Reason"]
        R6["Error / Warning Badges"]
    end

    subgraph Bottom["Bottom Panel"]
        B1["Event Table"]
        B2["Timestamps"]
        B3["Audio + Reaction Scores"]
        B4["Status"]
    end

    L1 --> L3
    T1 --> L3
    T2 --> L3
    T3 --> L3
    T4 --> L3
    L3 --> C1
    L3 --> R1
    L3 --> B1
    C4 --> R1
    B1 --> R1
    R3 --> L4
    R2 --> L4
Loading

7. High-Level Module Architecture

flowchart TB
    subgraph Clients["User-Facing Clients"]
        CLI["CLI\nccs analyze / doctor / export"]
        WEB["Web UI\neditor review workspace"]
        VLC["Future VLC Plugin"]
        API["Future Local API"]
    end

    subgraph Core["Reusable Core Pipeline"]
        PIPE["Pipeline Orchestrator"]
        CONFIG["Config + Thresholds"]
        DIAG["Diagnostics + Friendly Errors"]
        TYPES["Shared Data Models"]
    end

    subgraph Audio["Audio Analysis"]
        EXTRACT["Audio Extraction"]
        DSP2["DSP Features\nFFT / STFT / RMS / Onsets"]
        A_BACKENDS["Audio Backends\nYAMNet / PANNs / AST / BEATs"]
        EVENTS["Event Smoothing\nMerge / Filter / Normalize"]
    end

    subgraph Vision["Visual Reaction Analysis"]
        FRAMES2["Frame Sampler"]
        FLOW2["Optical Flow"]
        V_BACKENDS["Vision Backends\nMediaPipe / MMPose / MMAction2"]
        REACT2["Reaction Scoring"]
    end

    subgraph Decision["Caption Decision"]
        SCORE["Decision Scorer"]
        RULES["Importance Rules\nAmbient Penalties"]
        LABELS2["Caption Labels\nGlossary + Translation"]
    end

    subgraph Outputs["Exports"]
        SRT2["SRT"]
        JSON2["JSON Debug Report"]
        CSV2["CSV Review Report"]
    end

    CLI --> PIPE
    WEB --> PIPE
    VLC --> API
    API --> PIPE

    PIPE --> CONFIG
    PIPE --> DIAG
    PIPE --> TYPES
    PIPE --> EXTRACT
    EXTRACT --> DSP2
    DSP2 --> A_BACKENDS
    A_BACKENDS --> EVENTS
    EVENTS --> FRAMES2
    FRAMES2 --> FLOW2
    FRAMES2 --> V_BACKENDS
    FLOW2 --> REACT2
    V_BACKENDS --> REACT2
    EVENTS --> SCORE
    REACT2 --> SCORE
    RULES --> SCORE
    SCORE --> LABELS2
    LABELS2 --> SRT2
    SCORE --> JSON2
    SCORE --> CSV2
Loading

8. Device / GPU Handling

flowchart TD
    A["User selects device mode"] --> B{"Mode"}
    B -- "auto" --> C{"GPU available?"}
    C -- "Yes" --> D["Use GPU"]
    C -- "No" --> E["Fallback to CPU\nrecord fallback reason"]
    B -- "cpu" --> F["Force CPU"]
    B -- "cuda" --> G{"GPU available?"}
    G -- "Yes" --> D
    G -- "No" --> H["Stop with clear diagnostic"]
    H --> I["Suggest: retry with --device cpu"]
    H --> J["Suggest: run ccs doctor"]
    D --> K["Save device metadata"]
    E --> K
    F --> K
Loading

9. Project Roadmap

flowchart LR
    P1["Phase 1\nProject Foundation"] --> P2["Phase 2\nAudio Detection"]
    P2 --> P3["Phase 3\nVisual Reaction Detection"]
    P3 --> P4["Phase 4\nDecision + Output"]
    P4 --> P5["Phase 5\nCLI Productization"]
    P5 --> P6["Phase 6\nWeb Editor UI"]
    P6 --> P7["Phase 7\nAdvanced Backends"]
    P7 --> P8["Phase 8\nEvaluation + Packaging"]

    P2 -.->|Midpoint Goal 1| M["Midpoint\nGoals 1 + 2 complete"]
    P3 -.->|Midpoint Goal 2| M
Loading

Files Changed

Core Pipeline
  • main/cc_suggester/__init__.py
  • main/cc_suggester/__main__.py
  • main/cc_suggester/core/__init__.py
  • main/cc_suggester/core/config.py
  • main/cc_suggester/core/diagnostics.py
  • main/cc_suggester/core/errors.py
  • main/cc_suggester/core/media.py
  • main/cc_suggester/core/pipeline.py
  • main/cc_suggester/core/types.py
Audio Module
  • main/cc_suggester/audio/__init__.py
  • main/cc_suggester/audio/extractor.py
  • main/cc_suggester/audio/wav.py
  • main/cc_suggester/audio/dsp.py
  • main/cc_suggester/audio/events.py
  • main/cc_suggester/audio/vad.py
  • main/cc_suggester/audio/label_mapping.py
  • main/cc_suggester/audio/backends/__init__.py
  • main/cc_suggester/audio/backends/base.py
  • main/cc_suggester/audio/backends/dsp.py
  • main/cc_suggester/audio/backends/mock.py
  • main/cc_suggester/audio/backends/unavailable.py
  • main/cc_suggester/audio/backends/yamnet.py
Vision Module
  • main/cc_suggester/vision/__init__.py
  • main/cc_suggester/vision/frame_sampler.py
  • main/cc_suggester/vision/optical_flow.py
  • main/cc_suggester/vision/reactions.py
  • main/cc_suggester/vision/backends/__init__.py
  • main/cc_suggester/vision/backends/base.py
  • main/cc_suggester/vision/backends/mediapipe.py
  • main/cc_suggester/vision/backends/mock.py
  • main/cc_suggester/vision/backends/opencv.py
Decision Engine
  • main/cc_suggester/decision/__init__.py
  • main/cc_suggester/decision/labels.py
  • main/cc_suggester/decision/rules.py
  • main/cc_suggester/decision/scorer.py
Output / Export
  • main/cc_suggester/output/__init__.py
  • main/cc_suggester/output/csv_report.py
  • main/cc_suggester/output/json_report.py
  • main/cc_suggester/output/review_export.py
  • main/cc_suggester/output/srt.py
CLI
  • main/cc_suggester/cli/__init__.py
  • main/cc_suggester/cli/app.py
Web UI (Streamlit)
  • main/cc_suggester/ui/__init__.py
  • main/cc_suggester/ui/streamlit_app.py
Translation
  • main/cc_suggester/translation/__init__.py
  • main/cc_suggester/translation/glossary.py
Config
  • main/configs/default.json
Tests
  • main/tests/test_config_cli.py
  • main/tests/test_dsp_backend.py
  • main/tests/test_outputs.py
  • main/tests/test_real_video_integration.py
  • main/tests/test_review_export.py
  • main/tests/test_vision_pipeline.py
  • main/tests/test_yamnet_backend.py
Scripts
  • main/scripts/generate_sample_video.py
Build & Dependencies
  • main/pyproject.toml
  • main/requirements.txt
  • main/requirements-audio.txt
  • main/requirements-dev.txt
  • main/requirements-translate.txt
  • main/requirements-ui.txt
  • main/requirements-vision.txt
Documentation & Assets
  • README.md
  • main/README.md
  • main/.gitignore
  • .gitignore
  • mockups/hindi.png
  • mockups/telugu.png
  • mockups/mallu.png
  • mockups/web-ui.html
  • demo_videos/drivelink
  • demo_videos/output/vid1.reviewed.en.srt
  • demo_videos/output/vid2.reviewed.en.srt
  • demo_videos/output/vid3.reviewed.en.srt
  • demo_videos/output/vid4.reviewed.en.srt
  • demo_videos/output/vid5.reviewed.en.srt
  • demo_videos/output/vid10.reviewed.en.srt
  • demo_videos/output/vid11.reviewed.en.srt

Testing

cd main
pip install -r requirements.txt -r requirements-dev.txt
python -m pytest tests -v

Test Coverage

  • DSP backend: RMS, STFT, onset detection unit tests
  • YAMNet backend: Model load, inference, label mapping tests
  • Vision pipeline: Optical flow, MediaPipe, mock backend tests
  • Output: SRT formatting, JSON/CSV export correctness
  • CLI: Config loading, flag override, error suggestions
  • Review export: Accept/reject/re-export flow
  • Integration: End-to-end pipeline on real/synthetic video

Usage Quickstart

# Check environment
python -m cc_suggester doctor

# List available caption labels
python -m cc_suggester labels

# Run full analysis
python -m cc_suggester analyze video.mp4 --lang hi --device auto --out outputs/

# Export from JSON results
python -m cc_suggester export outputs/result.json --format srt --lang ta

# Launch Web UI
python -m cc_suggester web

Checklist

  • Pipeline runs on CPU (--device cpu) without errors
  • Pipeline runs with GPU detection (--device auto)
  • Audio extraction and DSP baseline produce event candidates
  • YAMNet backend produces confidence-scored classifications
  • Vision backends produce reaction scores on sampled frames
  • Decision engine correctly accepts/rejects based on thresholds
  • SRT output contains only accepted captions
  • JSON output contains full debug report (accepted + rejected)
  • CSV output is reviewer-friendly
  • CLI doctor reports environment correctly
  • CLI analyze runs end-to-end
  • Web UI launches and loads video
  • Web UI timeline markers and review panel work
  • Multilingual labels render correctly across all supported languages
  • Edge cases: missing video, invalid format, corrupt audio, no events detected
  • Tests pass: python -m pytest tests -v
  • README is up to date with current implementation

Future Work (Out of Scope for This PR)

  • PANNs / AST / BEATs advanced audio backends
  • MMPose / MMAction2 advanced vision backends
  • VLC plugin integration
  • Docker containerization
  • IndicTrans2 machine translation fallback
  • Real-time / live captioning
  • Speaker diarization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant