Stream-Polyglot

Seamless multilingual video/audio translation and subtitle generation tool

Stream-Polyglot is a cross-platform video and audio translation application that leverages the SeamlessM4T API to provide high-quality speech-to-text translation, subtitle generation, and audio dubbing capabilities across 100+ languages.

Features

Video Translation: Extract audio from video files, translate speech to text in target language
Subtitle Generation: Automatically generate SRT/VTT subtitle files with accurate timestamps
Bilingual Subtitles: Generate subtitles with both source and target languages
Audio Dubbing: Replace original audio with translated speech (text-to-speech in target language)
Voice Cloning Translation: Generate voice-cloned audio from bilingual SRT files using GPT-SoVITS
Multi-format Support: Works with MP4, MKV, WebM, AVI, and various audio formats (WAV, MP3, FLAC, M4A, OGG)
100+ Languages: Supports speech input in 101 languages and text output in 96 languages via SeamlessM4T
Cross-platform: Runs on Windows, macOS, and Linux

Architecture

┌─────────────────┐
│  Video/Audio    │
│  Input Files    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐     ┌──────────────────┐
│  FFmpeg         │────▶│  m4t API Server  │
│  Audio Extract  │     │  (SeamlessM4T)   │
└─────────┬───────┘     └────────┬─────────┘
          │                      │
          │                      │
          ▼                      ▼
┌─────────────────┐     ┌──────────────────┐
│  Subtitle       │     │  TTS (Optional)  │
│  Generator      │     │  Audio Dubbing   │
└─────────┬───────┘     └────────┬─────────┘
          │                      │
          └──────────┬───────────┘
                     ▼
          ┌──────────────────┐
          │  Output:         │
          │  - Subtitles     │
          │  - Dubbed Audio  │
          │  - Merged Video  │
          └──────────────────┘

Technology Stack

Core Dependencies

Python 3.10+: Main programming language
FFmpeg: Video/audio processing and manipulation
m4t API: SeamlessM4T translation backend (required)

Python Libraries

python-dotenv: Environment variable management
requests: HTTP client for m4t API communication
ffmpeg-python: FFmpeg wrapper for video/audio operations (planned)
pysrt: SRT subtitle file parsing and generation (planned)
webvtt-py: WebVTT subtitle format support (planned)
soundfile: Audio file I/O (planned)
numpy: Numerical operations for audio processing (planned)

Optional

MoviePy: Alternative video editing library (user-friendly)
PyAV: Low-level FFmpeg bindings for advanced control

Installation

Prerequisites

FFmpeg: Install FFmpeg on your system

# Ubuntu/Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

# Windows
# Download from https://ffmpeg.org/download.html

m4t API Server: Start the SeamlessM4T API server by Docker (8GB GPU memory required at least)

# Pull the m4t Docker image
docker pull kllambda/m4t:v1.0.0

# Run the m4t server (requires GPU)
docker run -d \
  --name m4t-server \
  --gpus all \
  -p 8000:8000 \
  kllambda/m4t:v1.0.0

# Check server status
curl http://localhost:8000/health

The server will be available at http://localhost:8000 by default.

Setup

Clone the repository:

git clone https://github.com/yourusername/stream-polyglot.git
cd stream-polyglot

Create virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Configure m4t API endpoint (optional):

Option 1: Using .env file (recommended)

# Copy the example file
cp .env.example .env

# Edit .env and set M4T_API_URL
# M4T_API_URL=http://localhost:8000

Option 2: Using environment variable

export M4T_API_URL=http://localhost:8000

Option 3: Using command-line argument

python main.py test --api-url http://localhost:8000

Usage

Generate Subtitles

Generate Chinese subtitles for English video:

python -m main video.mp4 --lang eng:cmn --subtitle

Generate Bilingual Subtitles

Generate bilingual subtitles (English + Chinese):

python -m main video.mp4 --lang eng:cmn --subtitle --subtitle-source-lang

Output format (video.eng-cmn.srt):

1
00:00:01,000 --> 00:00:04,000
你好，今天怎么样？
Hello, how are you today?

2
00:00:05,000 --> 00:00:08,000
我很好，谢谢！
I'm doing great, thank you!

Automatic Subtitle Refinement with LLM

NEW FEATURE: Automatically refine generated subtitles with LLM (subtitle-refiner) to improve translation quality:

# Generate and automatically refine subtitles (bilingual mode enabled by default)
python -m main video.mp4 --lang eng:cmn --subtitle --subtitle-refiner

# Explicitly specify bilingual mode (same result as above)
python -m main video.mp4 --lang eng:cmn --subtitle --subtitle-source-lang --subtitle-refiner

Important: Using --subtitle-refiner automatically enables --subtitle-source-lang to generate bilingual subtitles, as the refiner works best with both source and target languages for context.

How it works:

Generates bilingual subtitle file (source + target languages)
Automatically runs subtitle-refiner on the generated SRT file
LLM reviews and improves translations for accuracy and naturalness
Outputs refined subtitle file (e.g., video.eng-cmn.refined.srt)

Benefits:

Better translations: LLM refines machine translations for naturalness
Context awareness: Reviews subtitles in windowed context for coherence
Bilingual by default: Both languages help LLM understand context better
Automated workflow: No manual refinement step needed
Preserves timing: Only improves text, keeps original timestamps

Requirements:

subtitle-refiner must be installed at ../stream-polyglot-refiner/subtitle-refiner/
OpenAI API key configured in subtitle-refiner environment
Additional processing time: ~1-2 minutes per 100 subtitle entries

Generate Audio Dubbing

Replace audio with translated speech:

python -m main video.mp4 --lang eng:jpn --audio

Split Audio into Vocals and Accompaniment

NEW FEATURE: Use --split to separate vocals from background music before timeline segmentation:

# Split audio before generating subtitles
python -m main video.mp4 --lang eng:cmn --subtitle --split

# Split audio before generating audio dubbing
python -m main video.mp4 --lang eng:jpn --audio --split

# Works with both subtitle and audio generation
python -m main video.mp4 --lang eng:cmn --subtitle --audio --split

How it works:

Extracts audio from video using FFmpeg
Detects audio duration and splits into chunks if needed (>5 minutes)
Starts audio splitting in background thread (non-blocking)
- For short audio (<5 min): Single API call
- For long audio (>5 min): Client-side chunking, multiple API calls, automatic concatenation
Continues immediately with VAD segmentation and ASR processing
Split audio is saved asynchronously to cache directory (.stream-polyglot-cache/[video_name]/split/)
- vocals.wav: Clean human voice/speech
- accompaniment.wav: Background music and other sounds
Note: Original audio (not separated vocals) is used for VAD segmentation and ASR to maintain compatibility

Benefits:

No duration limit: Client-side chunking handles audio of any length
No network timeout: Each chunk is sent separately (5-minute max per request)
No performance impact: Audio splitting runs in background, doesn't delay subtitle/audio generation
Audio archival: Saves separated vocals and accompaniment for later use
Audio editing: Provides clean vocal and accompaniment tracks for video editing
Cached for reuse: Split audio is saved for future processing
Optional feature: Only used when --split flag is set (normal processing without it)

Use cases:

Archive separate vocal and music tracks from videos
Provide clean audio sources for video editing workflows
Extract vocals or accompaniment for remixing purposes

Requirements:

m4t server must have Spleeter installed (pip install spleeter)
Audio splitting runs asynchronously in background (no processing delay)

Generate Voice-Cloned Audio from Bilingual Subtitles

NEW FEATURE: Generate voice-cloned audio using bilingual SRT subtitles with GPT-SoVITS voice cloning:

# Option 1: Use existing cached timeline (fast)
# First generate bilingual subtitle to create cache
python -m main video.mp4 --lang eng:cmn --subtitle --subtitle-source-lang

# Then generate voice-cloned audio using cache
python -m main video.mp4 --lang eng:cmn --trans-voice video.eng-cmn.srt

# Option 2: Direct voice cloning (automatic segmentation if no cache)
# If cache doesn't exist, it will automatically extract audio and segment it
python -m main video.mp4 --lang eng:cmn --trans-voice video.eng-cmn.srt

# Option 3: Infer language from SRT filename
python -m main --trans-voice video.eng-cmn.srt

# Option 4: Use fixed seed for reproducible voice cloning
python -m main video.mp4 --lang eng:cmn --trans-voice video.eng-cmn.srt --seed 42

Random Seed for Reproducibility:

--seed parameter controls the randomness in voice generation
Default (no --seed): Generates one random seed at the start and uses it for ALL segments in that generation
- This ensures consistency across all voice cloned segments in a single run
- Different runs will produce different (but internally consistent) results
Fixed seed (--seed 42): Uses the same seed across runs for fully reproducible results
- Same input + same seed = identical output audio
- Useful for A/B testing, debugging, or when consistent output is required

How it works:

Reads bilingual SRT file (target language + source language)
Checks for cached timeline; if not found, automatically extracts audio and segments it
Matches subtitle timing with cached audio fragments
Uses cached fragment audio as reference for voice cloning
Generates target language speech with cloned voice characteristics (using the same seed for all segments)
Concatenates all segments into final audio track

Benefits:

Preserves original speaker's voice characteristics
Better than generic TTS (more natural and expressive)
Reuses cached timeline data for fast processing
Automatic segmentation if cache doesn't exist
Perfect for dubbing videos while maintaining voice identity
Consistent voice characteristics across all segments (same seed used for all cloning operations)

Generate Both Subtitles and Audio

Create both subtitle file and dubbed audio:

python -m main video.mp4 --lang eng:cmn --subtitle --audio

Specify Custom Output Directory

python -m main video.mp4 --lang jpn:eng --subtitle --output ./translated/

Use Custom API Server

python -m main video.mp4 --lang eng:cmn --subtitle --api-url http://192.168.1.100:8000

View All Options

python -m main --help

Supported Languages

SeamlessM4T supports 101 languages for speech input and 96 languages for text output.

Popular Languages

Language	Code	Speech Input	Text Output
English	eng	✓	✓
Chinese (Simplified)	cmn	✓	✓
Chinese (Traditional)	cmn_Hant	✓	✓
Japanese	jpn	✓	✓
Korean	kor	✓	✓
French	fra	✓	✓
German	deu	✓	✓
Spanish	spa	✓	✓
Russian	rus	✓	✓
Arabic	arb	✓	✓

See m4t documentation for the complete language list.

Subtitle Formats

SRT (SubRip)

1
00:00:01,000 --> 00:00:04,000
Hello, how are you today?

2
00:00:05,000 --> 00:00:08,000
I'm doing great, thank you!

VTT (WebVTT)

WEBVTT

00:00:01.000 --> 00:00:04.000
Hello, how are you today?

00:00:05.000 --> 00:00:08.000
I'm doing great, thank you!

Container Format Support

Format	Extension	Subtitle Support	Audio Tracks	Recommended Use Case
MP4	.mp4	Limited (mov_text)	Single	Universal compatibility
MKV	.mkv	Excellent (SRT, ASS, multiple tracks)	Multiple	Professional archiving
WebM	.webm	WebVTT	Single	Web streaming
AVI	.avi	Poor (external only)	Single	Legacy support

Performance

Processing Speed

Subtitle Generation: ~0.2-0.5x real-time (5-minute video → 10-25 minutes)
Audio Dubbing: ~0.1-0.3x real-time (requires TTS for entire audio)
Video Remuxing: Near real-time (no re-encoding)

Hardware Requirements

CPU: Multi-core recommended for parallel processing
RAM: 4GB minimum, 8GB+ recommended
GPU: Optional (m4t server uses GPU for inference)
Storage: ~10x video file size for temporary processing

Project Structure

stream-polyglot/
├── README.md                 # This file
├── requirements.txt          # Python dependencies
├── setup.py                  # Package installation
├── stream_polyglot/          # Main package
│   ├── __init__.py
│   ├── video_processor.py    # FFmpeg integration
│   ├── m4t_client.py         # m4t API client
│   ├── subtitle_generator.py # SRT/VTT generation
│   ├── translator.py         # Main orchestration
│   └── utils.py              # Helper functions
├── tests/                    # Unit tests
│   ├── test_video_processor.py
│   ├── test_m4t_client.py
│   └── test_subtitle_generator.py
└── examples/                 # Example scripts
    ├── basic_translation.py
    ├── batch_processing.py
    └── advanced_dubbing.py

Roadmap

v0.1.0 (Current)

v0.2.0

Audio dubbing functionality
WebVTT subtitle support
Batch processing
Progress tracking

v0.3.0

Multi-track subtitle support (MKV)
Subtitle timing adjustment
Audio/video synchronization
Configuration file support

v1.0.0

GUI interface (Streamlit/Qt)
Advanced subtitle editing
Plugin system
Performance optimizations

Contributing

Contributions are welcome! Please follow these guidelines:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Troubleshooting

Common Issues

Q: "FFmpeg not found" error

# Verify FFmpeg installation
ffmpeg -version

# Add FFmpeg to PATH (Windows)
# Add C:\path\to\ffmpeg\bin to System Environment Variables

Q: "m4t API connection refused"

# Check if m4t server is running
curl http://localhost:8000/health

# Start m4t server
cd ~/work/m4t
./start_dev.sh

Q: "Audio/subtitle out of sync"

# Use --sync-offset parameter to adjust timing
python stream-polyglot.py translate \
  --input video.mp4 \
  --sync-offset 0.5  # Delay subtitles by 0.5 seconds

License

MIT License - see LICENSE file for details

Acknowledgments

SeamlessM4T (Meta AI): Multilingual translation model
FFmpeg: Video/audio processing framework
Community contributors and testers

Contact

Issues: https://github.com/yourusername/stream-polyglot/issues
Discussions: https://github.com/yourusername/stream-polyglot/discussions

Stream-Polyglot - Breaking language barriers, one frame at a time.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
docs		docs
examples		examples
subtitle-refiner		subtitle-refiner
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
audio_timeline.py		audio_timeline.py
create_test_audio.py		create_test_audio.py
create_test_speech.py		create_test_speech.py
generate_all_speakers.py		generate_all_speakers.py
main.py		main.py
requirements.txt		requirements.txt
srt_utils.py		srt_utils.py
test_audio_split.py		test_audio_split.py

k-l-lambda/stream-polyglot

Folders and files

Latest commit

History

Repository files navigation

Stream-Polyglot

Features

Architecture

Technology Stack

Core Dependencies

Python Libraries

Optional

Installation

Prerequisites

Setup

Usage

Generate Subtitles

Generate Bilingual Subtitles

Automatic Subtitle Refinement with LLM

Generate Audio Dubbing

Split Audio into Vocals and Accompaniment

Generate Voice-Cloned Audio from Bilingual Subtitles

Generate Both Subtitles and Audio

Specify Custom Output Directory

Use Custom API Server

View All Options

Supported Languages

Popular Languages

Subtitle Formats

SRT (SubRip)

VTT (WebVTT)

Container Format Support

Performance

Processing Speed

Hardware Requirements

Project Structure

Roadmap

v0.1.0 (Current)

v0.2.0

v0.3.0

v1.0.0

Contributing

Troubleshooting

Common Issues

License

Acknowledgments

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages