Seamless multilingual video/audio translation and subtitle generation tool
Stream-Polyglot is a cross-platform video and audio translation application that leverages the SeamlessM4T API to provide high-quality speech-to-text translation, subtitle generation, and audio dubbing capabilities across 100+ languages.
- Video Translation: Extract audio from video files, translate speech to text in target language
- Subtitle Generation: Automatically generate SRT/VTT subtitle files with accurate timestamps
- Bilingual Subtitles: Generate subtitles with both source and target languages
- Audio Dubbing: Replace original audio with translated speech (text-to-speech in target language)
- Voice Cloning Translation: Generate voice-cloned audio from bilingual SRT files using GPT-SoVITS
- Multi-format Support: Works with MP4, MKV, WebM, AVI, and various audio formats (WAV, MP3, FLAC, M4A, OGG)
- 100+ Languages: Supports speech input in 101 languages and text output in 96 languages via SeamlessM4T
- Cross-platform: Runs on Windows, macOS, and Linux
┌─────────────────┐
│ Video/Audio │
│ Input Files │
└────────┬────────┘
│
▼
┌─────────────────┐ ┌──────────────────┐
│ FFmpeg │────▶│ m4t API Server │
│ Audio Extract │ │ (SeamlessM4T) │
└─────────┬───────┘ └────────┬─────────┘
│ │
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ Subtitle │ │ TTS (Optional) │
│ Generator │ │ Audio Dubbing │
└─────────┬───────┘ └────────┬─────────┘
│ │
└──────────┬───────────┘
▼
┌──────────────────┐
│ Output: │
│ - Subtitles │
│ - Dubbed Audio │
│ - Merged Video │
└──────────────────┘
- Python 3.10+: Main programming language
- FFmpeg: Video/audio processing and manipulation
- m4t API: SeamlessM4T translation backend (required)
- python-dotenv: Environment variable management
- requests: HTTP client for m4t API communication
- ffmpeg-python: FFmpeg wrapper for video/audio operations (planned)
- pysrt: SRT subtitle file parsing and generation (planned)
- webvtt-py: WebVTT subtitle format support (planned)
- soundfile: Audio file I/O (planned)
- numpy: Numerical operations for audio processing (planned)
- MoviePy: Alternative video editing library (user-friendly)
- PyAV: Low-level FFmpeg bindings for advanced control
-
FFmpeg: Install FFmpeg on your system
# Ubuntu/Debian sudo apt-get install ffmpeg # macOS brew install ffmpeg # Windows # Download from https://ffmpeg.org/download.html
-
m4t API Server: Start the SeamlessM4T API server by Docker (8GB GPU memory required at least)
# Pull the m4t Docker image docker pull kllambda/m4t:v1.0.0 # Run the m4t server (requires GPU) docker run -d \ --name m4t-server \ --gpus all \ -p 8000:8000 \ kllambda/m4t:v1.0.0 # Check server status curl http://localhost:8000/health
The server will be available at
http://localhost:8000by default.
-
Clone the repository:
git clone https://github.com/yourusername/stream-polyglot.git cd stream-polyglot -
Create virtual environment:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Configure m4t API endpoint (optional):
Option 1: Using .env file (recommended)
# Copy the example file cp .env.example .env # Edit .env and set M4T_API_URL # M4T_API_URL=http://localhost:8000
Option 2: Using environment variable
export M4T_API_URL=http://localhost:8000Option 3: Using command-line argument
python main.py test --api-url http://localhost:8000
Generate Chinese subtitles for English video:
python -m main video.mp4 --lang eng:cmn --subtitleGenerate bilingual subtitles (English + Chinese):
python -m main video.mp4 --lang eng:cmn --subtitle --subtitle-source-langOutput format (video.eng-cmn.srt):
1
00:00:01,000 --> 00:00:04,000
你好,今天怎么样?
Hello, how are you today?
2
00:00:05,000 --> 00:00:08,000
我很好,谢谢!
I'm doing great, thank you!NEW FEATURE: Automatically refine generated subtitles with LLM (subtitle-refiner) to improve translation quality:
# Generate and automatically refine subtitles (bilingual mode enabled by default)
python -m main video.mp4 --lang eng:cmn --subtitle --subtitle-refiner
# Explicitly specify bilingual mode (same result as above)
python -m main video.mp4 --lang eng:cmn --subtitle --subtitle-source-lang --subtitle-refinerImportant: Using --subtitle-refiner automatically enables --subtitle-source-lang to generate bilingual subtitles, as the refiner works best with both source and target languages for context.
How it works:
- Generates bilingual subtitle file (source + target languages)
- Automatically runs subtitle-refiner on the generated SRT file
- LLM reviews and improves translations for accuracy and naturalness
- Outputs refined subtitle file (e.g.,
video.eng-cmn.refined.srt)
Benefits:
- Better translations: LLM refines machine translations for naturalness
- Context awareness: Reviews subtitles in windowed context for coherence
- Bilingual by default: Both languages help LLM understand context better
- Automated workflow: No manual refinement step needed
- Preserves timing: Only improves text, keeps original timestamps
Requirements:
- subtitle-refiner must be installed at
../stream-polyglot-refiner/subtitle-refiner/ - OpenAI API key configured in subtitle-refiner environment
- Additional processing time: ~1-2 minutes per 100 subtitle entries
Replace audio with translated speech:
python -m main video.mp4 --lang eng:jpn --audioNEW FEATURE: Use --split to separate vocals from background music before timeline segmentation:
# Split audio before generating subtitles
python -m main video.mp4 --lang eng:cmn --subtitle --split
# Split audio before generating audio dubbing
python -m main video.mp4 --lang eng:jpn --audio --split
# Works with both subtitle and audio generation
python -m main video.mp4 --lang eng:cmn --subtitle --audio --splitHow it works:
- Extracts audio from video using FFmpeg
- Detects audio duration and splits into chunks if needed (>5 minutes)
- Starts audio splitting in background thread (non-blocking)
- For short audio (<5 min): Single API call
- For long audio (>5 min): Client-side chunking, multiple API calls, automatic concatenation
- Continues immediately with VAD segmentation and ASR processing
- Split audio is saved asynchronously to cache directory (
.stream-polyglot-cache/[video_name]/split/)vocals.wav: Clean human voice/speechaccompaniment.wav: Background music and other sounds
- Note: Original audio (not separated vocals) is used for VAD segmentation and ASR to maintain compatibility
Benefits:
- No duration limit: Client-side chunking handles audio of any length
- No network timeout: Each chunk is sent separately (5-minute max per request)
- No performance impact: Audio splitting runs in background, doesn't delay subtitle/audio generation
- Audio archival: Saves separated vocals and accompaniment for later use
- Audio editing: Provides clean vocal and accompaniment tracks for video editing
- Cached for reuse: Split audio is saved for future processing
- Optional feature: Only used when
--splitflag is set (normal processing without it)
Use cases:
- Archive separate vocal and music tracks from videos
- Provide clean audio sources for video editing workflows
- Extract vocals or accompaniment for remixing purposes
Requirements:
- m4t server must have Spleeter installed (
pip install spleeter) - Audio splitting runs asynchronously in background (no processing delay)
NEW FEATURE: Generate voice-cloned audio using bilingual SRT subtitles with GPT-SoVITS voice cloning:
# Option 1: Use existing cached timeline (fast)
# First generate bilingual subtitle to create cache
python -m main video.mp4 --lang eng:cmn --subtitle --subtitle-source-lang
# Then generate voice-cloned audio using cache
python -m main video.mp4 --lang eng:cmn --trans-voice video.eng-cmn.srt
# Option 2: Direct voice cloning (automatic segmentation if no cache)
# If cache doesn't exist, it will automatically extract audio and segment it
python -m main video.mp4 --lang eng:cmn --trans-voice video.eng-cmn.srt
# Option 3: Infer language from SRT filename
python -m main --trans-voice video.eng-cmn.srt
# Option 4: Use fixed seed for reproducible voice cloning
python -m main video.mp4 --lang eng:cmn --trans-voice video.eng-cmn.srt --seed 42Random Seed for Reproducibility:
--seedparameter controls the randomness in voice generation- Default (no --seed): Generates one random seed at the start and uses it for ALL segments in that generation
- This ensures consistency across all voice cloned segments in a single run
- Different runs will produce different (but internally consistent) results
- Fixed seed (--seed 42): Uses the same seed across runs for fully reproducible results
- Same input + same seed = identical output audio
- Useful for A/B testing, debugging, or when consistent output is required
How it works:
- Reads bilingual SRT file (target language + source language)
- Checks for cached timeline; if not found, automatically extracts audio and segments it
- Matches subtitle timing with cached audio fragments
- Uses cached fragment audio as reference for voice cloning
- Generates target language speech with cloned voice characteristics (using the same seed for all segments)
- Concatenates all segments into final audio track
Benefits:
- Preserves original speaker's voice characteristics
- Better than generic TTS (more natural and expressive)
- Reuses cached timeline data for fast processing
- Automatic segmentation if cache doesn't exist
- Perfect for dubbing videos while maintaining voice identity
- Consistent voice characteristics across all segments (same seed used for all cloning operations)
Create both subtitle file and dubbed audio:
python -m main video.mp4 --lang eng:cmn --subtitle --audiopython -m main video.mp4 --lang jpn:eng --subtitle --output ./translated/python -m main video.mp4 --lang eng:cmn --subtitle --api-url http://192.168.1.100:8000python -m main --helpSeamlessM4T supports 101 languages for speech input and 96 languages for text output.
| Language | Code | Speech Input | Text Output |
|---|---|---|---|
| English | eng | ✓ | ✓ |
| Chinese (Simplified) | cmn | ✓ | ✓ |
| Chinese (Traditional) | cmn_Hant | ✓ | ✓ |
| Japanese | jpn | ✓ | ✓ |
| Korean | kor | ✓ | ✓ |
| French | fra | ✓ | ✓ |
| German | deu | ✓ | ✓ |
| Spanish | spa | ✓ | ✓ |
| Russian | rus | ✓ | ✓ |
| Arabic | arb | ✓ | ✓ |
See m4t documentation for the complete language list.
1
00:00:01,000 --> 00:00:04,000
Hello, how are you today?
2
00:00:05,000 --> 00:00:08,000
I'm doing great, thank you!WEBVTT
00:00:01.000 --> 00:00:04.000
Hello, how are you today?
00:00:05.000 --> 00:00:08.000
I'm doing great, thank you!| Format | Extension | Subtitle Support | Audio Tracks | Recommended Use Case |
|---|---|---|---|---|
| MP4 | .mp4 | Limited (mov_text) | Single | Universal compatibility |
| MKV | .mkv | Excellent (SRT, ASS, multiple tracks) | Multiple | Professional archiving |
| WebM | .webm | WebVTT | Single | Web streaming |
| AVI | .avi | Poor (external only) | Single | Legacy support |
- Subtitle Generation: ~0.2-0.5x real-time (5-minute video → 10-25 minutes)
- Audio Dubbing: ~0.1-0.3x real-time (requires TTS for entire audio)
- Video Remuxing: Near real-time (no re-encoding)
- CPU: Multi-core recommended for parallel processing
- RAM: 4GB minimum, 8GB+ recommended
- GPU: Optional (m4t server uses GPU for inference)
- Storage: ~10x video file size for temporary processing
stream-polyglot/
├── README.md # This file
├── requirements.txt # Python dependencies
├── setup.py # Package installation
├── stream_polyglot/ # Main package
│ ├── __init__.py
│ ├── video_processor.py # FFmpeg integration
│ ├── m4t_client.py # m4t API client
│ ├── subtitle_generator.py # SRT/VTT generation
│ ├── translator.py # Main orchestration
│ └── utils.py # Helper functions
├── tests/ # Unit tests
│ ├── test_video_processor.py
│ ├── test_m4t_client.py
│ └── test_subtitle_generator.py
└── examples/ # Example scripts
├── basic_translation.py
├── batch_processing.py
└── advanced_dubbing.py
- Project initialization
- FFmpeg video processor
- m4t API client
- Basic subtitle generation (SRT)
- CLI interface
- Audio dubbing functionality
- WebVTT subtitle support
- Batch processing
- Progress tracking
- Multi-track subtitle support (MKV)
- Subtitle timing adjustment
- Audio/video synchronization
- Configuration file support
- GUI interface (Streamlit/Qt)
- Advanced subtitle editing
- Plugin system
- Performance optimizations
Contributions are welcome! Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Q: "FFmpeg not found" error
# Verify FFmpeg installation
ffmpeg -version
# Add FFmpeg to PATH (Windows)
# Add C:\path\to\ffmpeg\bin to System Environment VariablesQ: "m4t API connection refused"
# Check if m4t server is running
curl http://localhost:8000/health
# Start m4t server
cd ~/work/m4t
./start_dev.shQ: "Audio/subtitle out of sync"
# Use --sync-offset parameter to adjust timing
python stream-polyglot.py translate \
--input video.mp4 \
--sync-offset 0.5 # Delay subtitles by 0.5 secondsMIT License - see LICENSE file for details
- SeamlessM4T (Meta AI): Multilingual translation model
- FFmpeg: Video/audio processing framework
- Community contributors and testers
- Issues: https://github.com/yourusername/stream-polyglot/issues
- Discussions: https://github.com/yourusername/stream-polyglot/discussions
Stream-Polyglot - Breaking language barriers, one frame at a time.