Local • Free • Offline
Professional-grade voice processing in a single tool.
VODER brings together 10 processing modes under one interface — speech-to-text, text-to-speech, voice conversion, music generation, speech enhancement, sound effects, vocal separation, language dubbing, speaker diarization, and multi-speaker separation. It runs entirely on your machine, needs no subscription, and works with or without a GPU.
- Multi-Speaker Dialogue System — Write scripts with multiple characters, each with a distinct voice. Control per-line timing, volume, and duration with script directives. Embed sound effects directly into dialogue lines and generate automatic background music that matches the spoken duration.
- Voice Design & Cloning — Describe a voice in plain English and VODER generates it, or provide a reference clip to clone a speaker's voice. Mix designed and cloned voices within the same dialogue.
- Speaker Separation — Extract individual speakers from multi-speaker recordings into separate audio files, each with a speaker-labeled transcript.
- Voice Conversion with Video I/O — Transform one voice into another while preserving words, emotion, and timing. Drop in an MP4 and get back a video with the converted voice.
- Music Generation & Manipulation — Generate full songs from lyrics and style descriptions. Remix, repaint, complete, extract stems, build individual instrument tracks, or replace background music in existing audio/video. Output up to 12 separate instrument tracks.
- Speech-to-Text with Intelligence — Transcribe audio, video, images, or direct URLs. Translate to English from 99 languages. Identify who spoke when with speaker diarization. Batch process multiple files.
- Language Dubbing — Translate speech from one language to another while preserving the original speaker's voice identity.
- Smart Input Pipeline — Paste a YouTube, Bilibili, or TikTok URL directly as input. Feed an image and VODER extracts text via OCR. Automatically extract voice clips from multi-speaker audio for one-click voice cloning.
Describe a voice in plain English — "deep male voice, authoritative" — and VODER generates speech that matches. Or provide a reference audio clip and VODER clones the speaker's voice from it. Both approaches can be mixed in the same dialogue: some characters designed, others cloned.
Write scripts with multiple characters, each with a distinct voice. VODER assembles the full dialogue into a single audio file with per-line control over timing, volume, and duration via script directives (/time, /level, /duration). Embed sound effects directly into dialogue lines using the special sfx: character — door creaks, applause, rain — generated on the fly from text descriptions.
When generating dialogue, VODER can produce a background music track that exactly matches the spoken duration, mixed at a configurable volume with fade transitions. An optional reference audio can be provided for stylistic guidance — the reference is processed through SVS to extract clean instrumental before use. No manual editing or external tools needed.
Transform one voice into another while preserving the original words, emotion, and timing. Supports video input/output — drop in an MP4 and get back a video with the converted voice. For music, VODER switches to a high-fidelity 44.1kHz model. A mimic mode transfers not just the voice timbre but the accent and speaking style as well.
Generate full songs from lyrics and style descriptions. Beyond basic generation, VODER supports 6 sub-tasks: remix (style transfer with bias control), repaint (restyle a specific time range), complete (add missing instruments), lego (build individual tracks), extract (isolate specific stems), and bgm (replace background music in existing audio/video with generated music at a configurable volume). Output up to 12 individual instrument tracks for post-production. A three-tier quality system lets you trade speed for output quality.
Isolate clean vocals from any song, or extract the instrumental. Works with audio files, videos, and direct YouTube URLs. This separation engine also runs automatically behind the scenes in TTS (to clean voice cloning references), STS (to improve conversion quality), and STT (to pre-clean audio before transcription).
Transcribe audio, video, images, or direct URLs to text. Supports translation to English from 99 languages, speaker diarization (who spoke when), and batch processing of multiple files. An overdose mode using Microsoft VibeVoice ASR delivers higher-quality transcription with built-in speaker identification.
Remove noise, reduce room echo, and restore clarity from degraded recordings. Works on audio and video files alike.
Extract individual speakers from multi-speaker recordings into separate audio files, each with a speaker-labeled transcript.
Translate speech from one language to another while preserving the original speaker's voice identity. Accepts audio files and YouTube URLs.
Paste a YouTube, Bilibili, or TikTok URL directly as input — VODER downloads and processes it automatically. Feed an image containing text and VODER extracts it via OCR for TTS processing. Automatic voice clip extraction from multi-speaker audio enables one-click voice cloning for dialogue characters.
git clone https://github.com/HAKORADev/VODER.git && cd VODER
pip install -r requirements.txt && pip install --upgrade protobuf==5.29.6
# GUI
python src/voder.py
# CLI (interactive)
python src/voder.py cli
# One-liner examples
python src/voder.py tts script "Hello world" voice "female, cheerful"
python src/voder.py stt "audio.wav" timestamp dialogue
python src/voder.py sts base "input.wav" target "voice.wav"
python src/voder.py ttm lyrics "Walking down the street" styling "upbeat pop" 30
python src/voder.py svs "song.mp3" voice
python src/voder.py ss "meeting.wav"
python src/voder.py slc "spanish_speech.wav"
python src/voder.py se "noisy_recording.wav"
python src/voder.py sfx sound "thunder rumbling" duration 10Run in Colab — no installation needed: Open in Google Colab
FFmpeg is required for audio processing. Install via your system package manager. See READ.md for all setup details.
| Mode | What It Does | Input | Output |
|---|---|---|---|
| TTS | Generate speech from text, design or clone voices | Text / Image / URL | Audio |
| STS | Convert one voice to another | Audio / Video | Audio / Video |
| TTM | Generate, remix, repaint, bgm, and manipulate music | Text + Audio | Audio / Stems |
| STT | Transcribe audio, translate, identify speakers | Audio / Video / Image / URL | Text |
| SE | Denoise, dereverb, restore speech | Audio / Video | Audio / Video |
| SFX | Generate sound effects from text | Text | Audio |
| SVS | Isolate vocals from music | Audio / Video / URL | Audio |
| SLC | Dub speech to another language, keep voice | Audio / URL | Audio |
| SS | Extract individual speakers | Audio / Video | Audio per speaker |
| STT+TTS | Transcribe, edit, resynthesize (interactive) | Audio | Audio |
VODER orchestrates state-of-the-art open-source models — each selected for quality:
| Capability | Model |
|---|---|
| Speech Recognition | Whisper |
| Voice Synthesis & Cloning | Qwen3-TTS |
| Voice Conversion | Seed-VC |
| Music Generation | ACE-Step |
| Sound Effects | TangoFlux |
| Speech Enhancement | UniSE |
| Vocal / Music Separation | BS-RoFormer |
| Advanced ASR & Diarization | VibeVoice |
| Speaker Diarization | pyannote |
| Image Text Extraction | EasyOCR |
| Component | Minimum |
|---|---|
| CPU | 4-6 cores |
| RAM | 12 GB |
| GPU | Optional — all modes run on CPU |
| VRAM | 4 GB (6 GB recommended, 16 GB for music modes) |
| Storage | SSD recommended |
Some modes (SS, TTM overdose, ACE-Step complete) benefit from 24-32 GB VRAM or 48 GB+ system memory. See Guide.md for the full per-mode breakdown.
Speaker diarization requires a free Hugging Face token — set
HF_TOKENenv var orHF_TOKEN.txt. See READ.md for details.
| Document | What's Inside |
|---|---|
| READ.md | Mode descriptions, CLI examples, setup details, technical notes |
| Guide.md | Architecture deep-dives, creative techniques, tips & tricks |
| COMMAND_CATALOG.md | Complete one-liner reference for every mode, flag, and keyword |
| Languages.md | Language support across all components (99+ languages) |
| Bots.md | AI agent & automation usage guide |
| CHANGELOG.md | Development history |
VODER is open-source under AGPL-3.0. Contributions are welcome — new modes, model integrations, UI improvements, bug fixes, or documentation.
Built for the open-source AI voice community.
