Text-to-speech, speech-to-text, and voice cloning CLI for Apple Silicon, built on mlx-audio.
- macOS with Apple Silicon (M-series)
- uv
- ffmpeg (for non-WAV audio conversion in the API server)
uv sync# Use default model (Kokoro) and voice
uv run voice.py say "Hello world!"
# Choose a model and voice
uv run voice.py say "Bonjour !" -m voxtral -v fr_male
# Save without playing
uv run voice.py say "Hello" -o greeting.wav --no-playuv run voice.py clone "Text to speak" reference.wav
uv run voice.py clone "Text to speak" reference.wav -m voxtraluv run voice.py transcribe audio.wav
uv run voice.py transcribe audio.wav --streamuv run voice.py voices # all voices
uv run voice.py voices -m kokoro # kokoro voices only
uv run voice.py models # available model shortcuts| Shortcut | Model ID |
|---|---|
kokoro, kokoro-tts (default) |
mlx-community/Kokoro-82M-bf16 |
voxtral, voxtral-tts |
mlx-community/Voxtral-4B-TTS-2603-mlx-4bit |
You can also pass any full Hugging Face model ID with -m.
Starts an OpenAI-compatible transcription API on port 4444.
uv run voice.py serve
uv run voice.py serve -p 8080 # custom portPOST /v1/audio/transcriptions
| Parameter | Type | Default | Description |
|---|---|---|---|
file |
UploadFile | required | Audio file (WAV, WebM, MP3, MP4, OGG, FLAC, AAC) |
model |
string | "base" |
Model identifier |
language |
string | "en" |
Language code |
Response:
{"text": "transcribed text"}Example:
curl -X POST http://localhost:4444/v1/audio/transcriptions \
-F "file=@recording.webm" \
-F "language=en"Use ./cli to manage a persistent background service via launchd.
./cli install # install and start on login
./cli status # check if running
./cli logs # tail logs
./cli restart # restart the service
./cli stop # stop the service
./cli uninstall # stop and remove the serviceLogs are written to /tmp/voice-tts.log and /tmp/voice-tts.err.