Skip to content

feat(stt): local Whisper transcription backend via whisper-rs#105

Merged
jamiepine merged 3 commits intospacedriveapp:voicefrom
Marenz:stt-local
Feb 21, 2026
Merged

feat(stt): local Whisper transcription backend via whisper-rs#105
jamiepine merged 3 commits intospacedriveapp:voicefrom
Marenz:stt-local

Conversation

@Marenz
Copy link
Collaborator

@Marenz Marenz commented Feb 21, 2026

Adds a local Whisper speech-to-text backend as an alternative to routing transcription through an LLM provider. Based on #98.

How it works

Set routing.voice = "whisper-local://<spec>" in config. When the channel processes an audio attachment and the voice route starts with whisper-local://, it bypasses the HTTP provider path entirely and runs inference locally via whisper-rs.

<spec> is either a known model size name (tiny, base, small, medium, large, large-v3, …) or an absolute path to a GGML model file. Size names are downloaded automatically via hf-hub from ggerganov/whisper.cpp on HuggingFace and cached in ~/.cache/huggingface/hub.

The WhisperContext is loaded once and cached for the process lifetime via OnceLock. Switching models requires a restart.

Audio decoding

Telegram voice messages arrive as Ogg/Opus. These are decoded via the ogg + opus crates. All other formats (mp3, flac, wav, aac, …) fall through to symphonia. Both paths resample to 16 kHz mono f32 before passing to Whisper.

GPU acceleration

whisper-rs is built with the vulkan feature, so inference runs on the GPU where available. CUDA was not used — GCC 14+/glibc incompatibility with nvcc on modern distros makes it impractical.

Feature flag

Everything is behind the stt-whisper cargo feature. Builds without it are unaffected.

Related

While testing this, a bug was found where CompletionError mid-turn leaves dangling tool-call messages in conversation history, making the channel permanently unresponsive. Fixed separately in #104.

When routing.voice = "whisper-local://<spec>", audio attachments are
transcribed locally instead of via the LLM provider HTTP path.

<spec> is either:
- A known size name (tiny/base/small/medium/large) — fetched from
  ggerganov/whisper.cpp on HuggingFace via hf-hub, using the existing
  HF cache if already present
- An absolute path to a GGML model file

The WhisperContext is loaded once and cached in a OnceLock for the
process lifetime. Audio decoding (ogg, opus, mp3, flac, wav, m4a) is
handled by symphonia with linear resampling to 16 kHz mono f32.

All three deps (whisper-rs, hf-hub, symphonia) are optional behind the
stt-whisper feature flag.
@jamiepine
Copy link
Member

Amazing, thanks for building on top of my PR, makes my life much easier!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants