feat(stt): local Whisper transcription backend via whisper-rs#105
Merged
jamiepine merged 3 commits intospacedriveapp:voicefrom Feb 21, 2026
Merged
feat(stt): local Whisper transcription backend via whisper-rs#105jamiepine merged 3 commits intospacedriveapp:voicefrom
jamiepine merged 3 commits intospacedriveapp:voicefrom
Conversation
When routing.voice = "whisper-local://<spec>", audio attachments are transcribed locally instead of via the LLM provider HTTP path. <spec> is either: - A known size name (tiny/base/small/medium/large) — fetched from ggerganov/whisper.cpp on HuggingFace via hf-hub, using the existing HF cache if already present - An absolute path to a GGML model file The WhisperContext is loaded once and cached in a OnceLock for the process lifetime. Audio decoding (ogg, opus, mp3, flac, wav, m4a) is handled by symphonia with linear resampling to 16 kHz mono f32. All three deps (whisper-rs, hf-hub, symphonia) are optional behind the stt-whisper feature flag.
Member
|
Amazing, thanks for building on top of my PR, makes my life much easier! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a local Whisper speech-to-text backend as an alternative to routing transcription through an LLM provider. Based on #98.
How it works
Set
routing.voice = "whisper-local://<spec>"in config. When the channel processes an audio attachment and the voice route starts withwhisper-local://, it bypasses the HTTP provider path entirely and runs inference locally via whisper-rs.<spec>is either a known model size name (tiny,base,small,medium,large,large-v3, …) or an absolute path to a GGML model file. Size names are downloaded automatically via hf-hub fromggerganov/whisper.cppon HuggingFace and cached in~/.cache/huggingface/hub.The
WhisperContextis loaded once and cached for the process lifetime viaOnceLock. Switching models requires a restart.Audio decoding
Telegram voice messages arrive as Ogg/Opus. These are decoded via the
ogg+opuscrates. All other formats (mp3, flac, wav, aac, …) fall through to symphonia. Both paths resample to 16 kHz mono f32 before passing to Whisper.GPU acceleration
whisper-rs is built with the
vulkanfeature, so inference runs on the GPU where available. CUDA was not used — GCC 14+/glibc incompatibility with nvcc on modern distros makes it impractical.Feature flag
Everything is behind the
stt-whispercargo feature. Builds without it are unaffected.Related
While testing this, a bug was found where
CompletionErrormid-turn leaves dangling tool-call messages in conversation history, making the channel permanently unresponsive. Fixed separately in #104.