feat(stt): local Whisper transcription backend via whisper-rs by Marenz · Pull Request #105 · spacedriveapp/spacebot

Marenz · 2026-02-21T12:27:50Z

Adds a local Whisper speech-to-text backend as an alternative to routing transcription through an LLM provider. Based on #98.

How it works

Set routing.voice = "whisper-local://<spec>" in config. When the channel processes an audio attachment and the voice route starts with whisper-local://, it bypasses the HTTP provider path entirely and runs inference locally via whisper-rs.

<spec> is either a known model size name (tiny, base, small, medium, large, large-v3, …) or an absolute path to a GGML model file. Size names are downloaded automatically via hf-hub from ggerganov/whisper.cpp on HuggingFace and cached in ~/.cache/huggingface/hub.

The WhisperContext is loaded once and cached for the process lifetime via OnceLock. Switching models requires a restart.

Audio decoding

Telegram voice messages arrive as Ogg/Opus. These are decoded via the ogg + opus crates. All other formats (mp3, flac, wav, aac, …) fall through to symphonia. Both paths resample to 16 kHz mono f32 before passing to Whisper.

GPU acceleration

whisper-rs is built with the vulkan feature, so inference runs on the GPU where available. CUDA was not used — GCC 14+/glibc incompatibility with nvcc on modern distros makes it impractical.

Feature flag

Everything is behind the stt-whisper cargo feature. Builds without it are unaffected.

Related

While testing this, a bug was found where CompletionError mid-turn leaves dangling tool-call messages in conversation history, making the channel permanently unresponsive. Fixed separately in #104.

When routing.voice = "whisper-local://<spec>", audio attachments are transcribed locally instead of via the LLM provider HTTP path. <spec> is either: - A known size name (tiny/base/small/medium/large) — fetched from ggerganov/whisper.cpp on HuggingFace via hf-hub, using the existing HF cache if already present - An absolute path to a GGML model file The WhisperContext is loaded once and cached in a OnceLock for the process lifetime. Audio decoding (ogg, opus, mp3, flac, wav, m4a) is handled by symphonia with linear resampling to 16 kHz mono f32. All three deps (whisper-rs, hf-hub, symphonia) are optional behind the stt-whisper feature flag.

jamiepine · 2026-02-21T20:36:24Z

Amazing, thanks for building on top of my PR, makes my life much easier!

Marenz added 2 commits February 21, 2026 13:27

Enable Vulkan GPU backend and Ogg/Opus decode for local Whisper STT

3f1f3e3

Marenz mentioned this pull request Feb 21, 2026

feat: add dedicated voice model routing and attachment transcription #98

Merged

docs: document local Whisper STT backend in README

aeea2a1

jamiepine merged commit ce5d0a0 into spacedriveapp:voice Feb 21, 2026

Marenz mentioned this pull request Feb 23, 2026

feat(stt): local Whisper transcription backend + transcribe_audio worker tool #177

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(stt): local Whisper transcription backend via whisper-rs#105

feat(stt): local Whisper transcription backend via whisper-rs#105
jamiepine merged 3 commits intospacedriveapp:voicefrom
Marenz:stt-local

Marenz commented Feb 21, 2026 •

edited

Loading

Uh oh!

jamiepine commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Marenz commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamiepine commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Marenz commented Feb 21, 2026 •

edited

Loading