Skip to content

feat(stt): local Whisper transcription backend + transcribe_audio worker tool#177

Open
Marenz wants to merge 4 commits intospacedriveapp:mainfrom
Marenz:feat/stt-local-v2
Open

feat(stt): local Whisper transcription backend + transcribe_audio worker tool#177
Marenz wants to merge 4 commits intospacedriveapp:mainfrom
Marenz:feat/stt-local-v2

Conversation

@Marenz
Copy link
Collaborator

@Marenz Marenz commented Feb 23, 2026

Re-submission of #105, which was merged into the voice branch before that branch landed in main — the commits never reached main.

Local Whisper backend

Set routing.voice = "whisper-local://<spec>" in config. When the channel processes an audio attachment, it bypasses the HTTP provider path and runs inference locally via whisper-rs.

<spec> is a known size name (tiny, base, small, medium, large, large-v3) — downloaded from ggerganov/whisper.cpp on HuggingFace — or an absolute path to a GGML model file. The WhisperContext is cached for the process lifetime; switching models requires a restart.

Audio decoding: Ogg/Opus (Telegram voice messages) is handled via the ogg + opus crates. All other formats fall through to symphonia. Both paths resample to 16 kHz mono f32 before Whisper.

GPU acceleration via the vulkan feature (CUDA excluded due to GCC 14+/nvcc incompatibility on modern distros).

Everything is behind the stt-whisper cargo feature. Builds without it are unaffected.

STT unification + transcribe_audio worker tool

The transcription logic is extracted from channel.rs into a shared stt::transcribe_bytes() function that handles both the local Whisper path and any OpenAI-compatible HTTP provider. Workers now get a transcribe_audio tool that reads a local audio file and transcribes it using whatever is configured in routing.voice — no need to shell out to the whisper CLI.

@Marenz Marenz changed the title feat(stt): local Whisper transcription backend via whisper-rs feat(stt): local Whisper transcription backend + transcribe_audio worker tool Feb 23, 2026
Marenz added 4 commits March 2, 2026 02:07
When routing.voice = "whisper-local://<spec>", audio attachments are
transcribed locally instead of via the LLM provider HTTP path.

<spec> is either:
- A known size name (tiny/base/small/medium/large) — fetched from
  ggerganov/whisper.cpp on HuggingFace via hf-hub, using the existing
  HF cache if already present
- An absolute path to a GGML model file

The WhisperContext is loaded once and cached in a OnceLock for the
process lifetime. Audio decoding (ogg, opus, mp3, flac, wav, m4a) is
handled by symphonia with linear resampling to 16 kHz mono f32.

All three deps (whisper-rs, hf-hub, symphonia) are optional behind the
stt-whisper feature flag.
Workers can now call transcribe_audio(path) to transcribe local audio
files. The tool uses whatever is configured in routing.voice — local
Whisper (whisper-local://<spec>) or any OpenAI-compatible HTTP provider.

The transcription logic is extracted from channel.rs into stt.rs as
transcribe_bytes(), shared by both the channel attachment handler and
the new tool. The stt module is now always compiled (not gated on
stt-whisper) since it handles all provider paths.
@Marenz Marenz force-pushed the feat/stt-local-v2 branch from 37d9e0d to 89b57ea Compare March 2, 2026 01:46
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 2, 2026

Walkthrough

Introduces Speech-to-Text (STT) transcription feature supporting local Whisper and OpenAI-compatible providers, along with extensive agent channel enhancements including worker/branch spawning APIs, concurrent worker limits, attachment download/transcription, history management, and user message formatting with attribution context.

Changes

Cohort / File(s) Summary
Documentation & Configuration
README.md, prompts/en/tools/transcribe_audio_description.md.j2
Adds detailed Voice Transcription section documenting configuration options (routing.voice), provider-based and local Whisper setups, model caching, supported formats, and GPU acceleration notes; introduces tool description template.
Speech-to-Text Core
src/stt.rs
New STT module with unified transcribe_bytes API supporting both local Whisper (when feature-gated) and HTTP-based OpenAI-compatible backends, audio format inference, error handling, and internal helpers for model resolution, audio decoding, and resampling.
Transcribe Audio Tool
src/tools/transcribe_audio.rs, src/tools.rs
New tool implementing audio-to-text transcription with file I/O, MIME type inference, and delegation to STT backend; integrates into tool server when voice_model is configured.
Agent Channel Enhancement
src/agent/channel.rs
Extensive additions for spawning branches/workers/OpenCode agents with concurrent limits, attachment handling (download and transcription), user message formatting with attribution, event filtering, history rollback/commit management, and related utilities.
Worker & Tool Server Wiring
src/agent/worker.rs, Cargo.toml
Routes voice_model from runtime configuration into worker tool server creation; passes LlmManager and HTTP client dependencies enabling STT integration; updates function signature of create_worker_tool_server.
Library & Feature Export
src/lib.rs, src/prompts/text.rs
Conditionally exports new stt module under "stt-whisper" feature gate; adds transcribe_audio tool description to prompt lookup table.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested reviewers

  • jamiepine
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main changes: addition of a local Whisper transcription backend and a transcribe_audio worker tool, accurately reflecting the primary purpose of the PR.
Description check ✅ Passed The description is directly related to the changeset, providing comprehensive context about the local Whisper backend, STT unification, audio decoding strategy, GPU support, and the transcribe_audio tool implementation.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/agent/channel.rs (1)

3-9: ⚠️ Potential issue | 🔴 Critical

Remove the duplicated helper block and fix stale ProcessEvent payloads in channel.rs.

This file imports from channel_attachments, channel_dispatch, and channel_history modules (lines 3–9) but also redefines those same helpers locally: spawn_memory_persistence_branch (1830), extract_reply_from_tool_syntax (2223), format_user_message (2277), event_is_for_channel (2343), download_attachments (2381), and apply_history_after_turn (2585). Additionally, ProcessEvent payloads at lines 2051 and 2149 are missing the required worker_type field, and line 2208 is missing the required success field—these will fail to compile. Delete the entire duplicated helper block (1800–2843) and keep the canonical implementations in the separate modules.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/channel.rs` around lines 3 - 9, Remove the entire duplicated helper
block that reimplements spawn_memory_persistence_branch,
extract_reply_from_tool_syntax, format_user_message (and
format_batched_user_message if present), event_is_for_channel,
download_attachments, and apply_history_after_turn so the file uses the
canonical implementations imported at the top; after deleting that block, verify
there are no remaining local definitions or duplicate symbols and update any
call sites to use the imported functions. Also scan this file for any
ProcessEvent payload constructions and ensure they include the required fields
(add worker_type and add success where missing) so payload shapes match the
canonical ProcessEvent type, and run a compile to catch any unresolved
references from the deletion.
🧹 Nitpick comments (2)
src/stt.rs (2)

310-313: Use Path::is_absolute() for absolute-path detection.

spec.starts_with('/') is Unix-specific. std::path::Path::is_absolute() is the correct path check across platforms.

♻️ Suggested fix
-        if spec.starts_with('/') {
-            if std::path::Path::new(spec).exists() {
+        if std::path::Path::new(spec).is_absolute() {
+            if std::path::Path::new(spec).exists() {
                 return Ok(spec.to_owned());
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/stt.rs` around lines 310 - 313, Replace the Unix-specific check
spec.starts_with('/') with the cross-platform Path::is_absolute() idiom: use
std::path::Path::new(spec).is_absolute() in the same conditional that also calls
Path::new(spec).exists(), updating the branch in the function/block that handles
the spec variable so it returns Ok(spec.to_owned()) when the path is absolute
and exists (look for the spec.starts_with('/') check near the
Path::new(spec).exists() call in src/stt.rs).

251-523: Expand abbreviated variable names in local decoding/transcription paths.

Names like ctx, n, dec, and mss reduce readability in already dense logic. Prefer explicit names in this module.

As per coding guidelines, "Do not abbreviate variable names; use queue not q, message not msg, channel not ch; common abbreviations like config are acceptable".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/stt.rs` around lines 251 - 523, The review asks to replace abbreviated
local variable names with explicit, descriptive names to improve readability: in
transcribe_blocking rename ctx -> context and n -> segment_count (and any other
short names in that function); in get_or_load_context keep variable names
explicit (e.g., ctx -> context) when caching/loading; in decode_to_f32 rename
mss -> media_source_stream, decoded -> decoded_packet_or_buffer (or
decoded_buffer), and track-related locals to descriptive names (e.g., track_id,
sample_rate, channel_count are fine); in decode_ogg_opus rename dec ->
opus_decoder and pcm -> pcm_buffer and samples -> mono_samples; and in resample
keep names clear (e.g., ratio, output_len, output). Update all usages
consistently within each function (transcribe_blocking, get_or_load_context,
resolve_model_path, decode_to_f32, is_ogg_opus, decode_ogg_opus, resample) and
run cargo check to ensure no missing renames.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib.rs`:
- Around line 27-28: The crate root currently conditionally declares the stt
module with #[cfg(feature = "stt-whisper")], but other files (e.g., references
to crate::stt in transcribe_audio.rs and channel.rs) import it unconditionally;
remove the feature gate so the module is always declared (change #[cfg(feature =
"stt-whisper")] pub mod stt; to pub mod stt;), and if needed keep internal cfgs
inside the stt module implementation to select whisper-specific code paths while
preserving the unconditional module export.

In `@src/stt.rs`:
- Line 300: The call to CONTEXT.set((model_spec.to_owned(), ctx)) currently
discards the Result with let _ =; instead explicitly handle the failure from
OnceLock::set by checking its return (or catching the Err) and logging a
debug-level message that a race occurred and the context was already initialized
(include model_spec info), or propagate the error if appropriate; reference
CONTEXT, OnceLock::set, model_spec and ctx to locate the site and ensure the
branch still retrieves the existing value when set fails.
- Around line 454-499: The loop currently swallows reader.read_packet() errors
and uses decoder.as_mut().unwrap() and abbreviated names; change the packet loop
to fully match reader.read_packet() and on Err(e) return
Err(WhisperError::Decode(e.to_string())), on Ok(None) break; replace
decoder.as_mut().unwrap() with a checked access that returns
Err(WhisperError::Decode("missing decoder".into())) if decoder is None; rename
local variables: dec -> decoder_mut or decoder_ref and n -> num_samples, and
rename inner ch -> channel to avoid abbreviations; ensure the decode_float error
is propagated as WhisperError::Decode and update subsequent uses to the new
variable names when extending samples or mixing down to mono.

In `@src/tools.rs`:
- Around line 434-440: Whitespace-only voice_model values cause
TranscribeAudioTool to be registered then fail; trim the voice_model before the
emptiness check and use the trimmed value when constructing TranscribeAudioTool.
Specifically, replace the current check of voice_model.is_empty() with a check
of a trimmed string (e.g., let trimmed = voice_model.trim(); if
trimmed.is_empty() { ... }) and pass that trimmed/owned value into
TranscribeAudioTool::new (the server.tool(TranscribeAudioTool::new(...)) call)
so only non-empty, non-whitespace model names register.

In `@src/tools/transcribe_audio.rs`:
- Around line 24-35: TranscribeAudioTool currently allows the model to supply
arbitrary host paths; update the file-reading logic in the transcribe_audio
method (and the similar code at the other site referenced around lines 83-87) to
enforce workspace boundaries by canonicalizing the requested path and the
workspace root (resolve symlinks), then ensure the canonicalized requested path
is inside the canonicalized workspace root (starts_with); if not, return an
error/deny access and do not read the file. Use the existing TranscribeAudioTool
method and struct members (or add a workspace_root field if needed) to perform
this check before any filesystem access.
- Around line 84-87: Before reading the audio file into memory in
transcribe_audio.rs, check the file size via tokio::fs::metadata(&args.path) (or
equivalent) and enforce a configurable size cap (e.g., MAX_AUDIO_BYTES
constant); if metadata.len() exceeds the cap return a TranscribeAudioError
indicating the file is too large instead of calling tokio::fs::read. Update the
code paths around args.path and the TranscribeAudioError usage so you only read
into memory after the size check and ensure the error message includes the
actual file size and the allowed limit.

---

Outside diff comments:
In `@src/agent/channel.rs`:
- Around line 3-9: Remove the entire duplicated helper block that reimplements
spawn_memory_persistence_branch, extract_reply_from_tool_syntax,
format_user_message (and format_batched_user_message if present),
event_is_for_channel, download_attachments, and apply_history_after_turn so the
file uses the canonical implementations imported at the top; after deleting that
block, verify there are no remaining local definitions or duplicate symbols and
update any call sites to use the imported functions. Also scan this file for any
ProcessEvent payload constructions and ensure they include the required fields
(add worker_type and add success where missing) so payload shapes match the
canonical ProcessEvent type, and run a compile to catch any unresolved
references from the deletion.

---

Nitpick comments:
In `@src/stt.rs`:
- Around line 310-313: Replace the Unix-specific check spec.starts_with('/')
with the cross-platform Path::is_absolute() idiom: use
std::path::Path::new(spec).is_absolute() in the same conditional that also calls
Path::new(spec).exists(), updating the branch in the function/block that handles
the spec variable so it returns Ok(spec.to_owned()) when the path is absolute
and exists (look for the spec.starts_with('/') check near the
Path::new(spec).exists() call in src/stt.rs).
- Around line 251-523: The review asks to replace abbreviated local variable
names with explicit, descriptive names to improve readability: in
transcribe_blocking rename ctx -> context and n -> segment_count (and any other
short names in that function); in get_or_load_context keep variable names
explicit (e.g., ctx -> context) when caching/loading; in decode_to_f32 rename
mss -> media_source_stream, decoded -> decoded_packet_or_buffer (or
decoded_buffer), and track-related locals to descriptive names (e.g., track_id,
sample_rate, channel_count are fine); in decode_ogg_opus rename dec ->
opus_decoder and pcm -> pcm_buffer and samples -> mono_samples; and in resample
keep names clear (e.g., ratio, output_len, output). Update all usages
consistently within each function (transcribe_blocking, get_or_load_context,
resolve_model_path, decode_to_f32, is_ogg_opus, decode_ogg_opus, resample) and
run cargo check to ensure no missing renames.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ebd1e1 and 89b57ea.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock, !**/*.lock
  • Cargo.toml is excluded by !**/*.toml
📒 Files selected for processing (9)
  • README.md
  • prompts/en/tools/transcribe_audio_description.md.j2
  • src/agent/channel.rs
  • src/agent/worker.rs
  • src/lib.rs
  • src/prompts/text.rs
  • src/stt.rs
  • src/tools.rs
  • src/tools/transcribe_audio.rs

Comment on lines +27 to +28
#[cfg(feature = "stt-whisper")]
pub mod stt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Check for the stt module declaration and feature gates in src/lib.rs
rg -n 'pub mod stt|cfg\(feature = "stt-whisper"\)' src/lib.rs

Repository: spacedriveapp/spacebot

Length of output: 116


🏁 Script executed:

# Find all references to crate::stt throughout the codebase
rg -n 'crate::stt' --type rust -A 2

Repository: spacedriveapp/spacebot

Length of output: 1258


🏁 Script executed:

# Also check for any use statements that might reference stt
rg -n 'use.*stt' --type rust

Repository: spacedriveapp/spacebot

Length of output: 48


🏁 Script executed:

# Check the context around the call sites to see if they're gated
sed -n '85,100p' src/tools/transcribe_audio.rs

Repository: spacedriveapp/spacebot

Length of output: 652


🏁 Script executed:

# Check if the files containing call sites have any feature gating
head -20 src/tools/transcribe_audio.rs
head -30 src/agent/channel.rs

Repository: spacedriveapp/spacebot

Length of output: 1835


🏁 Script executed:

# Verify if there are conditional compilations around those call sites
sed -n '2490,2515p' src/agent/channel.rs

Repository: spacedriveapp/spacebot

Length of output: 1269


🏁 Script executed:

# Check if the files with call sites are themselves feature-gated
head -1 src/tools/transcribe_audio.rs
grep -n '#\[cfg' src/tools/transcribe_audio.rs | head -5

Repository: spacedriveapp/spacebot

Length of output: 104


🏁 Script executed:

# Check if transcribe_audio.rs is conditionally compiled
grep -B5 'pub struct TranscribeAudioTool' src/tools/transcribe_audio.rs

Repository: spacedriveapp/spacebot

Length of output: 201


🏁 Script executed:

# Check the tools.rs file to see how transcribe_audio is included
cat src/tools.rs | head -50

Repository: spacedriveapp/spacebot

Length of output: 1689


🏁 Script executed:

# Check if the containing functions/methods are feature-gated
sed -n '2480,2520p' src/agent/channel.rs | grep -B20 'crate::stt'

Repository: spacedriveapp/spacebot

Length of output: 1206


🏁 Script executed:

# Check tools.rs to see if transcribe_audio is included unconditionally
grep -n 'transcribe_audio' src/tools.rs

Repository: spacedriveapp/spacebot

Length of output: 126


🏁 Script executed:

# Check if there's a Cargo.toml to understand the stt-whisper feature
cat Cargo.toml | head -100

Repository: spacedriveapp/spacebot

Length of output: 2464


🏁 Script executed:

# Check if transcribe_audio tool usage is conditionally compiled elsewhere
grep -rn 'TranscribeAudioTool' --type rust src/

Repository: spacedriveapp/spacebot

Length of output: 183


🏁 Script executed:

# Check if the stt module itself is properly structured to handle the feature gate gracefully
head -30 src/stt.rs 2>/dev/null || echo "stt.rs not found, checking for stt/"

Repository: spacedriveapp/spacebot

Length of output: 1207


🏁 Script executed:

# Check if there's a stt/ directory (module)
ls -la src/ | grep -i stt

Repository: spacedriveapp/spacebot

Length of output: 124


🏁 Script executed:

# Check the full transcribe_bytes function signature in stt.rs
sed -n '30,80p' src/stt.rs

Repository: spacedriveapp/spacebot

Length of output: 1745


🏁 Script executed:

# Check how transcribe_bytes handles the case when stt-whisper is not enabled
grep -n 'pub.*fn transcribe_bytes' src/stt.rs -A 20

Repository: spacedriveapp/spacebot

Length of output: 773


🏁 Script executed:

# Verify that the HTTP provider path is unconditionally available
grep -n 'http\|Http\|provider' src/stt.rs | head -20

Repository: spacedriveapp/spacebot

Length of output: 1172


🏁 Script executed:

# Verify that stt.rs itself is NOT feature-gated at the module level
head -5 src/stt.rs

Repository: spacedriveapp/spacebot

Length of output: 289


🏁 Script executed:

# Check if there's a mod declaration for stt in src/lib.rs and if it matches the cfg gate
sed -n '20,35p' src/lib.rs

Repository: spacedriveapp/spacebot

Length of output: 340


🏁 Script executed:

# Confirm that transcribe_audio.rs has no feature gates around the call site
sed -n '1,15p' src/tools/transcribe_audio.rs && sed -n '80,100p' src/tools/transcribe_audio.rs

Repository: spacedriveapp/spacebot

Length of output: 1217


🏁 Script executed:

# Double check: try to understand if tools.rs or tools/transcribe_audio.rs is itself feature gated
grep -n '#\[cfg' src/tools.rs
grep -n '#\[cfg' src/tools/transcribe_audio.rs

Repository: spacedriveapp/spacebot

Length of output: 82


🏁 Script executed:

# Check what happens in agent/channel.rs - is that file feature gated?
head -50 src/agent/channel.rs | grep -n '#\[cfg'

Repository: spacedriveapp/spacebot

Length of output: 48


🏁 Script executed:

# Final verification: confirm build would fail without stt-whisper
# by checking if crate::stt is definitely called in non-gated contexts
echo "=== Checking transcribe_audio.rs context ===" && \
sed -n '1,20p' src/tools/transcribe_audio.rs | grep -c 'cfg' && \
echo "=== Checking if TranscribeAudioTool is exposed in tools.rs ===" && \
sed -n '50,110p' src/tools.rs | grep -i transcribe

Repository: spacedriveapp/spacebot

Length of output: 112


Remove the feature gate from the stt module declaration at the crate root.

The stt module is gated with #[cfg(feature = "stt-whisper")], but call sites in src/tools/transcribe_audio.rs and src/agent/channel.rs reference crate::stt unconditionally. Since the stt module itself supports builds without the feature (dispatching to HTTP providers), the module should be unconditionally available.

🔧 Proposed fix
-#[cfg(feature = "stt-whisper")]
 pub mod stt;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[cfg(feature = "stt-whisper")]
pub mod stt;
pub mod stt;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib.rs` around lines 27 - 28, The crate root currently conditionally
declares the stt module with #[cfg(feature = "stt-whisper")], but other files
(e.g., references to crate::stt in transcribe_audio.rs and channel.rs) import it
unconditionally; remove the feature gate so the module is always declared
(change #[cfg(feature = "stt-whisper")] pub mod stt; to pub mod stt;), and if
needed keep internal cfgs inside the stt module implementation to select
whisper-specific code paths while preserving the unconditional module export.

let ctx = WhisperContext::new_with_params(&model_path, params)
.map_err(|e| WhisperError::Load(e.to_string()))?;

let _ = CONTEXT.set((model_spec.to_owned(), ctx));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find and examine the context around line 300 in src/stt.rs
head -n 310 src/stt.rs | tail -n 30

Repository: spacedriveapp/spacebot

Length of output: 1124


🏁 Script executed:

# Search for CONTEXT definition in the file to understand its type
rg "CONTEXT" src/stt.rs -B 2 -A 2

Repository: spacedriveapp/spacebot

Length of output: 835


🏁 Script executed:

# Check for other uses of OnceLock or similar patterns in the codebase
rg "OnceLock" src/ -A 2 -B 2

Repository: spacedriveapp/spacebot

Length of output: 2028


Explicitly handle OnceLock::set failure instead of silently discarding it.

Line 300 silently discards the result of CONTEXT.set() via let _ =. Per coding guidelines, errors should be handled, logged, or propagated—not silently discarded. While the code handles the failure implicitly by retrieving the already-set value, the race condition should be acknowledged explicitly at debug level.

🔧 Suggested fix
-        let _ = CONTEXT.set((model_spec.to_owned(), ctx));
+        if CONTEXT.set((model_spec.to_owned(), ctx)).is_err() {
+            tracing::debug!("Whisper context was initialized concurrently; using existing context");
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let _ = CONTEXT.set((model_spec.to_owned(), ctx));
if CONTEXT.set((model_spec.to_owned(), ctx)).is_err() {
tracing::debug!("Whisper context was initialized concurrently; using existing context");
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/stt.rs` at line 300, The call to CONTEXT.set((model_spec.to_owned(),
ctx)) currently discards the Result with let _ =; instead explicitly handle the
failure from OnceLock::set by checking its return (or catching the Err) and
logging a debug-level message that a race occurred and the context was already
initialized (include model_spec info), or propagate the error if appropriate;
reference CONTEXT, OnceLock::set, model_spec and ctx to locate the site and
ensure the branch still retrieves the existing value when set fails.

Comment on lines +454 to +499
while let Ok(Some(packet)) = reader.read_packet() {
if header_packets < 2 {
if header_packets == 0 {
// Parse OpusHead to get channel count and pre-skip.
if packet.data.len() >= 11 && &packet.data[0..8] == b"OpusHead" {
channels = packet.data[9] as usize;
// Output sample rate is always 48000 for libopus.
sample_rate = 48000;
}
decoder = Some(
opus::Decoder::new(
sample_rate,
if channels == 2 {
opus::Channels::Stereo
} else {
opus::Channels::Mono
},
)
.map_err(|e| WhisperError::Decode(e.to_string()))?,
);
}
header_packets += 1;
continue;
}

let dec = decoder.as_mut().unwrap();
// Max Opus frame: 120ms at 48kHz = 5760 samples per channel.
let max_samples = 5760 * channels;
let mut pcm = vec![0f32; max_samples];
let n = dec
.decode_float(&packet.data, &mut pcm, false)
.map_err(|e| WhisperError::Decode(e.to_string()))?;

// Mix down to mono.
if channels == 1 {
samples.extend_from_slice(&pcm[..n]);
} else {
for frame in 0..n {
let mut sum = 0f32;
for ch in 0..channels {
sum += pcm[frame * channels + ch];
}
samples.push(sum / channels as f32);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the actual code around lines 454-499 in src/stt.rs
wc -l src/stt.rs

Repository: spacedriveapp/spacebot

Length of output: 80


🏁 Script executed:

# Read the specific section mentioned in the review
sed -n '440,510p' src/stt.rs

Repository: spacedriveapp/spacebot

Length of output: 2858


🏁 Script executed:

# Also check the OnceLock issue mentioned at line 300
sed -n '295,305p' src/stt.rs

Repository: spacedriveapp/spacebot

Length of output: 448


🏁 Script executed:

# Check for abbreviated variable names throughout the file
rg '\b(ctx|dec|n|mss)\b' src/stt.rs -n

Repository: spacedriveapp/spacebot

Length of output: 866


Propagate Ogg packet read failures and remove panic path in decoder access.

Line 454 silently drops read_packet() errors by exiting the loop, and line 479 can panic on malformed headers via unwrap(). Both violate error-handling requirements and should return WhisperError::Decode instead. Additionally, variable names dec and n use disallowed abbreviations.

Suggested fix
-        while let Ok(Some(packet)) = reader.read_packet() {
+        loop {
+            let packet = match reader.read_packet() {
+                Ok(Some(packet)) => packet,
+                Ok(None) => break,
+                Err(error) => return Err(WhisperError::Decode(error.to_string())),
+            };
             if header_packets < 2 {
                 if header_packets == 0 {
                     // Parse OpusHead to get channel count and pre-skip.
                     if packet.data.len() >= 11 && &packet.data[0..8] == b"OpusHead" {
                         channels = packet.data[9] as usize;
                         // Output sample rate is always 48000 for libopus.
                         sample_rate = 48000;
+                    } else {
+                        return Err(WhisperError::Decode("invalid Opus header".into()));
                     }
                     decoder = Some(
                         opus::Decoder::new(
                             sample_rate,
                             if channels == 2 {
                                 opus::Channels::Stereo
                             } else {
                                 opus::Channels::Mono
                             },
                         )
                         .map_err(|e| WhisperError::Decode(e.to_string()))?,
                     );
                 }
                 header_packets += 1;
                 continue;
             }
 
-            let dec = decoder.as_mut().unwrap();
+            let decoder = decoder
+                .as_mut()
+                .ok_or_else(|| WhisperError::Decode("Opus decoder not initialized".into()))?;
             // Max Opus frame: 120ms at 48kHz = 5760 samples per channel.
             let max_samples = 5760 * channels;
             let mut pcm = vec![0f32; max_samples];
-            let n = dec
+            let samples_decoded = decoder
                 .decode_float(&packet.data, &mut pcm, false)
                 .map_err(|e| WhisperError::Decode(e.to_string()))?;

Per coding guidelines: "Don't silently discard errors; use let _ = only on channel sends where the receiver may be dropped; handle, log, or propagate all other errors" and "Do not abbreviate variable names; use queue not q, message not msg, channel not ch."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/stt.rs` around lines 454 - 499, The loop currently swallows
reader.read_packet() errors and uses decoder.as_mut().unwrap() and abbreviated
names; change the packet loop to fully match reader.read_packet() and on Err(e)
return Err(WhisperError::Decode(e.to_string())), on Ok(None) break; replace
decoder.as_mut().unwrap() with a checked access that returns
Err(WhisperError::Decode("missing decoder".into())) if decoder is None; rename
local variables: dec -> decoder_mut or decoder_ref and n -> num_samples, and
rename inner ch -> channel to avoid abbreviations; ensure the decode_float error
is propagated as WhisperError::Decode and update subsequent uses to the new
variable names when extending samples or mixing down to mono.

Comment on lines +434 to +440
if !voice_model.is_empty() {
server = server.tool(TranscribeAudioTool::new(
voice_model,
llm_manager,
http,
));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Trim voice_model before deciding tool registration.

Whitespace-only values currently register transcribe_audio and then fail at call time.

🔧 Proposed fix
-    if !voice_model.is_empty() {
+    let trimmed_voice_model = voice_model.trim().to_string();
+    if !trimmed_voice_model.is_empty() {
         server = server.tool(TranscribeAudioTool::new(
-            voice_model,
+            trimmed_voice_model,
             llm_manager,
             http,
         ));
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if !voice_model.is_empty() {
server = server.tool(TranscribeAudioTool::new(
voice_model,
llm_manager,
http,
));
}
let trimmed_voice_model = voice_model.trim().to_string();
if !trimmed_voice_model.is_empty() {
server = server.tool(TranscribeAudioTool::new(
trimmed_voice_model,
llm_manager,
http,
));
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tools.rs` around lines 434 - 440, Whitespace-only voice_model values
cause TranscribeAudioTool to be registered then fail; trim the voice_model
before the emptiness check and use the trimmed value when constructing
TranscribeAudioTool. Specifically, replace the current check of
voice_model.is_empty() with a check of a trimmed string (e.g., let trimmed =
voice_model.trim(); if trimmed.is_empty() { ... }) and pass that trimmed/owned
value into TranscribeAudioTool::new (the
server.tool(TranscribeAudioTool::new(...)) call) so only non-empty,
non-whitespace model names register.

Comment on lines +24 to +35
impl TranscribeAudioTool {
/// Create a new transcribe audio tool.
pub fn new(
voice_model: impl Into<String>,
llm_manager: Arc<LlmManager>,
http: reqwest::Client,
) -> Self {
Self {
voice_model: voice_model.into(),
llm_manager,
http,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Enforce workspace path boundaries before reading files.

transcribe_audio currently reads any host path supplied by the model. This bypasses workspace-scoped file safety.

🔒 Proposed fix
 use std::sync::Arc;
+use std::path::PathBuf;

 pub struct TranscribeAudioTool {
     /// The configured voice model spec (full `routing.voice` value).
     voice_model: String,
+    workspace: PathBuf,
     llm_manager: Arc<LlmManager>,
     http: reqwest::Client,
 }

 impl TranscribeAudioTool {
     pub fn new(
         voice_model: impl Into<String>,
+        workspace: PathBuf,
         llm_manager: Arc<LlmManager>,
         http: reqwest::Client,
     ) -> Self {
         Self {
             voice_model: voice_model.into(),
+            workspace,
             llm_manager,
             http,
         }
     }
 }
@@
     async fn call(&self, args: Self::Args) -> Result<Self::Output, Self::Error> {
-        let audio = tokio::fs::read(&args.path)
+        let requested = PathBuf::from(&args.path);
+        let candidate = if requested.is_absolute() {
+            requested
+        } else {
+            self.workspace.join(requested)
+        };
+
+        let workspace = std::fs::canonicalize(&self.workspace)
+            .map_err(|e| TranscribeAudioError(format!("invalid workspace: {e}")))?;
+        let canonical = std::fs::canonicalize(&candidate)
+            .map_err(|e| TranscribeAudioError(format!("failed to resolve {}: {e}", args.path)))?;
+        if !canonical.starts_with(&workspace) {
+            return Err(TranscribeAudioError(format!(
+                "path escapes workspace: {}",
+                args.path
+            )));
+        }
+
+        let audio = tokio::fs::read(&canonical)
             .await
             .map_err(|e| TranscribeAudioError(format!("failed to read {}: {}", args.path, e)))?;

Also applies to: 83-87

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tools/transcribe_audio.rs` around lines 24 - 35, TranscribeAudioTool
currently allows the model to supply arbitrary host paths; update the
file-reading logic in the transcribe_audio method (and the similar code at the
other site referenced around lines 83-87) to enforce workspace boundaries by
canonicalizing the requested path and the workspace root (resolve symlinks),
then ensure the canonicalized requested path is inside the canonicalized
workspace root (starts_with); if not, return an error/deny access and do not
read the file. Use the existing TranscribeAudioTool method and struct members
(or add a workspace_root field if needed) to perform this check before any
filesystem access.

Comment on lines +84 to +87
let audio = tokio::fs::read(&args.path)
.await
.map_err(|e| TranscribeAudioError(format!("failed to read {}: {}", args.path, e)))?;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add a file-size cap before reading audio into memory.

Reading arbitrary-size files into memory can cause avoidable memory pressure.

🔧 Proposed fix
+const MAX_AUDIO_BYTES: u64 = 25 * 1024 * 1024; // 25MB
@@
-        let audio = tokio::fs::read(&canonical)
+        let metadata = tokio::fs::metadata(&canonical)
+            .await
+            .map_err(|e| TranscribeAudioError(format!("failed to stat {}: {}", args.path, e)))?;
+        if metadata.len() > MAX_AUDIO_BYTES {
+            return Err(TranscribeAudioError(format!(
+                "file too large: {} bytes (max {})",
+                metadata.len(),
+                MAX_AUDIO_BYTES
+            )));
+        }
+
+        let audio = tokio::fs::read(&canonical)
             .await
             .map_err(|e| TranscribeAudioError(format!("failed to read {}: {}", args.path, e)))?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tools/transcribe_audio.rs` around lines 84 - 87, Before reading the audio
file into memory in transcribe_audio.rs, check the file size via
tokio::fs::metadata(&args.path) (or equivalent) and enforce a configurable size
cap (e.g., MAX_AUDIO_BYTES constant); if metadata.len() exceeds the cap return a
TranscribeAudioError indicating the file is too large instead of calling
tokio::fs::read. Update the code paths around args.path and the
TranscribeAudioError usage so you only read into memory after the size check and
ensure the error message includes the actual file size and the allowed limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant