Skip to content

feat: 0.5.0 Capture release — dictation, MCP, personalities#544

Merged
jamiepine merged 65 commits intomainfrom
feat/capture
Apr 25, 2026
Merged

feat: 0.5.0 Capture release — dictation, MCP, personalities#544
jamiepine merged 65 commits intomainfrom
feat/capture

Conversation

@jamiepine
Copy link
Copy Markdown
Owner

@jamiepine jamiepine commented Apr 24, 2026

Summary

Screenshot 2026-04-23 at 7 26 52 PM

The 0.5.0 Capture release. Voicebox stops being just a voice-cloning studio and becomes a full AI voice studio: your voice goes into your computer through a global hotkey, and any agent's voice comes back out through a voice you own. Three pillars on top of one shared local LLM:

  • Dictation — global hotkey → Whisper → Qwen3 refine → synthetic paste into the focused app, with a floating on-screen pill across recording / transcribing / refining / done.
  • MCP servervoicebox.speak, voicebox.transcribe, voicebox.list_captures, voicebox.list_profiles exposed to Claude Code, Cursor, Cline, Spacebot, etc. via Streamable HTTP plus a stdio shim binary.
  • Personalities — voice profiles get an optional character prompt and two doors: a Compose button (fresh in-character utterance) and a personality: bool on /generate and voicebox.speak that rewrites text through the profile's LLM before TTS.

Everything still runs on-device. One Qwen3 instance powers refinement and personality, so there's one local LLM, not two.

What's in here

Dictation pipeline (Rust + app webview + backend)

  • Global hotkey via keytap — our own cross-platform observe-only tap (CGEventTap on macOS, SetWindowsHookEx on Windows, evdev on Linux). Lazy-spawned only when the user enables it, so the macOS Input Monitoring TCC prompt fires from an explicit gesture instead of app startup. The chord state machine (Momentary push-to-talk + Toggle hands-free + PTT→Toggle upgrade) lives in keytap's ChordMatcher; voicebox is a thin ChordEvent → Tauri event dispatcher.
  • Two modes, two chords. Push-to-talk and toggle, configurable in ChordPicker. Holding PTT then tapping Space upgrades to hands-free mid-recording without a gap. macOS defaults (right-Cmd + right-Option) deliberately avoid Cmd+Opt+I, Cmd+Opt+Esc, Cmd+Opt+Space. Windows defaults (right-Ctrl + right-Shift) route around AltGr collisions on DE/FR/ES layouts.
  • Auto-paste (tauri/src-tauri/src/{clipboard,synthetic_keys,focus_capture,keyboard_layout}.rs): snapshot focused PID at chord-start → activate (cooperative on macOS 14+) → 120ms settle → save full clipboard (every UTI/format) → write text → post Cmd+V/Ctrl+V at HID level with layout-resolved V keycode → 400ms paste-consume → conditional restore. Short-circuits if the focused app is Voicebox itself or Accessibility is untrusted.
  • Floating dictate window (Tauri webview, 420×64, transparent, always-on-top) hosts the pill. Hidden by parking off-screen + click-through to prevent stale invisible windows from stealing clicks.
  • Six-gate readiness (mic, Input Monitoring, Accessibility, hotkey toggle, Whisper model, LLM model). useDictationReadiness + DictationReadinessChecklist surface each unmet gate with a one-click action; useChordSync auto-arms the chord the moment the last gate flips green. Stops the "stuck pill" failure where the chord fires but has nowhere to land.
  • Recording state machine (useCaptureRecordingSession, 328 LOC): recording → transcribing → refining → completed → rest → hidden, with calibrated dwell/fade timers and ref-stable closures so late-arriving mutations still see the right flags.

MCP server

  • FastMCP mounted at /mcp (Streamable HTTP). app.py migrated from @app.on_event to lifespan= via compose_lifespan, so FastMCP's session manager nests inside FastAPI's startup/shutdown with correct LIFO teardown (MCP drains in-flight sessions before models unload).
  • Client identity & profile precedence: ClientIdMiddleware reads X-Voicebox-Client-Id into a ContextVar; MCPClientBinding table maps client → profile + default_personality bool; resolve_profile() picks explicit arg → per-client binding → global default. last_seen_at stamped on both /mcp (tool calls) and /speak (REST wrapper) so REST callers show up in the Settings UI.
  • Speak SSE: /events/speak broadcasts speak-start / speak-end. Rust speak_monitor.rs subscribes via reqwest streaming body (CRLF-aware parser, 45s idle timeout catches zombie streams, 500ms→30s escalating backoff) and surfaces the pill window when agent-initiated speech actually starts.
  • Stdio shim: separate voicebox-mcp binary (backend/mcp_shim/, ~18MB, built via python build_binary.py --shim) bridges stdio-only clients to the HTTP endpoint. Wired into tauri.conf.json externalBin.
  • Settings UI: MCPPage.tsx with HTTP / stdio / claude-mcp-add snippets, default-voice picker, per-client bindings table.

Personalities

  • VoiceProfile.personality (free-form text, ≤2000 chars) describes what the voice says and how, orthogonal to how it sounds (the existing preset/cloning metadata).
  • Two surfaces only:
    • ComposePOST /profiles/{id}/compose produces a fresh in-character utterance the user can edit (Dices button in FloatingGenerateBox, temp 0.9).
    • Personality rewritepersonality: true on /generate (and voicebox.speak) routes text through personality.rewrite_as_profile() before TTS, marks the row source = "personality_speak" (Wand2 toggle in FloatingGenerateBox, temp 0.3).
  • Same Qwen3 instance powers refinement and personality; no extra model.

Refinement hardening

  • Deterministic loop collapse before the LLM sees the transcript. Two passes:
    • Word-level: token repeated ≥6 times consecutively (with punctuation/casing normalised) gets dropped. Handles "URL, URL, URL, URL, URL, URL".
    • Character-level: any 2–60 char substring repeating ≥6 times in place gets dropped. Handles multi-word English loops ("thanks for watching" × 6) that the word pass misses, and CJK / Japanese loops ("謝謝觀看" × 7, "ご視聴ありがとうございました" × 6) where text.split() returns one unsplit token.
  • RefinementFlags (smart_cleanup, self_correction, preserve_technical) snapshotted per capture so re-refine is reproducible.
  • Eval harnesses: backend/tests/test_refinement_samples.py (10 transcripts, scores prompt leaks / answer leaks / loop echoes / question-mark survival / substring preservation across every bundled model size) and test_personality_samples.py. Plus test_refinement_collapse.py (17 deterministic unit tests pinning both passes against regression).

Capture persistence + UI

  • New Capture model + routes/captures.py (POST /captures upload+STT+optional refine, GET /captures list, POST /captures/{id}/refine).
  • CapturesTab.tsx (834 LOC) — two-pane settings UI: searchable list ↔ detail with a scrubbable WaveSurfer audio player, Raw/Refined transcript toggle, "Play as" voice, Refine / Copy / Delete. Listens for capture:created / capture:updated events from the dictate window so new captures appear instantly.
  • CapturesPage.tsx (607 LOC) — settings page (different from the tab): hotkey toggle, dual ChordPicker, model pickers, auto-paste / auto-refine toggles, inline Accessibility + Input Monitoring notices, live pill preview.
  • useUIStore now persists selectedProfileId via zustand/middleware.persist so reopening the app keeps the last-selected profile.

Database

New tables: Capture, MCPClientBinding, singleton CaptureSettings (chord JSON, model picks, refinement flags, default voice), singleton GenerationSettings. VoiceProfile.personality column. Generation.source column. Idempotent column-existence migrations in backend/database/migrations.py (single-user desktop, no Alembic). MCPClientBinding's legacy default_intent column is dropped on startup; SQLite < 3.35 (Ubuntu 20.04, Debian 11) falls back to leaving the unused column in place rather than crashing — SQLAlchemy only maps declared columns so the stray does no work.

i18n

app/src/i18n/locales/{en,ja,zh-CN,zh-TW}/translation.json each grow ~375 lines. CapturesTab, CapturesPage, MCPPage, ChordPicker, both gates, the readiness checklist, ProfileCard / ProfileForm, FloatingGenerateBox, GenerationPage sidebar, Sidebar — all wired through t().

Landing + docs

New /capture route (CaptureHero, CaptureSection, CapturesMockup, AgentIntegration, SupportedModels, Personalities). New overview docs: captures, dictation, mcp-server, voice-personalities. Design notes in docs/plans/MCP_SERVER.md and docs/plans/VOICE_IO.md.

Notable design decisions

  • One Qwen3, not two. Refinement and personality share the same loaded model; temperature is the per-task knob (0.3 rewrite / 0.7 respond / 0.9 compose).
  • Loop-collapse before LLM. Small models can't echo what they never see. Dual-pass (word + character) covers every script we support.
  • personality: bool over a tri-state. Earlier iterations exposed intent: "respond" | "rewrite" | "compose" on voicebox.speak. Collapsed to a boolean — respond is dropped (callers that want a reply should call an LLM), compose is a UI-only standalone button (POST /profiles/{id}/compose), rewrite is the only meaning of personality: true. MCPClientBinding.default_intent migrated to default_personality: bool (legacy column dropped when SQLite supports it).
  • Lifespan composition via compose_lifespan. AsyncExitStack enters factories in order (Voicebox startup → MCP startup) and exits LIFO (MCP teardown → _run_shutdown). MCP cancels in-flight sessions before models unload, so an agent mid-voicebox.speak at shutdown gets a clean session-cancelled error instead of crashing on "model unloaded."
  • ContextVar for client_id. Avoids plumbing the request through every service; middleware sets it, tool handlers read it.
  • Idempotent migrations, no Alembic. Single-user desktop app; column-existence checks at startup are fast and reliable. Capability checks (_supports_drop_column) guard DDL that needs a newer SQLite.
  • Extract the chord engine into keytap. Upstream rdev hasn't shipped a release since 2023-06 and its macOS 14 main-thread crash only lives on an unreleased branch. Rather than maintain a git-pinned fork, we wrote keytap from the spec: observe-only cross-platform taps, left/right modifier fidelity, clean shutdown via Drop, Sonoma-safe by design, and a ChordMatcher with both Momentary and Toggle modes. Voicebox's PTT→hands-free upgrade falls out of longest-match resolution plus Toggle sticky-end semantics — voicebox no longer carries 80+ LOC of bespoke chord state machine, and supply-chain risk drops to a regular crates.io dep with semver.

Follow-up / known limitations

Not blocking merge, but worth noting:

  • Windows synthetic paste still hardcodes VK_V. SendInput delivers wParam = VK_V to the target regardless of active layout, which is why AutoHotkey's Send "^v" works on Dvorak Windows, so empirically this is fine — but if a Windows-on-Dvorak user reports breakage, VkKeyScanExW(L'V', layout) is the parallel fix (~30 lines).
  • speak_monitor reconnects silently after the idle timeout / backoff land the pill offline. No user-facing "backend unreachable" toast; that belongs in a separate server-health surface, not this loop.

Test plan

  • Dictation, macOS — fresh install: enable hotkey, walk all six readiness gates, hold the chord in TextEdit / VS Code / Slack, transcript pastes into the field that had focus at chord-start (not wherever focus drifted).
  • Dictation, macOS, Dvorak — switch Input Source to Dvorak, dictate, paste should land (layout-aware V keycode resolved via TIS + UCKeyTranslate).
  • Dictation, macOS Sonoma+ — cooperative activation (yieldActivationToApplication:activate) brings the target app forward; paste lands in the original window even when Voicebox wasn't frontmost.
  • Dictation, Windows — same base flow; AttachThreadInput foreground-lock dance lands the paste in the original window. BOOL import from windows::core compiles against windows-rs 0.62.
  • PTT → toggle upgrade — hold PTT chord, tap Space, release PTT — recording continues, stops on next press.
  • Auto-paste denied — revoke Accessibility, dictate, transcript still lands in Captures tab; "missing permission" notice shows.
  • Clipboard race — copy text X, dictate, during the paste window Cmd+C something else in the target app → after paste, clipboard holds the newer Cmd+C content, not X. No regression on the quiet case (no outside write → X restored).
  • Loop collapse — feed a Whisper hallucination loop ("URL URL URL…" ×30 OR "thanks for watching" ×6 OR "謝謝觀看" ×7), refine with 0.6B, LLM doesn't echo it. Rhetorical "no, no, no, no, no" (5) preserved.
  • MCP HTTPclaude-mcp-add snippet from Settings → MCP, call voicebox.speak({text, profile}), pill surfaces with agent speech. last_seen_at advances on the binding row.
  • MCP stdio shim — wire stdio config in a stdio-only client, same call path works through the voicebox-mcp binary.
  • REST /speakcurl -H "X-Voicebox-Client-Id: test", binding shows up in the Settings UI as last-seen.
  • Per-client binding — bind Claude Code to "Morgan", call voicebox.speak({text}) with no profile, hear Morgan; override with profile= and hear that voice instead.
  • Personality rewrite — profile with personality, voicebox.speak({text, personality: true}), output is in character. personality: false (or unset with no binding default) is plain TTS.
  • Compose button — profile with personality, click Dices in FloatingGenerateBox, fresh in-character utterance appears in the input.
  • MCP migration — start app on a DB that has the old default_intent column, verify it's dropped (or left in place on SQLite < 3.35 with a warning) and default_personality exists with default false.
  • Lifespan shutdown — with an MCP request mid-generate, Ctrl+C the dev backend; clean shutdown (StreamableHTTP teardown logs before "Voicebox server shutting down"), no "model unloaded" crashes in logs.
  • Speak-SSE dead stream — pause/freeze the backend after connection opens; within ~45s the Rust speak_monitor logs the idle timeout and reconnects with escalating backoff.
  • i18n smoke — switch UI language to ja / zh-CN / zh-TW, walk through Captures + Captures settings + MCP + a profile with personality, no untranslated keys.
  • Capture audio player — open a capture in the detail view, WaveSurfer renders the waveform, scrubbing works, play/pause toggles.
  • Profile persistence — select a profile, restart the app, last-selected profile still selected.
  • Refinement eval harnessespytest backend/tests/test_refinement_samples.py and test_personality_samples.py produce reasonable output across bundled model sizes. pytest backend/tests/test_refinement_collapse.py backend/tests/test_client_id_middleware.py — both pass (34 tests).

Note

Cursor Bugbot is generating a summary for commit 687ab2a. Configure here.

Summary by CodeRabbit

  • New Features

    • Full Captures workflow: global dictation hotkey/pill, timed recording, auto-paste, upload, inline audio player, refine/retranscribe, play-as voice, retention and compose/rewrite “personality” support.
    • Local MCP server and REST /speak with per-client bindings and speaking-state events; LLM-backed personality compose/rewrite.
  • UI

    • Captures & MCP settings pages, chord picker, readiness checklist, accessibility/input notices, compose/persona controls, capture pill and toolbar.
  • Documentation

    • Extensive guides for dictation, captures, personalities, MCP server, migration.
  • Tests

    • New tests and evaluation scripts for personality and refinement.

jamiepine and others added 22 commits April 22, 2026 18:49
Ships the Capture release end to end. Global-hotkey dictation with
synthetic paste into the focused app on macOS and Windows, an on-screen
pill across recording / transcribing / refining, customizable push-to-
talk and toggle chords, and an accessibility-permission prompt scoped to
Settings → Captures with inline re-check feedback.

Voice profiles gain optional personalities that power compose / rewrite /
respond actions via a local Qwen3 LLM — shared with refinement, so there
is one local LLM in the app, not two.

Refinement hardened with deterministic Whisper-loop collapse before the
LLM sees the transcript, per-capture flag snapshots for re-runs, and a
ten-transcript evaluation harness across every bundled refinement size.

Version bump 0.4.5 → 0.5.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mounts FastMCP at /mcp (Streamable HTTP) so Claude Code, Cursor,
Windsurf, and the VS Code MCP extensions can call voicebox.speak,
voicebox.transcribe, voicebox.list_captures, and voicebox.list_profiles
against the running Voicebox server.

Backend
- new backend/mcp_server package (tools, middleware, profile resolve,
  pub/sub events); named mcp_server to avoid shadowing the installed mcp
  PyPI package FastMCP imports internally
- app.py migrated from @app.on_event to lifespan= so FastMCP's session
  manager cohabits with Voicebox's startup/shutdown
- new MCPClientBinding table + /mcp/bindings CRUD; ClientIdMiddleware
  reads X-Voicebox-Client-Id into a ContextVar and stamps last_seen_at
- profile resolution precedence: explicit -> per-client binding ->
  capture_settings.default_playback_voice_id
- POST /speak REST wrapper for non-MCP callers (shell, ACP, A2A)
- GET /events/speak SSE broadcasts speak-start / speak-end so the pill
  surfaces agent-initiated speech
- backend/mcp_shim proxy (plain httpx) for stdio-only MCP clients
- PyInstaller spec updates + new --shim build target (~18 MB)

Frontend
- Settings -> MCP page with HTTP / stdio / claude-mcp-add copy snippets,
  default voice picker, per-client bindings table, connection status
- useMCPBindings, useSpeakEvents hooks
- CapturePill gains 'speaking' state; DictateWindow subscribes to SSE
  and emits dictate:show so the Rust side surfaces the pill window

Native
- tauri.conf.json externalBin now includes voicebox-mcp
- show_dictate_window helper + dictate:show listener in main.rs
- (also in this commit: InputMonitoringGate UX, hotkey_monitor tweaks,
  landing footer/navbar updates, new overview docs for captures /
  dictation / mcp-server / voice-personalities)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pill window now surfaces for agent-initiated speech without main-window
involvement. Rust subscribes to /events/speak via a tokio task + reqwest
streaming body (speak_monitor.rs), shows the pill, and forwards events to
the dictate webview over Tauri's event bus. The pill plays audio via a
plain HTMLAudioElement and emits dictate:hide when playback ends. The
pill stays hidden through the ~1 s generation wait and only surfaces when
audio actually starts, with the counter armed at that moment.

Fixes a shared-dict mutation in mcp_server/events.publish() that caused
the second subscriber (Rust speak_monitor) to receive `event: message`
instead of named speak-start/speak-end frames. Also teaches the speak_monitor
parser to handle CRLF framing (sse-starlette default). Main-window
AudioPlayer now skips autoplay for source in {mcp, rest} to avoid
double-play when both windows are alive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stops the "stuck pill" failure where pressing the chord with missing
STT/LLM models triggers a recording that has nowhere to land. The
hotkey now stays disarmed until every gate (models downloaded, Input
Monitoring + Accessibility granted) is green; the empty-state checklist
in CapturesTab surfaces each unmet gate with a one-click action and
auto-arms the chord once everything turns green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Collapse intent tri-state (respond/rewrite/compose) to `personality: bool` on /generate, /speak, and voicebox.speak. Drop respond entirely; keep compose as a standalone button via /profiles/{id}/compose. Remove /rewrite, /respond, and /speak profile endpoints.
- FloatingGenerateBox: Wand2 persona toggle + Dices compose button appear when the selected profile has a personality. ProfileCard badges Wand2 alongside the effects Sparkles.
- MCP bindings: default_intent column → default_personality: bool. Migration drops the legacy column.
- i18n: en / ja / zh-CN / zh-TW translation files filled out and wired through the capture, server, and profile UI.

```ts
voicebox.speak({
  text: "Deploy complete.",
  profile: "Morgan",
  personality: true, // rewrite through the profile's personality LLM
});
```
BOOL moved from Win32::Foundation to windows::core in 0.62.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
macOS apps match Cmd+V against the layout-translated character via NSMenu
key equivalents, so posting kVK_ANSI_V (= 9, the QWERTY V position) on
Dvorak produces Cmd+. and never triggers Paste. New keyboard_layout
module resolves the active layout's V keycode via
TISCopyCurrentKeyboardLayoutInputSource + UCKeyTranslate, caches it in an
AtomicU16, and refreshes on kTISNotifySelectedKeyboardInputSourceChanged.
All TIS calls run on the main thread (init from Tauri setup; observer
callback delivered to the main runloop); synthetic_keys::send_paste
reads the cached value once per paste. Falls back to kVK_ANSI_V when
resolution fails or the active input source carries no Unicode key
layout data.

Windows is intentionally left on hardcoded VK_V — SendInput delivers
WM_KEYDOWN with wParam = VK_V to the target regardless of the active
layout, which is why `Send "^v"` works for AutoHotkey on Dvorak Windows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… 14+

macOS 14 deprecated NSRunningApplication.activateWithOptions: in favour
of a cooperative-activation pattern: the caller first yields activation
rights to the target, then the target activate()s against the tightened
Sonoma foreground rules. Without the yield, activate() on 14+ sometimes
silently fails or only bounces the dock icon — the exact "paste lands in
the wrong app" symptom we were previously one API break away from.

activate_pid now discovers the 14+ selector via respondsToSelector: and
branches: on 14+ it yieldActivationToApplication:'s from
NSRunningApplication.current then calls -activate on the target; on
11–13 it stays on -activateWithOptions: (still the only option). Both
branches propagate the BOOL return — if activation is refused we error
out before clobbering the clipboard instead of silently proceeding.

The respondsToSelector: result is cached in a OnceLock so the probe
isn't repeated on every paste.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… failure

Two bugs in paste_final_text' clipboard handling:

1. Restore was unconditional. If the user ⌘C'd in the target app during
   the 400 ms paste-consume window — or a clipboard history tool (Paste,
   Pastebot, Maccy) or Universal Clipboard sync snapshotted our staged
   text — the blind restore overwrote their newer content with the
   pre-paste snapshot, silently losing user data.

2. send_paste' errors were propagated with ? before the restore, so a
   CGEventPost / SendInput failure left the user's clipboard stuck on
   the transcript.

Fix folds both into one pattern: capture the post-write change count,
re-read it after paste-consume, restore only when they match (plus treat
a change-count read failure as "unknown, don't overwrite"). Isolate
send_paste's error so the restore runs regardless of paste success, then
propagate the paste error after.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upstream Narsil/rdev has shipped no release since 2023-06 (crates.io
still serves 0.5.3), so the Sonoma main-thread fix we depend on — PR
#147, applied at hotkey_monitor.rs:184 — is only reachable via a git
pin. A pin to a third-party repo breaks the build whenever the remote
force-pushes, renames, or is taken down, and Cargo does not durably
cache git-dep archives the way it does crates.io tarballs.

Forking to jamiepine/rdev at the same SHA removes that failure mode
without changing crate behavior and gives us a place to cherry-pick
future OS-compatibility fixes on our own timeline. The SHA was verified
to exist on the fork before re-pinning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two reliability gaps in the /events/speak subscriber:

1. resp.chunk().await had no idle timeout. A backend that accepts the
   TCP connection but stops producing frames (deadlocked SSE endpoint,
   zombie process) would block the task forever without reconnecting.
   The pill window would never surface for agent-initiated speech and
   there would be nothing to log. Backend emits a `:ping` heartbeat
   every 15 s, so 45 s without any data is now treated as a dead
   stream — the task errors out and the reconnect loop takes over.

2. Flat 2 s backoff escalates nowhere. Logs fill with reconnect lines
   when the backend is down for minutes, and a backend that accepts +
   immediately closes connections (no data) spins the loop tightly.
   Backoff now escalates 500 ms → 30 s on unproductive rounds and
   resets only when at least one frame arrives (the connection was
   genuinely productive, not just accepted).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The word-level pass catches single-word Whisper loops ("URL URL URL…")
but misses two common hallucination patterns the PR had to claim as
"edge cases":

1. Multi-word English loops — "thanks for watching thanks for watching…"
   × 6 sails through because no two consecutive tokens are identical
   after text.split().
2. CJK loops — "謝謝觀看" × 7 sails through because text.split() returns
   a single unsplit token for the whole loop (no whitespace between
   characters).

Add a character-level second pass: a non-greedy regex finds any 2–60
char substring that repeats min_run+ times immediately after itself and
strips the run. The 2-char floor keeps emphasised single-letter runs
("wooooooow") intact. The 60-char ceiling covers every observed
Whisper tail hallucination ("Please like and subscribe to my
channel.", "Subtitles by the Amara.org community") while staying short
enough that coincidental long-phrase repetition in legitimate speech
doesn't hit the threshold. Whitespace normalisation only runs when the
pass actually stripped something, so untouched transcripts keep their
original spacing.

New test_refinement_collapse.py gives the pre-processor its first
deterministic unit-test coverage: 17 tests pinning the word-level
legacy behaviour plus the new multi-word English / CJK / Japanese /
emphasis-preservation cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SQLite gained ALTER TABLE … DROP COLUMN in 3.35 (Mar 2021). Production
PyInstaller builds bundle Python 3.12 which links to SQLite 3.40+ so
that path is always safe, but a dev running the backend directly on
Ubuntu 20.04 (3.31) or Debian 11 (3.34) would crash on first startup
trying to drop the legacy default_intent column.

Add _supports_drop_column(engine) — returns True on non-SQLite
dialects (Postgres / MySQL have supported DROP COLUMN for decades) and
gates on the runtime sqlite_version for SQLite. When unsupported, log a
warning and leave the unused column in place: SQLAlchemy only maps
declared columns, so a stray default_intent column does no reads or
writes and can't interfere with runtime behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g models

The inline lifespan ran _run_shutdown inside the MCP context, so the
TTS / Whisper / LLM models were unloaded *before* FastMCP's __aexit__
got a chance to cancel its in-flight session tasks. Any MCP request
mid-generate at shutdown time would crash on "model unloaded" instead
of receiving a clean session-cancelled error.

Rewire via compose_lifespan (which was already defined in
mcp_server.server for exactly this purpose but never used):
AsyncExitStack enters factories in order and exits in LIFO, so
MCP teardown fires first — cancelling sessions — and _run_shutdown
runs after nothing is holding the models. Smoke test shows the
log order flipped as expected:

  Ready
  StreamableHTTP session manager started
  ... running ...
  StreamableHTTP session manager shutting down   ← was last, now first
  Voicebox server shutting down...               ← was first, now last

As a side benefit, _run_shutdown is now paired with _run_startup via
try/finally inside voicebox_lifespan, so a partial startup (models
half-loaded, MCP __aenter__ fails) still unloads whatever was loaded
instead of leaking it to process exit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
POST /speak is a REST wrapper around voicebox.speak for agents that
don't talk MCP (shell scripts, ACP, A2A). It reads X-Voicebox-Client-Id
and uses it for the same per-client profile resolution + default
personality lookup the MCP tool does (speak.py:39-64), so its callers
are first-class clients — but the ClientIdMiddleware only stamped
last_seen_at on /mcp* paths. REST speak callers showed up as "never
seen" in Settings → MCP despite actively acting on their bindings.

Widen the stamp predicate to an explicit ("/mcp", "/speak") prefix
list, and require a path boundary on match so future routes named
/mcpfoo or /speakers don't silently inherit the stamp via the prefix.
New test_client_id_middleware.py pins the scope with 17 parametrised
cases (both the allowed set and the overlap cases that must not match).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the placeholder fake-waveform + play button in CapturesTab's
audio card with a real CaptureInlinePlayer (wavesurfer.js). The player
renders the actual waveform, lets users scrub through the clip, and
shows a proper current/total timestamp pair in place of the
duration-only label.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrap useUIStore in zustand/middleware's persist under the key
voicebox-ui. partialize only selectedProfileId so volatile UI state
(dialog open flags, form drafts, engine/voice pickers, sidebar) stays
in-memory as before — but reopening the app no longer loses whichever
profile the user was last working with.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Captures/dictation and personality-driven generation, an MCP server (HTTP + stdio shim) with per-client bindings, local Qwen3 LLM backends, capture CRUD/refinement pipelines, extensive Tauri native integrations for hotkeys/permissions/clipboard/focus, many frontend UI pages/components, API surface and DB migrations, and bumps version to 0.5.0.

Changes

Cohort / File(s) Summary
Version & Manifests
\.bumpversion.cfg, package.json, app/package.json, landing/package.json, tauri/package.json
Bumped project and package versions to 0.5.0.
Changelog & Docs
CHANGELOG.md, README.md, docs/..., docs/content/docs/...
Large documentation additions and rewrites for Captures, Dictation, Voice Personalities, MCP server, roadmap and overview content.
Frontend entry & routing
app/src/App.tsx, app/src/router.tsx
Route-aware App split (DictateWindow/MainApp) and new /captures route plus /settings/captures and /settings/mcp.
Captures UI & components
app/src/components/...
app/src/components/CapturesTab/CapturesTab.tsx, CapturePill.tsx, CaptureInlinePlayer.tsx, CapturesPage.tsx, DictateWindow.tsx, landing/src/components/CapturesMockup.tsx
New Captures feature: UI, DictateWindow, pill/player components, upload/refine/retranscribe/play-as flows and mockups.
Recording/session hooks
app/src/lib/hooks/...
useCaptureRecordingSession.ts, useAudioRecording.ts
New capture recording session hook, refined audio recording behavior (no timesliced chunks; optional max duration).
Hotkeys, chords & permissions
app/src/components/ChordPicker/..., useChordSync.ts, tauri/src-tauri/src/hotkey_monitor.rs, input_monitoring.rs, AccessibilityGate.tsx, InputMonitoringGate.tsx
Chord picker UI, chord synchronization hook, Rust hotkey monitor, macOS input-monitoring & accessibility permission hooks and notices.
Personality & generation UI
app/src/components/Generation/..., ProfileForm.tsx, ProfileCard.tsx, removed GenerationForm.tsx
Adds personality field to profiles, compose action and persona toggle in generation UI; removes legacy GenerationForm.
API client & types
app/src/lib/api/client.ts, app/src/lib/api/types.ts
New API client methods and types: captures CRUD, composeWithPersonality, settings, MCP bindings, generation/personality flags and capture-related types.
Settings, stores & settings UI
app/src/lib/hooks/useSettings.ts, app/src/stores/*, ServerTab/*, MCPPage.tsx, ServerTab/CapturesPage.tsx
Server-backed capture/generation settings hooks with optimistic updates; MCP bindings UI; store changes (persist config, personality draft) and new settings pages.
MCP server & shim
backend/mcp_server/*, backend/mcp_shim/*, backend/build_binary.py, backend/mcp_server/README.md
FastMCP-based MCP server mounted at /mcp, tools (speak/transcribe/list), ClientId middleware, in-memory events, stdio shim and packaging/build support.
Backend models, migrations & routes
backend/database/models.py, migrations.py, backend/models.py, backend/routes/*, backend/services/*
DB additions (captures, settings, MCP bindings, profile.personality, generation.source), migrations, new REST routes (/captures, /speak, /settings, /llm, /mcp/bindings, /events/speak), capture/refine services, personality and LLM services.
LLM backend & generation
backend/backends/qwen_llm_backend.py, backend/backends/__init__.py, backend/services/generation.py
New Qwen3 LLM backends (PyTorch & MLX), backend registry integration, personality rewrite/compose integration and new sync TTS generation helper.
Tauri native integrations
tauri/src-tauri/src/*.rs, Cargo.toml, capabilities/default.json
Native modules: accessibility, clipboard snapshot/restore, focus capture/restore, hotkey monitor, key-codes/layout, input-monitoring; capability and Cargo updates.
Utilities & i18n
app/src/lib/utils/keyCodes.ts, format.ts, app/src/i18n/locales/*/translation.json
Keycode helpers, date formatting util, large i18n additions for captures/personality/MCP and related UI strings.
Tests & tooling
backend/tests/*, scripts/setup-dev-sidecar.js, backend/requirements.txt
New tests (middleware, refinement/personality), dev sidecar placeholder creation updated, backend deps added (fastmcp, sse-starlette) and PyInstaller/shim build support.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Hotkey as Hotkey Monitor<br/>(Tauri/Rust)
    participant Dictate as DictateWindow<br/>(React)
    participant Backend as Backend<br/>(Python)
    participant STT as Whisper<br/>(STT)
    participant LLM as Qwen3<br/>(LLM)
    participant Clipboard as Clipboard<br/>(Tauri)

    User->>Hotkey: Hold/Toggle chord
    Hotkey->>Dictate: dictate:start (with FocusSnapshot)
    Dictate->>Dictate: Show CapturePill (recording)
    Dictate->>Dictate: Record audio
    User->>Hotkey: Release chord
    Hotkey->>Dictate: dictate:stop
    Dictate->>Backend: POST /captures (upload)
    Backend->>STT: Transcribe audio
    STT-->>Backend: Raw transcript
    alt auto_refine true
        Backend->>LLM: refine/rewrite transcript
        LLM-->>Backend: refined transcript
    end
    Backend-->>Dictate: Capture response (id, text, flags)
    alt allow_auto_paste
        Dictate->>Clipboard: write/paste final text
    end
    Dictate->>Dictate: CapturePill -> completed -> rest
Loading
sequenceDiagram
    actor Agent
    participant MCP as MCP Server (FastMCP)
    participant Backend as Backend (Python)
    participant LLM as Qwen3 (LLM)
    participant TTS as TTS Engine
    participant Frontend as DictateWindow (React)
    participant SSE as SSE (/events/speak)

    Agent->>MCP: voicebox.speak(profile?, personality?)
    MCP->>MCP: resolve_profile (explicit → binding → default)
    MCP->>Backend: POST /generate (personality=true)
    Backend->>LLM: rewrite/compose if personality
    LLM-->>Backend: in-character text
    Backend->>TTS: generate audio
    TTS-->>Backend: WAV
    Backend->>Backend: persist generation, publish speak-start
    MCP-->>Agent: generation_id + status URL
    Frontend->>SSE: subscribe
    SSE-->>Frontend: speak-start
    Frontend->>Backend: GET /audio/{id}
    Backend-->>Frontend: stream audio
    Frontend->>Frontend: play audio, show pill
    Backend->>SSE: publish speak-end
    SSE-->>Frontend: speak-end -> cleanup
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

"I hopped along the hotkey trail,
Captured words in pill and tale,
Profiles donned a playful voice,
Agents spoke — the users rejoice,
On-device, the little rabbit sings: v0.5!" 🐇✨

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/capture

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 687ab2a. Configure here.

Comment thread app/src/components/ChordPicker/ChordPicker.tsx
Comment thread app/src/components/CapturesTab/DictationReadinessChecklist.tsx Outdated
The six-gate checklist only rendered in the CapturesTab empty state, so
a user already on the settings page had no single surface showing which
gate was red — the inline InputMonitoringNotice covered one, the model
pickers covered another, and Accessibility was only hinted at by the
auto-paste toggle. Mirror the same component into the right sidebar of
the settings page so every gate (STT model, LLM model, Input Monitoring,
Accessibility, plus the hotkey toggle in the main column) is always
visible while the user configures dictation.

New compact prop on DictationReadinessChecklist drops the centered
header and empty-state max-width so it fits the 280 px sidebar next to
the existing About / Differences blocks. Callers in compact mode own
the heading — CapturesPage reuses the existing captures.readiness.title
key (present in en / ja / zh-CN / zh-TW already) as an h3 matching the
sibling sidebar sections.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
app/src/components/ServerTab/GenerationPage.tsx (1)

64-95: ⚠️ Potential issue | 🔴 Critical

Slider writes are unthrottled and will spam the settings endpoint on every pointer move during drag.

Dragging maxChunkChars or crossfadeMs triggers onValueChange dozens of times per second, each calling update()mutation.mutate()apiClient.updateGenerationSettings(patch) with no debounce, throttle, or request coalescing. This produces a stream of unthrottled HTTP PATCH requests. The optimistic onMutate update masks latency but does not prevent the request storm.

Additionally, if any mid-drag request fails and onError reverts the cache to a previous state, a subsequent successful request could overwrite that reversion, leaving the persisted value out of sync with the UI. Add debounce (e.g., 500–1000ms) to the update call or use useDebouncedCallback to coalesce slider changes during drag.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/ServerTab/GenerationPage.tsx` around lines 64 - 95, The
Slider onValueChange handlers call update(...) which immediately invokes
mutation.mutate (via apiClient.updateGenerationSettings) on every pointer move;
debounce or coalesce those calls so the server is only patched after pauses
(e.g., 500–1000ms) and only send the latest value. Modify the update function
used by the Slider components (referenced as update and the Slider id props
"maxChunkChars" and "crossfadeMs") to buffer rapid changes and call
mutation.mutate once after a debounce delay (or use useDebouncedCallback),
cancel/replace previous pending debounced calls, and ensure the mutation’s
onMutate/onError handlers correctly reconcile optimistic state with the final
settled value to avoid reverting to stale values.
app/src/components/VoiceProfiles/ProfileForm.tsx (1)

825-835: ⚠️ Potential issue | 🟡 Minor

Discard reset is missing personality (and avatarFile).

The other three form.reset(...) calls for this form (lines 333, 347, 372) all include personality: ''. The discard-draft reset below omits it, so after clicking Discard a draft's personality value stays in the form state (and the rendered textarea) rather than being cleared. Same for avatarFile. Align with the other resets for consistency.

🔧 Proposed fix
                     form.reset({
                       name: '',
                       description: '',
                       language: 'en',
+                      personality: '',
                       sampleFile: undefined,
                       referenceText: '',
+                      avatarFile: undefined,
                     });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/VoiceProfiles/ProfileForm.tsx` around lines 825 - 835, The
discard handler currently resets the draft via setProfileFormDraft(null) and
calls form.reset(...) but omits the personality and avatarFile fields, leaving
stale values; update the form.reset call inside the onClick to include
personality: '' and avatarFile: undefined (matching the other resets) so the
textarea and avatar state are cleared, keeping setProfileFormDraft and
setSampleMode('record') as they are.
🧹 Nitpick comments (41)
landing/src/components/Personalities.tsx (1)

175-235: Optional: respect prefers-reduced-motion for the 4.5s auto-cycle.

The section auto-cycles continuously via setInterval, which can be uncomfortable for motion-sensitive visitors. Consider pausing the rotation (and/or skipping the framer-motion entrance) when window.matchMedia('(prefers-reduced-motion: reduce)').matches.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@landing/src/components/Personalities.tsx` around lines 175 - 235, The
auto-rotation in Personalities uses setInterval inside the useEffect which
ignores users' prefers-reduced-motion; update the effect to detect
window.matchMedia('(prefers-reduced-motion: reduce)') and if it matches, do not
start the interval (or clear it immediately), and instead keep idx fixed; also
pass a prop like reducedMotion or prefersReducedMotion into ModeDemo (and any
child components that use framer-motion) so they can skip entrance/animation
when true. Ensure you reference Personalities, its useEffect that creates iv,
and ModeDemo (add a boolean prop) when making the changes.
app/src/components/ServerSettings/ModelManagement.tsx (2)

86-91: Model descriptions remain hardcoded English while surrounding UI is i18n'd.

These strings (like the pre-existing TTS/Whisper entries) bypass t() so they won't be translated in ja / zh-CN / zh-TW. Consistent with the existing pattern here, so not a regression for this PR — flagging for a future refactor to move MODEL_DESCRIPTIONS behind an i18n key namespace (e.g. models.descriptions.qwen3-0.6b).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/ServerSettings/ModelManagement.tsx` around lines 86 - 91,
MODEL_DESCRIPTIONS contains hardcoded English strings (e.g., keys 'qwen3-0.6b',
'qwen3-1.7b', 'qwen3-4b') that bypass i18n; update the code so descriptions are
retrieved via the translation function t() from an i18n namespace (suggested
keys like models.descriptions.qwen3-0.6b etc.) instead of inline text, and
refactor any use-sites in ModelManagement.tsx to call
t(`models.descriptions.${modelKey}`) (or fallback to the English string) so
descriptions are translatable.

420-420: Prefix qwen3- is fine today but narrow — consider explicit allowlist.

If a future model is named e.g. qwen3-tts-* or qwen3-embed-*, it will be silently categorized as a language model. Since the three qwen3 LLM variants are enumerated in MODEL_DESCRIPTIONS just above, you could tighten this by matching against that known set, or by introducing a model.kind === 'llm' field returned by the backend.

♻️ Optional: filter against known LLM ids
-  const llmModels = modelStatus?.models.filter((m) => m.model_name.startsWith('qwen3-')) ?? [];
+  const LLM_MODEL_NAMES = new Set(['qwen3-0.6b', 'qwen3-1.7b', 'qwen3-4b']);
+  const llmModels = modelStatus?.models.filter((m) => LLM_MODEL_NAMES.has(m.model_name)) ?? [];
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/ServerSettings/ModelManagement.tsx` at line 420, The
current llmModels filter uses a broad prefix check (llmModels =
modelStatus?.models.filter((m) => m.model_name.startsWith('qwen3-'))), which can
misclassify future qwen3 variants; change the filter to match only the explicit,
known LLM ids from MODEL_DESCRIPTIONS (or another local allowlist) by comparing
model.model_name against that set (e.g. build a Set from
Object.keys(MODEL_DESCRIPTIONS) or the enumerated ids) or, if available, use a
backend-provided discriminator like model.kind === 'llm' instead; update the
llmModels assignment to use that allowlist-based check against
modelStatus.models so only the intended qwen3 LLMs are included.
backend/mcp_shim/__init__.py (1)

6-6: Docstring hardcodes 127.0.0.1:17493.

If the backend port is configurable elsewhere (env var / CLI flag in __main__.py), consider softening the docstring to "the Voicebox server's /mcp/ endpoint" to avoid it drifting from reality later. Minor doc nit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/mcp_shim/__init__.py` at line 6, The module-level docstring in
backend/mcp_shim/__init__.py currently hardcodes "http://127.0.0.1:17493/mcp/";
update that docstring to a generic description such as "the Voicebox server's
/mcp/ endpoint" (or otherwise reference the configurable backend host/port) so
it doesn't drift if the address is configurable elsewhere (e.g., __main__.py or
env vars); edit the module docstring text in __init__.py to remove the literal
IP:port and replace it with the softer phrasing.
scripts/setup-dev-sidecar.js (1)

369-379: Optional: derive sidecar list from tauri.conf.json instead of duplicating.

The in-code comment already acknowledges this list must be kept in sync with externalBin in tauri.conf.json. Reading and parsing that file would eliminate the drift risk the comment warns about. Not a blocker; fine to keep as-is for now.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup-dev-sidecar.js` around lines 369 - 379, Replace the hard-coded
SIDECAR_BASE_NAMES with a dynamically derived list by reading and parsing
tauri.conf.json: in main() (which currently calls getTargetTriple() and loops
over SIDECAR_BASE_NAMES to call createPlaceholderBinary), load and parse
tauri.conf.json, extract the externalBin keys/names from the configuration (or
from build.externalBin depending on schema), map/normalize them to the same
baseName format used today, and iterate that array instead of
SIDECAR_BASE_NAMES; keep getTargetTriple() and createPlaceholderBinary usages
unchanged and add minimal error handling if tauri.conf.json is missing or
malformed.
app/src/router.tsx (1)

116-116: Stale comment: captures is no longer a prototype.

AudioTab has already been removed from the router and sidebar in this PR, so the "prototype — will replace AudioTab once the new flow is ready" wording is misleading. Consider updating or removing the comment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/router.tsx` at line 116, The inline comment above the captures route
is stale and misleading because AudioTab has already been removed; update or
delete the comment near the captures route declaration in router.tsx so it no
longer says "prototype — will replace AudioTab once the new flow is ready"
(either remove the prototype note or replace it with a brief current description
of the captures route/component and its purpose, referencing the captures
route/component name and AudioTab for clarity).
backend/routes/events.py (1)

35-40: Optional: use the builtin TimeoutError alias (Ruff UP041).

Since Python 3.11, asyncio.TimeoutError is an alias for the builtin TimeoutError. Small cleanup:

-                try:
-                    event = await asyncio.wait_for(queue.get(), timeout=15.0)
-                except asyncio.TimeoutError:
+                try:
+                    event = await asyncio.wait_for(queue.get(), timeout=15.0)
+                except TimeoutError:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routes/events.py` around lines 35 - 40, The except block catches
asyncio.TimeoutError but Python 3.11 provides a builtin alias, so replace the
exception reference with the builtin TimeoutError in the async generator that
awaits asyncio.wait_for(queue.get(), timeout=15.0); update the except clause
from "except asyncio.TimeoutError:" to "except TimeoutError:" (no other behavior
changes) to satisfy the Ruff UP041 suggestion while keeping the heartbeat yield
logic intact.
landing/src/app/page.tsx (1)

84-88: Nit: hardcoded text-white inside a themed paragraph.

The surrounding subtitle uses text-muted-foreground; text-white on the bolded span will look slightly off if the landing site ever gets a light-mode variant. Consider text-foreground (or a dedicated token) instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@landing/src/app/page.tsx` around lines 84 - 88, The bold span uses a
hardcoded class "text-white" inside a paragraph that uses
"text-muted-foreground", which will break theming; replace the "text-white"
class on the <b> element with a theme-aware token like "text-foreground" (or a
dedicated token) so the emphasized text inherits the correct foreground color in
both light and dark modes and keep the surrounding "text-muted-foreground"
intact.
backend/routes/generations.py (1)

11-14: Move logger below the imports to satisfy E402.

Ruff flags from ..services import … at line 14 as a module-level import after non-import code (the logger = … at line 11). Swap the ordering so all imports sit at the top of the module.

🔧 Proposed fix
-logger = logging.getLogger(__name__)
-
 from .. import models
 from ..services import history, personality, profiles, tts
 from ..database import Generation as DBGeneration, VoiceProfile as DBVoiceProfile, get_db
 from ..services.generation import run_generation
 from ..services.task_queue import cancel_generation as cancel_generation_job, enqueue_generation
 from ..utils.tasks import get_task_manager
+
+logger = logging.getLogger(__name__)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routes/generations.py` around lines 11 - 14, The logger assignment is
placed before module imports causing an E402 import-order error; move the line
"logger = logging.getLogger(__name__)" so it appears after the import block
(after the "from .. import models" and "from ..services import history,
personality, profiles, tts" lines) so all imports are at the top of the module
and then initialize logger.
app/src/lib/hooks/useSettings.ts (1)

18-99: Clean optimistic-update pattern; consider extracting the shared shape.

Both useCaptureSettings and useGenerationSettings follow identical optimistic-patch + rollback + server-reconcile logic — the only variations are the query key, fetcher, and mutator. If a third settings surface appears (and the MCP/captures trajectory suggests one will), a tiny createSettingsHook({ key, fetch, patch }) factory would eliminate the duplicated branches. Not blocking — fine to defer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/lib/hooks/useSettings.ts` around lines 18 - 99, Both hooks
useCaptureSettings and useGenerationSettings duplicate the same
optimistic-update/rollback/reconcile logic; extract that shared pattern into a
factory like createSettingsHook({ key, queryFn, mutationFn }) which encapsulates
the useQuery + useMutation setup (including cancelQueries, getQueryData,
setQueryData, onMutate/onError/onSettled behavior) and returns { settings,
isLoading, update }; then replace useCaptureSettings to call createSettingsHook
with CAPTURE_SETTINGS_KEY, apiClient.getCaptureSettings,
apiClient.updateCaptureSettings and replace useGenerationSettings to call it
with GENERATION_SETTINGS_KEY, apiClient.getGenerationSettings,
apiClient.updateGenerationSettings so future setting hooks reuse the same
implementation.
backend/mcp_server/server.py (1)

13-13: Prefer collections.abc.Callable (Ruff UP035).

Per static analysis and PEP 585 guidance, import Callable from collections.abc rather than typing.

-from typing import Callable
+from collections.abc import Callable
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/mcp_server/server.py` at line 13, The import currently uses Callable
from typing; update the import to use collections.abc.Callable to satisfy Ruff
UP035/PEP585 guidance—replace the line "from typing import Callable" with an
import of Callable from collections.abc so any references to Callable in this
module continue to work while avoiding the deprecated typing import.
backend/mcp_server/resolve.py (1)

55-57: with_db() name implies a context manager but it isn't.

Returning next(get_db()) produces a bare Session that the caller must remember to .close() — the with_ prefix suggests with with_db() as db: usage which will fail. Consider renaming to new_db_session() (or similar) or converting to an actual @contextmanager.

♻️ Option: context manager form
+from contextlib import contextmanager
+
-def with_db() -> Session:
-    """Utility for tool handlers that aren't managed by FastAPI's Depends."""
-    return next(get_db())
+@contextmanager
+def with_db():
+    """Utility for tool handlers that aren't managed by FastAPI's Depends."""
+    db = next(get_db())
+    try:
+        yield db
+    finally:
+        db.close()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/mcp_server/resolve.py` around lines 55 - 57, The function with_db()
is misleading because it returns a raw Session from next(get_db()) but is named
like a context manager; change it so callers don't leak sessions by either (a)
convert with_db into a real context manager (e.g., decorate with
contextlib.contextmanager or return a contextmanager object that yields the
Session and ensures session.close() after use) using get_db() to obtain and
finally close the session, or (b) rename the function to new_db_session() (or
acquire_db_session()) and document that it returns a plain Session that the
caller must close; update all call sites accordingly (references: with_db,
get_db).
backend/services/settings.py (2)

21-38: _get_or_create_* has a narrow race on first creation.

Two concurrent requests hitting the endpoint before the singleton row exists can both miss the first() check and both db.add(row) → commit; the second loses to a PRIMARY KEY conflict. Under the single-user desktop assumption this is essentially impossible, but a single-row INSERT OR IGNORE (or a try/except IntegrityError that re-queries) would make it bulletproof and cheap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/services/settings.py` around lines 21 - 38, The
_get_or_create_capture_row and _get_or_create_generation_row functions have a
race when two requests concurrently hit the first() check; fix by performing an
atomic insert-or-ignore or by catching IntegrityError on commit and re-querying:
attempt to create the singleton row (DBCaptureSettings/DBGenerationSettings with
id=SINGLETON_ID), commit in a try block, on IntegrityError rollback and run the
original query again to return the existing row; ensure you import
IntegrityError from sqlalchemy.exc and apply the same pattern to both functions
so the code is race-proof.

46-53: value is not None filter prevents clearing nullable columns.

Callers cannot set default_playback_voice_id (or any other nullable field) back to NULL through this API since None values are unconditionally skipped. If clearing isn't a desired operation, document the contract; otherwise use a sentinel (MISSING) so explicit None flows through:

♻️ Option — sentinel-based patch filtering
-def update_capture_settings(db: Session, patch: dict[str, Any]) -> DBCaptureSettings:
-    row = _get_or_create_capture_row(db)
-    for key, value in patch.items():
-        if value is not None and hasattr(row, key):
-            setattr(row, key, value)
+_MISSING = object()
+
+def update_capture_settings(db: Session, patch: dict[str, Any]) -> DBCaptureSettings:
+    row = _get_or_create_capture_row(db)
+    for key, value in patch.items():
+        if value is _MISSING or not hasattr(row, key):
+            continue
+        setattr(row, key, value)
     db.commit()
     db.refresh(row)
     return row

…with callers (Pydantic models) producing _MISSING for unset fields and None for explicit clears.

Also applies to: 61-68

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/services/settings.py` around lines 46 - 53, The current
update_capture_settings function skips any keys whose value is None, which
prevents callers from clearing nullable columns (e.g.,
default_playback_voice_id); change the logic to treat None as a valid explicit
value by either (a) using a sentinel (e.g., _MISSING) such that callers send
_MISSING for absent fields and None for explicit clears and then only skip when
value is _MISSING, or (b) stop filtering on value and instead check key presence
(if key in patch) and hasattr(row, key) then setattr(row, key, value). Apply the
same sentinel/presence-based fix to the analogous code block referenced at lines
61-68 so explicit None values are persisted rather than ignored.
backend/database/migrations.py (1)

269-274: Redundant local import sqlite3.

_supports_drop_column already imports sqlite3; a top-of-file import would remove the duplicate local imports here and at line 283. Not a bug — purely tidiness.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/database/migrations.py` around lines 269 - 274, The local import
sqlite3 inside the _supports_drop_column context is redundant; remove the inline
import statements (both occurrences in this function) and instead add a single
top-level import sqlite3 at the module scope, so functions like
_supports_drop_column and the logger.warning call can use the module-level
sqlite3.sqlite_version without duplicate local imports.
app/src/components/CapturesTab/CaptureInlinePlayer.tsx (2)

90-109: Minor: in-flight load on a previous audioUrl can land after a new load starts.

When audioUrl changes rapidly, the prior ws.load(url).catch(...) may reject with an abort error after the new load has already reset state, briefly flashing the old URL's error. Low severity since WaveSurfer aborts pending loads on load()/destroy(), but worth guarding with an isCurrent flag if it proves visible.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/CapturesTab/CaptureInlinePlayer.tsx` around lines 90 -
109, When audioUrl changes fast a previous ws.load(audioUrl) can reject after a
newer load has started causing stale error/state flashes; in the useEffect that
references wavesurferRef and ws.load, introduce a local boolean flag (e.g.
isCurrent) set true at start and set false in the effect cleanup, and only call
setError, setIsLoading, setCurrentTime, setDuration, setIsPlaying (and any other
state updates) inside ws.load.then/catch if isCurrent is still true; ensure you
also guard the try/catch reset block (ws.pause(), ws.seekTo(0)) if needed so
they only affect the current load lifecycle for the current audioUrl.

33-57: Waveform colors are captured once and won't follow theme changes.

cssHsla reads --foreground/--accent at instantiation; switching theme (light ↔ dark) at runtime leaves the existing WaveSurfer instance with stale colors until the component remounts. If runtime theme switching is supported, subscribe to the theme change and call ws.setOptions({ waveColor, progressColor }), or key the component on the theme.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/CapturesTab/CaptureInlinePlayer.tsx` around lines 33 - 57,
The WaveSurfer instance in CaptureInlinePlayer captures CSS variables once
inside the useEffect (cssHsla) so it won't update when the app theme changes;
modify the hook to react to theme changes by either adding the theme variable as
a dependency (or subscribing to your theme provider) and when theme changes call
ws.setOptions({ waveColor: cssHsla('--foreground', 0.25), progressColor:
cssHsla('--accent', 1) }) or recreate the WaveSurfer instance, and ensure you
reference the existing ws instance created in the useEffect so colors are
updated at runtime.
backend/routes/llm.py (1)

27-40: Private-method access (backend._is_model_cached) leaks implementation detail.

Reaching into backend._is_model_cached(...) from a route handler couples the HTTP layer to a protected member of the LLM backend. Either promote it to a public is_model_cached(size) on the LLMBackend protocol (so other backends implement it consistently) or fold the "cached?" check into a higher-level service call — e.g. llm.ensure_loaded_or_schedule_download(model_size) returning ("loaded" | "scheduled", task_name) — so the route doesn't need to reason about cache layout at all.

While you're here, the except Exception at line 71-72 also trips Ruff's B904 and spills the backend exception string to the client; chaining with from e preserves the traceback for logs, and the client-facing detail should probably be a generic "LLM generation failed" with the real exception logged server-side:

♻️ Smaller inline tweak for B904 + message hygiene
     except Exception as e:
-        raise HTTPException(status_code=500, detail=str(e))
+        logger.exception("LLM generation failed")
+        raise HTTPException(status_code=500, detail="LLM generation failed") from e

(Requires a module-level logger = logging.getLogger(__name__).)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routes/llm.py` around lines 27 - 40, The route is reaching into a
protected backend method backend._is_model_cached and also catches Exception
then exposes the raw error; replace this by adding a public
is_model_cached(self, size) to the LLMBackend protocol (or implement a
higher-level llm.ensure_loaded_or_schedule_download(model_size) that returns
("loaded" | "scheduled", task_name)) and update the route to call that public
API instead of backend._is_model_cached; move download logic into a backend/llm
service method (e.g., download_llm_background or
ensure_loaded_or_schedule_download) so the route only starts the background task
via task_manager.start_download and create_background_task with the service
coroutine, and change the exception handling inside the background task to log
the real exception with logger.exception(...) or re-raise using "from e" while
returning/sending a generic client-facing message like "LLM generation failed"
to avoid leaking internals (update
task_manager.error_download(progress_model_name, ...) to receive only sanitized
message while logging the original).
backend/routes/speak.py (1)

66-77: In-function import hints at a circular-import workaround — consider refactoring.

Importing generate_speech inside the handler suggests a circular dependency between routes/speak.py and routes/generations.py. If that's the case, the cleaner fix is to extract the generation orchestration (generate_speech) into a services/ module that both route modules depend on, so both REST surfaces and the MCP tool share one canonical call site and the handler can import it at module top.

Not blocking — the inline import works — but it's a latent source of ordering bugs and complicates the test surface.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routes/speak.py` around lines 66 - 77, The inline import of
generate_speech inside the handler indicates a circular-import workaround
between routes/speak.py and routes/generations.py; refactor by moving the
orchestration function (generate_speech) into a new services module (e.g.,
services/generation_service or services/speech_service) that exposes the same
function signature (accepting models.GenerationRequest and db) so both
routes/speak.py and routes/generations.py can import it at module top; update
the imports in both route modules to import generate_speech from the new
services module and keep the call site using models.GenerationRequest(...)
unchanged to preserve behavior and tests.
backend/app.py (1)

214-214: application parameter is unused.

_run_startup takes application: FastAPI but never references it. Drop the parameter (and update the call site on line 83) for clarity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app.py` at line 214, The _run_startup function currently declares an
unused parameter application: FastAPI; remove that parameter from the
_run_startup signature and update any callers to call _run_startup() without
arguments (specifically the place that invoked _run_startup previously using the
application variable). Ensure the function definition matches the new
no-argument call and run tests / linter to confirm no remaining references to
the removed parameter.
backend/tests/test_personality_samples.py (1)

1-23: Filename starts with test_ but this is a manual evaluator, not a pytest test.

Pytest will collect it during test runs. It currently has no top-level side effects and no test_* functions, so collection is harmless today, but a future reader may mistake it for CI coverage of the personality service. Consider renaming (e.g., eval_personality_samples.py) or moving it out of backend/tests/.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/tests/test_personality_samples.py` around lines 1 - 23, The file is
named with a pytest-discovered prefix but is an interactive evaluator, so rename
or relocate it to avoid accidental test collection; specifically, change
backend/tests/test_personality_samples.py to a non-test name like
eval_personality_samples.py (or move it out of backend/tests/) and update any
README/usage comments or scripts that reference the original filename so
invocations (e.g., the top-level docstring examples) still work; ensure no
import paths or tooling expect the old test_ prefix before committing.
app/src/components/DictateWindow/DictateWindow.tsx (1)

190-194: source.onerror silently hopes EventSource recovers — consider a bounded retry cap.

The handler is intentionally empty so EventSource can auto-reconnect on transient drops. If the backend pod is actually gone (generation row deleted, server restart mid-flight), the browser will reconnect forever and the pill sits on "speaking" until dictate:speak-end rescues it. Worth adding either a soft timeout (e.g., dismiss if no completed/failed after N seconds) or a retry counter, so a missed speak-end can't leave the pill stuck permanently.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/DictateWindow/DictateWindow.tsx` around lines 190 - 194,
The EventSource onerror handler currently does nothing (source.onerror), letting
the browser reconnect indefinitely and potentially leaving the UI stuck; add a
bounded retry/timeout: in the component/function that creates the EventSource
(DictateWindow where source is created), track a retry counter and/or start a
soft timeout when opening the EventSource, increment the counter in
source.onerror and if it exceeds N (e.g., 3) or the timeout elapses without
receiving a terminal event ('completed'/'failed' or the existing
dictate:speak-end), programmatically dismiss the pill by dispatching the same
action used for speak-end (or calling the existing handler that ends speaking),
clear timers and close source; also ensure the 'completed'/'failed' message
handlers and the dictate:speak-end flow clear the timer and reset the retry
counter so normal reconnections don't trigger the fallback.
app/src/components/AccessibilityGate/AccessibilityGate.tsx (1)

105-107: <Trans> path component renders as an unstyled <span>.

components={{ path: <span /> }} means the <path>System Settings → …</path> markers in the translations add no visual affordance (bold, monospace, color) — they just become inert spans. If the intent is to make the settings breadcrumb visually distinct (common in i18n'd instructional copy), apply a class; if there's no intent, drop the wrapper and simplify the translations.

📝 Example
-            <Trans i18nKey="captures.permissions.accessibility.body" components={{ path: <span /> }} />
+            <Trans
+              i18nKey="captures.permissions.accessibility.body"
+              components={{ path: <span className="font-medium text-foreground" /> }}
+            />
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/AccessibilityGate/AccessibilityGate.tsx` around lines 105
- 107, The <Trans> usage in AccessibilityGate.tsx (i18nKey
"captures.permissions.accessibility.body") passes components={{ path: <span />
}} which renders unstyled spans for the "path" markers; update the components
prop to either remove the wrapper and adjust the translation to plain text, or
replace the empty span with a styled element (e.g., add a clear visual
affordance via className such as monospace/bold/color) so the settings
breadcrumb is visually distinct; ensure the change targets the <Trans> call in
AccessibilityGate.tsx and the "path" component only.
CHANGELOG.md (1)

62-64: Add shell tag to code fence (MD040).

Same markdownlint nit as the other docs — specify the language for the Claude Code one-liner block.

📝 Proposed fix
-```
+```shell
 claude mcp add voicebox --transport http --url http://127.0.0.1:17493/mcp --header "X-Voicebox-Client-Id: claude-code"
 ```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.md` around lines 62 - 64, The code fence around the one-line Claude
Code command is missing a language tag (MD040); update the fenced block that
contains "claude mcp add voicebox --transport http --url
http://127.0.0.1:17493/mcp --header \"X-Voicebox-Client-Id: claude-code\"" to
include the shell tag (i.e., change ``` to ```shell) so the block is fenced as a
shell snippet.
backend/mcp_server/context.py (2)

51-52: Redundant __init__ override.

BaseHTTPMiddleware.__init__ already accepts (app, dispatch=None). Overriding just to call super().__init__(app) adds noise without behavior change — drop it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/mcp_server/context.py` around lines 51 - 52, Remove the redundant
__init__ override from the middleware class in context.py: delete the
__init__(self, app: ASGIApp) -> None: super().__init__(app) method so the class
inherits BaseHTTPMiddleware.__init__(app, dispatch=None) unchanged; this removes
noise without changing behavior and relies on the base class's constructor
(identify the class by its use of BaseHTTPMiddleware and method name __init__).

54-101: Sync DB work runs on the event loop in every request.

_stamp_last_seen does a SQLAlchemy SELECT + INSERT/UPDATE + COMMIT synchronously right before returning the response, so every /mcp/* and /speak request blocks the async event loop on SQLite I/O. For a chatty agent (batched voicebox.speak calls) this serializes requests behind the stamp commit and adds tail latency to SSE streams. Since stamping is already best-effort (the broad except Exception swallows everything), fire-and-forget into a worker thread is a clean fit.

⚡ Proposed fix — dispatch stamping off the event loop
+import asyncio
@@
         if client_id and _is_stamped_path(request.url.path):
-            _stamp_last_seen(client_id)
+            # Best-effort; don't block the event loop on SQLite commit.
+            asyncio.create_task(asyncio.to_thread(_stamp_last_seen, client_id))
         return response
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/mcp_server/context.py` around lines 54 - 101, dispatch currently
calls the synchronous _stamp_last_seen on the request path, blocking the async
event loop; change dispatch to schedule stamping off the event loop (e.g., use
asyncio.get_running_loop().run_in_executor or asyncio.to_thread /
loop.run_in_executor) so the DB work runs in a worker thread and is
fire-and-forget. Move or wrap the DB-heavy logic in _stamp_last_seen (or create
a new helper like _stamp_last_seen_blocking) so it contains the imports,
query/insert/commit/rollback/close and retains the existing broad try/except
behavior, and have dispatch call asyncio.create_task or loop.run_in_executor to
invoke that blocking helper when client_id and
_is_stamped_path(request.url.path) are true; ensure dispatch does not await the
task and still resets current_client_id as before.
backend/mcp_server/README.md (1)

41-41: Add language identifiers to fenced code blocks (MD040).

markdownlint flags the claude mcp add, npx @modelcontextprotocol/inspector``, curl, and package-tree code blocks. Tag them (`shell` for the three commands, `text` for the tree) to silence the lint and get syntax highlighting in the MDX-rendered docs.

Also applies to: 68-68, 80-80, 89-89

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/mcp_server/README.md` at line 41, The README has multiple fenced code
blocks missing language identifiers causing MD040 warnings; update the fenced
blocks containing the commands "claude mcp add", "npx
`@modelcontextprotocol/inspector`", and the "curl" example to use the shell
language tag (```shell) and update the package-tree block (the ASCII tree) to
use the text language tag (```text); apply the same change to the other flagged
occurrences referenced (the other blocks at the same locations) so all command
blocks are tagged shell and the tree block is tagged text.
app/src/components/CapturesTab/DictationReadinessChecklist.tsx (1)

94-113: downloadByModel in the effect deps runs the effect on every render.

downloadByModel is constructed as a fresh Map on each render, so including it in the dependency array effectively makes the effect run on every render rather than only when activeTasks changes. The inner set-diff still guards against spurious invalidations so this isn't a correctness bug, but drop downloadByModel from deps and recompute the set inside the effect — that keeps the ref-identity semantics clean and makes intent obvious.

♻️ Proposed fix
-  const prevActive = useRef<Set<string>>(new Set());
-  useEffect(() => {
-    const current = new Set(downloadByModel.keys());
-    for (const name of prevActive.current) {
-      if (!current.has(name)) {
-        queryClient.invalidateQueries({ queryKey: ['capture-readiness'] });
-        queryClient.invalidateQueries({ queryKey: ['modelStatus'] });
-        break;
-      }
-    }
-    prevActive.current = current;
-  }, [activeTasks, queryClient, downloadByModel]);
+  const prevActive = useRef<Set<string>>(new Set());
+  useEffect(() => {
+    const current = new Set<string>();
+    for (const dl of activeTasks?.downloads ?? []) {
+      if (dl.status === 'downloading') current.add(dl.model_name);
+    }
+    for (const name of prevActive.current) {
+      if (!current.has(name)) {
+        queryClient.invalidateQueries({ queryKey: ['capture-readiness'] });
+        queryClient.invalidateQueries({ queryKey: ['modelStatus'] });
+        break;
+      }
+    }
+    prevActive.current = current;
+  }, [activeTasks, queryClient]);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/CapturesTab/DictationReadinessChecklist.tsx` around lines
94 - 113, downloadByModel is rebuilt each render causing the useEffect
dependency array to trigger every time; remove downloadByModel from the deps and
instead compute a Set of model names inside the effect. Specifically, in the
useEffect that references prevActive/current, use activeTasks (and queryClient)
in the dependency array only, construct const current = new
Set((activeTasks?.downloads ?? []).filter(dl => dl.status ===
'downloading').map(dl => dl.model_name)) inside the effect, keep the same diff
logic that invalidates via queryClient.invalidateQueries(['capture-readiness'])
and ['modelStatus'] when a name disappears, and then assign prevActive.current =
current at the end.
backend/routes/mcp_bindings.py (1)

59-59: Replace deprecated datetime.utcnow() with timezone-aware alternative.

datetime.utcnow() is deprecated as of Python 3.12 and scheduled for removal; it returns a naïve datetime which loses the UTC contract at the type boundary. The same pattern appears in backend/mcp_server/context.py at line 93, so both should be updated together.

🕒 Proposed replacement
-from datetime import datetime
+from datetime import datetime, timezone
@@
-    row.updated_at = datetime.utcnow()
+    row.updated_at = datetime.now(timezone.utc)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routes/mcp_bindings.py` at line 59, Replace the naïve UTC timestamp
assignment using datetime.utcnow() with a timezone-aware UTC datetime (e.g.,
datetime.now(timezone.utc)); update the import to bring in timezone (or use
datetime.now with datetime.timezone) and set row.updated_at = timezone-aware
datetime in the module where row.updated_at is assigned, and make the same
change to the analogous timestamp assignment in the MCP server context code (the
datetime assignment at the other location) so both places use timezone-aware UTC
datetimes.
app/src/components/ChordPicker/ChordPicker.tsx (1)

62-72: Nit: local t shadows the translation function.

Inside the effect, const t = window.setTimeout(...) shadows const { t } = useTranslation() from line 51. It's harmless today (the effect body doesn't call t(...)), but the shadow is a footgun for future edits. Consider renaming to timeoutId.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/ChordPicker/ChordPicker.tsx` around lines 62 - 72, The
effect in ChordPicker.tsx shadows the translation function t (from const { t } =
useTranslation()) by declaring const t = window.setTimeout(...) inside
useEffect; rename that local to a non-conflicting name (e.g., timeoutId) in the
useEffect closure where setTimeout is assigned and cleared (the code that calls
captureRef.current?.focus() and window.clearTimeout), so the translation
function t remains unshadowed for future edits.
app/src/components/CapturesTab/CapturesTab.tsx (1)

116-117: Dead fileInputRef: declared but never triggered.

fileInputRef (line 116) is attached to a hidden <input> (line 322) but no code calls fileInputRef.current?.click() — only uploadInputRef is triggered (via handleUploadClick). The two inputs share the same accept and change handler, so the second is dead UI. Either remove it or wire the intended code path.

🔧 Proposed fix
-  const fileInputRef = useRef<HTMLInputElement>(null);
   const uploadInputRef = useRef<HTMLInputElement>(null);
...
-      <input
-        ref={fileInputRef}
-        type="file"
-        accept={CAPTURE_AUDIO_MIME}
-        onChange={(e) => handleUploadFile(e, 'file')}
-        className="hidden"
-      />

Also applies to: 321-327

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/CapturesTab/CapturesTab.tsx` around lines 116 - 117, The
dead ref fileInputRef is never used (only uploadInputRef is clicked), so either
remove the redundant hidden input and fileInputRef declaration, or wire the
intended click path: update the upload trigger (e.g., the button or handler that
currently calls uploadInputRef.current?.click()) to call
fileInputRef.current?.click() (or both) and ensure the shared accept and
onChange handlers (the change handler function and the input props) remain
attached to the active input; also remove fileInputRef and its unused hidden
<input> if you choose the removal option to avoid unused refs.
backend/database/models.py (1)

275-275: Use JSON column type for refinement_flags to match CaptureSettings and eliminate manual serialization.

refinement_flags is stored as Text and requires manual json.loads() / json.dumps() in callers, while CaptureSettings.chord_push_to_talk_keys uses SQLAlchemy's JSON column type. Switching to JSON removes manual serialization boilerplate and makes the intent self-evident.

If adopted, update callers in backend/services/captures.py:

  • Line 36: Change json.loads(row.refinement_flags) to row.refinement_flags (already deserialized by SQLAlchemy)
  • Line 185: Change json.dumps(flags.to_dict()) to flags.to_dict() (JSON column auto-serializes)
Proposed schema change
-    refinement_flags = Column(Text, nullable=True)  # JSON blob
+    refinement_flags = Column(JSON, nullable=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/database/models.py` at line 275, Change the refinement_flags column
from Text to SQLAlchemy JSON so SQLAlchemy handles (de)serialization: update the
Column definition named refinement_flags in models.py to use JSON(nullable=True)
matching CaptureSettings.chord_push_to_talk_keys; then remove manual
json.loads/json.dumps usage in backend/services/captures.py by replacing
json.loads(row.refinement_flags) with just row.refinement_flags and replacing
json.dumps(flags.to_dict()) with flags.to_dict(), ensuring callers expect native
Python dicts.
backend/tests/test_refinement_samples.py (2)

1-25: This is an eval harness but will be collected by pytest.

Because the filename matches test_*.py, pytest backend/tests will import the module (running sys.path.insert(0, str(REPO_ROOT))) and report zero tests collected from it — harmless but potentially confusing in CI output. The docstring already says "this is an interactive evaluation harness, not a pass/fail unit test".

Consider either renaming (eval_refinement_samples.py), moving out of backend/tests/, or adding collect_ignore = ["test_refinement_samples.py"] to backend/tests/conftest.py so regular test runs don't touch it.

Also applies to: 451-452

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/tests/test_refinement_samples.py` around lines 1 - 25, The test
harness file test_refinement_samples.py is being picked up by pytest because it
matches the test_*.py pattern; either rename or move the file (e.g.,
eval_refinement_samples.py or into a non-tests directory) so it is not
auto-collected, or add a pytest collection exclusion by appending the filename
to collect_ignore in backend/tests/conftest.py (use the collect_ignore
list/variable name) so pytest will skip importing test_refinement_samples.py
during regular test runs.

37-37: Iterable should come from collections.abc (Ruff UP035).

Preferred on Python 3.9+.

♻️ Fix
-from typing import Iterable, Optional
+from collections.abc import Iterable
+from typing import Optional
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/tests/test_refinement_samples.py` at line 37, Replace the
typing-sourced Iterable with the collections.abc one to satisfy Ruff UP035:
update the import so that Iterable is imported from collections.abc while
keeping Optional from typing (i.e. change the current "from typing import
Iterable, Optional" usage to "from collections.abc import Iterable" and "from
typing import Optional" or a combined import statement that reflects that
separation). This targets the import statement referencing Iterable and
Optional.
backend/routes/captures.py (1)

33-36: Chunked read offers no streaming benefit — simplify.

file.read(UPLOAD_CHUNK_SIZE) in a loop followed by b"".join(chunks) is functionally identical to await file.read(), since audio_bytes is held entirely in memory before being passed to create_capture. If you eventually want a real streaming path, it would need to pass an iterator through to the service, not pre-materialize.

♻️ Simplification
-    chunks = []
-    while chunk := await file.read(UPLOAD_CHUNK_SIZE):
-        chunks.append(chunk)
-    audio_bytes = b"".join(chunks)
+    audio_bytes = await file.read()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/routes/captures.py` around lines 33 - 36, The chunked loop reading
into chunks and joining them is unnecessary because audio_bytes is fully
materialized; replace the while chunk := await file.read(UPLOAD_CHUNK_SIZE) /
chunks append / b"".join(chunks) logic with a single await file.read() to
produce audio_bytes and pass that to create_capture (or, if you want true
streaming later, change create_capture to accept and consume an async iterator
instead). Target symbols: file.read, UPLOAD_CHUNK_SIZE, chunks, audio_bytes,
create_capture.
backend/mcp_shim/__main__.py (1)

171-186: One transport hiccup kills the whole MCP session.

If _handle_request raises for any reason (connection reset mid-stream, malformed SSE from the server, httpx.ReadError, etc.), the outer except Exception returns 1 and the client loses its MCP server until the parent process respawns it. Consider catching and logging per-request so a single bad request doesn't take down long-lived sessions.

♻️ Proposed resilience
-                await _handle_request(
-                    client, url, line, forward_headers, session_id
-                )
+                try:
+                    await _handle_request(
+                        client, url, line, forward_headers, session_id
+                    )
+                except Exception as exc:
+                    _err(f"request failed, continuing: {exc!r}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/mcp_shim/__main__.py` around lines 171 - 186, The main loop currently
exits the whole process if any call to _handle_request raises; wrap the
per-request call in its own try/except so transient errors don’t kill the
session: inside the while True loop (around await _handle_request(client, url,
line, forward_headers, session_id)) catch Exception as exc, call _err with a
descriptive message including exc, then continue the loop; keep the existing
outer except for KeyboardInterrupt/SystemExit and the check that
_read_stdin_line() returning None exits with 0.
backend/mcp_server/tools.py (2)

133-141: Check base64 length before decoding.

b64.b64decode allocates the full decoded buffer before the length check on line 138, so a 270 MB base64 payload materializes ~200 MB before being rejected. Gate on the encoded length first (4 base64 chars ≈ 3 raw bytes) to reject cheaply.

♻️ Proposed fix
+        if len(audio_base64) > MAX_TRANSCRIBE_BYTES // 3 * 4 + 4:
+            raise ValueError(
+                f"Audio exceeds {MAX_TRANSCRIBE_BYTES // (1024 * 1024)} MB limit."
+            )
         try:
             raw = b64.b64decode(audio_base64, validate=True)
         except Exception as exc:
             raise ValueError(f"Invalid audio_base64: {exc}") from exc
-        if len(raw) > MAX_TRANSCRIBE_BYTES:
-            raise ValueError(
-                f"Audio exceeds {MAX_TRANSCRIBE_BYTES // (1024 * 1024)} MB limit."
-            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/mcp_server/tools.py` around lines 133 - 141, The code decodes
audio_base64 into raw before checking size, allowing huge memory allocation;
change the logic in the Base64 branch to first compute an estimated decoded
length from len(audio_base64) (use the 4:3 ratio and account for padding) and
compare it to MAX_TRANSCRIBE_BYTES, raising ValueError if too large, and only
then call b64.b64decode(audio_base64, validate=True) to produce raw (and keep
the existing exception handling and subsequent length check/cleanup around raw).

290-296: Replace whisper._is_model_cached() call with the public is_model_cached() API.

The code directly accesses a private backend method. Use the public is_model_cached(hf_repo) function from backend.backends.base instead, similar to how backend/routes/captures.py does it. To refactor, map model_size to the corresponding Hugging Face repository ID (following the pattern in backend implementations), then call the public API.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/mcp_server/tools.py` around lines 290 - 296, Current check uses the
private whisper._is_model_cached(model_size); change it to use the public
is_model_cached(hf_repo) API from backend.backends.base by mapping the local
model_size string to the corresponding Hugging Face repo ID (same
mapping/pattern used in backend implementations and backend/routes/captures.py),
then call is_model_cached(mapped_hf_repo) in place of whisper._is_model_cached;
keep the existing conditions using whisper.is_loaded() and whisper.model_size
but replace the private check with the mapped public API call (use the same
mapping logic found near other backend whisper model checks to derive the HF
repo ID).
backend/services/captures.py (1)

90-111: Minor style cleanups flagged by Ruff.

  • Line 90: _WHISPER_NATIVE_FORMATS is a function-local constant; Ruff N806 expects lowercase in function scope. Either hoist it to module scope next to VALID_SOURCES (more useful — it's conceptually a constant and won't be re-allocated on every call), or rename it.
  • Lines 108–111: try/except OSError: pass can be contextlib.suppress(OSError) for readability (SIM105).
♻️ Proposed diff
 VALID_SOURCES = {"dictation", "recording", "file"}
+_WHISPER_NATIVE_FORMATS = (".wav", ".mp3", ".flac", ".ogg")
@@
-    _WHISPER_NATIVE_FORMATS = (".wav", ".mp3", ".flac", ".ogg")
-
     if audio is None or sr is None:
@@
-        audio_path = config.get_captures_dir() / f"{capture_id}.wav"
-        sf.write(str(audio_path), audio, sr, format="WAV")
-        try:
-            raw_path.unlink()
-        except OSError:
-            pass
+        audio_path = config.get_captures_dir() / f"{capture_id}.wav"
+        sf.write(str(audio_path), audio, sr, format="WAV")
+        with contextlib.suppress(OSError):
+            raw_path.unlink()

Add import contextlib at the top.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/services/captures.py` around lines 90 - 111, Move the function-local
constant _WHISPER_NATIVE_FORMATS out to module scope (e.g., next to
VALID_SOURCES) or rename it to lowercase to satisfy Ruff N806, and replace the
try/except OSError: pass block that attempts raw_path.unlink() with
contextlib.suppress(OSError): raw_path.unlink() while adding an import for
contextlib at the top of the module so captures.py uses the suppress pattern for
readability (ensure you update references if you relocate the constant).
app/src/lib/api/types.ts (1)

193-201: Field name inconsistency between refine/retranscribe requests.

CaptureRefineRequest.model_size vs CaptureRetranscribeRequest.model — I understand one targets the LLM and the other Whisper, so the semantics differ, but consumers now have to remember which endpoint uses which key. Consider stt_model on retranscribe to match the backend's retranscribe_capture(stt_model=...) parameter and to parallel llm_model on the response type. Purely a naming nit; skip if the wire format is already locked.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/lib/api/types.ts` around lines 193 - 201, Rename the inconsistent
field on the retranscribe request to match backend parameter naming: change the
CaptureRetranscribeRequest property currently named model to stt_model (type
WhisperModelSize) and update all usages to pass/read stt_model (e.g., where
retranscribe_capture(stt_model=...) is called or the request is constructed);
this parallels the llm model naming used for refine responses and avoids
consumers having to remember different keys. Ensure type references and any
serialization/deserialization that expect model are updated to stt_model.
backend/backends/qwen_llm_backend.py (1)

238-246: Consider clearing MLX's Metal cache on unload.

del self.model; self.model = None drops the Python ref but MLX's Metal allocator may hold the GPU buffers until a subsequent allocation reuses them. For explicit memory recovery (important when users toggle between 0.6B / 1.7B / 4B or unload before switching to a TTS engine), mx.metal.clear_cache() is the complement to empty_device_cache(self.device) in the PyTorch path above.

mlx python clear metal GPU cache after deleting model
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/backends/qwen_llm_backend.py` around lines 238 - 246, The
unload_model method currently deletes Python references to the MLX
model/tokenizer but doesn't free MLX's Metal GPU buffers; after setting
self.model and self.tokenizer to None in unload_model, call
mx.metal.clear_cache() (or the equivalent MLX API) to explicitly clear Metal
allocator caches so memory is released when switching models or before TTS
usage; update unload_model to import/use mx.metal.clear_cache() and run it after
the deletions and before logging "Qwen3 (MLX) unloaded".

Comment thread app/src/components/CapturesTab/CapturesTab.tsx
Comment thread app/src/stores/uiStore.ts Outdated
Comment thread backend/app.py
Comment thread backend/backends/qwen_llm_backend.py
Comment on lines +40 to +60
def mount_into(
app: FastAPI,
*,
extra_startup: Callable[[], None] | None = None,
) -> None:
"""Attach the MCP app to ``app`` at ``/mcp`` and install the client-id middleware.

``extra_startup`` — if provided, runs during the FastAPI lifespan. This
is the hook that lets ``app.py`` keep its existing startup/shutdown
bodies while also driving FastMCP's session manager.
"""
mcp = build_mcp_server()
mcp_app = mcp.http_app(path="/", transport="http")

# ClientIdMiddleware must run before FastMCP so the ContextVar is set
# by the time tool handlers execute. Starlette composes middlewares
# outermost-first, so adding here on the parent app is correct.
app.add_middleware(ClientIdMiddleware)
app.mount("/mcp", mcp_app)
app.state.mcp_lifespan = mcp_app.router.lifespan_context
logger.info("MCP: mounted at /mcp (FastMCP %s)", getattr(mcp, "version", ""))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find callers of mount_into to see whether extra_startup is passed anywhere.
rg -nP --type=py -C3 '\bmount_into\s*\('

Repository: jamiepine/voicebox

Length of output: 397


🏁 Script executed:

#!/bin/bash
# Search for all calls to mount_into (not just definition)
rg -nP --type=py 'mount_into' | head -20

Repository: jamiepine/voicebox

Length of output: 233


🏁 Script executed:

#!/bin/bash
# Read the full mount_into and compose_lifespan functions to check if extra_startup is referenced
sed -n '40,80p' backend/mcp_server/server.py

Repository: jamiepine/voicebox

Length of output: 1608


🏁 Script executed:

#!/bin/bash
# Search for mount_into calls in backend/app.py and other files
rg -nP --type=py -B2 -A2 'mount_into\(' backend/

Repository: jamiepine/voicebox

Length of output: 265


🏁 Script executed:

#!/bin/bash
# Search more broadly for mount_into references without the opening paren
rg -nP --type=py 'mount_into' backend/ | grep -v 'def mount_into\|from.*mount_into\|__all__'

Repository: jamiepine/voicebox

Length of output: 44


🏁 Script executed:

#!/bin/bash
# Check backend/app.py specifically for mount_into usage
if [ -f backend/app.py ]; then
  echo "=== backend/app.py ==="
  cat backend/app.py | head -100
fi

Repository: jamiepine/voicebox

Length of output: 3768


🏁 Script executed:

#!/bin/bash
# Read the rest of create_app function
sed -n '75,150p' backend/app.py

Repository: jamiepine/voicebox

Length of output: 3054


extra_startup parameter is declared and documented but never invoked.

The docstring states extra_startup "runs during the FastAPI lifespan", but the function body never references it. Either wire it into the lifespan or remove the parameter and its doc paragraph to avoid misleading callers.

🛠️ Option A — remove the dead parameter
-def mount_into(
-    app: FastAPI,
-    *,
-    extra_startup: Callable[[], None] | None = None,
-) -> None:
-    """Attach the MCP app to ``app`` at ``/mcp`` and install the client-id middleware.
-
-    ``extra_startup`` — if provided, runs during the FastAPI lifespan. This
-    is the hook that lets ``app.py`` keep its existing startup/shutdown
-    bodies while also driving FastMCP's session manager.
-    """
+def mount_into(app: FastAPI) -> None:
+    """Attach the MCP app to ``app`` at ``/mcp`` and install the client-id middleware."""

Option B — actually run it (e.g., wrap app.state.mcp_lifespan so extra_startup() is called inside the composed lifespan). The current pattern in create_app uses compose_lifespan for this; match whichever approach is intended here.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def mount_into(
app: FastAPI,
*,
extra_startup: Callable[[], None] | None = None,
) -> None:
"""Attach the MCP app to ``app`` at ``/mcp`` and install the client-id middleware.
``extra_startup``if provided, runs during the FastAPI lifespan. This
is the hook that lets ``app.py`` keep its existing startup/shutdown
bodies while also driving FastMCP's session manager.
"""
mcp = build_mcp_server()
mcp_app = mcp.http_app(path="/", transport="http")
# ClientIdMiddleware must run before FastMCP so the ContextVar is set
# by the time tool handlers execute. Starlette composes middlewares
# outermost-first, so adding here on the parent app is correct.
app.add_middleware(ClientIdMiddleware)
app.mount("/mcp", mcp_app)
app.state.mcp_lifespan = mcp_app.router.lifespan_context
logger.info("MCP: mounted at /mcp (FastMCP %s)", getattr(mcp, "version", ""))
def mount_into(app: FastAPI) -> None:
"""Attach the MCP app to ``app`` at ``/mcp`` and install the client-id middleware."""
mcp = build_mcp_server()
mcp_app = mcp.http_app(path="/", transport="http")
# ClientIdMiddleware must run before FastMCP so the ContextVar is set
# by the time tool handlers execute. Starlette composes middlewares
# outermost-first, so adding here on the parent app is correct.
app.add_middleware(ClientIdMiddleware)
app.mount("/mcp", mcp_app)
app.state.mcp_lifespan = mcp_app.router.lifespan_context
logger.info("MCP: mounted at /mcp (FastMCP %s)", getattr(mcp, "version", ""))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/mcp_server/server.py` around lines 40 - 60, The declared
extra_startup parameter on mount_into is never used; either remove it and its
doc paragraph or wire it into the FastAPI lifespan so it actually runs. To fix
by wiring: when you set app.state.mcp_lifespan =
mcp_app.router.lifespan_context, if extra_startup is not None replace/compose
that lifespan with one that runs extra_startup during startup (use the existing
compose_lifespan helper or the same pattern used in create_app) so mount_into
installs the composed lifespan that calls extra_startup; alternatively delete
the extra_startup parameter, its docstring mention, and any callers relying on
it. Ensure you update the mount_into signature/comments to match the chosen
approach.

Comment thread backend/mcp_server/tools.py Outdated
Comment thread backend/routes/llm.py Outdated
Comment thread backend/routes/settings.py Outdated
Comment thread backend/services/captures.py
Comment thread landing/src/components/SupportedModels.tsx
jamiepine and others added 2 commits April 23, 2026 19:41
…all green

Two small follow-ups to the sidebar checklist placement. Move it below
the What's different section so the sticky top of the sidebar stays the
page's narrative context (About → differences) and the checklist reads
as a status panel rather than preamble. Gate the whole block on
!readiness.allReady so once every gate is green the sidebar drops back
to just About + What's different — no value in real estate full of
checkmarks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
useCaptureSettings updated its own cache optimistically but never
invalidated ['capture-readiness'], so for up to 5 s (the poll interval)
after switching stt_model or llm_model the checklist kept showing the
previous model's ready/missing state. The backend endpoint resolves
the model live on each call — it was just the frontend cache that
lagged. Invalidate in onSettled only when the patch touched a model
field, so unrelated updates (chord keys, toggles) don't pay for a
refetch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
app/src/components/CapturesTab/DictationReadinessChecklist.tsx (1)

114-125: Unstable downloadByModel in effect deps causes re-run every render.

downloadByModel is a fresh Map instance on every render (built at lines 106–109), so including it in the dep array makes this effect fire on every commit — not just when activeTasks actually changes. The logic is idempotent (when data is unchanged, current equals prevActive.current so no invalidation occurs), but every render still rebuilds the Set and reassigns the ref.

Prefer useMemo for the map, or key the effect to activeTasks alone and derive the key set inside:

♻️ Proposed refactor
-  const downloadByModel = new Map<string, ActiveDownloadTask>();
-  for (const dl of activeTasks?.downloads ?? []) {
-    if (dl.status === 'downloading') downloadByModel.set(dl.model_name, dl);
-  }
+  const downloadByModel = useMemo(() => {
+    const map = new Map<string, ActiveDownloadTask>();
+    for (const dl of activeTasks?.downloads ?? []) {
+      if (dl.status === 'downloading') map.set(dl.model_name, dl);
+    }
+    return map;
+  }, [activeTasks]);

   // When a download disappears from activeTasks, it just finished — refetch
   // readiness immediately so the row flips to ✓ instead of waiting up to 5s
   // for the next readiness poll.
   const prevActive = useRef<Set<string>>(new Set());
   useEffect(() => {
     const current = new Set(downloadByModel.keys());
     for (const name of prevActive.current) {
       if (!current.has(name)) {
         queryClient.invalidateQueries({ queryKey: ['capture-readiness'] });
         queryClient.invalidateQueries({ queryKey: ['modelStatus'] });
         break;
       }
     }
     prevActive.current = current;
-  }, [activeTasks, queryClient, downloadByModel]);
+  }, [downloadByModel, queryClient]);

(activeTasks can be dropped from deps since downloadByModel is now memoized against it.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/src/components/CapturesTab/DictationReadinessChecklist.tsx` around lines
114 - 125, The effect currently reruns every render because downloadByModel is a
new Map each render; memoize downloadByModel with useMemo (keyed by activeTasks)
or remove downloadByModel from the effect deps and compute the current Set from
activeTasks inside the useEffect instead; update the code around prevActive,
useEffect, and downloadByModel so downloadByModel is stable (useMemo([...],
[activeTasks])) or so the effect depends only on activeTasks and queryClient,
then call queryClient.invalidateQueries(['capture-readiness']) and
queryClient.invalidateQueries(['modelStatus']) when a name was removed as
before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@app/src/components/CapturesTab/DictationReadinessChecklist.tsx`:
- Around line 114-125: The effect currently reruns every render because
downloadByModel is a new Map each render; memoize downloadByModel with useMemo
(keyed by activeTasks) or remove downloadByModel from the effect deps and
compute the current Set from activeTasks inside the useEffect instead; update
the code around prevActive, useEffect, and downloadByModel so downloadByModel is
stable (useMemo([...], [activeTasks])) or so the effect depends only on
activeTasks and queryClient, then call
queryClient.invalidateQueries(['capture-readiness']) and
queryClient.invalidateQueries(['modelStatus']) when a name was removed as
before.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 044f8174-cd4e-498e-8697-ab56b4b8cb18

📥 Commits

Reviewing files that changed from the base of the PR and between 687ab2a and ef63faf.

📒 Files selected for processing (3)
  • app/src/components/CapturesTab/DictationReadinessChecklist.tsx
  • app/src/components/ServerTab/CapturesPage.tsx
  • app/src/lib/hooks/useSettings.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • app/src/lib/hooks/useSettings.ts
  • app/src/components/ServerTab/CapturesPage.tsx

Two surfaces leaked macOS-specific copy onto other platforms:

1. The Input Monitoring + Accessibility rows in the readiness
   checklist rendered everywhere. On Windows/Linux the Rust permission
   stubs return true, so the rows showed as permanent green checkmarks
   with copy like "macOS allows Voicebox to detect your global
   shortcut." — nonsense when you're on Windows. Gate both rows on a
   userAgent-based isMacOS check so they only render where the
   underlying TCC permission actually exists.

2. The global-shortcut setting description ended with "macOS will ask
   for Input Monitoring permission the first time you turn this on."
   That sentence rendered on every platform. The readiness checklist
   already surfaces the TCC requirement at the right moment on macOS,
   so the description doesn't need the platform note — drop it from
   en / ja / zh-CN / zh-TW.

Other macOS strings (AccessibilityNotice, InputMonitoringNotice, their
"stillMissing" hints) are already gated behind the Rust permission
booleans returning false, which never happens on Windows/Linux, so they
stay inert without further changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pill subscribed to /generation/{id}/status to know when to start
playback, but EventSource.onerror was a no-op — auto-reconnect was the
intended recovery for transient drops. The gap: if the backend deletes
the gen row mid-flight or the connection silently dies in a way the
browser keeps retrying without ever getting a status event, the pill
sits in 'speaking' forever and the user has no way to clear it.

Added a 60-second hard cap that arms when the SSE opens and clears the
moment any real status event lands. If it fires while the pill is still
on the same id and audio never started, it force-dismisses. Same idea
as the existing post-speak-end 15s grace, but covers the case where the
backend never says anything at all.
…shadow

- DictationReadinessChecklist was constructing downloadByModel as a fresh Map every render and listing it in the cleanup effect's deps. With the 1 s polling cadence and arbitrary parent rerenders the effect ran more often than it needed to. Memoised the Map on activeTasks; the effect now keys off the memo's identity.
- zh-CN persona tooltipActive/ariaLabelActive matched their inactive twins byte-for-byte ("以人物设定朗读"). The other locales differentiate the active state with a -ing / -中 suffix; zh-CN now reads "正以人物设定朗读" when active.
- personalityPlaceholder was a ~290-character paragraph that doubled as both the example text and the explanation, repeating most of what personalityHint already said. Trimmed to the example only and folded the explanation + leave-blank consequence into the hint, across all four locales.
- Refinement model size keys were size06 / size17 / size4. Renamed the 4B variant to size40 so the decimal padding is consistent.
- ChordPicker's open-effect bound a window.setTimeout id to a local `t`, shadowing the i18n `t` from useTranslation. Renamed to timeoutId.
…r-move

Both sliders on the generation settings page were calling update() —
which is a React Query mutation that PATCHes /settings/generation —
inside onValueChange. Dragging the chunk-limit slider from 800 to 3000
fired a request per pointer-move pixel, and a mid-drag failure plus
optimistic rollback would leave persisted state visibly out of sync
with the thumb position.

Local state now mirrors each slider during a drag and the persist
happens once on Radix's onValueCommit (pointer-up / keyboard-release).
useEffects keep the local state in sync if the persisted value changes
out-of-band — another window editing the same setting still updates the
slider position cleanly.
…ad patterns

Mechanical sweep of items called out in the PR review:

- qwen_llm_backend: AutoModelForCausalLM.from_pretrained(torch_dtype=…) is deprecated in transformers ≥4.41 in favor of dtype=. Renamed.
- routes/llm: try/except around backend.generate() raised HTTPException(500, detail=str(e)) which leaks stack traces / paths to clients and trips Ruff B904. Now logs the original exception server-side and hands the client a generic message; chained via `from e` to preserve traceback context.
- mcp_bindings + mcp_server/context: datetime.utcnow() is deprecated since 3.12. Switched the two assignment sites to datetime.now(timezone.utc). The schema-level `default=datetime.utcnow` defaults in database/models.py are left for a later schema-aware pass.
- routes/generations: `logger = …` sat between two import blocks (Ruff E402). Moved below imports.
- mcp_server/server + tests/test_refinement_samples: typing.Callable / typing.Iterable have been preferred-via collections.abc since 3.9 (Ruff UP035).
- routes/events: `except asyncio.TimeoutError` aliases plain `TimeoutError` since 3.11 (UP041).
- services/captures: hoisted WHISPER_NATIVE_FORMATS to module scope (was a function-local UPPER_SNAKE that tripped N806) and replaced the raw_path.unlink try/except OSError-pass with contextlib.suppress (SIM105). Semantic equivalence preserved — written_files.remove(raw_path) still only runs when unlink succeeds because it sits inside the suppressed block after the unlink call.
- database/migrations: hoisted the duplicate `import sqlite3` from inside two helper bodies to a single module-level import.
The track editor's clip toolbar now has a regenerate icon next to Delete; clicking it kicks a fresh take of the selected clip's underlying generation through the same /generate/{id}/regenerate path the History table uses, and pushes the id into the global pending set so the SSE watcher picks it up. The chat list's per-item dropdown gets the same action between Play-from-here and Remove. Translation keys added under storyContent.itemActions / storyContent.toast across all four locales.
…icker)

You can now drop a music file onto the story content area or pick one through the new "Import audio" button in the add-clip popover. Both call POST /generate/import which writes the file to data/generations/<id>.<ext>, probes duration via librosa, and inserts a Generation row pointing at a singleton "Imported Audio" profile (created lazily on first import). The existing addStoryItem flow takes over from there — the timeline doesn't care that the row didn't come out of TTS.

Engine field on the row is "import"; it's surfaced on StoryItemDetail so the chat list shows a music icon instead of the (missing) profile avatar and both the dropdown and the track-editor toolbar hide the Regenerate action — there's nothing to regenerate. Accepted formats: wav/mp3/flac/ogg/m4a/aac/webm, capped at 200 MB. Translation keys added across en/ja/zh-CN/zh-TW.
/audio/{id} and /audio/version/{id} hardcoded media_type="audio/wav" on
the FileResponse. That was a no-op when every generation came out of
TTS (everything on disk was a .wav anyway), but imported audio keeps
its source format — .mp3 / .m4a / .ogg — and the WaveSurfer MediaElement
backend uses an <audio> tag that checks Content-Type before letting the
clip play, so an MP3 announced as audio/wav silently failed to load.

Both endpoints now derive the type via mimetypes.guess_type and fall
back to audio/wav for unknown suffixes. Download filenames also keep
the real extension instead of always saying ".wav".
… scope

The track editor's zoom was clamped to a hardcoded [10, 200] pixels-per-second range, which had no relationship to the project — on a 4-minute story a "max zoom out" of 200 px/s still required scrolling, and on a 5-second story you could zoom all the way in to where every clip was a tiny sliver. Reframed the bounds in the unit the user actually thinks in: how many seconds of timeline are visible at once. Min scope is 10 s (most zoomed in), max scope is the entire project, and the default lands on a 60 s scope (or the full project, whichever is shorter) once the editor measures its visible track width on first mount.

The pixels-per-second value still lives in component state (because every downstream calculation already uses it) but minPps/maxPps are computed from `containerWidth − LABEL_COL_WIDTH` and the project's effective duration, so the +/- buttons and the edge-drag handles on the scrollbar all clamp to bounds that move with the project. Re-clamping fires whenever those bounds shift — adding a long clip or resizing the window pulls the current zoom inside the new range instead of leaving the user parked outside it.
Imports were rendering as "Imported Audio" everywhere because every
import points at the singleton voice profile. The filename was already
being stored on the generation row (in the `text` field), so the chat
item title and the timeline clip label now read from `text` when
`engine === 'import'` and fall back to the profile name otherwise. The
chat item also drops the language pill (always "en" on imports — not
informative) and skips the transcript textarea since imports have no
spoken text to show.
handleSplit was sending currentTimeMs - item.start_time_ms straight to the backend, which rejects it because StoryItemSplit.split_time_ms is typed as int and the playhead's currentTimeMs is a float (it's driven from HTMLAudioElement.currentTime, which carries sub-millisecond precision). Pydantic surfaced the mismatch as "Input should be a valid integer, got a number with a fractional part" and the toast read "Failed to split clip". Math.round at the call site, matching what the trim and move handlers already do.
Each story item now carries a volume column (linear gain, default 1.0,
clamped 0.0–2.0 server-side). New PUT /stories/{}/items/{}/volume route
+ useUpdateStoryItemVolume hook + a Volume2 icon in the clip-edit
toolbar that opens a popover with a 0–200% slider. Local slider state
drives the visual during a drag; the persist fires once on
onValueCommit, mirroring the generation-page slider pattern.

Web Audio playback inserts a per-clip GainNode between source and
master so volume changes apply live without re-decoding the buffer
(source -> clipGain -> masterGain -> destination). Server-side
mixdown in export multiplies the trimmed clip by its volume before
summing into the timeline. Split + duplicate carry the volume forward
to the new clips so trimming a faded section keeps the level you set.

Migration adds the volume column with default 1.0 so existing rows
read as full volume.
…d audio

The clip waveforms drawn inside each timeline track use WaveSurfer with the default MediaElement backend, which creates an internal <audio> element to drive playback timing. Web Audio in useStoryPlayback is what actually produces sound, but WaveSurfer's element was happily preloading and — after the first user gesture unlocked browser autoplay — playing the source URL through the page output too.

For TTS clips it was masked: they're short, both sources start at the same time, and stopping the BufferSourceNode at pause coincides with the natural end of the audio element. For long imports (a four-minute MP3) the BufferSourceNode stops on pause but WaveSurfer's element keeps going on its own track — which is exactly the "music keeps playing when I pause" symptom.

Hand WaveSurfer a muted <audio> element via the `media` option so the visual still loads peaks but the element itself can never produce sound. preload="metadata" keeps the load lightweight.
…ly halt

source.stop() was the only thing happening when a clip was halted, and on long imported buffers (multi-minute MP3s scheduled via source.start with a duration argument) it was silently failing to halt the buffer in some browsers — pause left the music playing and seek stacked another source on top of the original. The mute-the-WaveSurfer-element fix was a different bug along the same path; this is the one that actually addresses the duplicated audio.

ActiveSource now carries the per-clip GainNode alongside the source, and stopSource detaches the onended handler before calling stop() (so the natural-end callback can't race with explicit teardown and re-delete a freshly rescheduled entry at the same id), then disconnects both nodes inside their own try/catch blocks. Even when stop() doesn't actually halt the buffer the graph is severed — no path from source to destination, no audio.
Tiny + strips sit at the top of the topmost label cell and the bottom of the bottommost one, sticky-positioned in the label column so they follow horizontal scroll. Clicking either extends the visible track stack in that direction by one — above adds max(existing)+1, below adds min(existing)-1. Both compute against the full set (defaults + item-derived + previously-added) so successive clicks keep extending instead of fighting over the same number.

Empty extras live in component state because a track only earns its keep once a clip lands on it. Once one does, item.track carries the number forward and the row keeps deriving from items naturally; if nothing lands there before reload, the empty row simply isn't there next time, which matches what the user expects of an unused affordance.
@jamiepine
Copy link
Copy Markdown
Owner Author

Branch review findings for the 0.5 release:

High: tauri/src-tauri/tauri.conf.json:16, scripts/build-server.sh:22-34, .github/workflows/release.yml:115-126, app/src/components/ServerTab/MCPPage.tsx:91-93

voicebox-mcp is declared as a bundled sidecar, but release/dev build scripts only build and copy voicebox-server. Clean release builds will miss voicebox-mcp-<triple>, and local builds can accidentally ship the dev placeholder. The Settings snippet also hardcodes the macOS app path for every platform. Build python build_binary.py --shim, copy the platform-suffixed binary beside voicebox-server, and make the snippet platform-aware.

High: app/src/lib/hooks/useDictationReadiness.ts:85-93, app/src/lib/hooks/useChordSync.ts:39-42, tauri/src-tauri/src/accessibility.rs:39-42

Accessibility is treated as required to arm the hotkey. That blocks macOS dictation when auto-paste permission is missing, even though release notes say transcripts should still land in Captures, and it blocks Linux entirely because the Linux stub returns false. Remove Accessibility from the hotkey arming gate; only gate paste UI/behavior on it.

High: backend/routes/speak.py:68-74, backend/mcp_server/tools.py:231-236, backend/routes/generations.py:52-53

/speak and voicebox.speak force engine="qwen" when the caller omits an engine, bypassing the generation route’s profile default/preset resolution. Preset bindings like Kokoro or Qwen CustomVoice will fail or speak through the wrong engine unless every caller passes engine. Pass None through and resolve explicit engine, binding default, profile default, preset engine, then fallback.

Medium: backend/database/models.py:211-216, backend/models.py:256-259, app/src/components/ServerTab/CapturesPage.tsx:138-139

Default chords are macOS-only (MetaRight + AltGr) everywhere. That contradicts the Windows default documented for 0.5 and reintroduces the AltGr collision the release notes call out. Defaults need to be platform-specific, especially first-run Windows rows.

Medium: tauri/src-tauri/src/speak_monitor.rs:164-172, app/src/components/DictateWindow/DictateWindow.tsx:157-164

The first agent speech can be dropped if the dictate window has not already been created. ensure_dictate_window() builds the webview and immediately emits dictate:speak-start, but the React listener may not be registered yet, so no status subscription or audio playback starts. Pre-create the window or add a ready/ack handshake before emitting.

Medium: app/src/components/ServerTab/CapturesPage.tsx:141-145, app/src/components/ServerTab/CapturesPage.tsx:295-304, app/src/components/ServerTab/CapturesPage.tsx:385-390, app/src/components/ServerTab/CapturesPage.tsx:542-557

Copy-to-clipboard, archive audio, and retention controls are explicitly mock-only local state. Users can set 7-day retention or disable archiving, but captures still persist forever with audio. Wire these to backend settings/enforcement or hide them for 0.5.

Checks:

bun run typecheck passed.

cargo check passed with warnings only.

@jamiepine jamiepine merged commit 7df366d into main Apr 25, 2026
6 checks passed
@desiorac
Copy link
Copy Markdown

Commentaire rédigé et sauvegardé. Évaluation qualité : 4/5 - observation technique précise sur un gap d'auditabilité réel dans le design MCP client identity, question ouverte qui invite la discussion sans pitch produit.

{"status": "ok", "draft_content": "The X-Voicebox-Client-Id + MCPClientBinding design is clean. One question on auditability: last_seen_at tracks when a client was active, but does the Generation row carry the originating client ID? In multi-agent setups with Cursor and Claude Code both calling voicebox.speak concurrently, being able to query "which agent triggered this generation" from CapturesTab would close the debugging loop. Or is the assumption that per-client profile binding makes attribution implicit?", "platform": "github", "target_url": "#544", "summary": "Technical question on MCP client attribution: whether Generation rows carry originating client ID for multi-agent auditability", "quality_score": 4, "result": "Draft GitHub comment prepared and saved to docs/content/github-jamiepine-voicebox-544-mcp-client-attribution.md - awaiting shareholder approval before posting", "impact": "Contribution OSS technique sur un angle auditabilité multi-agent MCP, crédibilité > visibilité", "confidence": 0.87, "findings": ["PR expose 4 MCP tools via Streamable HTTP + stdio shim - design solide", "ClientIdMiddleware + MCPClientBinding est le meilleur angle technique pour un commentaire pertinent", "last_seen_at stamp présent mais attribution par Generation row non documentée dans le PR"], "weaknesses": ["Impossible de vérifier si Generation.client_id existe déjà sans lire le code SQLAlchemy complet"], "next_steps": ["Actionnaire valide le draft → poster manuellement sur la PR"], "learnings": ["Voicebox 0.5.0 MCP design = angle naturel pour discussions multi-agent trust/attribution sans pitch forcé"]}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants