The voice layer for any coding agent — real barge-in, streaming latency, and the agents you already use.
한국어 · 日本語 · 中文 · Español · Français · Русский
VerbalCoding turns a Discord voice channel into a hands-free cockpit for any CLI coding agent. Hermes ships its own /voice join for Hermes; VerbalCoding is a thin, agent-agnostic layer that puts the same loop on top of Hermes, Claude Code, Codex, Gemini, OpenCode, OpenClaw, Aider, Cursor CLI, or any non-interactive shell command — with the rough edges other voice frontends still have on their roadmap:
- True audio barge-in — interrupt the agent mid-sentence; Hermes' built-in voice pauses its listener during TTS.
- Streaming pipeline — first sentence plays while the agent is still writing (Hermes lists this as a future Phase-4 item).
- Smart progress narration — describes intent ("wiring the new login route"), not file lists.
- Voice plan mode — say "plan it first", edit by voice ("skip step 3"), say "approve" to execute.
- Cross-agent routing by voice — "ask Codex what it thinks" for a single turn, "switch to Aider" to make it sticky, "back to default" to restore. The plan can also emit a
which_agentslot so the agent itself picks the next backend. - Phone-down mode — push notification with a voice summary when a long task completes and the room is empty.
| Capability | Why it matters |
|---|---|
| Agent choice, first-class | Hermes Agent, Claude Code, Codex, Gemini CLI, OpenCode, OpenClaw, Aider, Cursor CLI, or any custom command. vc setup auto-detects what's installed. |
| Cross-agent voice routing | Say "ask Codex …" (single turn), "switch to Aider" (sticky), or "back to default". Missing binaries are detected and the bridge offers to fall back to the default agent. Handoff prompts carry recent utterances + last plan decisions to the new agent. |
| Real barge-in | VAD thresholds tuned for indoor and noisy rooms; cut in mid-utterance and resume the conversation. |
| Streaming end-to-end | STREAMING_TTS=1 plays sentences as the agent produces them; first audio in well under a second on a warm cache. |
| Smart progress | Optional LLM summarizer collapses raw events into one human sentence; falls back to the existing regex labels when no key is set. |
| Plan-mode by voice | Narrated, editable, voice-driven plans without touching the keyboard. |
| Phone-down handoff | Long task + empty VC = push notification (ntfy/pushover) with a redacted one-line summary and tap-to-rejoin link. |
| Local speech loop | Discord audio is transcribed by local whisper-cli; TTS via Edge, OpenVoice, SpeechSwift/CosyVoice, or Supertonic. |
| Real operations support | Doctor auto-fixes, Docker UDP guidance, latency metrics, multi-instance project rooms, redacted config checks. |
Already using Hermes Agent? Hermes itself has a working Discord voice loop via
/voice join//voice channel. Use VerbalCoding when you want it agent-agnostic, want barge-in and streaming today, or want plan-mode, push handoff, and smart narration on top of the same loop. The two coexist — VerbalCoding can drive Hermes as its backend.
npm install -g verbalcoding@latest
vc setup # detects installed agents and lets you pick
vc doctor
vc startvc setup is the normal human path. Keep Discord Developer Portal open while it asks for your bot token, application/client ID, transcript target, and voice channel names.
Automation can skip prompts, then fill Discord details later:
vc setup --yes
vc setup token <bot-token> --client-id <discord-client-id>
vc setup channels "General,Team Voice"
vc doctorContributor clone path:
git clone https://github.com/ca1773130n/VerbalCoding.git
cd VerbalCoding
./scripts/install.sh
vc doctor
./run.sh- Create a Discord application and bot in https://discord.com/developers/applications.
- Enable the Message Content privileged intent.
- Run
vc setupand paste the bot token plus application/client ID when prompted. - Enter exact voice channel names for auto-join.
- Invite the bot with:
vc bot invite <discord-client-id>
vc bot invite <discord-client-id> --guild <guild-id>Secrets are stored in ignored local env files with mode 0600 and are not printed back by vc doctor.
vc setup # guided setup with agent auto-detection
vc setup --yes # non-interactive bootstrap/starter config
vc setup token # rotate or add Discord bot token/client ID later
vc setup channels "General,Team Voice" # update auto-join voice channel names
vc bot invite CLIENT_ID # generate a Discord bot invite URL
vc status # show active language, TTS, bridge settings, and resolved backend
vc language ko|en|auto # switch STT/progress/TTS language preset
vc doctor # redacted health check with auto-fix suggestions
vc start # start the default bridge
vc instance setup NAME # create an isolated project voice bot
vc instance start NAME # run that bot in the backgroundIn Discord:
| Command | What it does |
|---|---|
!join / !leave |
Join or leave your current voice channel. |
!ask <prompt> |
Send text to the same selected agent backend. |
!verbose on|off |
Toggle short progress updates. |
!latency / !metrics |
Summarize recent STT/agent/TTS latency. |
!sensitivity normal|conservative |
Tune barge-in for indoor or noisy environments. |
!session new <name> <workdir> [context] --voice <voice-channel> |
Bind a project session to a voice room. |
The differentiation push is tracked in docs/ROADMAP.md. Five phases land the claims above:
| # | Phase | What it adds |
|---|---|---|
| 1 | Streaming pipeline | Sentence-by-sentence TTS while the agent is still writing. |
| 2 | Agent-agnostic adapters | First-class Aider + Cursor CLI; vc setup auto-detects. |
| 6 | Smart progress | LLM-summarized narration. Falls back to today's regex labels. |
| 7 | Voice plan mode | Narrate plan, voice-edit, approve to execute. |
| 10 | Push notification handoff | ntfy/Pushover when a long task ends and the room is empty. |
| Guide | What you get |
|---|---|
| Docs hub | One page linking every guide and localized doc set. |
| Roadmap | Differentiation plan and per-phase implementation plans. |
| Fresh Install | npm/global setup, Discord app setup, token/channel commands, first run. |
| Usage Guide | CLI commands, Discord commands, run modes, voice changes, latency metrics. |
| Hermes Built-in Voice vs VerbalCoding | What Hermes already supports and when VerbalCoding is worth adding. |
| Configuration | .env, agent backends, MCP server, TTS backends, operational notes. |
| Troubleshooting | Docker host networking, UDP voice failures, missing token/channel diagnostics. |
| Multi-Instance | One permanent Discord voice room per project. |
| Release Notes | Current capabilities, checks, and public-release gaps. |
| Layer | Default |
|---|---|
| Runtime | Node.js 20+ and npm; setup can install via Homebrew/apt/dnf/pacman where supported. |
| Audio | ffmpeg; setup/doctor can install it on supported OSes. |
| Speech recognition | Local whisper-cli from whisper.cpp plus models/ggml-small-q5_1.bin. |
| TTS | Edge TTS by default; optional OpenVoice, SpeechSwift/CosyVoice, Supertonic, OmniVoice, and Qwen3 TTS CLI paths. |
| Discord | Bot token, Message Content intent, voice permissions, matching auto-join channel names. |
| Agent | At least one CLI harness installed; vc setup auto-detects Hermes, Claude Code, Codex, Gemini, OpenCode, OpenClaw, Aider, Cursor CLI. |
| Platform focus | macOS / Apple Silicon most tested; Linux bootstrap is best-effort; Windows unsupported for now. |
Discord text login can work while voice join fails if outbound UDP is blocked. If logs show Cannot perform IP discovery - socket closed, use Linux host networking for the service that runs vc start:
services:
verbalcoding:
network_mode: "host"Do not combine network_mode: "host" with ports:. Docker Desktop for macOS/Windows behaves differently; if UDP still fails there, run VerbalCoding directly on the host or a Linux VM.
Run lightweight checks before sending changes:
node --check app-node/main.mjs
npm test
bash -n run.sh scripts/install.sh scripts/bootstrap_prereqs.sh
npm pack --dry-run
vc doctorPublic-release oriented but still early. The roadmap above tracks live differentiation work. Demo video/GIF, broader Linux validation, CI, and deeper security review are still TODOs.