Problem
The bot already supports voice/audio input via STT, which is great for mobile use.
A natural follow-up for that workflow is optional TTS output: when a user sends a Telegram voice/audio message toggle TTS on with /tts, the bot could return the normal text reply plus an audio rendering of that same final assistant response.
This would improve hands-free/mobile usability without changing the normal text-first workflow. (idea from OpenClaw).
Proposal
- Keep normal text prompts exactly as they are today: text reply only
- For Telegram
voice / audio input:
- transcribe with the existing STT flow
- send the normal final text response
- optionally send a TTS audio file of that exact final assistant text
Make it opt-in via env config make a /tts toggle, disabled by default
Proposed config
Something like:
- TTS_ENABLED=false use /tts toggle instead
TTS_API_URL= (fallback to STT_API_URL if unset)
TTS_API_KEY= (fallback to STT_API_KEY if unset)
TTS_MODEL=gpt-4o-mini-tts
TTS_VOICE=alloy
Scope / guardrails
To keep this small and low-risk:
- no change for text-origin prompts
- no streaming spoken output
- just one final audio file after the normal text reply
Why this seems aligned
This stays within the current single-chat / predictable interaction model in CONCEPT.md:
- it does not add parallelism or group-specific behavior
- it only extends the existing voice-input path
- it remains optional and disabled by default
Implementation notes
I already prototyped this locally my fork and it was pretty contained:
- small TTS client modeled after the existing STT client
- lightweight tracking so only audio-origin prompts trigger TTS
- hook into the final assistant completion path after the normal text reply
- docs + tests included
Done criteria (optional)
- When sending a voice memo/file,
and .env has it enabled and enabled via /tts, bot responds with text output and then an audio file of that text
- When sending a text input, bot always replies as before with text regardless of configuration.
Problem
The bot already supports voice/audio input via STT, which is great for mobile use.
A natural follow-up for that workflow is optional TTS output: when a user
sends a Telegram voice/audio messagetoggle TTS on with/tts, the bot could return the normal text reply plus an audio rendering of that same final assistant response.This would improve hands-free/mobile usability without changing the normal text-first workflow. (idea from OpenClaw).
Proposal
voice/audioinput:Make it opt-in via env configmake a/ttstoggle, disabled by defaultProposed config
Something like:
-useTTS_ENABLED=false/ttstoggle insteadTTS_API_URL=(fallback toSTT_API_URLif unset)TTS_API_KEY=(fallback toSTT_API_KEYif unset)TTS_MODEL=gpt-4o-mini-ttsTTS_VOICE=alloyScope / guardrails
To keep this small and low-risk:
Why this seems aligned
This stays within the current single-chat / predictable interaction model in
CONCEPT.md:Implementation notes
I already prototyped this locally my fork and it was pretty contained:
Done criteria (optional)
and .env has it enabledand enabled via/tts, bot responds with text output and then an audio file of that text- When sending a text input, bot always replies as before with text regardless of configuration.