Skip to content

Add audio input support for HTTP Gateway and CLI (Phase 2) #412

@yacosta738

Description

@yacosta738

Context

Follow-up from #246 (audio input support Phase 1). Phase 1 delivered core transcription infrastructure + Telegram channel. Phase 2 extends audio input to the remaining two entry points defined in the PRD.

Scope

HTTP Gateway

  • New POST /web/chat/audio endpoint accepting multipart/form-data
  • Fields: audio (file), session_id (optional), language (optional)
  • Body limit increased to 25 MiB for this endpoint only
  • Validate file, stage, transcribe, dispatch text through existing path
  • Return transcription + agent response

CLI

  • /audio <path> command for local file transcription
  • Read local file, validate format/size/duration
  • Stage as StagedAudio, transcribe, inject text

Optional

  • whisper-rs embedded transcription behind --features audio-transcription (zero external dependency)
  • Model auto-download tooling

Acceptance Criteria

  • An integration can send audio through the HTTP Gateway and receive a normal agent response based on the transcription
  • A user can pass an audio file through the CLI and get the same conversational behavior
  • All supported formats work (OGG/Opus, MP3, WAV, M4A)
  • Error handling matches the 6 error types from PRD
  • Existing text and image flows unaffected

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions