Skip to content

feat: design spec for audio + video modes#24

Draft
marksverdhei wants to merge 1 commit intomainfrom
feat/audio-video-modes
Draft

feat: design spec for audio + video modes#24
marksverdhei wants to merge 1 commit intomainfrom
feat/audio-video-modes

Conversation

@marksverdhei
Copy link
Copy Markdown
Owner

Summary

Design proposal (no code yet) for extending Sorting Hat with audio and video modes, alongside the existing text and image paths. Triggered by the new local multimodal endpoint (Gemma 4 E4B via llama.cpp) that accepts input_audio parts and a frames+audio video shape.

Spec lives at docs/superpowers/specs/2026-04-22-audio-video-modes-design.md.

Highlights:

  • Magic-byte detection for audio (WAV/MP3/FLAC/OGG/AAC/M4A) and video (MP4/MOV/WebM/MKV/AVI), with --audio / --video overrides mirroring --image.
  • Audio mode: transcode to 16 kHz mono WAV via ffmpeg, send as input_audio.
  • Video mode: extract K=4 frames + optional audio track, send all in one chat completion.
  • New env vars HAT_VIDEO_FRAMES, HAT_VIDEO_INCLUDE_AUDIO. ffmpeg/ffprobe become a soft dep — skip with a clear message if missing.
  • Guard-clause + multi-turn flow stays unchanged.

Spec was reviewed by the spec-document-reviewer subagent: approved with five advisory notes (part ordering vs. existing image mode, process_file arg growth, guard-clause prompt wording, .ogv fallback, ID3v2 test coverage) — all to be folded into the implementation plan.

Test plan

  • Maintainer reviews spec doc and approves direction
  • Implementation plan written and reviewed (separate PR or pushed onto this branch)
  • Implementation lands with bats tests for detection + payload shape
  • Manual smoke test against Gemma 4 E4B endpoint with sample audio/video files

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant