feat: add comprehensive multimodal driver support#11
Conversation
Add support for new agi-driver multimodal features including audio, video, MCP servers, and tool choice configuration. ## Protocol Changes ### New Events - AudioTranscriptEvent: Audio transcript from buffer - VideoFrameEvent: Video frame from camera/screen - SpeechStartedEvent/SpeechFinishedEvent: TTS playback - TurnDetectedEvent: Voice turn detection ### New Commands - GetAudioTranscriptCommand: Request audio transcript - GetVideoFrameCommand: Request video frame ### Updated StartCommand - agent_identity: Agent identity (default: agi-2-claude) - tool_choice: Tool choice configuration - mcp_servers: MCP server configurations - audio_input_enabled, audio_buffer_seconds - turn_detection_enabled, turn_detection_silence_ms - speech_output_enabled, speech_voice - camera_enabled, camera_buffer_seconds - screen_recording_enabled, screen_recording_buffer_seconds ### New Interfaces - MCPServerConfig: MCP server configuration - AgentIdentity: Agent identity information - ToolChoice: Tool choice type ## Breaking Changes This is a breaking change with no backwards compatibility. StartCommand has many new optional fields. ## Related PRs - agi-api (driver): https://github.com/agi-inc/agents/pull/344 - agi-python: https://github.com/agi-inc/agi-python/pull/8 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Additional Comments (1)
Prompt To Fix With AIThis is a comment left during a code review.
Path: src/driver/protocol.ts
Line: 219:232
Comment:
**StartCommand now required**
`StartCommand` now requires `mode` and `agent_name` (`protocol.ts:228-231`), but `src/driver/driver.ts` still constructs `StartCommand` with `mode: mode ?? this.mode` where `this.mode` can be `undefined` and with `agent_name: this.agentName || undefined` (`driver.ts:247-251`). With the new types this becomes a type error, and at runtime it can emit invalid JSON to the driver. Either keep these fields optional in the protocol, or update `DriverOptions`/constructor defaults so `mode`/`agent_name` are always set before sending `start`.
How can I resolve this? If you propose a fix, please make it concise. |
Add documentation and reference implementation for new agi-driver multimodal features including audio, video, MCP servers, and tool choice configuration. ## Documentation Provided ### MULTIMODAL_UPDATES.md Complete guide for implementing multimodal features in C# SDK ### Protocol_Multimodal.cs Reference implementation of new protocol types: - New event classes (AudioTranscriptEvent, VideoFrameEvent, etc.) - New command classes (GetAudioTranscriptCommand, GetVideoFrameCommand) - Helper classes (MCPServerConfig, AgentIdentity, ToolChoice) - StartCommand extensions for multimodal features ## Changes Needed ### Protocol.cs - Add new event types to DriverEventType enum - Add new command types to DriverCommandType enum - Add new event/command classes - Add multimodal fields to StartCommand - Update AgentName default to "agi-2-claude" ### Driver.cs - Update event parsing for new event types - Add convenience methods for multimodal features ## Breaking Changes This is a breaking change with no backwards compatibility. StartCommand has many new fields (all with sensible defaults). ## Related PRs - agi-api (driver): https://github.com/agi-inc/agents/pull/344 - agi-python: https://github.com/agi-inc/agi-python/pull/8 - agi-node: agi-inc/agi-node#11 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add comprehensive multimodal features to AGI CLI: ## New CLI Options - --voice: Enable voice input/output (requires OPENAI_API_KEY) - --camera: Enable camera video feed - --screen: Enable screen recording - --mcp: Load MCP servers from config - --mcp-config: Custom MCP config path (default: ~/.agi/mcp.json) ## Features - Voice input with automatic turn detection - Text-to-speech output - Camera and screen video buffers - MCP server integration for extended tools - All features work together seamlessly ## Usage Examples agi --voice "What's the time?" agi --voice --screen "What's on my screen?" agi --voice --camera --screen --mcp "Help me with my work" ## Related PRs - agi-api (driver): https://github.com/agi-inc/agents/pull/344 - agi-python: https://github.com/agi-inc/agi-python/pull/8 - agi-node: agi-inc/agi-node#11 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…fix type issues - Added parseEvent() and serializeCommand() functions required by driver - Made agent_name, reason, message, question_id optional in commands - Fixed mode type to be literal union instead of string - Build now succeeds without type errors 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Additional Comments (1)
Prompt To Fix With AIThis is a comment left during a code review.
Path: src/driver/driver.ts
Line: 166:173
Comment:
**Docs/examples don’t match API**
`AgentDriver.start` is still a positional-args API (`start(goal, screenshot?, width?, height?, mode?)`) in `src/driver/driver.ts:166-173`, but `MULTIMODAL_UPDATES.md` documents calling `start({...})` with a `StartCommand`-shaped object. As-is, those examples will throw at runtime (object passed as `goal` string) and the multimodal options can’t be provided via the public API. Either update the docs to match the actual signature, or (if intended) change `AgentDriver.start` to accept an options object and forward the new multimodal fields.
How can I resolve this? If you propose a fix, please make it concise. |
- Add multimodal options to DriverOptions interface - Store voice, camera, screen, mcp, mcpConfig in AgentDriver - Implement loadMcpConfig() to read and parse MCP config files - Pass multimodal options to StartCommand: - audio_input_enabled, turn_detection_enabled, speech_output_enabled - camera_enabled, screen_recording_enabled - mcp_servers loaded from config file - Full implementation with no TODOs or shortcuts 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add missing switch cases in handleLine() for audio_transcript, video_frame, speech_started, speech_finished, and turn_detected events so they are properly emitted to listeners. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(ci): add extra-files config to bump Agi.csproj version release-please with release-type "simple" only bumps the manifest and CHANGELOG. The publish workflow reads the version from Agi.csproj, so we need extra-files to keep it in sync. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add comprehensive multimodal driver support documentation Add documentation and reference implementation for new agi-driver multimodal features including audio, video, MCP servers, and tool choice configuration. ## Documentation Provided ### MULTIMODAL_UPDATES.md Complete guide for implementing multimodal features in C# SDK ### Protocol_Multimodal.cs Reference implementation of new protocol types: - New event classes (AudioTranscriptEvent, VideoFrameEvent, etc.) - New command classes (GetAudioTranscriptCommand, GetVideoFrameCommand) - Helper classes (MCPServerConfig, AgentIdentity, ToolChoice) - StartCommand extensions for multimodal features ## Changes Needed ### Protocol.cs - Add new event types to DriverEventType enum - Add new command types to DriverCommandType enum - Add new event/command classes - Add multimodal fields to StartCommand - Update AgentName default to "agi-2-claude" ### Driver.cs - Update event parsing for new event types - Add convenience methods for multimodal features ## Breaking Changes This is a breaking change with no backwards compatibility. StartCommand has many new fields (all with sensible defaults). ## Related PRs - agi-api (driver): https://github.com/agi-inc/agents/pull/344 - agi-python: https://github.com/agi-inc/agi-python/pull/8 - agi-node: agi-inc/agi-node#11 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * feat(driver): wire multimodal options from DriverOptions to StartCommand - Add Voice, Camera, Screen, Mcp, McpConfig properties to DriverOptions - Store multimodal options in AgentDriver constructor - Add multimodal fields to StartCommand (audio, speech, camera, screen, MCP) - Pass options from DriverOptions to StartCommand in StartAsync(): Voice → AudioInputEnabled, TurnDetectionEnabled, SpeechOutputEnabled Camera → CameraEnabled Screen → ScreenRecordingEnabled Mcp → McpServers (loaded from config file) - Implement LoadMcpConfig() for reading MCP server configurations - Add multimodal event types to DriverEventType enum - Add multimodal command types to DriverCommandType enum - Add multimodal event parsing to DriverProtocol.ParseEvent() - Remove duplicate enum declarations from Protocol_Multimodal.cs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version to 0.5.0 for multimodal release Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* feat: add multimodal support (voice, camera, screen, MCP) Add comprehensive multimodal features to AGI CLI: ## New CLI Options - --voice: Enable voice input/output (requires OPENAI_API_KEY) - --camera: Enable camera video feed - --screen: Enable screen recording - --mcp: Load MCP servers from config - --mcp-config: Custom MCP config path (default: ~/.agi/mcp.json) ## Features - Voice input with automatic turn detection - Text-to-speech output - Camera and screen video buffers - MCP server integration for extended tools - All features work together seamlessly ## Usage Examples agi --voice "What's the time?" agi --voice --screen "What's on my screen?" agi --voice --camera --screen --mcp "Help me with my work" ## Related PRs - agi-api (driver): https://github.com/agi-inc/agents/pull/344 - agi-python: https://github.com/agi-inc/agi-python/pull/8 - agi-node: agi-inc/agi-node#11 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * feat(cli): add --api-url option for custom API endpoint Allows users to specify a custom AGI API endpoint URL: - Added apiUrl to CliArgs interface - Added --api-url CLI option - Pass apiUrl to useAgent hook Usage: agi --api-url http://localhost:8000 "your goal" 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * feat(cli): wire multimodal options through to driver - Update App.tsx to pass voice, camera, screen, mcp, mcpConfig to useAgent - Update UseAgentOptions interface to accept multimodal options - Pass all multimodal options to AgentDriver constructor - Complete end-to-end wiring: CLI args → App → useAgent → AgentDriver → API Now the --voice, --camera, --screen, --mcp flags are fully functional! 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * fix(hooks): add missing multimodal deps to useCallback array Add voice, camera, screen, mcp, mcpConfig to the start callback dependency array so React captures the correct values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version to 0.6.0 for multimodal release Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(cli): remove unused imports and bump to 0.5.15 Remove unused mkdirSync, join, and color variable that caused ESLint failures in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore(deps): require @agi_inc/agi-js ^0.5.0 for multimodal support CI will pass once agi-node 0.5.0 is published to npm. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Multimodal Driver Support - Node SDK Updates
This update adds comprehensive multimodal support to the Node SDK to match the new agi-driver capabilities.
Changes Made
Protocol Updates (
src/driver/protocol.ts)New Event Types
AudioTranscriptEvent: Audio transcript from bufferVideoFrameEvent: Video frame from camera/screenSpeechStartedEvent: TTS playback startedSpeechFinishedEvent: TTS playback finishedTurnDetectedEvent: Voice turn detectionNew Command Types
GetAudioTranscriptCommand: Request audio transcriptGetVideoFrameCommand: Request video frameNew Interfaces
MCPServerConfig: MCP server configurationAgentIdentity: Agent identity informationToolChoice: Tool choice configuration typeUpdated StartCommand
Added fields for multimodal configuration:
agent_identity?: AgentIdentity- Agent identity (default: agi-2-claude by AGI Company)tool_choice?: ToolChoice- Tool choice modemcp_servers?: MCPServerConfig[]- MCP server configurationsaudio_input_enabled?: boolean,audio_buffer_seconds?: numberturn_detection_enabled?: boolean,turn_detection_silence_ms?: numberspeech_output_enabled?: boolean,speech_voice?: stringcamera_enabled?: boolean,camera_buffer_seconds?: numberscreen_recording_enabled?: boolean,screen_recording_buffer_seconds?: numberExports (
src/driver/index.ts)Added exports for all new event and command types, plus helper interfaces.
Usage Examples
Basic Multimodal Session
Handling New Events
Voice-Only Mode
MCP Servers
Tool Choice Configuration
Breaking Changes
StartCommandinterface has many new optional fieldsagent_nameshould be set to"agi-2-claude"for new agentsTesting
Related PRs
Greptile Overview
Greptile Summary
This PR extends the driver protocol to support multimodal capabilities (audio transcript, video frames, speech/turn events) and adds corresponding command types and config interfaces. The
AgentDriveris updated to forward new event types and to optionally enable voice/camera/screen/MCP features by populating the newStartCommandfields.Key merge blockers are around MCP config handling and documentation accuracy: the default
~path expansion currently resolves to/.agi/...(root) rather than the user’s home, the MCP loader is typed asany[]and can silently disable MCP on malformed configs, and the new MULTIMODAL_UPDATES.md examples don’t match the actual exported JS/TS API (agentName+ positionalstart(...)).Confidence Score: 3/5
any[]and swallows parse/shape errors, making MCP silently not work in common cases. Additionally, documentation examples currently do not match the SDK API and will fail for users.Important Files Changed
agent_nameoption and object-basedstartsignature that doesn't matchAgentDriver.start.~expansion is incorrect and config loader is typed asany[]and swallows JSON errors.Sequence Diagram
sequenceDiagram autonumber participant App as User App participant SDK as AgentDriver participant FS as MCP Config File participant Driver as agi-driver App->>SDK: new AgentDriver(options) App->>SDK: start(goal, screenshot?, w?, h?, mode?) SDK->>Driver: spawn(binaryPath) Driver-->>SDK: stdout {event:"ready"} SDK->>SDK: build StartCommand (multimodal fields) alt MCP enabled SDK->>FS: read mcpConfig path FS-->>SDK: JSON config SDK->>SDK: map config -> mcp_servers[] end SDK->>Driver: stdin serializeCommand(StartCommand) loop driver emits events Driver-->>SDK: stdout DriverEvent (e.g., action/thinking/audio_transcript) SDK-->>App: emit(event.event) end Driver-->>SDK: stdout {event:"finished"} SDK-->>App: resolve start() with DriverResult