Skip to content

feat: add comprehensive multimodal driver support#11

Merged
JacobFV merged 6 commits intomainfrom
jacob/multimodal-driver-support
Feb 10, 2026
Merged

feat: add comprehensive multimodal driver support#11
JacobFV merged 6 commits intomainfrom
jacob/multimodal-driver-support

Conversation

@JacobFV
Copy link
Copy Markdown
Contributor

@JacobFV JacobFV commented Feb 10, 2026

Multimodal Driver Support - Node SDK Updates

This update adds comprehensive multimodal support to the Node SDK to match the new agi-driver capabilities.

Changes Made

Protocol Updates (src/driver/protocol.ts)

New Event Types

  • AudioTranscriptEvent: Audio transcript from buffer
  • VideoFrameEvent: Video frame from camera/screen
  • SpeechStartedEvent: TTS playback started
  • SpeechFinishedEvent: TTS playback finished
  • TurnDetectedEvent: Voice turn detection

New Command Types

  • GetAudioTranscriptCommand: Request audio transcript
  • GetVideoFrameCommand: Request video frame

New Interfaces

  • MCPServerConfig: MCP server configuration
  • AgentIdentity: Agent identity information
  • ToolChoice: Tool choice configuration type

Updated StartCommand

Added fields for multimodal configuration:

  • agent_identity?: AgentIdentity - Agent identity (default: agi-2-claude by AGI Company)
  • tool_choice?: ToolChoice - Tool choice mode
  • mcp_servers?: MCPServerConfig[] - MCP server configurations
  • audio_input_enabled?: boolean, audio_buffer_seconds?: number
  • turn_detection_enabled?: boolean, turn_detection_silence_ms?: number
  • speech_output_enabled?: boolean, speech_voice?: string
  • camera_enabled?: boolean, camera_buffer_seconds?: number
  • screen_recording_enabled?: boolean, screen_recording_buffer_seconds?: number

Exports (src/driver/index.ts)

Added exports for all new event and command types, plus helper interfaces.

Usage Examples

Basic Multimodal Session

import { AgentDriver } from '@agi-inc/agi-node';

const driver = new AgentDriver({
  mode: 'local',
  agent_name: 'agi-2-claude'
});

// Start with multimodal features
await driver.start({
  goal: 'Help me with my computer',
  mode: 'local',
  agent_name: 'agi-2-claude',

  // Voice features
  audio_input_enabled: true,
  turn_detection_enabled: true,
  speech_output_enabled: true,
  speech_voice: 'alloy',

  // Video features
  camera_enabled: true,
  screen_recording_enabled: true,

  // MCP servers
  mcp_servers: [
    {
      name: 'filesystem',
      command: 'npx',
      args: ['-y', '@modelcontextprotocol/server-filesystem', '/path/to/dir'],
      env: {}
    }
  ],

  // Tool choice
  tool_choice: 'auto'
});

Handling New Events

driver.on('audio_transcript', (event: AudioTranscriptEvent) => {
  console.log(`Transcript: ${event.transcript}`);
});

driver.on('video_frame', (event: VideoFrameEvent) => {
  // event.frame_base64 contains JPEG frame
  saveFrame(event.frame_base64);
});

driver.on('speech_started', (event: SpeechStartedEvent) => {
  console.log(`🔊 Speaking: ${event.text}`);
});

driver.on('speech_finished', () => {
  console.log('✓ Finished speaking');
});

driver.on('turn_detected', (event: TurnDetectedEvent) => {
  console.log(`You said: ${event.transcript}`);
});

Voice-Only Mode

await driver.start({
  goal: '(voice input)',
  mode: 'local',
  audio_input_enabled: true,
  turn_detection_enabled: true,
  turn_detection_silence_ms: 1000,  // 1 second of silence = turn complete
  speech_output_enabled: true,
  speech_voice: 'alloy'  // or: echo, fable, onyx, nova, shimmer
});

MCP Servers

const mcpServers: MCPServerConfig[] = [
  {
    name: 'filesystem',
    command: 'npx',
    args: ['-y', '@modelcontextprotocol/server-filesystem', '/Users/you/Documents']
  },
  {
    name: 'database',
    command: 'python',
    args: ['-m', 'my_db_server'],
    env: { DATABASE_URL: 'postgresql://...' }
  }
];

await driver.start({
  goal: 'Analyze my documents',
  mode: 'local',
  mcp_servers: mcpServers
});

Tool Choice Configuration

// Auto (default)
tool_choice: 'auto'

// Required - must use at least one tool
tool_choice: 'required'

// None - no tool use
tool_choice: 'none'

// Specific tool
tool_choice: { type: 'tool', name: 'filesystem__read_file' }

Breaking Changes

⚠️ This is a breaking change with no backwards compatibility.

  • StartCommand interface has many new optional fields
  • New event types may be emitted
  • agent_name should be set to "agi-2-claude" for new agents

Testing

# Install updated SDK
npm install

# Build TypeScript
npm run build

# Run tests
npm test

# Try a voice session
node -e "
const { AgentDriver } = require('./dist');

(async () => {
  const driver = new AgentDriver({ mode: 'local' });
  const result = await driver.start({
    goal: 'Test voice',
    mode: 'local',
    audio_input_enabled: true,
    speech_output_enabled: true
  });
  console.log(result);
})();
"

Related PRs

Greptile Overview

Greptile Summary

This PR extends the driver protocol to support multimodal capabilities (audio transcript, video frames, speech/turn events) and adds corresponding command types and config interfaces. The AgentDriver is updated to forward new event types and to optionally enable voice/camera/screen/MCP features by populating the new StartCommand fields.

Key merge blockers are around MCP config handling and documentation accuracy: the default ~ path expansion currently resolves to /.agi/... (root) rather than the user’s home, the MCP loader is typed as any[] and can silently disable MCP on malformed configs, and the new MULTIMODAL_UPDATES.md examples don’t match the actual exported JS/TS API (agentName + positional start(...)).

Confidence Score: 3/5

  • This PR adds useful protocol support but has a couple of correctness issues that will break expected MCP behavior and user-facing docs.
  • Multimodal protocol additions look consistent, but the MCP config default path expansion resolves to filesystem root and the MCP config loader is typed as any[] and swallows parse/shape errors, making MCP silently not work in common cases. Additionally, documentation examples currently do not match the SDK API and will fail for users.
  • src/driver/driver.ts, MULTIMODAL_UPDATES.md

Important Files Changed

Filename Overview
MULTIMODAL_UPDATES.md Adds multimodal usage doc, but examples use non-existent agent_name option and object-based start signature that doesn't match AgentDriver.start.
package.json Bumps package version to 0.5.0; otherwise unchanged.
src/driver/driver.ts Adds multimodal toggles and MCP config loading and wires into StartCommand; current MCP ~ expansion is incorrect and config loader is typed as any[] and swallows JSON errors.
src/driver/index.ts Exports newly added protocol types and configs; no issues found.
src/driver/protocol.ts Extends protocol with multimodal events/commands and helper types; parseEvent/serializeCommand remain exported.
package-lock.json Lockfile updates for version bump / dependency metadata; no functional code changes.

Sequence Diagram

sequenceDiagram
  autonumber
  participant App as User App
  participant SDK as AgentDriver
  participant FS as MCP Config File
  participant Driver as agi-driver

  App->>SDK: new AgentDriver(options)
  App->>SDK: start(goal, screenshot?, w?, h?, mode?)
  SDK->>Driver: spawn(binaryPath)
  Driver-->>SDK: stdout {event:"ready"}
  SDK->>SDK: build StartCommand (multimodal fields)
  alt MCP enabled
    SDK->>FS: read mcpConfig path
    FS-->>SDK: JSON config
    SDK->>SDK: map config -> mcp_servers[]
  end
  SDK->>Driver: stdin serializeCommand(StartCommand)
  loop driver emits events
    Driver-->>SDK: stdout DriverEvent (e.g., action/thinking/audio_transcript)
    SDK-->>App: emit(event.event)
  end
  Driver-->>SDK: stdout {event:"finished"}
  SDK-->>App: resolve start() with DriverResult
Loading

Add support for new agi-driver multimodal features including audio, video, MCP servers, and tool choice configuration.

## Protocol Changes

### New Events
- AudioTranscriptEvent: Audio transcript from buffer
- VideoFrameEvent: Video frame from camera/screen
- SpeechStartedEvent/SpeechFinishedEvent: TTS playback
- TurnDetectedEvent: Voice turn detection

### New Commands
- GetAudioTranscriptCommand: Request audio transcript
- GetVideoFrameCommand: Request video frame

### Updated StartCommand
- agent_identity: Agent identity (default: agi-2-claude)
- tool_choice: Tool choice configuration
- mcp_servers: MCP server configurations
- audio_input_enabled, audio_buffer_seconds
- turn_detection_enabled, turn_detection_silence_ms
- speech_output_enabled, speech_voice
- camera_enabled, camera_buffer_seconds
- screen_recording_enabled, screen_recording_buffer_seconds

### New Interfaces
- MCPServerConfig: MCP server configuration
- AgentIdentity: Agent identity information
- ToolChoice: Tool choice type

## Breaking Changes
This is a breaking change with no backwards compatibility. StartCommand has many new optional fields.

## Related PRs
- agi-api (driver): https://github.com/agi-inc/agents/pull/344
- agi-python: https://github.com/agi-inc/agi-python/pull/8

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment thread src/driver/protocol.ts
Comment thread src/driver/protocol.ts
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 10, 2026

Additional Comments (1)

src/driver/protocol.ts
StartCommand now required

StartCommand now requires mode and agent_name (protocol.ts:228-231), but src/driver/driver.ts still constructs StartCommand with mode: mode ?? this.mode where this.mode can be undefined and with agent_name: this.agentName || undefined (driver.ts:247-251). With the new types this becomes a type error, and at runtime it can emit invalid JSON to the driver. Either keep these fields optional in the protocol, or update DriverOptions/constructor defaults so mode/agent_name are always set before sending start.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/driver/protocol.ts
Line: 219:232

Comment:
**StartCommand now required**

`StartCommand` now requires `mode` and `agent_name` (`protocol.ts:228-231`), but `src/driver/driver.ts` still constructs `StartCommand` with `mode: mode ?? this.mode` where `this.mode` can be `undefined` and with `agent_name: this.agentName || undefined` (`driver.ts:247-251`). With the new types this becomes a type error, and at runtime it can emit invalid JSON to the driver. Either keep these fields optional in the protocol, or update `DriverOptions`/constructor defaults so `mode`/`agent_name` are always set before sending `start`.

How can I resolve this? If you propose a fix, please make it concise.

JacobFV added a commit to agi-inc/agi-csharp that referenced this pull request Feb 10, 2026
Add documentation and reference implementation for new agi-driver multimodal features including audio, video, MCP servers, and tool choice configuration.

## Documentation Provided

### MULTIMODAL_UPDATES.md
Complete guide for implementing multimodal features in C# SDK

### Protocol_Multimodal.cs
Reference implementation of new protocol types:
- New event classes (AudioTranscriptEvent, VideoFrameEvent, etc.)
- New command classes (GetAudioTranscriptCommand, GetVideoFrameCommand)
- Helper classes (MCPServerConfig, AgentIdentity, ToolChoice)
- StartCommand extensions for multimodal features

## Changes Needed

### Protocol.cs
- Add new event types to DriverEventType enum
- Add new command types to DriverCommandType enum
- Add new event/command classes
- Add multimodal fields to StartCommand
- Update AgentName default to "agi-2-claude"

### Driver.cs
- Update event parsing for new event types
- Add convenience methods for multimodal features

## Breaking Changes
This is a breaking change with no backwards compatibility. StartCommand has many new fields (all with sensible defaults).

## Related PRs
- agi-api (driver): https://github.com/agi-inc/agents/pull/344
- agi-python: https://github.com/agi-inc/agi-python/pull/8
- agi-node: agi-inc/agi-node#11

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
JacobFV added a commit to agi-inc/agi-cli that referenced this pull request Feb 10, 2026
Add comprehensive multimodal features to AGI CLI:

## New CLI Options
- --voice: Enable voice input/output (requires OPENAI_API_KEY)
- --camera: Enable camera video feed
- --screen: Enable screen recording
- --mcp: Load MCP servers from config
- --mcp-config: Custom MCP config path (default: ~/.agi/mcp.json)

## Features
- Voice input with automatic turn detection
- Text-to-speech output
- Camera and screen video buffers
- MCP server integration for extended tools
- All features work together seamlessly

## Usage Examples
agi --voice "What's the time?"
agi --voice --screen "What's on my screen?"
agi --voice --camera --screen --mcp "Help me with my work"

## Related PRs
- agi-api (driver): https://github.com/agi-inc/agents/pull/344
- agi-python: https://github.com/agi-inc/agi-python/pull/8
- agi-node: agi-inc/agi-node#11

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…fix type issues

- Added parseEvent() and serializeCommand() functions required by driver
- Made agent_name, reason, message, question_id optional in commands
- Fixed mode type to be literal union instead of string
- Build now succeeds without type errors

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment thread MULTIMODAL_UPDATES.md
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 10, 2026

Additional Comments (1)

src/driver/driver.ts
Docs/examples don’t match API

AgentDriver.start is still a positional-args API (start(goal, screenshot?, width?, height?, mode?)) in src/driver/driver.ts:166-173, but MULTIMODAL_UPDATES.md documents calling start({...}) with a StartCommand-shaped object. As-is, those examples will throw at runtime (object passed as goal string) and the multimodal options can’t be provided via the public API. Either update the docs to match the actual signature, or (if intended) change AgentDriver.start to accept an options object and forward the new multimodal fields.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/driver/driver.ts
Line: 166:173

Comment:
**Docs/examples don’t match API**

`AgentDriver.start` is still a positional-args API (`start(goal, screenshot?, width?, height?, mode?)`) in `src/driver/driver.ts:166-173`, but `MULTIMODAL_UPDATES.md` documents calling `start({...})` with a `StartCommand`-shaped object. As-is, those examples will throw at runtime (object passed as `goal` string) and the multimodal options can’t be provided via the public API. Either update the docs to match the actual signature, or (if intended) change `AgentDriver.start` to accept an options object and forward the new multimodal fields.

How can I resolve this? If you propose a fix, please make it concise.

- Add multimodal options to DriverOptions interface
- Store voice, camera, screen, mcp, mcpConfig in AgentDriver
- Implement loadMcpConfig() to read and parse MCP config files
- Pass multimodal options to StartCommand:
  - audio_input_enabled, turn_detection_enabled, speech_output_enabled
  - camera_enabled, screen_recording_enabled
  - mcp_servers loaded from config file
- Full implementation with no TODOs or shortcuts

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment thread src/driver/driver.ts
Comment thread src/driver/driver.ts
JacobFV and others added 3 commits February 10, 2026 09:29
Add missing switch cases in handleLine() for audio_transcript,
video_frame, speech_started, speech_finished, and turn_detected
events so they are properly emitted to listeners.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment thread src/driver/driver.ts
Comment thread src/driver/driver.ts
Comment thread MULTIMODAL_UPDATES.md
Copy link
Copy Markdown
Contributor

@NamanGarg20 NamanGarg20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@JacobFV JacobFV merged commit 0b11341 into main Feb 10, 2026
10 of 12 checks passed
JacobFV added a commit to agi-inc/agi-csharp that referenced this pull request Feb 10, 2026
* fix(ci): add extra-files config to bump Agi.csproj version

release-please with release-type "simple" only bumps the manifest
and CHANGELOG. The publish workflow reads the version from Agi.csproj,
so we need extra-files to keep it in sync.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add comprehensive multimodal driver support documentation

Add documentation and reference implementation for new agi-driver multimodal features including audio, video, MCP servers, and tool choice configuration.

## Documentation Provided

### MULTIMODAL_UPDATES.md
Complete guide for implementing multimodal features in C# SDK

### Protocol_Multimodal.cs
Reference implementation of new protocol types:
- New event classes (AudioTranscriptEvent, VideoFrameEvent, etc.)
- New command classes (GetAudioTranscriptCommand, GetVideoFrameCommand)
- Helper classes (MCPServerConfig, AgentIdentity, ToolChoice)
- StartCommand extensions for multimodal features

## Changes Needed

### Protocol.cs
- Add new event types to DriverEventType enum
- Add new command types to DriverCommandType enum
- Add new event/command classes
- Add multimodal fields to StartCommand
- Update AgentName default to "agi-2-claude"

### Driver.cs
- Update event parsing for new event types
- Add convenience methods for multimodal features

## Breaking Changes
This is a breaking change with no backwards compatibility. StartCommand has many new fields (all with sensible defaults).

## Related PRs
- agi-api (driver): https://github.com/agi-inc/agents/pull/344
- agi-python: https://github.com/agi-inc/agi-python/pull/8
- agi-node: agi-inc/agi-node#11

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* feat(driver): wire multimodal options from DriverOptions to StartCommand

- Add Voice, Camera, Screen, Mcp, McpConfig properties to DriverOptions
- Store multimodal options in AgentDriver constructor
- Add multimodal fields to StartCommand (audio, speech, camera, screen, MCP)
- Pass options from DriverOptions to StartCommand in StartAsync():
  Voice → AudioInputEnabled, TurnDetectionEnabled, SpeechOutputEnabled
  Camera → CameraEnabled
  Screen → ScreenRecordingEnabled
  Mcp → McpServers (loaded from config file)
- Implement LoadMcpConfig() for reading MCP server configurations
- Add multimodal event types to DriverEventType enum
- Add multimodal command types to DriverCommandType enum
- Add multimodal event parsing to DriverProtocol.ParseEvent()
- Remove duplicate enum declarations from Protocol_Multimodal.cs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version to 0.5.0 for multimodal release

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
JacobFV added a commit to agi-inc/agi-cli that referenced this pull request Feb 10, 2026
* feat: add multimodal support (voice, camera, screen, MCP)

Add comprehensive multimodal features to AGI CLI:

## New CLI Options
- --voice: Enable voice input/output (requires OPENAI_API_KEY)
- --camera: Enable camera video feed
- --screen: Enable screen recording
- --mcp: Load MCP servers from config
- --mcp-config: Custom MCP config path (default: ~/.agi/mcp.json)

## Features
- Voice input with automatic turn detection
- Text-to-speech output
- Camera and screen video buffers
- MCP server integration for extended tools
- All features work together seamlessly

## Usage Examples
agi --voice "What's the time?"
agi --voice --screen "What's on my screen?"
agi --voice --camera --screen --mcp "Help me with my work"

## Related PRs
- agi-api (driver): https://github.com/agi-inc/agents/pull/344
- agi-python: https://github.com/agi-inc/agi-python/pull/8
- agi-node: agi-inc/agi-node#11

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* feat(cli): add --api-url option for custom API endpoint

Allows users to specify a custom AGI API endpoint URL:
- Added apiUrl to CliArgs interface
- Added --api-url CLI option
- Pass apiUrl to useAgent hook

Usage: agi --api-url http://localhost:8000 "your goal"

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* feat(cli): wire multimodal options through to driver

- Update App.tsx to pass voice, camera, screen, mcp, mcpConfig to useAgent
- Update UseAgentOptions interface to accept multimodal options
- Pass all multimodal options to AgentDriver constructor
- Complete end-to-end wiring: CLI args → App → useAgent → AgentDriver → API

Now the --voice, --camera, --screen, --mcp flags are fully functional!

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix(hooks): add missing multimodal deps to useCallback array

Add voice, camera, screen, mcp, mcpConfig to the start callback
dependency array so React captures the correct values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version to 0.6.0 for multimodal release

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(cli): remove unused imports and bump to 0.5.15

Remove unused mkdirSync, join, and color variable that caused
ESLint failures in CI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore(deps): require @agi_inc/agi-js ^0.5.0 for multimodal support

CI will pass once agi-node 0.5.0 is published to npm.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants