Skip to content

DurableAgent: Support LanguageModelV3ToolResultOutput for multimodal tool results (images, files) #848

@Dinesh563

Description

@Dinesh563

Summary

Currently, DurableAgent.executeTool always wraps tool results as a text type by JSON stringifying them, even when the tool returns a properly typed LanguageModelV3ToolResultOutput with multimodal content (e.g., images, files).

This prevents tools from returning images or files that the LLM can "see" via vision capabilities.

Current Behavior

In durable-agent.js, the executeTool function unconditionally stringifies tool results:

const toolResult = await execute(parsedInput, { toolCallId, messages, experimental_context });
return {
    type: 'tool-result',
    toolCallId: toolCall.toolCallId,
    toolName: toolCall.toolName,
    output: {
        type: 'text',
        value: JSON.stringify(toolResult) ?? '',  // <-- Always stringified!
    },
};

This means if a tool returns:

return {
  type: 'content',
  value: [
    { type: 'text', text: 'Here is the image' },
    { type: 'file-data', data: base64ImageData, mediaType: 'image/jpeg' },
  ],
};

return { type: 'content', value: [ { type: 'text', text: 'Here is the image' }, { type: 'file-data', data: base64ImageData, mediaType: 'image/jpeg' }, ],};
It gets stringified to a text value, and the LLM never "sees" the image.

Expected Behavior

executeTool should detect if the tool result is already a valid LanguageModelV3ToolResultOutput and pass it through without modification.
Proposed Solution

const toolResult = await execute(parsedInput, { toolCallId, messages, experimental_context });

// Check if tool result is already a LanguageModelV3ToolResultOutput
if (isToolResultOutput(toolResult)) {
  return {
    type: 'tool-result',
    toolCallId: toolCall.toolCallId,
    toolName: toolCall.toolName,
    output: toolResult,  // Pass through as-is
  };
}

// Otherwise, wrap as text (current behavior)
return {
    type: 'tool-result',
    ...
};

function isToolResultOutput(result: unknown): result is LanguageModelV3ToolResultOutput {
  if (typeof result !== 'object' || result === null) return false;
  const r = result as { type?: string };
  return ['text', 'json', 'content', 'error-text', 'error-json', 'execution-denied'].includes(r.type ?? '');
}

Use Case
Building AI agents that can:
Generate images via tools (e.g., image generation APIs)
Fetch and display images to the LLM for vision analysis
Return file attachments (PDFs, etc.) in tool results

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions