Summary
Currently, DurableAgent.executeTool always wraps tool results as a text type by JSON stringifying them, even when the tool returns a properly typed LanguageModelV3ToolResultOutput with multimodal content (e.g., images, files).
This prevents tools from returning images or files that the LLM can "see" via vision capabilities.
Current Behavior
In durable-agent.js, the executeTool function unconditionally stringifies tool results:
const toolResult = await execute(parsedInput, { toolCallId, messages, experimental_context });
return {
type: 'tool-result',
toolCallId: toolCall.toolCallId,
toolName: toolCall.toolName,
output: {
type: 'text',
value: JSON.stringify(toolResult) ?? '', // <-- Always stringified!
},
};
This means if a tool returns:
return {
type: 'content',
value: [
{ type: 'text', text: 'Here is the image' },
{ type: 'file-data', data: base64ImageData, mediaType: 'image/jpeg' },
],
};
return { type: 'content', value: [ { type: 'text', text: 'Here is the image' }, { type: 'file-data', data: base64ImageData, mediaType: 'image/jpeg' }, ],};
It gets stringified to a text value, and the LLM never "sees" the image.
Expected Behavior
executeTool should detect if the tool result is already a valid LanguageModelV3ToolResultOutput and pass it through without modification.
Proposed Solution
const toolResult = await execute(parsedInput, { toolCallId, messages, experimental_context });
// Check if tool result is already a LanguageModelV3ToolResultOutput
if (isToolResultOutput(toolResult)) {
return {
type: 'tool-result',
toolCallId: toolCall.toolCallId,
toolName: toolCall.toolName,
output: toolResult, // Pass through as-is
};
}
// Otherwise, wrap as text (current behavior)
return {
type: 'tool-result',
...
};
function isToolResultOutput(result: unknown): result is LanguageModelV3ToolResultOutput {
if (typeof result !== 'object' || result === null) return false;
const r = result as { type?: string };
return ['text', 'json', 'content', 'error-text', 'error-json', 'execution-denied'].includes(r.type ?? '');
}
Use Case
Building AI agents that can:
Generate images via tools (e.g., image generation APIs)
Fetch and display images to the LLM for vision analysis
Return file attachments (PDFs, etc.) in tool results
Summary
Currently,
DurableAgent.executeToolalways wraps tool results as a text type by JSON stringifying them, even when the tool returns a properly typedLanguageModelV3ToolResultOutputwith multimodal content (e.g., images, files).This prevents tools from returning images or files that the LLM can "see" via vision capabilities.
Current Behavior
In
durable-agent.js, theexecuteToolfunction unconditionally stringifies tool results:This means if a tool returns:
return { type: 'content', value: [ { type: 'text', text: 'Here is the image' }, { type: 'file-data', data: base64ImageData, mediaType: 'image/jpeg' }, ],};It gets stringified to a text value, and the LLM never "sees" the image.
Expected Behavior
Use Case
Building AI agents that can:
Generate images via tools (e.g., image generation APIs)
Fetch and display images to the LLM for vision analysis
Return file attachments (PDFs, etc.) in tool results