[v2] Make ToolTaskHandler.getTask/getTaskResult optional and actually invoke them#1764
[v2] Make ToolTaskHandler.getTask/getTaskResult optional and actually invoke them#1764felixweinberger wants to merge 9 commits intomainfrom
Conversation
🦋 Changeset detectedLatest commit: 72020e9 The changes in this PR will be included in the next version bump. This PR includes changesets to release 6 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
@modelcontextprotocol/client
@modelcontextprotocol/server
@modelcontextprotocol/express
@modelcontextprotocol/fastify
@modelcontextprotocol/hono
@modelcontextprotocol/node
commit: |
…actually invoke them These handlers were defined on the interface but never invoked — three code paths bypassed them and called TaskStore directly: - TaskManager.handleGetTask - TaskManager.handleGetTaskPayload - McpServer.handleAutomaticTaskPolling The handlers exist to support proxying external job systems (AWS Step Functions, CI/CD pipelines) where the external system is the source of truth for task state. But every test/example implementation was pure boilerplate delegation to the store. This change makes the handlers optional: when omitted, TaskStore handles requests (zero boilerplate, previous de-facto behavior). When provided, they're invoked for tasks/get, tasks/result, and automatic polling. Dispatch logic is isolated in ExperimentalMcpServerTasks. Core gains a single setTaskOverrides() method on TaskManager — no changes to public TaskManagerOptions type. McpServer itself gains ~5 lines. Also drops the Args parameter from TaskRequestHandler since tool input arguments aren't available at tasks/get/tasks/result time. BREAKING CHANGE: TaskRequestHandler signature changed from (args, ctx) to (ctx). ToolTaskHandler.getTask and getTaskResult are now optional. Closes #1332 Co-authored-by: Luca Chang <lucalc@amazon.com>
b106062 to
0f7ba7c
Compare
TypeDoc can't resolve {@linkcode TaskStore} from the server package
since core isn't in the processed package set.
|
@claude review |
There was a problem hiding this comment.
This PR introduces an override mechanism into core TaskManager, has breaking changes to the experimental task API, and touches 13 files across core/server/tests — worth a human review for the design decisions involved.
Extended reasoning...
Overview
This PR makes getTask and getTaskResult optional on ToolTaskHandler and actually wires them into the tasks/get and tasks/result request paths, fixing a bug where these handlers were defined but never invoked. The changes span 13 files: core TaskManager gains a new TaskLookupOverrides type and setTaskOverrides() method; ExperimentalMcpServerTasks gains ~50 lines of dispatch logic (_taskToTool map, _recordTask, _installOverrides, _dispatch); McpServer is modified to eagerly construct the experimental module and install overrides at construction time; tests remove ~180 lines of boilerplate handlers; and migration docs are updated.
Security risks
No direct security risks. The override mechanism is internal (@internal annotation) and only invoked by McpServer. The _dispatch method validates that the tool exists and has the expected handler shape before calling it. No auth, crypto, or permissions code is affected.
Level of scrutiny
This warrants careful human review for several reasons:
- Core infrastructure change:
TaskManagerin@modelcontextprotocol/coregains a new internal method that alters the lookup flow fortasks/getandtasks/result— this affects all servers using tasks. - Breaking API changes:
TaskRequestHandlersignature changes from(args, ctx)to(ctx), andgetTask/getTaskResultbecome optional. Even though the API is experimental, downstream users may be affected. - Design decisions: The override pattern (McpServer installing overrides into core TaskManager at construction time) is an architectural choice that should be validated by a maintainer.
- Major version bumps: The changeset triggers Major bumps for
node,express, andhonopackages.
Other factors
The PR has good test coverage with two new tests (handlers invoked when provided, TaskStore consulted when omitted) and all existing tests updated. The in-memory _taskToTool map limitation is documented. The code is well-structured with clear separation of concerns. The net LOC change is negative (removing boilerplate), which is positive. However, the scope and the core changes justify human sign-off.
…handlers # Conflicts: # docs/migration-SKILL.md
…nto fweinberger/tooltask-handlers
…nput_required; restore migration-SKILL.md row
| while (task.status !== 'completed' && task.status !== 'failed' && task.status !== 'cancelled' && task.status !== 'input_required') { | ||
| await new Promise(resolve => setTimeout(resolve, pollInterval)); | ||
| const updatedTask = await ctx.task.store.getTask(taskId); | ||
| if (!updatedTask) { | ||
| throw new ProtocolError(ProtocolErrorCode.InternalError, `Task ${taskId} not found during polling`); | ||
| } | ||
| task = updatedTask; | ||
| task = handler?.getTask ? await handler.getTask(taskCtx) : await ctx.task.store.getTask(taskId); | ||
| } | ||
|
|
There was a problem hiding this comment.
🔴 The polling loop in handleAutomaticTaskPolling (mcp.ts:336–340) uses a bare setTimeout with no abort signal wiring, so when the client disconnects and ctx.mcpReq.signal is aborted, the server continues calling handler.getTask(taskCtx) at every pollInterval until the external task reaches a terminal state. This PR introduces the path where handler.getTask calls external systems (AWS Step Functions, CI/CD APIs, etc.), turning the pre-existing missing check into a real external-API resource leak.
Extended reasoning...
What the bug is and how it manifests
The polling loop in handleAutomaticTaskPolling (mcp.ts:336-340) uses await new Promise(resolve => setTimeout(resolve, pollInterval)) with no abort signal wiring. ctx.mcpReq.signal is available but never consulted anywhere in the loop. Compare with TaskManager._waitForTaskUpdate() (taskManager.ts:857+), which correctly pre-checks signal.aborted and wires an abort listener to clearTimeout/reject. The handleAutomaticTaskPolling path is a separate implementation that bypasses this helper entirely.
The specific code path that triggers it
The polling path is entered when: a tool registers with taskSupport: 'optional' and provides a custom getTask handler (the primary new use case this PR introduces), and the client calls tools/call without a task hint. In that scenario the server holds the HTTP connection open and polls in a loop. If handler.getTask is omitted the loop falls back to the in-memory ctx.task.store.getTask — cheap and negligible. If handler.getTask is provided (the new external-proxy case), each iteration makes a real external API call.
Why existing code doesn't prevent it
TaskManager._waitForTaskUpdate() is the correct abort-aware helper, but handleAutomaticTaskPolling never calls it — the two code paths are completely independent. There are no other abort checks inside the loop; neither the setTimeout nor the handler.getTask call site checks the signal.
What the impact would be
If the client disconnects (HTTP proxy timeout, user cancel, network drop), ctx.mcpReq.signal is aborted but the loop continues running, making external API calls at every pollInterval until the remote system returns a terminal status. For a 2-minute SFN execution with a 5-second poll interval and a 10-second client timeout, the server makes roughly 22 unnecessary DescribeExecution calls after the client is gone, wasting external API quota and holding server resources.
Step-by-step proof
- Tool
proxy-sfnregistered withtaskSupport: 'optional'andgetTask→ AWS SFNDescribeExecution. - Client calls
tools/callwithout a task hint →handleAutomaticTaskPollingentered. - SFN job is expected to run for 2 minutes;
pollInterval = 5000. - After 10 seconds, the client's HTTP proxy kills the connection;
ctx.mcpReq.signalfiresabort. - The loop does not observe the abort event. It continues:
await new Promise(resolve => setTimeout(resolve, 5000))resolves normally,handler.getTask(taskCtx)callsDescribeExecution— no check ofsignal.aborted. - This repeats ~22 more times over the next ~110 seconds until SFN reaches a terminal state.
- Server then calls
handler.getTaskResultand attempts to return a result to the dead connection.
How to fix it
Replace the bare setTimeout with a signal-aware delay (matching the pattern in _waitForTaskUpdate), and add ctx.mcpReq.signal.throwIfAborted() after each handler.getTask call:
// signal-aware delay
await new Promise<void>((resolve, reject) => {
if (ctx.mcpReq.signal.aborted) { reject(new ProtocolError(...)); return; }
const id = setTimeout(resolve, pollInterval);
ctx.mcpReq.signal.addEventListener('abort', () => { clearTimeout(id); reject(new ProtocolError(...)); }, { once: true });
});
// check again after the external call
task = handler?.getTask ? await handler.getTask(taskCtx) : await ctx.task.store.getTask(taskId);
ctx.mcpReq.signal.throwIfAborted();| // getTaskResult is terminal — drop the mapping only after the handler resolves | ||
| // so a transient throw doesn't orphan the task on retry. | ||
| if (method === 'getTaskResult') { | ||
| this._taskToTool.delete(taskId); | ||
| } | ||
| return result; |
There was a problem hiding this comment.
🔴 After a successful getTaskResult call for an external-proxy tool, the _taskToTool entry is deleted at mcpServer.ts:111; if the HTTP response is lost before the client receives it, any retry of tasks/result finds no entry, falls through to the local TaskStore (which permanently shows 'working' for external-proxy tools since storeTaskResult is never called locally), and enters an infinite polling loop via _waitForTaskUpdate until the HTTP connection times out. To fix, cache the result in a short-lived map (keyed by taskId) so retries within a grace period can return the same result without re-invoking the handler, or alternatively re-invoke the handler idempotently on retry.
Extended reasoning...
What the bug is and how it manifests
This PR intentionally moves _taskToTool.delete(taskId) to after handler(taskCtx) resolves (mcpServer.ts line 111), correctly protecting against the transient-throw retry case. However, there is a second retry scenario that is not addressed: the handler succeeds, the entry is deleted, the result is assembled and returned to the HTTP layer — but the TCP connection drops (or a proxy times out) before the client receives the response. The client retries tasks/result. At that point _taskToTool has no entry for taskId, so _dispatch returns undefined for both getTask and getTaskResult. Both fall through to _requireTaskStore.
The specific code path that triggers it
handleGetTaskPayloadcallshandleGetTask(taskId, ctx)(taskManager.ts:432).handleGetTaskcalls_overrides.getTask(taskId, ctx)→_dispatch('getTask')→ no_taskToToolentry → returnsundefined→ falls through to_requireTaskStore.getTask(taskId, sessionId).- The local
TaskStorereturns aTaskwithstatus = 'working'. For external-proxy tools,storeTaskResultis never called locally — the external system (e.g. AWS Step Functions) is the authoritative source of truth. The SDK never updates the local store for these tasks. isTerminal('working')isfalse→_waitForTaskUpdate(task.pollInterval, signal)is called.- After waiting,
handleTaskResult()is called recursively. Step 2-4 repeats. The local store still says'working'— it will never say anything else. - The loop continues at
pollInterval-ms intervals untilctx.mcpReq.signalfires (HTTP connection closes or the client gives up).
Why existing code does not prevent it
The comment on lines 108–109 of mcpServer.ts acknowledges the transient-throw retry case: "getTaskResult is terminal — drop the mapping only after the handler resolves so a transient throw doesn't orphan the task on retry." This reasoning is correct for handler throws. But once the handler succeeds and the entry is deleted, the SDK has no mechanism to distinguish a first-time call from a retry-after-lost-response. The local TaskStore for external-proxy tools permanently holds the initial 'working' status because storeTaskResult is only called by the tool's createTask background work or via the store's update methods — neither of which an external-proxy tool ever calls.
What the impact is
For the primary new use case introduced by this PR — proxying external job systems (AWS Step Functions, CI/CD pipelines) — any dropped HTTP connection after a completed tasks/result call causes the server to spin at pollInterval ms for the full HTTP timeout duration (potentially minutes). Each retry from the client creates a new polling goroutine. Under real network conditions (load balancers, proxies with short idle timeouts, mobile clients), this is not a theoretical edge case.
How to fix it
Option A (result cache with TTL): After _dispatch('getTaskResult') succeeds, store the result in a short-lived Map<taskId, {result, expiry}>. On retry, if _taskToTool has no entry but the result cache does, return the cached result immediately. Clear the cache entry after the TTL or after onClose().
Option B (idempotent re-invoke): Restore the _taskToTool entry on network errors. This is difficult because the entry deletion and the HTTP response write happen in different layers.
Option C (sentinel in TaskStore): After getTaskResult succeeds, write a sentinel 'completed' status to the local TaskStore before deleting the _taskToTool entry. Retries would then find the task terminal, fetch the cached local result, and return immediately. This requires external-proxy tools to also write a minimal result to the local store.
Step-by-step proof
- Tool
proxy-sfnis registered with customgetTask+getTaskResulthandlers (proxying AWS Step Functions)._taskToToolhas entry'T1' → 'proxy-sfn'. LocalTaskStorehas taskT1withstatus='working'. - Client calls
tasks/resultforT1.handleGetTask→_dispatch('getTask')→ finds entry → calls SFNDescribeExecution→ returns{status:'completed'}. isTerminal('completed') === true→handleGetTaskPayloadcalls_overrides.getTaskResult('T1', ctx)._dispatch('getTaskResult'): calls SFNGetExecutionHistory→ returns result →_taskToTool.delete('T1')→ returns result.- Result is wrapped and returned to HTTP layer. TCP connection drops before client receives it.
- Client retries
tasks/resultforT1. handleGetTask→_dispatch('getTask')→ no entry → falls through to_requireTaskStore.getTask('T1')→ returns{status:'working'}.isTerminal('working') === false→_waitForTaskUpdate(pollInterval, signal)→handleTaskResult()recurse.- Goto step 7. Loop runs at
pollIntervalms until the HTTP connection times out.
| ``` | ||
|
|
||
| ### `ToolTaskHandler.getTask` and `getTaskResult` are now optional (experimental) | ||
|
|
||
| `getTask` and `getTaskResult` are now optional on `ToolTaskHandler`. When omitted, `tasks/get` and `tasks/result` are served directly from the configured `TaskStore`. Their signature has also changed — they no longer receive the tool's input arguments (which aren't available at `tasks/get`/`tasks/result` time). | ||
|
|
||
| If your handlers just delegated to the store, delete them: | ||
|
|
||
| **Before:** | ||
|
|
||
| ```typescript | ||
| server.experimental.tasks.registerToolTask('long-task', config, { | ||
| createTask: async (args, ctx) => { /* ... */ }, | ||
| getTask: async (args, ctx) => ctx.task.store.getTask(ctx.task.id), | ||
| getTaskResult: async (args, ctx) => ctx.task.store.getTaskResult(ctx.task.id) | ||
| }); | ||
| ``` | ||
|
|
||
| **After:** | ||
|
|
||
| ```typescript | ||
| server.experimental.tasks.registerToolTask('long-task', config, { | ||
| createTask: async (args, ctx) => { /* ... */ } | ||
| }); | ||
| ``` | ||
|
|
||
| Keep them if you're proxying an external job system (AWS Step Functions, CI/CD pipelines, etc.) — the new signature takes only `ctx`: | ||
|
|
There was a problem hiding this comment.
🟡 docs/migration-SKILL.md is missing the ToolTaskHandler breaking-change entry required by CLAUDE.md. Users or LLM tools relying on migration-SKILL.md for mechanical migration will not be guided to remove boilerplate getTask/getTaskResult handlers or update the TaskRequestHandler signature from (args, ctx) to (ctx), which will cause compile errors.
Extended reasoning...
CLAUDE.md requirement (lines 27-30): 'When making breaking changes, document them in both: docs/migration.md — human-readable guide with before/after code examples AND docs/migration-SKILL.md — LLM-optimized mapping tables for mechanical migration.' This is an explicit, unambiguous policy.
What the PR changed: A new section was added to docs/migration.md (lines 503–530) documenting two breaking changes to the experimental ToolTaskHandler API: (1) getTask and getTaskResult are now optional, and (2) the TaskRequestHandler signature changed from (args, ctx) to (ctx). This section correctly provides before/after code examples.
What was omitted: docs/migration-SKILL.md has zero mentions of ToolTaskHandler, getTask, getTaskResult, or TaskRequestHandler. The file already contains a section 12 ('Experimental: TaskCreationParams.ttl no longer accepts null') demonstrating that experimental-API breaking changes are expected there. The omission is inconsistent and violates the documented process.
Why this matters: docs/migration.md explicitly promotes migration-SKILL.md as the LLM-optimized guide for mechanical migration. Users or tools (like Claude Code) that load migration-SKILL.md and search for getTask/getTaskResult guidance will find nothing. They will retain boilerplate handlers with the old (args, ctx) signature that now causes a TypeScript compile error, because the TaskRequestHandler type was changed to (ctx: TaskServerContext) => ... with no args parameter.
Step-by-step proof of impact:
- Developer has an existing implementation with getTask: async (args, ctx) => ctx.task.store.getTask(ctx.task.id) and getTaskResult: async (args, ctx) => ctx.task.store.getTaskResult(ctx.task.id).
- Developer runs an LLM-assisted migration using migration-SKILL.md as context.
- The LLM finds no entry for getTask, getTaskResult, ToolTaskHandler, or TaskRequestHandler in migration-SKILL.md.
- The LLM does not remove the boilerplate handlers or update the signature.
- TypeScript compiler rejects the old (args, ctx) signature — compile error.
Suggested fix: Add a new row or section to migration-SKILL.md (parallel to the existing section 12) with a concise mapping table covering: (a) delete getTask/getTaskResult if they delegate to ctx.task.store, and (b) drop the args parameter if keeping custom handlers for external-system proxying.
Makes
getTaskandgetTaskResultoptional onToolTaskHandlerand wires them into thetasks/getandtasks/resultrequest paths. Supersedes #1332.Motivation and Context
These handlers were defined on the interface but never invoked — three code paths bypassed them and hit
TaskStoredirectly:TaskManager.handleGetTask→_requireTaskStore.getTaskTaskManager.handleGetTaskPayload→_requireTaskStore.getTask/getTaskResultMcpServer.handleAutomaticTaskPolling→ctx.task.storeThe handlers exist to support proxying external job systems (AWS Step Functions, CI/CD pipelines, etc.) where the external system is the source of truth for task state. But every test and example implementation was pure boilerplate delegation to
ctx.task.store, which made the handlers look vestigial.This PR makes them optional:
TaskStorehandles everything. Zero boilerplate.tasks/get,tasks/result, and automatic polling. Useful for external-system proxies.Isolation
All dispatch logic lives in
ExperimentalMcpServerTasks(the experimental module that already ownsregisterToolTask):_taskToToolmap,_recordTask(),_installOverrides(),_dispatch()— ~50 lines, all contained thereTaskManager.setTaskOverrides()+ two lookup checks. No changes to the publicTaskManagerOptionstype.How Has This Been Tested?
tasks/getandtasks/resultTaskStoreis consulted directly when handlers are omittedBreaking Changes
Yes (experimental API):
TaskRequestHandlersignature changed from(args, ctx)to(ctx)— tool arguments aren't available attasks/get/tasks/resulttimegetTaskandgetTaskResultare now optional onToolTaskHandlerSee
docs/migration.md.Types of changes
Checklist
Additional context
Known limitation: the taskId → tool mapping is in-memory and doesn't survive restarts or span multiple server instances. In those scenarios, requests fall through to
TaskStore. Documented on theToolTaskHandlerinterface.Also adds an
isToolTaskHandlertype guard replacing inline'createTask' in handlerchecks.Closes #1332