Skip to content

[v2] Make ToolTaskHandler.getTask/getTaskResult optional and actually invoke them#1764

Open
felixweinberger wants to merge 9 commits intomainfrom
fweinberger/tooltask-handlers
Open

[v2] Make ToolTaskHandler.getTask/getTaskResult optional and actually invoke them#1764
felixweinberger wants to merge 9 commits intomainfrom
fweinberger/tooltask-handlers

Conversation

@felixweinberger
Copy link
Copy Markdown
Contributor

@felixweinberger felixweinberger commented Mar 25, 2026

Makes getTask and getTaskResult optional on ToolTaskHandler and wires them into the tasks/get and tasks/result request paths. Supersedes #1332.

Motivation and Context

These handlers were defined on the interface but never invoked — three code paths bypassed them and hit TaskStore directly:

  1. TaskManager.handleGetTask_requireTaskStore.getTask
  2. TaskManager.handleGetTaskPayload_requireTaskStore.getTask/getTaskResult
  3. McpServer.handleAutomaticTaskPollingctx.task.store

The handlers exist to support proxying external job systems (AWS Step Functions, CI/CD pipelines, etc.) where the external system is the source of truth for task state. But every test and example implementation was pure boilerplate delegation to ctx.task.store, which made the handlers look vestigial.

This PR makes them optional:

  • Omitted (the common case) → TaskStore handles everything. Zero boilerplate.
  • Provided → invoked for tasks/get, tasks/result, and automatic polling. Useful for external-system proxies.

Isolation

All dispatch logic lives in ExperimentalMcpServerTasks (the experimental module that already owns registerToolTask):

  • _taskToTool map, _recordTask(), _installOverrides(), _dispatch() — ~50 lines, all contained there
  • Core gains one internal method: TaskManager.setTaskOverrides() + two lookup checks. No changes to the public TaskManagerOptions type.
  • McpServer gains ~5 lines: construct experimental module eagerly, install overrides, record tasks on creation.

How Has This Been Tested?

  • All existing task lifecycle tests pass (boilerplate handlers removed — net -180 LOC in tests)
  • New test: optional handlers are invoked when provided for tasks/get and tasks/result
  • New test: TaskStore is consulted directly when handlers are omitted

Breaking Changes

Yes (experimental API):

  • TaskRequestHandler signature changed from (args, ctx) to (ctx) — tool arguments aren't available at tasks/get/tasks/result time
  • getTask and getTaskResult are now optional on ToolTaskHandler

See docs/migration.md.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

Known limitation: the taskId → tool mapping is in-memory and doesn't survive restarts or span multiple server instances. In those scenarios, requests fall through to TaskStore. Documented on the ToolTaskHandler interface.

Also adds an isToolTaskHandler type guard replacing inline 'createTask' in handler checks.

Closes #1332

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Mar 25, 2026

🦋 Changeset detected

Latest commit: 72020e9

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 6 packages
Name Type
@modelcontextprotocol/core Minor
@modelcontextprotocol/server Minor
@modelcontextprotocol/node Major
@modelcontextprotocol/express Major
@modelcontextprotocol/fastify Major
@modelcontextprotocol/hono Major

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new bot commented Mar 25, 2026

Open in StackBlitz

@modelcontextprotocol/client

npm i https://pkg.pr.new/@modelcontextprotocol/client@1764

@modelcontextprotocol/server

npm i https://pkg.pr.new/@modelcontextprotocol/server@1764

@modelcontextprotocol/express

npm i https://pkg.pr.new/@modelcontextprotocol/express@1764

@modelcontextprotocol/fastify

npm i https://pkg.pr.new/@modelcontextprotocol/fastify@1764

@modelcontextprotocol/hono

npm i https://pkg.pr.new/@modelcontextprotocol/hono@1764

@modelcontextprotocol/node

npm i https://pkg.pr.new/@modelcontextprotocol/node@1764

commit: 72020e9

…actually invoke them

These handlers were defined on the interface but never invoked — three
code paths bypassed them and called TaskStore directly:

- TaskManager.handleGetTask
- TaskManager.handleGetTaskPayload
- McpServer.handleAutomaticTaskPolling

The handlers exist to support proxying external job systems (AWS Step
Functions, CI/CD pipelines) where the external system is the source of
truth for task state. But every test/example implementation was pure
boilerplate delegation to the store.

This change makes the handlers optional: when omitted, TaskStore handles
requests (zero boilerplate, previous de-facto behavior). When provided,
they're invoked for tasks/get, tasks/result, and automatic polling.

Dispatch logic is isolated in ExperimentalMcpServerTasks. Core gains a
single setTaskOverrides() method on TaskManager — no changes to public
TaskManagerOptions type. McpServer itself gains ~5 lines.

Also drops the Args parameter from TaskRequestHandler since tool input
arguments aren't available at tasks/get/tasks/result time.

BREAKING CHANGE: TaskRequestHandler signature changed from (args, ctx)
to (ctx). ToolTaskHandler.getTask and getTaskResult are now optional.

Closes #1332

Co-authored-by: Luca Chang <lucalc@amazon.com>
@felixweinberger felixweinberger force-pushed the fweinberger/tooltask-handlers branch from b106062 to 0f7ba7c Compare March 25, 2026 22:14
@felixweinberger felixweinberger changed the title [v2] Remove getTask/getTaskResult from ToolTaskHandler [v2] Make ToolTaskHandler.getTask/getTaskResult optional and actually invoke them Mar 25, 2026
@felixweinberger felixweinberger marked this pull request as ready for review March 25, 2026 22:16
@felixweinberger felixweinberger requested a review from a team as a code owner March 25, 2026 22:16
TypeDoc can't resolve {@linkcode TaskStore} from the server package
since core isn't in the processed package set.
@felixweinberger
Copy link
Copy Markdown
Contributor Author

@claude review

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces an override mechanism into core TaskManager, has breaking changes to the experimental task API, and touches 13 files across core/server/tests — worth a human review for the design decisions involved.

Extended reasoning...

Overview

This PR makes getTask and getTaskResult optional on ToolTaskHandler and actually wires them into the tasks/get and tasks/result request paths, fixing a bug where these handlers were defined but never invoked. The changes span 13 files: core TaskManager gains a new TaskLookupOverrides type and setTaskOverrides() method; ExperimentalMcpServerTasks gains ~50 lines of dispatch logic (_taskToTool map, _recordTask, _installOverrides, _dispatch); McpServer is modified to eagerly construct the experimental module and install overrides at construction time; tests remove ~180 lines of boilerplate handlers; and migration docs are updated.

Security risks

No direct security risks. The override mechanism is internal (@internal annotation) and only invoked by McpServer. The _dispatch method validates that the tool exists and has the expected handler shape before calling it. No auth, crypto, or permissions code is affected.

Level of scrutiny

This warrants careful human review for several reasons:

  1. Core infrastructure change: TaskManager in @modelcontextprotocol/core gains a new internal method that alters the lookup flow for tasks/get and tasks/result — this affects all servers using tasks.
  2. Breaking API changes: TaskRequestHandler signature changes from (args, ctx) to (ctx), and getTask/getTaskResult become optional. Even though the API is experimental, downstream users may be affected.
  3. Design decisions: The override pattern (McpServer installing overrides into core TaskManager at construction time) is an architectural choice that should be validated by a maintainer.
  4. Major version bumps: The changeset triggers Major bumps for node, express, and hono packages.

Other factors

The PR has good test coverage with two new tests (handlers invoked when provided, TaskStore consulted when omitted) and all existing tests updated. The in-memory _taskToTool map limitation is documented. The code is well-structured with clear separation of concerns. The net LOC change is negative (removing boilerplate), which is positive. However, the scope and the core changes justify human sign-off.

…nput_required; restore migration-SKILL.md row
Comment on lines +336 to 340
while (task.status !== 'completed' && task.status !== 'failed' && task.status !== 'cancelled' && task.status !== 'input_required') {
await new Promise(resolve => setTimeout(resolve, pollInterval));
const updatedTask = await ctx.task.store.getTask(taskId);
if (!updatedTask) {
throw new ProtocolError(ProtocolErrorCode.InternalError, `Task ${taskId} not found during polling`);
}
task = updatedTask;
task = handler?.getTask ? await handler.getTask(taskCtx) : await ctx.task.store.getTask(taskId);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The polling loop in handleAutomaticTaskPolling (mcp.ts:336–340) uses a bare setTimeout with no abort signal wiring, so when the client disconnects and ctx.mcpReq.signal is aborted, the server continues calling handler.getTask(taskCtx) at every pollInterval until the external task reaches a terminal state. This PR introduces the path where handler.getTask calls external systems (AWS Step Functions, CI/CD APIs, etc.), turning the pre-existing missing check into a real external-API resource leak.

Extended reasoning...

What the bug is and how it manifests

The polling loop in handleAutomaticTaskPolling (mcp.ts:336-340) uses await new Promise(resolve => setTimeout(resolve, pollInterval)) with no abort signal wiring. ctx.mcpReq.signal is available but never consulted anywhere in the loop. Compare with TaskManager._waitForTaskUpdate() (taskManager.ts:857+), which correctly pre-checks signal.aborted and wires an abort listener to clearTimeout/reject. The handleAutomaticTaskPolling path is a separate implementation that bypasses this helper entirely.

The specific code path that triggers it

The polling path is entered when: a tool registers with taskSupport: 'optional' and provides a custom getTask handler (the primary new use case this PR introduces), and the client calls tools/call without a task hint. In that scenario the server holds the HTTP connection open and polls in a loop. If handler.getTask is omitted the loop falls back to the in-memory ctx.task.store.getTask — cheap and negligible. If handler.getTask is provided (the new external-proxy case), each iteration makes a real external API call.

Why existing code doesn't prevent it

TaskManager._waitForTaskUpdate() is the correct abort-aware helper, but handleAutomaticTaskPolling never calls it — the two code paths are completely independent. There are no other abort checks inside the loop; neither the setTimeout nor the handler.getTask call site checks the signal.

What the impact would be

If the client disconnects (HTTP proxy timeout, user cancel, network drop), ctx.mcpReq.signal is aborted but the loop continues running, making external API calls at every pollInterval until the remote system returns a terminal status. For a 2-minute SFN execution with a 5-second poll interval and a 10-second client timeout, the server makes roughly 22 unnecessary DescribeExecution calls after the client is gone, wasting external API quota and holding server resources.

Step-by-step proof

  1. Tool proxy-sfn registered with taskSupport: 'optional' and getTask → AWS SFN DescribeExecution.
  2. Client calls tools/call without a task hint → handleAutomaticTaskPolling entered.
  3. SFN job is expected to run for 2 minutes; pollInterval = 5000.
  4. After 10 seconds, the client's HTTP proxy kills the connection; ctx.mcpReq.signal fires abort.
  5. The loop does not observe the abort event. It continues: await new Promise(resolve => setTimeout(resolve, 5000)) resolves normally, handler.getTask(taskCtx) calls DescribeExecution — no check of signal.aborted.
  6. This repeats ~22 more times over the next ~110 seconds until SFN reaches a terminal state.
  7. Server then calls handler.getTaskResult and attempts to return a result to the dead connection.

How to fix it

Replace the bare setTimeout with a signal-aware delay (matching the pattern in _waitForTaskUpdate), and add ctx.mcpReq.signal.throwIfAborted() after each handler.getTask call:

// signal-aware delay
await new Promise<void>((resolve, reject) => {
  if (ctx.mcpReq.signal.aborted) { reject(new ProtocolError(...)); return; }
  const id = setTimeout(resolve, pollInterval);
  ctx.mcpReq.signal.addEventListener('abort', () => { clearTimeout(id); reject(new ProtocolError(...)); }, { once: true });
});
// check again after the external call
task = handler?.getTask ? await handler.getTask(taskCtx) : await ctx.task.store.getTask(taskId);
ctx.mcpReq.signal.throwIfAborted();

Comment on lines +108 to +113
// getTaskResult is terminal — drop the mapping only after the handler resolves
// so a transient throw doesn't orphan the task on retry.
if (method === 'getTaskResult') {
this._taskToTool.delete(taskId);
}
return result;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 After a successful getTaskResult call for an external-proxy tool, the _taskToTool entry is deleted at mcpServer.ts:111; if the HTTP response is lost before the client receives it, any retry of tasks/result finds no entry, falls through to the local TaskStore (which permanently shows 'working' for external-proxy tools since storeTaskResult is never called locally), and enters an infinite polling loop via _waitForTaskUpdate until the HTTP connection times out. To fix, cache the result in a short-lived map (keyed by taskId) so retries within a grace period can return the same result without re-invoking the handler, or alternatively re-invoke the handler idempotently on retry.

Extended reasoning...

What the bug is and how it manifests

This PR intentionally moves _taskToTool.delete(taskId) to after handler(taskCtx) resolves (mcpServer.ts line 111), correctly protecting against the transient-throw retry case. However, there is a second retry scenario that is not addressed: the handler succeeds, the entry is deleted, the result is assembled and returned to the HTTP layer — but the TCP connection drops (or a proxy times out) before the client receives the response. The client retries tasks/result. At that point _taskToTool has no entry for taskId, so _dispatch returns undefined for both getTask and getTaskResult. Both fall through to _requireTaskStore.

The specific code path that triggers it

  1. handleGetTaskPayload calls handleGetTask(taskId, ctx) (taskManager.ts:432).
  2. handleGetTask calls _overrides.getTask(taskId, ctx)_dispatch('getTask') → no _taskToTool entry → returns undefined → falls through to _requireTaskStore.getTask(taskId, sessionId).
  3. The local TaskStore returns a Task with status = 'working'. For external-proxy tools, storeTaskResult is never called locally — the external system (e.g. AWS Step Functions) is the authoritative source of truth. The SDK never updates the local store for these tasks.
  4. isTerminal('working') is false_waitForTaskUpdate(task.pollInterval, signal) is called.
  5. After waiting, handleTaskResult() is called recursively. Step 2-4 repeats. The local store still says 'working' — it will never say anything else.
  6. The loop continues at pollInterval-ms intervals until ctx.mcpReq.signal fires (HTTP connection closes or the client gives up).

Why existing code does not prevent it

The comment on lines 108–109 of mcpServer.ts acknowledges the transient-throw retry case: "getTaskResult is terminal — drop the mapping only after the handler resolves so a transient throw doesn't orphan the task on retry." This reasoning is correct for handler throws. But once the handler succeeds and the entry is deleted, the SDK has no mechanism to distinguish a first-time call from a retry-after-lost-response. The local TaskStore for external-proxy tools permanently holds the initial 'working' status because storeTaskResult is only called by the tool's createTask background work or via the store's update methods — neither of which an external-proxy tool ever calls.

What the impact is

For the primary new use case introduced by this PR — proxying external job systems (AWS Step Functions, CI/CD pipelines) — any dropped HTTP connection after a completed tasks/result call causes the server to spin at pollInterval ms for the full HTTP timeout duration (potentially minutes). Each retry from the client creates a new polling goroutine. Under real network conditions (load balancers, proxies with short idle timeouts, mobile clients), this is not a theoretical edge case.

How to fix it

Option A (result cache with TTL): After _dispatch('getTaskResult') succeeds, store the result in a short-lived Map<taskId, {result, expiry}>. On retry, if _taskToTool has no entry but the result cache does, return the cached result immediately. Clear the cache entry after the TTL or after onClose().

Option B (idempotent re-invoke): Restore the _taskToTool entry on network errors. This is difficult because the entry deletion and the HTTP response write happen in different layers.

Option C (sentinel in TaskStore): After getTaskResult succeeds, write a sentinel 'completed' status to the local TaskStore before deleting the _taskToTool entry. Retries would then find the task terminal, fetch the cached local result, and return immediately. This requires external-proxy tools to also write a minimal result to the local store.

Step-by-step proof

  1. Tool proxy-sfn is registered with custom getTask + getTaskResult handlers (proxying AWS Step Functions). _taskToTool has entry 'T1' → 'proxy-sfn'. Local TaskStore has task T1 with status='working'.
  2. Client calls tasks/result for T1. handleGetTask_dispatch('getTask') → finds entry → calls SFN DescribeExecution → returns {status:'completed'}.
  3. isTerminal('completed') === truehandleGetTaskPayload calls _overrides.getTaskResult('T1', ctx).
  4. _dispatch('getTaskResult'): calls SFN GetExecutionHistory → returns result → _taskToTool.delete('T1') → returns result.
  5. Result is wrapped and returned to HTTP layer. TCP connection drops before client receives it.
  6. Client retries tasks/result for T1.
  7. handleGetTask_dispatch('getTask')no entry → falls through to _requireTaskStore.getTask('T1') → returns {status:'working'}.
  8. isTerminal('working') === false_waitForTaskUpdate(pollInterval, signal)handleTaskResult() recurse.
  9. Goto step 7. Loop runs at pollInterval ms until the HTTP connection times out.

Comment on lines 503 to +530
```

### `ToolTaskHandler.getTask` and `getTaskResult` are now optional (experimental)

`getTask` and `getTaskResult` are now optional on `ToolTaskHandler`. When omitted, `tasks/get` and `tasks/result` are served directly from the configured `TaskStore`. Their signature has also changed — they no longer receive the tool's input arguments (which aren't available at `tasks/get`/`tasks/result` time).

If your handlers just delegated to the store, delete them:

**Before:**

```typescript
server.experimental.tasks.registerToolTask('long-task', config, {
createTask: async (args, ctx) => { /* ... */ },
getTask: async (args, ctx) => ctx.task.store.getTask(ctx.task.id),
getTaskResult: async (args, ctx) => ctx.task.store.getTaskResult(ctx.task.id)
});
```

**After:**

```typescript
server.experimental.tasks.registerToolTask('long-task', config, {
createTask: async (args, ctx) => { /* ... */ }
});
```

Keep them if you're proxying an external job system (AWS Step Functions, CI/CD pipelines, etc.) — the new signature takes only `ctx`:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 docs/migration-SKILL.md is missing the ToolTaskHandler breaking-change entry required by CLAUDE.md. Users or LLM tools relying on migration-SKILL.md for mechanical migration will not be guided to remove boilerplate getTask/getTaskResult handlers or update the TaskRequestHandler signature from (args, ctx) to (ctx), which will cause compile errors.

Extended reasoning...

CLAUDE.md requirement (lines 27-30): 'When making breaking changes, document them in both: docs/migration.md — human-readable guide with before/after code examples AND docs/migration-SKILL.md — LLM-optimized mapping tables for mechanical migration.' This is an explicit, unambiguous policy.

What the PR changed: A new section was added to docs/migration.md (lines 503–530) documenting two breaking changes to the experimental ToolTaskHandler API: (1) getTask and getTaskResult are now optional, and (2) the TaskRequestHandler signature changed from (args, ctx) to (ctx). This section correctly provides before/after code examples.

What was omitted: docs/migration-SKILL.md has zero mentions of ToolTaskHandler, getTask, getTaskResult, or TaskRequestHandler. The file already contains a section 12 ('Experimental: TaskCreationParams.ttl no longer accepts null') demonstrating that experimental-API breaking changes are expected there. The omission is inconsistent and violates the documented process.

Why this matters: docs/migration.md explicitly promotes migration-SKILL.md as the LLM-optimized guide for mechanical migration. Users or tools (like Claude Code) that load migration-SKILL.md and search for getTask/getTaskResult guidance will find nothing. They will retain boilerplate handlers with the old (args, ctx) signature that now causes a TypeScript compile error, because the TaskRequestHandler type was changed to (ctx: TaskServerContext) => ... with no args parameter.

Step-by-step proof of impact:

  1. Developer has an existing implementation with getTask: async (args, ctx) => ctx.task.store.getTask(ctx.task.id) and getTaskResult: async (args, ctx) => ctx.task.store.getTaskResult(ctx.task.id).
  2. Developer runs an LLM-assisted migration using migration-SKILL.md as context.
  3. The LLM finds no entry for getTask, getTaskResult, ToolTaskHandler, or TaskRequestHandler in migration-SKILL.md.
  4. The LLM does not remove the boilerplate handlers or update the signature.
  5. TypeScript compiler rejects the old (args, ctx) signature — compile error.

Suggested fix: Add a new row or section to migration-SKILL.md (parallel to the existing section 12) with a concise mapping table covering: (a) delete getTask/getTaskResult if they delegate to ctx.task.store, and (b) drop the args parameter if keeping custom handlers for external-system proxying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant