Skip to content

Fix agent-tool recovery finalization#1476

Merged
whoiskatrin merged 3 commits into
mainfrom
fix-agent-tool-reconcile-finalize
May 8, 2026
Merged

Fix agent-tool recovery finalization#1476
whoiskatrin merged 3 commits into
mainfrom
fix-agent-tool-reconcile-finalize

Conversation

@whoiskatrin
Copy link
Copy Markdown
Contributor

@whoiskatrin whoiskatrin commented May 7, 2026

Summary

Fixes #1475.

This PR makes recovered agent-tool runs go through the same terminal finalization path as live runAgentTool() executions.

Previously, _reconcileAgentToolRuns() updated cf_agent_tool_runs when a parent Agent woke up and found a child run had reached a terminal state, but it skipped two observable parts of the framework lifecycle:

  • onAgentToolFinish(...)
  • terminal agent-tool-event broadcasts via _broadcastAgentToolTerminal(...)

That meant application state mirrors, dashboards, structured logs, and chat UI state could silently drift from the framework’s durable run registry after parent DO eviction / hibernation.

What changed

Shared terminal finalization path

Added a private _finishAgentToolRun(...) helper that centralizes the terminal lifecycle invariant:

  1. update cf_agent_tool_runs
  2. broadcast terminal agent-tool event when a sequence is available
  3. invoke onAgentToolFinish(...)

Live runAgentTool() terminal paths now use this helper too, so recovery and live execution share the same behavior.

Reconciliation now restores observable lifecycle

_reconcileAgentToolRuns() now:

  • selects the full stored agent-tool row, not only run_id / agent_type
  • reconstructs AgentToolRunInfo from durable state
  • resolves the child by stored agent_type
  • inspects the child’s terminal state
  • best-effort replays stored child chunks before terminal broadcast
  • finalizes the run through _finishAgentToolRun(...)

Stored chunk replay helper

Added _broadcastAgentToolStoredChunks(...) to keep chunk replay behavior local and shared between:

  • recovery reconciliation
  • replay to connecting clients
  • interrupted replay path

This keeps sequencing behavior consistent: terminal events are emitted after any replayed stored chunks.

Why this architecture

The bug came from terminal finalization being spread across call sites. runAgentTool() knew to update durable state, broadcast, and call the lifecycle hook, while _reconcileAgentToolRuns() only updated durable state.

This PR deepens the internal finalization module: callers provide a run and terminal result, and the implementation owns the lifecycle ordering. That improves locality and makes it less likely future recovery paths drift from live execution.

Tests

Added regression coverage for parent recovery reconciliation using AIChatAgentToolParent / AIChatAgentToolChild.

The test simulates:

  1. a parent registry row stuck in running
  2. a child agent-tool run that has completed durably
  3. parent recovery reconciliation
  4. verification that:
    • onAgentToolFinish(...) fires
    • the terminal finished event is broadcast
    • reconstructed run metadata is passed to the lifecycle hook

Verification

Ran:

cd packages/ai-chat && npm run test:workers -- src/tests/agent-tools.test.ts
npm run check

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 7, 2026

🦋 Changeset detected

Latest commit: ee5d1e6

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
agents Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 7, 2026

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1476

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1476

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1476

hono-agents

npm i https://pkg.pr.new/hono-agents@1476

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1476

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1476

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1476

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1476

commit: ee5d1e6

@whoiskatrin whoiskatrin marked this pull request as ready for review May 8, 2026 08:25
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

Copy link
Copy Markdown

@aron-cf aron-cf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment regarding changeset.

"agents": patch
---

Ensure recovered agent-tool runs go through the same terminal lifecycle path as live runs. Parent recovery reconciliation now replays stored child chunks, broadcasts terminal agent-tool events, and invokes `onAgentToolFinish` after updating the parent run registry.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't clearly describe to the reader what actually changed from their perspective. I think it's something like "Fixed bug causing client state to drift from internal Durable Object state when tool calls spanned a Durable Object restart."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixed

@whoiskatrin whoiskatrin merged commit 3c48858 into main May 8, 2026
5 checks passed
@whoiskatrin whoiskatrin deleted the fix-agent-tool-reconcile-finalize branch May 8, 2026 10:25
@github-actions github-actions Bot mentioned this pull request May 8, 2026
threepointone added a commit that referenced this pull request May 11, 2026
* fix(voice): harden Workers AI STT turn handling

Follow up on PR #1458 by preserving Flux turn transcripts across lifecycle events and using model-detected speech start for low-latency barge-in.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(chat): harden stream resume negotiation close races

Follow-up to PR #1463: route stream-resume negotiation sends through close-safe helpers so WebSocket close races do not crash resume handling in think and ai-chat.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(voice): parse raw NDJSON text streams

Follow-up to PR #1462: make the voice text stream parser honor its documented NDJSON support while preserving SSE parsing for AI text streams.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(agents): defer recovered agent-tool finish hooks (#1476)

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(voice): cover useVoiceAgent enabled lifecycle (#1478)

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(ai-chat): close resumed streams on disconnect (#1487)

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(voice): invalidate playback on client interrupt (#1458)

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(voice): invalidate playback when ending calls (#1458)

Co-authored-by: Cursor <cursoragent@cursor.com>

* Run deferred finish hooks after successful startup

Ensure recovered agent-tool finish hooks are only executed after a successful user onStart. Await _runDeferredAgentToolFinishHooks inside the onStart flow so deferred finishes are skipped when startup fails. Add a test and helper (reconcileCompletedChildWithFailedStartupForTest) to verify finish hooks are not run on failed startup and to cover lifecycle ordering and event emission.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_reconcileAgentToolRuns updates SQL terminal status but does not invoke onAgentToolFinish or broadcast — causes silent state desync after DO eviction

2 participants