Skip to content

fix(tool): fix bash tool execution hangs and mitigate database locking#20999

Open
vhqtvn wants to merge 2 commits intoanomalyco:devfrom
vhqtvn:dev
Open

fix(tool): fix bash tool execution hangs and mitigate database locking#20999
vhqtvn wants to merge 2 commits intoanomalyco:devfrom
vhqtvn:dev

Conversation

@vhqtvn
Copy link
Copy Markdown

@vhqtvn vhqtvn commented Apr 4, 2026

Issue for this PR

Closes #21000
Relates to #20096 (Fixes one cause of indefinite bash tool hangs, but does not implement the requested experimental.tool_timeout feature)
Relates to #19521 (Mitigates database is locked errors by heavily reducing SQLite write contention during stream processing, though multi-process concurrency may still lock the DB)

Type of change

  • Bug fix
  • New feature
  • Refactor / code improvement
  • Documentation

What does this PR do?

This PR fixes severe backend hangs during tool execution (often seen as "ghost processes" in the UI) and mitigates SQLiteError: database is locked exceptions that crash the execution context without bringing down the Node process.

Specifically, it addresses two root causes:

  1. Race Condition in Process Spawning (Fixes Hangs)
    Underlying fast-exiting processes (like jq) could terminate so quickly that the spawn event was omitted or misordered by the OS/Node process bindings. This left the resume() callback in CrossSpawnSpawner blocked forever, hanging runPromiseExit. I added resilient event tracking (spawn, error, exit, close) to guarantee the execution promise always resolves regardless of event ordering.

  2. Event Loop Blockage on High-Volume Output (Mitigates DB Locking & Fixes Output Truncation)
    Commands emitting thousands of lines quickly (e.g. npm install, large cat or grep) caused Stream.runForEach to synchronously queue thousands of db.insert().run() calls via ctx.metadata(...) on the main thread, choking the event loop and exacerbating database is locked exceptions. This led to silent promise rejections, dropped events, and 15-second OpenChamber timeouts.

    • I implemented a 250ms throttle (now - lastUpdate > 250) on metadata updates to reduce SQLite write contention.
    • Added a deferred setTimeout(flush, 250) to ensure the final output chunks are reliably emitted when the process exits.
    • Fixed a race condition where runPromiseExit resolved on process close and instantly killed the background stream fiber handle.all (truncating output). Fiber.join(streamFiber) is now used to ensure the OS pipes and async stream buffer completely drain before returning.

How did you verify your code works?

  • Manually tested by dispatching concurrent jq commands. Processes no longer hang and correctly resolve.
  • Ran high-volume output commands (cat on large files, find /) to verify the DB doesn't lock and metadata updates smoothly in the UI.
  • Verified 100% of trailing output chunks are captured when processes exit quickly.
  • Ran bun run typecheck cleanly across the package.

Screenshots / recordings

No response

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

This commit addresses two major stability issues during tool execution:

1. **Race Condition in Process Spawning (Fixes Hangs):**
   Fast-exiting processes (like 'jq') could terminate so quickly that the 'spawn' event was omitted or misordered. This left the 'resume()' callback in 'CrossSpawnSpawner' blocked forever, causing the session to hang indefinitely. We now defensively trigger 'resume' on 'error', 'exit', or 'close' to guarantee the execution promise resolves.

2. **Event Loop Blockage on High-Volume Output (Mitigates DB Locking & Fixes Output Truncation):**
   Commands emitting thousands of lines quickly (e.g. 'npm install', large 'cat' or 'grep') caused 'Stream.runForEach' to synchronously queue thousands of 'db.insert().run()' calls via 'ctx.metadata(...)' on the main thread, choking the event loop and exacerbating 'database is locked' exceptions.
   - Throttled metadata updates to 250ms ('now - lastUpdate > 250').
   - Added a deferred 'setTimeout(flush, 250)' to ensure trailing output chunks are not missed.
   - Fixed a race condition where 'runPromiseExit' resolved on process 'close' and forcefully interrupted the stream fiber ('handle.all'), truncating output. Now uses 'Fiber.join(streamFiber)' to completely drain pipes before returning.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

Thanks for your contribution!

This PR doesn't have a linked issue. All PRs must reference an existing issue.

Please:

  1. Open an issue describing the bug/feature (if one doesn't exist)
  2. Add Fixes #<number> or Closes #<number> to this PR description

See CONTRIBUTING.md for details.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

The following comment was made by an LLM, it may be inaccurate:

Based on my search results, I found the following potentially related PRs:

Related PRs Found:

  1. feat(session): watchdog for stuck tool/session recovery - feat(session): watchdog for stuck tool/session recovery #20104

    • Related to the hang issue; addresses stuck tools/sessions with a watchdog mechanism
  2. fix(opencode): write tool causes client to hang indefinitely when creating new files - fix(opencode): write tool causes client to hang indefinitely when creating new files #15684

    • Also addresses indefinite hangs in tool execution
  3. fix: switch SQLite from WAL to DELETE journal mode to prevent corruption - fix: switch SQLite from WAL to DELETE journal mode to prevent corruption #14326

    • Related to SQLite database issues, though focused on corruption rather than locking

These PRs are related to the issues being fixed but don't appear to be exact duplicates of PR #20999. They address complementary aspects like watchdog recovery, client hangs, and SQLite configuration.

This commit addresses two major stability issues during tool execution:

1. **Race Condition in Process Spawning (Fixes Hangs):**
   Fast-exiting processes (like 'jq') could terminate so quickly that the 'spawn' event was omitted or misordered. This left the 'resume()' callback in 'CrossSpawnSpawner' blocked forever, causing the session to hang indefinitely. We now defensively trigger 'resume' on 'error', 'exit', or 'close' to guarantee the execution promise resolves.

2. **Event Loop Blockage on High-Volume Output (Mitigates DB Locking & Fixes Output Truncation):**
   Commands emitting thousands of lines quickly (e.g. 'npm install', large 'cat' or 'grep') caused 'Stream.runForEach' to synchronously queue thousands of 'db.insert().run()' calls via 'ctx.metadata(...)' on the main thread, choking the event loop and exacerbating 'database is locked' exceptions.
   - Throttled metadata updates to 250ms ('now - lastUpdate > 250').
   - Added a deferred 'setTimeout(flush, 250)' to ensure trailing output chunks are not missed.
   - Fixed a race condition where 'runPromiseExit' resolved on process 'close' and forcefully interrupted the stream fiber ('handle.all'), truncating output. Now uses 'Fiber.join(streamFiber)' to completely drain pipes before returning.

Includes test fixes for:
- tool.bash stream metadata throttling flakiness
- tool.write file permission flakiness depending on OS umask
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Bash tool hangs on fast-exiting processes and locks database on massive output

1 participant