fix(tool): fix bash tool execution hangs and mitigate database locking#20999
fix(tool): fix bash tool execution hangs and mitigate database locking#20999vhqtvn wants to merge 2 commits intoanomalyco:devfrom
Conversation
This commit addresses two major stability issues during tool execution:
1. **Race Condition in Process Spawning (Fixes Hangs):**
Fast-exiting processes (like 'jq') could terminate so quickly that the 'spawn' event was omitted or misordered. This left the 'resume()' callback in 'CrossSpawnSpawner' blocked forever, causing the session to hang indefinitely. We now defensively trigger 'resume' on 'error', 'exit', or 'close' to guarantee the execution promise resolves.
2. **Event Loop Blockage on High-Volume Output (Mitigates DB Locking & Fixes Output Truncation):**
Commands emitting thousands of lines quickly (e.g. 'npm install', large 'cat' or 'grep') caused 'Stream.runForEach' to synchronously queue thousands of 'db.insert().run()' calls via 'ctx.metadata(...)' on the main thread, choking the event loop and exacerbating 'database is locked' exceptions.
- Throttled metadata updates to 250ms ('now - lastUpdate > 250').
- Added a deferred 'setTimeout(flush, 250)' to ensure trailing output chunks are not missed.
- Fixed a race condition where 'runPromiseExit' resolved on process 'close' and forcefully interrupted the stream fiber ('handle.all'), truncating output. Now uses 'Fiber.join(streamFiber)' to completely drain pipes before returning.
|
Thanks for your contribution! This PR doesn't have a linked issue. All PRs must reference an existing issue. Please:
See CONTRIBUTING.md for details. |
|
The following comment was made by an LLM, it may be inaccurate: Based on my search results, I found the following potentially related PRs: Related PRs Found:
These PRs are related to the issues being fixed but don't appear to be exact duplicates of PR #20999. They address complementary aspects like watchdog recovery, client hangs, and SQLite configuration. |
This commit addresses two major stability issues during tool execution:
1. **Race Condition in Process Spawning (Fixes Hangs):**
Fast-exiting processes (like 'jq') could terminate so quickly that the 'spawn' event was omitted or misordered. This left the 'resume()' callback in 'CrossSpawnSpawner' blocked forever, causing the session to hang indefinitely. We now defensively trigger 'resume' on 'error', 'exit', or 'close' to guarantee the execution promise resolves.
2. **Event Loop Blockage on High-Volume Output (Mitigates DB Locking & Fixes Output Truncation):**
Commands emitting thousands of lines quickly (e.g. 'npm install', large 'cat' or 'grep') caused 'Stream.runForEach' to synchronously queue thousands of 'db.insert().run()' calls via 'ctx.metadata(...)' on the main thread, choking the event loop and exacerbating 'database is locked' exceptions.
- Throttled metadata updates to 250ms ('now - lastUpdate > 250').
- Added a deferred 'setTimeout(flush, 250)' to ensure trailing output chunks are not missed.
- Fixed a race condition where 'runPromiseExit' resolved on process 'close' and forcefully interrupted the stream fiber ('handle.all'), truncating output. Now uses 'Fiber.join(streamFiber)' to completely drain pipes before returning.
Includes test fixes for:
- tool.bash stream metadata throttling flakiness
- tool.write file permission flakiness depending on OS umask
Issue for this PR
Closes #21000
Relates to #20096 (Fixes one cause of indefinite bash tool hangs, but does not implement the requested
experimental.tool_timeoutfeature)Relates to #19521 (Mitigates
database is lockederrors by heavily reducing SQLite write contention during stream processing, though multi-process concurrency may still lock the DB)Type of change
What does this PR do?
This PR fixes severe backend hangs during tool execution (often seen as "ghost processes" in the UI) and mitigates
SQLiteError: database is lockedexceptions that crash the execution context without bringing down the Node process.Specifically, it addresses two root causes:
Race Condition in Process Spawning (Fixes Hangs)
Underlying fast-exiting processes (like
jq) could terminate so quickly that thespawnevent was omitted or misordered by the OS/Node process bindings. This left theresume()callback inCrossSpawnSpawnerblocked forever, hangingrunPromiseExit. I added resilient event tracking (spawn,error,exit,close) to guarantee the execution promise always resolves regardless of event ordering.Event Loop Blockage on High-Volume Output (Mitigates DB Locking & Fixes Output Truncation)
Commands emitting thousands of lines quickly (e.g.
npm install, largecatorgrep) causedStream.runForEachto synchronously queue thousands ofdb.insert().run()calls viactx.metadata(...)on the main thread, choking the event loop and exacerbatingdatabase is lockedexceptions. This led to silent promise rejections, dropped events, and 15-second OpenChamber timeouts.now - lastUpdate > 250) on metadata updates to reduce SQLite write contention.setTimeout(flush, 250)to ensure the final output chunks are reliably emitted when the process exits.runPromiseExitresolved on processcloseand instantly killed the background stream fiberhandle.all(truncating output).Fiber.join(streamFiber)is now used to ensure the OS pipes and async stream buffer completely drain before returning.How did you verify your code works?
jqcommands. Processes no longer hang and correctly resolve.caton large files,find /) to verify the DB doesn't lock and metadata updates smoothly in the UI.bun run typecheckcleanly across the package.Screenshots / recordings
No response
Checklist