-
-
Notifications
You must be signed in to change notification settings - Fork 687
Description
Bug Description
We had a recent incident wherein a specific workload was able to cause Node.js to crash due to an uncaught exception.
We saw two distinct cases of uncaught exceptions:
Error: read ECONNRESET
at Pipe.onStreamRead (node:internal/stream_base_commons:217:20)
SocketError: other side closed
at Socket.<anonymous> (/data/node_modules/undici/lib/dispatcher/client-h1.js:701:24)
at Socket.emit (node:events:529:35)
at Socket.emit (node:domain:489:12)
at endReadableNT (node:internal/streams/readable:1400:12)
at process.processTicksAndRejections (node:internal/process/task_queues:82:21)
Both of these were caught via process.on('uncaughtException') and had the origin of uncaughtException (these were not unhandled rejections AFAICT).
To the best of our knowledge, all opportunities for exhaustive error handling have been exercised though we've been unable to reproduce a minimal repro outside our codebase. Within our codebase, we've been able to get deeper stack traces through Chrome Dev Tools:
onUncaughtException (graceful_termination.js:53)
emit (node:events:529)
emit (node:domain:489)
(anonymous) (node:internal/process/execution:158)
TickObject
init (node:internal/inspector_async_hook:25)
emitInitNative (node:internal/async_hooks:200)
emitInitScript (node:internal/async_hooks:503)
nextTick (node:internal/process/task_queues:132)
onDestroy (node:internal/streams/destroy:103)
(anonymous) (readable.js:68)
processImmediate (node:internal/timers:476)
topLevelDomainCallback (node:domain:161)
callbackTrampoline (node:internal/async_hooks:126)
Immediate
init (node:internal/inspector_async_hook:25)
emitInitNative (node:internal/async_hooks:200)
emitInitScript (node:internal/async_hooks:503)
initAsyncResource (node:internal/timers:164)
Immediate (node:internal/timers:620)
setImmediate (node:timers:307)
_destroy (readable.js:67)
_destroy (node:internal/streams/destroy:109)
destroy (node:internal/streams/destroy:71)
destroy (readable.js:58)
destroy (util.js:290)
(anonymous) (api-request.js:176)
(anonymous) (node:internal/process/task_queues:140)
runInAsyncScope (node:async_hooks:203)
runMicrotask (node:internal/process/task_queues:137)
Microtask
init (node:internal/inspector_async_hook:25)
emitInitNative (node:internal/async_hooks:200)
emitInitScript (node:internal/async_hooks:503)
AsyncResource (node:async_hooks:186)
queueMicrotask (node:internal/process/task_queues:152)
onError (api-request.js:175)
onError (request.js:299)
errorRequest (util.js:638)
(anonymous) (client-h1.js:740)
emit (node:events:529)
emit (node:domain:489)
(anonymous) (node:net:350)
callbackTrampoline (node:internal/async_hooks:128)
PIPEWRAP
init (node:internal/inspector_async_hook:25)
emitInitNative (node:internal/async_hooks:200)
Socket.connect (node:net:1218)
connect (node:net:249)
connect (connect.js:126)
socket (client.js:428)
connect (client.js:427)
_resume (client.js:600)
resume (client.js:534)
Client.<computed> (client.js:259)
[dispatch] (client.js:314)
Intercept (redirect-interceptor.js:11)
dispatch (dispatcher-base.js:177)
[dispatch] (pool-base.js:143)
dispatch (dispatcher-base.js:177)
request (api-request.js:203)
(anonymous) (api-request.js:196)
request (api-request.js:195)
executeValidatedActionsBatch (execute_actions_batch.js:493) // Where we invoke `.request()` of a `Pool` instance.
// ... snip ...
In that longer stack trace, you can see that we call request on a Pool instance in the executeValidatedActionsBatch function, which is declared as async function. While that usually allows us to capture errors via Promise rejection, in the case of the incident, a customer's workload was reliably causing these exceptions to bubble up to the process level.
We have some weak hypotheses:
- Timing issue between a socket
endevent and gaining access to the request.bodyReadable. We don't have a reference to theReadableby the time the error happens. - Timing issue between
AbortSignaland other clean up. - 😕 ❓
Reproducible By
This is reproduced (sometimes) when the process tree within which the target server is running is suddenly OOM killed. The undici Pool is connected via unix domain socket which might present unique ways of blowing up vs the usual tcp sockets.
Expected Behavior
We expect that no process level uncaught exceptions or unhandled rejections are possible in pseudo code like this:
// Wrap undici in an async function so that we can handle all rejections in the async
// continuation.
async function doRequest(options, sink) {
const res = await pool.request({ ...options, signal });
// Avoid uncaughts via EventEmitter legacy footguns.
res.body.on('error', () => {});
const firstChunk = await consumeStreamUpToDelimiter(res.body, '\n', { signal });
// Do stuff with first chunk. Conditionally do another client request whose body
// uses `res.body`.
// After we've done all the client requests, we want to pipe the tail of the last request
// back into a supplied sink (Writable stream).
await pipeline([res.body, sink], { signal });
}Logs & Screenshots
Added in description
Environment
Docker image node:18.19.1-bullseye via docker-for-mac on MacOS 14.17.1.
Additional context
Sort of terrible drawing of what our stuff is doing:
Server request --> Repeat N times --> await pipeline(lastClientRes.body, server response)
|--> lastClientRes = await Pool.request({ socketPath })
|--> for (i in [0...M]) { await consumeStreamUpToDelimiter(lastClientRes) }