Skip to content

Uncaught exception thrown in way that can't be intercepted in userland #3848

@ggoodman

Description

@ggoodman

Bug Description

We had a recent incident wherein a specific workload was able to cause Node.js to crash due to an uncaught exception.

We saw two distinct cases of uncaught exceptions:

Error: read ECONNRESET
    at Pipe.onStreamRead (node:internal/stream_base_commons:217:20)
SocketError: other side closed
    at Socket.<anonymous> (/data/node_modules/undici/lib/dispatcher/client-h1.js:701:24)
    at Socket.emit (node:events:529:35)
    at Socket.emit (node:domain:489:12)
    at endReadableNT (node:internal/streams/readable:1400:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)

Both of these were caught via process.on('uncaughtException') and had the origin of uncaughtException (these were not unhandled rejections AFAICT).

To the best of our knowledge, all opportunities for exhaustive error handling have been exercised though we've been unable to reproduce a minimal repro outside our codebase. Within our codebase, we've been able to get deeper stack traces through Chrome Dev Tools:

onUncaughtException (graceful_termination.js:53)
emit (node:events:529)
emit (node:domain:489)
(anonymous) (node:internal/process/execution:158)
TickObject
init (node:internal/inspector_async_hook:25)
emitInitNative (node:internal/async_hooks:200)
emitInitScript (node:internal/async_hooks:503)
nextTick (node:internal/process/task_queues:132)
onDestroy (node:internal/streams/destroy:103)
(anonymous) (readable.js:68)
processImmediate (node:internal/timers:476)
topLevelDomainCallback (node:domain:161)
callbackTrampoline (node:internal/async_hooks:126)
Immediate
init (node:internal/inspector_async_hook:25)
emitInitNative (node:internal/async_hooks:200)
emitInitScript (node:internal/async_hooks:503)
initAsyncResource (node:internal/timers:164)
Immediate (node:internal/timers:620)
setImmediate (node:timers:307)
_destroy (readable.js:67)
_destroy (node:internal/streams/destroy:109)
destroy (node:internal/streams/destroy:71)
destroy (readable.js:58)
destroy (util.js:290)
(anonymous) (api-request.js:176)
(anonymous) (node:internal/process/task_queues:140)
runInAsyncScope (node:async_hooks:203)
runMicrotask (node:internal/process/task_queues:137)
Microtask
init (node:internal/inspector_async_hook:25)
emitInitNative (node:internal/async_hooks:200)
emitInitScript (node:internal/async_hooks:503)
AsyncResource (node:async_hooks:186)
queueMicrotask (node:internal/process/task_queues:152)
onError (api-request.js:175)
onError (request.js:299)
errorRequest (util.js:638)
(anonymous) (client-h1.js:740)
emit (node:events:529)
emit (node:domain:489)
(anonymous) (node:net:350)
callbackTrampoline (node:internal/async_hooks:128)
PIPEWRAP
init (node:internal/inspector_async_hook:25)
emitInitNative (node:internal/async_hooks:200)
Socket.connect (node:net:1218)
connect (node:net:249)
connect (connect.js:126)
socket (client.js:428)
connect (client.js:427)
_resume (client.js:600)
resume (client.js:534)
Client.<computed> (client.js:259)
[dispatch] (client.js:314)
Intercept (redirect-interceptor.js:11)
dispatch (dispatcher-base.js:177)
[dispatch] (pool-base.js:143)
dispatch (dispatcher-base.js:177)
request (api-request.js:203)
(anonymous) (api-request.js:196)
request (api-request.js:195)
executeValidatedActionsBatch (execute_actions_batch.js:493) // Where we invoke `.request()` of a `Pool` instance.
// ... snip ...

In that longer stack trace, you can see that we call request on a Pool instance in the executeValidatedActionsBatch function, which is declared as async function. While that usually allows us to capture errors via Promise rejection, in the case of the incident, a customer's workload was reliably causing these exceptions to bubble up to the process level.

We have some weak hypotheses:

  1. Timing issue between a socket end event and gaining access to the request .body Readable. We don't have a reference to the Readable by the time the error happens.
  2. Timing issue between AbortSignal and other clean up.
  3. 😕 ❓

Reproducible By

This is reproduced (sometimes) when the process tree within which the target server is running is suddenly OOM killed. The undici Pool is connected via unix domain socket which might present unique ways of blowing up vs the usual tcp sockets.

Expected Behavior

We expect that no process level uncaught exceptions or unhandled rejections are possible in pseudo code like this:

// Wrap undici in an async function so that we can handle all rejections in the async 
// continuation.
async function doRequest(options, sink) {
  const res = await pool.request({ ...options, signal });
  
  // Avoid uncaughts via EventEmitter legacy footguns.
  res.body.on('error', () => {});

  const firstChunk = await consumeStreamUpToDelimiter(res.body, '\n', { signal });

  // Do stuff with first chunk. Conditionally do another client request whose body
  // uses `res.body`.

  // After we've done all the client requests, we want to pipe the tail of the last request
  // back into a supplied sink (Writable stream).
  await pipeline([res.body, sink], { signal });
}

Logs & Screenshots

Added in description

Environment

Docker image node:18.19.1-bullseye via docker-for-mac on MacOS 14.17.1.

Additional context

Sort of terrible drawing of what our stuff is doing:

Server request --> Repeat N times --> await pipeline(lastClientRes.body, server response)
                          |--> lastClientRes = await Pool.request({ socketPath })
                          |--> for (i in [0...M]) { await consumeStreamUpToDelimiter(lastClientRes) }

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions