Skip to content

UND_ERR_CONNECT_TIMEOUT errors thrown when there is CPU intensive code on the event loop #3410

@mknichel

Description

@mknichel

Bug Description

Work on the event loop can interrupt the Undici lifecycle for making requests, causing errors to be thrown even when there is no problem with the underlying connection. For example, if a fetch request is started and then work on the event loop takes more than 10 seconds (default connect timeout), Undici will throw a UND_ERR_CONNECT_TIMEOUT error even if the connection could be established very quickly.

I believe what is happening is:

  1. When the fetch request is started, Undici starts the work to make a connection. Undici calls setTimeout with the value of the connectTimeoutMs to throw an error and cancel the connection if it takes too long (https://github.com/nodejs/undici/blob/main/lib/core/connect.js). It makes a call to GetAddrInfoReqWrap (https://github.com/nodejs/node/blob/main/lib/dns.js#L221), but this is asynchronous and processing of the callback will be delayed until the next event loop.
  2. User tasks block the event loop for a long period of time.
  3. The onConnectTimeout timer is run because the previous task took longer than the timeout. onConnectTimeout calls setImmediate with a function to destroy the socket and throw the error. https://github.com/nodejs/undici/blob/main/lib/core/connect.js
  4. The GetAddrInfoReq lookup callback (emitLookup in node:net) is run. This code begins the TCP connection (internalConnect is called in https://github.com/nodejs/node/blob/main/lib/net.js#L1032) but that is also asynchronous, so it won't finish in this round of the event loop.
  5. The setImmediate function is run in the next phase which destroys the socket and throws the UND_ERR_CONNECT_TIMEOUT error.
  6. Undici never gets a chance to handle the TCP connection response.

Internally at Vercel, we have been seeing a high number of these UND_ERR_CONNECT_TIMEOUT issues while pre-rendering pages in our Next.js application. I can't run this task on my local machine so it's harder to debug, but it's a CPU intensive task and moving fetch requests to a worker thread eliminated the Undici errors. We tried other suggestions (like --dns-result-order=ipv4first and verified that we were not seeing any packet loss) that did not resolve the issue. Increasing the connect timeout resolves the issue in the reproduction but not the issue in our Next.js build (which I can't explain).

Reproducible By

A minimal reproduction is available at https://github.com/mknichel/undici-connect-timeout-errors.

We can reproduce the behavior on Node 18.x and 20.x and with the 5.24.0 and the latest version of Undici (6.19.2)

Expected Behavior

The Undici request lifecycle could operate on a separate thread that does not get blocked by user code. By separating it out from the user code, this would remove impact of any user code on requests.

To test this theory, we created a dispatcher that proxied the fetch request to a dedicated worker thread (new Worker from worker_threads). This eliminated all the Undici errors that we were seeing in our Next.js build.

Logs & Screenshots

In the minimal reproduction, the error is:

TypeError: fetch failed
    at fetch (/Users/mknichel/code/tmp/undici-connect-timeout-errors/node_modules/.pnpm/undici@6.19.2/node_modules/undici/index.js:112:13)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at fetchExample (/Users/mknichel/code/tmp/undici-connect-timeout-errors/index.ts:21:20)
    at main (/Users/mknichel/code/tmp/undici-connect-timeout-errors/index.ts:66:3) {
  [cause]: ConnectTimeoutError: Connect Timeout Error
      at onConnectTimeout (/Users/mknichel/code/tmp/undici-connect-timeout-errors/node_modules/.pnpm/undici@6.19.2/node_modules/undici/lib/core/connect.js:190:24)
      at /Users/mknichel/code/tmp/undici-connect-timeout-errors/node_modules/.pnpm/undici@6.19.2/node_modules/undici/lib/core/connect.js:133:46
      at Immediate._onImmediate (/Users/mknichel/code/tmp/undici-connect-timeout-errors/node_modules/.pnpm/undici@6.19.2/node_modules/undici/lib/core/connect.js:174:9)
      at process.processImmediate (node:internal/timers:478:21) {
    code: 'UND_ERR_CONNECT_TIMEOUT'

In our Next.js builds, the error is:

TypeError: fetch failed
    at node:internal/deps/undici/undici:12618:11
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async s (elided path)
    at async elided path {
  cause: ConnectTimeoutError: Connect Timeout Error
      at onConnectTimeout (node:internal/deps/undici/undici:7760:28)
      at node:internal/deps/undici/undici:7716:50
      at Immediate._onImmediate (node:internal/deps/undici/undici:7748:13)
      at process.processImmediate (node:internal/timers:478:21)
      at process.callbackTrampoline (node:internal/async_hooks:130:17) {
    code: 'UND_ERR_CONNECT_TIMEOUT'
    }
  }

Environment

The reproduction repo was erroring for me on Mac OS 14.4, while internally we are seeing issues on AWS EC2 Intel machines.

Additional context

Vercel/Next.js users have reported UND_ERR_CONNECT_TIMEOUT issues to us:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions