Skip to content

ci: reduce yarn install flakiness with concurrency cap and retries#3519

Merged
QuantumExplorer merged 4 commits into
v3.1-devfrom
claude/affectionate-nobel-d944df
Apr 23, 2026
Merged

ci: reduce yarn install flakiness with concurrency cap and retries#3519
QuantumExplorer merged 4 commits into
v3.1-devfrom
claude/affectionate-nobel-d944df

Conversation

@QuantumExplorer
Copy link
Copy Markdown
Member

@QuantumExplorer QuantumExplorer commented Apr 23, 2026

Issue being fixed or feature implemented

The "JS packages / Tests" jobs have been intermittently failing before any test runs, during the yarn install --inline-builds step.

Example failure: https://github.com/dashpay/platform/actions/runs/24816031011/job/72633394180?pr=3516

Failures manifest as:

Error: EBADF: bad file descriptor, fstat

on native-module packages (unrs-resolver, @pshenmic/zeromq, bufferutil, utf-8-validate). This is a yarn-berry race when many postinstall build scripts execute in parallel on Azure-hosted runners — which package gets hit varies run-to-run, making it look like a random flake.

What was done?

Two mitigations in the shared Setup Node.JS composite action (.github/actions/nodejs/action.yaml):

  1. YARN_CHILD_CONCURRENCY=1 — serializes native module builds so they don't race on file descriptors. Eliminates the root cause of EBADF at the cost of a small install-time regression (~15-30s on cold cache).
  2. Retry wrapper (3 attempts with linear backoff) — safety net for any remaining transient issue: npm registry blips, network timeouts, GitHub Actions runner hiccups.

A couple of things intentionally not included in this PR, since the scope is "stop the bleeding":

  • ssh2's NAN-based optional crypto binding fails to build against Node 24's V8 headers. That's noise in the logs but not what's failing the step — ssh2 handles it gracefully. Worth a follow-up with a yarn resolution bumping nan.
  • Caching could be expanded beyond .yarn/unplugged to also cover .yarn/cache and install state, reducing how often native builds run at all.

How Has This Been Tested?

Change is CI-only. Verified that the retry loop correctly exits 0 on success, retries on failure, and exits 1 after three failures. Real-world validation will be watching the "JS packages / Tests" jobs on this PR and subsequent PRs.

Breaking Changes

None.

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have added "!" to the title and described breaking changes in the corresponding section if my code contains any
  • I have made corresponding changes to the documentation if needed

Summary by CodeRabbit

  • Chores
    • Updated CI/CD configuration to implement automatic retry logic with backoff for dependency installation, improving build resilience through enhanced error handling.

CI has been hitting intermittent failures in the "JS packages / Tests"
jobs during `yarn install --inline-builds`. Failures show as:

  Error: EBADF: bad file descriptor, fstat

on native-module packages (unrs-resolver, @pshenmic/zeromq,
bufferutil, utf-8-validate). This is a yarn-berry race when many
postinstall build scripts run in parallel on Azure runners.

Two mitigations applied to the shared Setup Node.JS composite action:

- Set YARN_CHILD_CONCURRENCY=1 to serialize native builds and
  eliminate the race at the cost of a small install-time regression.
- Retry the install up to 3 times with backoff, as a safety net for
  any remaining transient issue (registry blips, network timeouts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added this to the v3.1.0 milestone Apr 23, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

Warning

Rate limit exceeded

@QuantumExplorer has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 19 minutes and 35 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 19 minutes and 35 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2ddf5183-d30f-4b90-aa9f-4c3d611db52e

📥 Commits

Reviewing files that changed from the base of the PR and between 06c7f1f and 7032b84.

📒 Files selected for processing (1)
  • .github/actions/nodejs/action.yaml
📝 Walkthrough

Walkthrough

A GitHub Actions workflow file is updated to enhance the Yarn installation step with retry logic and concurrency controls. The change adds environment variable configuration to serialize child processes and implements a three-attempt retry loop with exponential backoff to improve reliability.

Changes

Cohort / File(s) Summary
GitHub Actions Configuration
.github/actions/nodejs/action.yaml
Added YARN_CHILD_CONCURRENCY: '1' environment variable and replaced single install command with a bash retry loop (max 3 attempts) that logs warnings on failures, applies exponential backoff (sleep based on attempt count), and outputs a GitHub Actions error if all retries fail.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Poem

🐰 A yarn of resilience, we weave with care,
Three attempts to build without despair,
With backoff sleeping, child processes aligned,
Race conditions soothed, reliability refined,
Hop forward with confidence, the build won't break!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main changes: reducing yarn install flakiness through concurrency capping and retries, which directly matches the PR's primary objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/affectionate-nobel-d944df

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/actions/nodejs/action.yaml (1)

42-50: Avoid sleeping after the final failed attempt.

The loop unconditionally sleeps after every failed attempt, including the 3rd (final) one, adding ~15s of dead time before the step ultimately fails. Consider short-circuiting when attempt equals the max so CI fails faster on genuine outages.

♻️ Proposed refactor
-        for attempt in 1 2 3; do
-          if yarn install --inline-builds; then
-            exit 0
-          fi
-          echo "::warning::yarn install failed on attempt ${attempt}, retrying..."
-          sleep $((attempt * 5))
-        done
-        echo "::error::yarn install failed after 3 attempts"
-        exit 1
+        max_attempts=3
+        for attempt in $(seq 1 "$max_attempts"); do
+          if yarn install --inline-builds; then
+            exit 0
+          fi
+          if [ "$attempt" -lt "$max_attempts" ]; then
+            echo "::warning::yarn install failed on attempt ${attempt}, retrying in $((attempt * 5))s..."
+            sleep $((attempt * 5))
+          fi
+        done
+        echo "::error::yarn install failed after ${max_attempts} attempts"
+        exit 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/actions/nodejs/action.yaml around lines 42 - 50, The retry loop
sleeps after every failed attempt including the final one, causing unnecessary
delay; modify the loop around the `for attempt in 1 2 3; do` block so that after
a failed `yarn install --inline-builds` it only calls `sleep $((attempt * 5))`
when `attempt` is less than the max (3), and short-circuits to the final error
message and `exit 1` immediately when the last attempt fails, referencing the
`attempt` variable and the `yarn install --inline-builds` invocation to locate
where to add the conditional.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/actions/nodejs/action.yaml:
- Around line 42-50: The retry loop sleeps after every failed attempt including
the final one, causing unnecessary delay; modify the loop around the `for
attempt in 1 2 3; do` block so that after a failed `yarn install
--inline-builds` it only calls `sleep $((attempt * 5))` when `attempt` is less
than the max (3), and short-circuits to the final error message and `exit 1`
immediately when the last attempt fails, referencing the `attempt` variable and
the `yarn install --inline-builds` invocation to locate where to add the
conditional.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a9689e1f-f795-4d0f-a736-2b974c9194d8

📥 Commits

Reviewing files that changed from the base of the PR and between 4aa73ca and 06c7f1f.

📒 Files selected for processing (1)
  • .github/actions/nodejs/action.yaml

@thepastaclaw
Copy link
Copy Markdown
Collaborator

thepastaclaw commented Apr 23, 2026

Review Gate

Commit: 7032b846

  • Debounce: 10m ago (need 30m)

  • CI checks: checks still running (3 pending)

  • CodeRabbit review: comment found

  • Off-peak hours: off-peak (10:39 PM PT Wednesday)

  • Run review now (check to override)

YARN_CHILD_CONCURRENCY is a yarn classic setting and is rejected by
yarn 4 as an "Unrecognized or legacy configuration setting", causing
every yarn install to exit 1 immediately.

The correct setting in yarn berry is taskPoolConcurrency (env var
YARN_TASK_POOL_CONCURRENCY). Default is 10; use 2 to cap the race
window on native builds while keeping some parallelism for speed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-merge investigation showed the job failure is not a race but a
deterministic regression introduced by the ubuntu-24.04 runner image
2026-04-20.95. The same PR with identical yarn state passes on the
earlier image 2026-04-13.86 and fails on 2026-04-20.95 every run.

Failures manifest as `EBADF: bad file descriptor, fstat` in yarn
postinstall scripts for native packages (unrs-resolver, bufferutil,
utf-8-validate, @pshenmic/zeromq) — all failing within <200ms of
starting. This matches a known Node-level io_uring fs path
interaction with piped stdio under high postinstall concurrency.

Two targeted changes:

- UV_USE_IO_URING=0 tells libuv to skip io_uring and use the
  threadpool fs backend, which doesn't exhibit the EBADF regression.
- Drop `--inline-builds`; the default mode writes per-package build
  logs instead of piping every postinstall's stdio through yarn,
  removing the stdio-pipe pressure that compounds the EBADF issue.
  On failure yarn still reports the log paths.

Replaces the earlier YARN_TASK_POOL_CONCURRENCY setting, which CI
logs confirmed does not gate postinstall script concurrency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@QuantumExplorer
Copy link
Copy Markdown
Member Author

Root cause update — not a flake, a runner image regression. The failing test jobs ran on ubuntu24/20260420.95 (Apr 20), the passing Build JS job in the same PR ran on ubuntu24/20260413.86 (Apr 13). The newer image surfaces deterministic EBADF: bad file descriptor, fstat errors in yarn postinstall scripts within <200ms of starting.

Two changes replace the earlier YARN_TASK_POOL_CONCURRENCY approach (which CI logs confirmed doesn't gate postinstall concurrency):

  1. UV_USE_IO_URING=0 — tells libuv to use the threadpool fs backend instead of io_uring. The EBADF pattern matches known Node.js io_uring interactions with piped stdio.
  2. Drop --inline-builds — default mode writes per-package build logs instead of piping every postinstall's stdio through yarn. yarn still reports log paths on failure.

The retry loop remains as a safety net.

Previous commit blamed the ubuntu-24.04 runner image bump and tried
UV_USE_IO_URING=0 + dropping --inline-builds. That was the wrong
diagnosis.

Full stack trace from a failing tests job shows the EBADF comes from
Node's own ESM loader, not libuv fs/io_uring or yarn stdio piping:

  const stats = binding.fstat(fd, false, undefined, true);
    at tryStatSync (node:fs:391:25)
    at readFileSync (node:fs:447:17)
    at getSourceSync (node:internal/modules/esm/load:37:14)
    at createCJSModuleWrap (node:internal/modules/esm/translators:210:32)
    at ModuleLoader.commonjsStrategy (...:349:10)
  Error: EBADF: bad file descriptor, fstat
  Node.js v24.15.0

Build JS (passing) ran on runner image 20260413.86 with Node 24.14.1.
Tests (failing) ran on runner image 20260420.95 with Node 24.15.0.
Node 24.15.0 shipped a regression in the ESM loader's CJS translation
path that raises EBADF in getSourceSync -> readFileSync during
postinstall scripts that load CJS via ESM.

Pin node-version to 24.14.1. Keep the retry wrapper — cheap safety
net for unrelated transient install issues.

Revert UV_USE_IO_URING=0 and restore --inline-builds; neither was
contributing to the fix, and --inline-builds is useful for seeing
postinstall output in CI logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@QuantumExplorer
Copy link
Copy Markdown
Member Author

Fresh diagnosis from the latest failing log — full stack trace pins this on Node 24.15.0, not runner image or yarn stdio:

const stats = binding.fstat(fd, false, undefined, true);
  at tryStatSync (node:fs:391:25)
  at readFileSync (node:fs:447:17)
  at getSourceSync (node:internal/modules/esm/load:37:14)
  at createCJSModuleWrap (node:internal/modules/esm/translators:210:32)
Error: EBADF: bad file descriptor, fstat
Node.js v24.15.0

Build JS (passing) → runner image 2026-04-13 → Node 24.14.1.
Tests (failing) → runner image 2026-04-20 → Node 24.15.0.

Node 24.15.0 has a regression in its ESM → CJS translator that triggers EBADF on fstat in getSourceSync, which breaks any postinstall script that loads CJS via ESM.

Pinned node-version to 24.14.1 and reverted the UV_USE_IO_URING / --inline-builds changes from the earlier commit — neither was the actual fix.

@QuantumExplorer QuantumExplorer merged commit b771a17 into v3.1-dev Apr 23, 2026
34 checks passed
@QuantumExplorer QuantumExplorer deleted the claude/affectionate-nobel-d944df branch April 23, 2026 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants