ci: reduce yarn install flakiness with concurrency cap and retries#3519
Conversation
CI has been hitting intermittent failures in the "JS packages / Tests" jobs during `yarn install --inline-builds`. Failures show as: Error: EBADF: bad file descriptor, fstat on native-module packages (unrs-resolver, @pshenmic/zeromq, bufferutil, utf-8-validate). This is a yarn-berry race when many postinstall build scripts run in parallel on Azure runners. Two mitigations applied to the shared Setup Node.JS composite action: - Set YARN_CHILD_CONCURRENCY=1 to serialize native builds and eliminate the race at the cost of a small install-time regression. - Retry the install up to 3 times with backoff, as a safety net for any remaining transient issue (registry blips, network timeouts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 19 minutes and 35 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA GitHub Actions workflow file is updated to enhance the Yarn installation step with retry logic and concurrency controls. The change adds environment variable configuration to serialize child processes and implements a three-attempt retry loop with exponential backoff to improve reliability. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.github/actions/nodejs/action.yaml (1)
42-50: Avoid sleeping after the final failed attempt.The loop unconditionally sleeps after every failed attempt, including the 3rd (final) one, adding ~15s of dead time before the step ultimately fails. Consider short-circuiting when
attemptequals the max so CI fails faster on genuine outages.♻️ Proposed refactor
- for attempt in 1 2 3; do - if yarn install --inline-builds; then - exit 0 - fi - echo "::warning::yarn install failed on attempt ${attempt}, retrying..." - sleep $((attempt * 5)) - done - echo "::error::yarn install failed after 3 attempts" - exit 1 + max_attempts=3 + for attempt in $(seq 1 "$max_attempts"); do + if yarn install --inline-builds; then + exit 0 + fi + if [ "$attempt" -lt "$max_attempts" ]; then + echo "::warning::yarn install failed on attempt ${attempt}, retrying in $((attempt * 5))s..." + sleep $((attempt * 5)) + fi + done + echo "::error::yarn install failed after ${max_attempts} attempts" + exit 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/actions/nodejs/action.yaml around lines 42 - 50, The retry loop sleeps after every failed attempt including the final one, causing unnecessary delay; modify the loop around the `for attempt in 1 2 3; do` block so that after a failed `yarn install --inline-builds` it only calls `sleep $((attempt * 5))` when `attempt` is less than the max (3), and short-circuits to the final error message and `exit 1` immediately when the last attempt fails, referencing the `attempt` variable and the `yarn install --inline-builds` invocation to locate where to add the conditional.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.github/actions/nodejs/action.yaml:
- Around line 42-50: The retry loop sleeps after every failed attempt including
the final one, causing unnecessary delay; modify the loop around the `for
attempt in 1 2 3; do` block so that after a failed `yarn install
--inline-builds` it only calls `sleep $((attempt * 5))` when `attempt` is less
than the max (3), and short-circuits to the final error message and `exit 1`
immediately when the last attempt fails, referencing the `attempt` variable and
the `yarn install --inline-builds` invocation to locate where to add the
conditional.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a9689e1f-f795-4d0f-a736-2b974c9194d8
📒 Files selected for processing (1)
.github/actions/nodejs/action.yaml
Review GateCommit:
|
YARN_CHILD_CONCURRENCY is a yarn classic setting and is rejected by yarn 4 as an "Unrecognized or legacy configuration setting", causing every yarn install to exit 1 immediately. The correct setting in yarn berry is taskPoolConcurrency (env var YARN_TASK_POOL_CONCURRENCY). Default is 10; use 2 to cap the race window on native builds while keeping some parallelism for speed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-merge investigation showed the job failure is not a race but a deterministic regression introduced by the ubuntu-24.04 runner image 2026-04-20.95. The same PR with identical yarn state passes on the earlier image 2026-04-13.86 and fails on 2026-04-20.95 every run. Failures manifest as `EBADF: bad file descriptor, fstat` in yarn postinstall scripts for native packages (unrs-resolver, bufferutil, utf-8-validate, @pshenmic/zeromq) — all failing within <200ms of starting. This matches a known Node-level io_uring fs path interaction with piped stdio under high postinstall concurrency. Two targeted changes: - UV_USE_IO_URING=0 tells libuv to skip io_uring and use the threadpool fs backend, which doesn't exhibit the EBADF regression. - Drop `--inline-builds`; the default mode writes per-package build logs instead of piping every postinstall's stdio through yarn, removing the stdio-pipe pressure that compounds the EBADF issue. On failure yarn still reports the log paths. Replaces the earlier YARN_TASK_POOL_CONCURRENCY setting, which CI logs confirmed does not gate postinstall script concurrency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Root cause update — not a flake, a runner image regression. The failing test jobs ran on Two changes replace the earlier
The retry loop remains as a safety net. |
Previous commit blamed the ubuntu-24.04 runner image bump and tried
UV_USE_IO_URING=0 + dropping --inline-builds. That was the wrong
diagnosis.
Full stack trace from a failing tests job shows the EBADF comes from
Node's own ESM loader, not libuv fs/io_uring or yarn stdio piping:
const stats = binding.fstat(fd, false, undefined, true);
at tryStatSync (node:fs:391:25)
at readFileSync (node:fs:447:17)
at getSourceSync (node:internal/modules/esm/load:37:14)
at createCJSModuleWrap (node:internal/modules/esm/translators:210:32)
at ModuleLoader.commonjsStrategy (...:349:10)
Error: EBADF: bad file descriptor, fstat
Node.js v24.15.0
Build JS (passing) ran on runner image 20260413.86 with Node 24.14.1.
Tests (failing) ran on runner image 20260420.95 with Node 24.15.0.
Node 24.15.0 shipped a regression in the ESM loader's CJS translation
path that raises EBADF in getSourceSync -> readFileSync during
postinstall scripts that load CJS via ESM.
Pin node-version to 24.14.1. Keep the retry wrapper — cheap safety
net for unrelated transient install issues.
Revert UV_USE_IO_URING=0 and restore --inline-builds; neither was
contributing to the fix, and --inline-builds is useful for seeing
postinstall output in CI logs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Fresh diagnosis from the latest failing log — full stack trace pins this on Node 24.15.0, not runner image or yarn stdio: Build JS (passing) → runner image 2026-04-13 → Node 24.14.1. Node 24.15.0 has a regression in its ESM → CJS translator that triggers EBADF on Pinned |
Issue being fixed or feature implemented
The "JS packages / Tests" jobs have been intermittently failing before any test runs, during the
yarn install --inline-buildsstep.Example failure: https://github.com/dashpay/platform/actions/runs/24816031011/job/72633394180?pr=3516
Failures manifest as:
on native-module packages (
unrs-resolver,@pshenmic/zeromq,bufferutil,utf-8-validate). This is a yarn-berry race when many postinstall build scripts execute in parallel on Azure-hosted runners — which package gets hit varies run-to-run, making it look like a random flake.What was done?
Two mitigations in the shared
Setup Node.JScomposite action (.github/actions/nodejs/action.yaml):YARN_CHILD_CONCURRENCY=1— serializes native module builds so they don't race on file descriptors. Eliminates the root cause of EBADF at the cost of a small install-time regression (~15-30s on cold cache).A couple of things intentionally not included in this PR, since the scope is "stop the bleeding":
ssh2's NAN-based optional crypto binding fails to build against Node 24's V8 headers. That's noise in the logs but not what's failing the step — ssh2 handles it gracefully. Worth a follow-up with a yarn resolution bumpingnan..yarn/unpluggedto also cover.yarn/cacheand install state, reducing how often native builds run at all.How Has This Been Tested?
Change is CI-only. Verified that the retry loop correctly exits 0 on success, retries on failure, and exits 1 after three failures. Real-world validation will be watching the "JS packages / Tests" jobs on this PR and subsequent PRs.
Breaking Changes
None.
Checklist:
Summary by CodeRabbit