Skip to content

Improve npm install reliability: add retry config, node_modules caching, and better CI diagnostics#66430

Merged
javiercn merged 3 commits intomainfrom
copilot/update-ci-pipeline-reliability
Apr 23, 2026
Merged

Improve npm install reliability: add retry config, node_modules caching, and better CI diagnostics#66430
javiercn merged 3 commits intomainfrom
copilot/update-ci-pipeline-reliability

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 23, 2026

npm's default fetch retry/timeout settings are too aggressive for CI environments, causing flaky package installs. node_modules is also restored from scratch on every CI run with no caching. Additionally, npm output was not visible in CI logs, making failures difficult to diagnose.

Description

  • .npmrc — Added npm fetch retry/timeout knobs applied globally to all npm ci/npm install invocations:

    • fetch-retries=5 (↑ from 2)
    • fetch-retry-factor=2 (↓ from 10 — less aggressive backoff)
    • fetch-retry-maxtimeout=120000 (↑ from 60000)
    • fetch-retry-mintimeout=10000 (unchanged from default, explicit)
    • fetch-timeout=600000 (↑ from 300000)
  • default-build.yml — Added Cache@2 task after UseNode@1 in both public and official build template blocks; caches node_modules keyed by "node_modules" | "node24" | OS | hash(package-lock.json). The Node major version is included in the key to prevent cross-version cache restores. MSBuild's incremental build check (node_modules\.package-lock.json marker) automatically skips npm ci on cache hit.

  • components-e2e-tests.yml — Added Cache@2 before the explicit npm ci step; npm ci is conditioned on NODE_MODULES_CACHE_HIT != 'true'. Cache key includes "node24" to scope by Node major version.

  • validate-npm-package-lock-json.yml — Removed --fetch-retries 5 CLI flag (covered by .npmrc). cache: 'npm' is intentionally omitted to preserve the workflow's purpose of validating that packages are actually retrievable from the feed.

  • update-jquery-validate.yml — Added cache: 'npm' with cache-dependency-path: src/Mvc/build/package-lock.json to actions/setup-node, scoping the cache key to the lockfile for the packages actually being installed.

  • Npm.Workspace.nodeproj — Added ConsoleToMsBuild="true" to all npm Exec tasks (npm ci, npm run build, npm run test) so that npm stdout/stderr is routed through MSBuild's logging pipeline and visible in CI logs.

  • Npm.Workspace.FunctionalTests.nodeproj — Added ConsoleToMsBuild="true" to the npm run integration-test Exec task for the same reason.

  • You've read the Contributor Guide and Code of Conduct.

  • You've included unit or integration tests for your change, where applicable.

  • You've included inline docs for your change, where applicable.

  • There's an open issue for the PR that you are making. If you'd like to propose a new feature or change, please open an issue to discuss the change or find an existing issue.

@javiercn javiercn marked this pull request as ready for review April 23, 2026 07:01
@javiercn javiercn requested a review from wtgodbe as a code owner April 23, 2026 07:01
Copilot AI review requested due to automatic review settings April 23, 2026 07:01
@javiercn javiercn requested a review from a team as a code owner April 23, 2026 07:01
@github-actions github-actions Bot added the area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework label Apr 23, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce CI flakiness and speed up installs by tuning npm network retry/timeout behavior and adding dependency caching in CI workflows/pipelines.

Changes:

  • Added global npm fetch retry/timeout settings via .npmrc.
  • Added Azure Pipelines Cache@2 steps to cache node_modules keyed by OS + package-lock.json.
  • Enabled npm cache support in GitHub Actions workflows via actions/setup-node caching, and removed a redundant CLI retry flag.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
.npmrc Increases npm fetch retry/timeout settings to improve CI install reliability.
.github/workflows/validate-npm-package-lock-json.yml Enables actions/setup-node npm caching and removes --fetch-retries from npm ci.
.github/workflows/update-jquery-validate.yml Enables actions/setup-node npm caching for the scheduled dependency update workflow.
.azure/pipelines/jobs/default-build.yml Adds Cache@2 to cache root node_modules after Node installation in both template blocks.
.azure/pipelines/components-e2e-tests.yml Adds Cache@2 for node_modules and skips npm ci when the cache is hit.

Comment thread .azure/pipelines/jobs/default-build.yml Outdated
Comment on lines +429 to +431
key: '"node_modules" | "$(Agent.OS)" | $(Build.SourcesDirectory)/package-lock.json'
restoreKeys: |
"node_modules" | "$(Agent.OS)"
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache key doesn’t include the Node.js major version. If the UseNode@1 version is updated (or varies between jobs), restoring node_modules from an older Node version can cause subtle failures for packages with native components/optional binaries. Consider including the Node version (e.g., a node24 segment or a variable) in the key/restoreKeys.

Suggested change
key: '"node_modules" | "$(Agent.OS)" | $(Build.SourcesDirectory)/package-lock.json'
restoreKeys: |
"node_modules" | "$(Agent.OS)"
key: '"node_modules" | "node24" | "$(Agent.OS)" | $(Build.SourcesDirectory)/package-lock.json'
restoreKeys: |
"node_modules" | "node24" | "$(Agent.OS)"

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +71
- task: Cache@2
displayName: Cache node_modules
inputs:
key: '"node_modules" | "$(Agent.OS)" | $(Build.SourcesDirectory)/package-lock.json'
restoreKeys: |
"node_modules" | "$(Agent.OS)"
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node_modules cache key is only scoped by OS + lockfile. If the Node version used by the pipeline changes, cached node_modules may become incompatible for packages with native/optional binaries. Consider adding the Node major version (e.g., node24) to the key/restoreKeys to avoid cross-version restores.

Suggested change
- task: Cache@2
displayName: Cache node_modules
inputs:
key: '"node_modules" | "$(Agent.OS)" | $(Build.SourcesDirectory)/package-lock.json'
restoreKeys: |
"node_modules" | "$(Agent.OS)"
- script: |
NODE_MAJOR_VERSION=$(node -p "process.versions.node.split('.')[0]")
echo "##vso[task.setvariable variable=NodeMajorVersion]node${NODE_MAJOR_VERSION}"
displayName: Detect Node major version for cache key
- task: Cache@2
displayName: Cache node_modules
inputs:
key: '"node_modules" | "$(Agent.OS)" | "$(NodeMajorVersion)" | $(Build.SourcesDirectory)/package-lock.json'
restoreKeys: |
"node_modules" | "$(Agent.OS)" | "$(NodeMajorVersion)"

Copilot uses AI. Check for mistakes.
uses: actions/setup-node@53b83947a5a98c8d113130e565377fae1a50d02f # v6.3.0
with:
node-version: 20.x
cache: 'npm'
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actions/setup-node caching the npm cache can allow npm ci to succeed using previously cached tarballs, which may reduce the effectiveness of this workflow as a validation that packages are actually retrievable from the feed. Consider removing cache: 'npm' here, or force a cold cache for this job so missing/unmirrored packages are still detected.

Suggested change
cache: 'npm'

Copilot uses AI. Check for mistakes.
uses: actions/setup-node@53b83947a5a98c8d113130e565377fae1a50d02f # v6.3.0
with:
node-version: '20.x'
cache: 'npm'
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actions/setup-node defaults to using the repo-root lockfile to key cache: 'npm'. Since this workflow runs npm install under src/Mvc/build, the cache key/invalidation likely won’t match the dependencies being installed. Consider setting cache-dependency-path: src/Mvc/build/package-lock.json (or dropping caching if you intentionally want a root-level key).

Suggested change
cache: 'npm'
cache: 'npm'
cache-dependency-path: src/Mvc/build/package-lock.json

Copilot uses AI. Check for mistakes.
Comment thread .azure/pipelines/jobs/default-build.yml Outdated
Comment on lines +188 to +190
key: '"node_modules" | "$(Agent.OS)" | $(Build.SourcesDirectory)/package-lock.json'
restoreKeys: |
"node_modules" | "$(Agent.OS)"
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache key doesn’t include the Node.js major version. If the UseNode@1 version is updated (or varies between jobs), restoring node_modules from an older Node version can cause subtle failures for packages with native components/optional binaries. Consider including the Node version (e.g., a node24 segment or a variable) in the key/restoreKeys.

Suggested change
key: '"node_modules" | "$(Agent.OS)" | $(Build.SourcesDirectory)/package-lock.json'
restoreKeys: |
"node_modules" | "$(Agent.OS)"
key: '"node_modules" | "$(Agent.OS)" | "node24" | $(Build.SourcesDirectory)/package-lock.json'
restoreKeys: |
"node_modules" | "$(Agent.OS)" | "node24"

Copilot uses AI. Check for mistakes.
@javiercn
Copy link
Copy Markdown
Member

@copilot fix the issues

… cache-dependency-path

Agent-Logs-Url: https://github.com/dotnet/aspnetcore/sessions/5888b402-1b87-4de3-867f-e5f8933eaf2a

Co-authored-by: javiercn <6995051+javiercn@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 23, 2026

@copilot fix the issues

Fixed all review issues in d384d17:

  • default-build.yml (both public and official blocks): Added "node24" to cache key and restoreKeys to prevent cross-version cache restores.
  • components-e2e-tests.yml: Same — added "node24" to the cache key/restoreKeys.
  • validate-npm-package-lock-json.yml: Removed cache: 'npm' — caching would allow stale tarballs to satisfy installs, undermining the purpose of validating that packages are actually retrievable from the feed.
  • update-jquery-validate.yml: Added cache-dependency-path: src/Mvc/build/package-lock.json so the cache key is keyed to the lockfile for the packages actually being installed.

@copilot fix the issues

Fixed all the issues from the review in commit d384d17:

  1. default-build.yml (both public and official blocks): Added "node24" segment to key/restoreKeys to prevent cross-version cache restores.
  2. components-e2e-tests.yml: Added "node24" segment to key/restoreKeys for the same reason.
  3. validate-npm-package-lock-json.yml: Removed cache: 'npm' so the workflow still validates that packages are actually retrievable from the feed.
  4. update-jquery-validate.yml: Added cache-dependency-path: src/Mvc/build/package-lock.json to correctly scope the npm cache key to the relevant lockfile.

Copilot AI requested a review from javiercn April 23, 2026 07:45
@javiercn
Copy link
Copy Markdown
Member

@copilot we are getting ##[error]eng\Npm.Workspace.nodeproj(140,5): error MSB3073: (NETCORE_ENGINEERING_TELEMETRY=Restore) The command "npm ci" exited with code 57005.

from one of the builds. Investigate if there is a way we can make the npm output more visible

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 23, 2026

@copilot we are getting ##[error]eng\Npm.Workspace.nodeproj(140,5): error MSB3073: (NETCORE_ENGINEERING_TELEMETRY=Restore) The command "npm ci" exited with code 57005.

from one of the builds. Inves...

Added ConsoleToMsBuild="true" to all npm Exec tasks in Npm.Workspace.nodeproj and Npm.Workspace.FunctionalTests.nodeproj (commit 9a908df). This routes the child process's stdout/stderr through MSBuild's logging infrastructure so npm output (including error messages and install progress) will be visible in CI logs, making it easier to diagnose failures like the exit code 57005.

Copilot AI changed the title Improve npm install reliability: add retry config and node_modules caching to CI Improve npm install reliability: add retry config, node_modules caching, and better CI diagnostics Apr 23, 2026
displayName: Install Node 24.x
inputs:
version: 24.x
- task: Cache@2
Copy link
Copy Markdown
Member

@Youssef1313 Youssef1313 Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous step picks 24.x, which means whatever latest of 24.x. Are we risking getting a new 24.x released but using old cache? Do we even care (e.g, could we miss a potential security fix from node)?

Is it better to specify the full complete version in both the install step and the cache step (as cache key)? So:

  1. We ensure getting a consistent behavior (i.e, a new node version cannot suddenly break us)
  2. If node released any security fixes, we will update the version both in the cache key and the install step and will ensure we don't use outdated cache.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about this as well

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have to worry about that. It would be extremely rare and severe if picking up a new node/npm broke the package-lock.json format.

npm ci essentially reads the package-lock.json format and replicates the structure on disk, so I don't think this is something we need to be concerned about.

Copy link
Copy Markdown
Member

@wtgodbe wtgodbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than Youssef's question. FWIW, I'm looking into the flaky "Could not copy" error in CI now

@javiercn javiercn enabled auto-merge (squash) April 23, 2026 17:06
@javiercn javiercn merged commit 4854d07 into main Apr 23, 2026
26 checks passed
@javiercn javiercn deleted the copilot/update-ci-pipeline-reliability branch April 23, 2026 17:26
@dotnet-policy-service dotnet-policy-service Bot added this to the 11.0-preview4 milestone Apr 23, 2026
wtgodbe added a commit to wtgodbe/aspnetcore that referenced this pull request Apr 23, 2026
The npm ci command intermittently fails with exit code 57005 (0xDEAD)
in CI, producing zero diagnostic output. The .npmrc retry settings and
ConsoleToMsBuild added in dotnet#66430 did not help because npm crashes
before writing any error output.

Add pre-flight diagnostics (node/npm version, npm config, disk space)
so the environment is captured even on silent crashes. Switch to
--loglevel verbose so npm logs each network request before the crash
point. Capture the npm debug log file after failure instead of
immediately aborting, since npm writes detailed logs to .npm/_logs/
even on crash.

Relates to dotnet#62807

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wtgodbe added a commit to wtgodbe/aspnetcore that referenced this pull request Apr 23, 2026
The npm ci command intermittently fails with exit code 57005 (0xDEAD)
in CI, producing zero diagnostic output. The .npmrc retry settings and
ConsoleToMsBuild added in dotnet#66430 did not help because npm crashes
before writing any error output.

Add pre-flight diagnostics (node/npm version, npm config, disk space)
so the environment is captured even on silent crashes. Switch to
--loglevel verbose so npm logs each network request before the crash
point. Capture the npm debug log file after failure instead of
immediately aborting, since npm writes detailed logs to .npm/_logs/
even on crash.

Relates to dotnet#62807

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wtgodbe added a commit to wtgodbe/aspnetcore that referenced this pull request Apr 23, 2026
The npm ci command intermittently fails with exit code 57005 (0xDEAD)
in CI, producing zero diagnostic output. The .npmrc retry settings and
ConsoleToMsBuild added in dotnet#66430 did not help because npm crashes
before writing any error output.

Add pre-flight diagnostics (node/npm version, npm config, disk space)
so the environment is captured even on silent crashes. Switch to
--loglevel verbose so npm logs each network request before the crash
point. Capture the npm debug log file after failure instead of
immediately aborting, since npm writes detailed logs to .npm/_logs/
even on crash.

Relates to dotnet#62807

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wtgodbe added a commit to wtgodbe/aspnetcore that referenced this pull request Apr 23, 2026
The npm ci command intermittently fails with exit code 57005 (0xDEAD)
in CI, producing zero diagnostic output. The .npmrc retry settings and
ConsoleToMsBuild added in dotnet#66430 did not help because npm crashes
before writing any error output.

Add pre-flight diagnostics (node/npm version, npm config, disk space)
so the environment is captured even on silent crashes. Switch to
--loglevel verbose so npm logs each network request before the crash
point. Capture the npm debug log file after failure instead of
immediately aborting, since npm writes detailed logs to .npm/_logs/
even on crash.

Relates to dotnet#62807

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wtgodbe added a commit to wtgodbe/aspnetcore that referenced this pull request Apr 23, 2026
The npm ci command intermittently fails with exit code 57005 (0xDEAD)
in CI, producing zero diagnostic output. The .npmrc retry settings and
ConsoleToMsBuild added in dotnet#66430 did not help because npm crashes
before writing any error output.

Add pre-flight diagnostics (node/npm version, npm config, disk space)
so the environment is captured even on silent crashes. Switch to
--loglevel verbose so npm logs each network request before the crash
point. Capture the npm debug log file after failure instead of
immediately aborting, since npm writes detailed logs to .npm/_logs/
even on crash.

Relates to dotnet#62807

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wtgodbe added a commit that referenced this pull request Apr 24, 2026
The npm ci command intermittently fails with exit code 57005 (0xDEAD)
in CI, producing zero diagnostic output. The .npmrc retry settings and
ConsoleToMsBuild added in #66430 did not help because npm crashes
before writing any error output.

Add pre-flight diagnostics (node/npm version, npm config, disk space)
so the environment is captured even on silent crashes. Switch to
--loglevel verbose so npm logs each network request before the crash
point. Capture the npm debug log file after failure instead of
immediately aborting, since npm writes detailed logs to .npm/_logs/
even on crash.

Relates to #62807

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@wtgodbe wtgodbe modified the milestones: 11.0-preview4, 11.0-preview5 Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-infrastructure Includes: MSBuild projects/targets, build scripts, CI, Installers and shared framework

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants