Skip to content

fix(kube-client): propagate abort signal through list requests#2954

Open
petersutter wants to merge 6 commits into
masterfrom
fix/propagate-abort-signal-through-list-requests
Open

fix(kube-client): propagate abort signal through list requests#2954
petersutter wants to merge 6 commits into
masterfrom
fix/propagate-abort-signal-through-list-requests

Conversation

@petersutter
Copy link
Copy Markdown
Member

@petersutter petersutter commented May 11, 2026

How to categorize this PR?
/area robustness
/kind bug

What this PR does / why we need it:

Symptom observed: one of two dashboard pods failed to populate the Shoot backend cache on startup. The list request for Shoots never received a response and the reflector stalled silently — no error, no retry, no timeout.

Root causes addressed:

  1. ListWatcher.list() did not forward the abort signal to the list function, so a hung stream could not be cancelled — unlike watch(), which already passed it correctly.
  2. The get mixins (ClusterScoped.Readable.get / NamespaceScoped.Readable.get) destructured signal but never forwarded it — same class of bug as 1.
  3. Client.fetch() had no enforced request timeout. The existing responseTimeout option was defined but never wired up anywhere, so an HTTP/2 stream that delivered :status and then stalled mid-body would hang forever.

The fix:

  • Forward the abort signal through ListWatcher.list() and the cluster-/namespace-scoped get mixins.
  • Replace responseTimeout with a total requestTimeout covering session establishment, headers, and body. Implementation uses AbortSignal.timeout(ms) combined with the caller's signal via AbortSignal.any([...]).
  • Map Node HTTP/2's generic AbortError to TimeoutError (code: 'ETIMEDOUT') at every delayed-error surface — getHeaders(), body(), and async iterator — by identity-matching the package-created timeout signal's reason; the raw abort is preserved as cause.
  • stream() opts out via requestTimeout: 0 so watches stay long-lived. Per-call options still override.
  • Default 60 s.

This PR also introduces the KUBE_CLIENT_REQUEST_TIMEOUT environment variable that sets the default requestTimeout for all kube-client instances (dashboard client, per-user clients, and derived kubeconfig clients). Per-client options can still override the default. The Helm chart renders .Values.global.dashboard.kubeClient.requestTimeout into the container environment; 0 is rendered (disable) rather than skipped as falsy. Malformed env-var values throw at module load so misconfiguration surfaces immediately.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
As a possible future follow-up, we could evaluate whether the package-level config-dependent singletons should be replaced by an explicit createClientSet() factory that receives requestTimeout and other transport options as constructor arguments. In that model, the backend would instantiate the client set at startup from its loaded config and inject it into route handlers, services, and hooks. This is intentionally not part of this PR: it would touch every backend module that imports from @gardener-dashboard/kube-client and should only be considered as a separate refactoring effort if we decide the reduced coupling is worth the larger change surface.

Release note:

Fix an issue where the dashboard backend cache could stop updating if Kubernetes API list requests became unresponsive.
Add `KUBE_CLIENT_REQUEST_TIMEOUT` environment variable to configure the default total request timeout (in milliseconds) for Kubernetes API requests made by the dashboard backend. Defaults to 60000; set to 0 to disable. When using the Helm chart, this can be configured via `.Values.global.dashboard.kubeClient.requestTimeout`.

Summary by CodeRabbit

  • New Features

    • Introduce configurable request timeout (env/config) with default 60000ms; can be set to 0 to disable.
    • Dashboard deployment now exposes KUBE_CLIENT_REQUEST_TIMEOUT to the container.
  • Bug Fixes

    • List operations now receive abort signals like watch operations.
    • More consistent timeout handling and error mapping for timed-out requests.
    • Ensure request option objects are always provided to session requests.
  • Tests

    • Added/updated tests for timeout behavior, abort-signal forwarding, and env parsing.

Review Change Stack

@gardener-prow gardener-prow Bot added the area/robustness Robustness, reliability, resilience related label May 11, 2026
@gardener-prow
Copy link
Copy Markdown

gardener-prow Bot commented May 11, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign petersutter for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow Bot added kind/bug Bug cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 11, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Parses KUBE_CLIENT_REQUEST_TIMEOUT into client defaults; validates and forwards AbortSignals in kube-client read paths; ListWatcher.list forwards a set signal; Client.fetch enforces a requestTimeout-driven abort mapped to TimeoutError; SessionPool.request defaults options; tests and Helm chart rendering updated.

Changes

Kube-client requestTimeout and signal propagation

Layer / File(s) Summary
parseRequestTimeout and Client defaultOptions
packages/kube-client/lib/index.js
Parses KUBE_CLIENT_REQUEST_TIMEOUT, validates it as a non-negative uint32, merges into module defaultOptions, and applies per-client overrides; createClient/createDashboardClient now default options = {} and dashboardClient created via createDashboardClient().
Signal Validation in Readable mixins
packages/kube-client/lib/mixins.js
ClusterScoped.Readable.get/list, NamespaceScoped.Readable.get/list, and listAllNamespaces call assertSignal(signal) and forward the validated signal into internal requests.
ListWatcher Signal Forwarding
packages/kube-client/lib/cache/ListWatcher.js, packages/kube-client/__tests__/cache.list-watcher.spec.js
ListWatcher.list(query) conditionally attaches this.signal to options passed to listFunc when set; unit test verifies forwarded signal and merged searchParams.
kube-client env propagation tests
packages/kube-client/__tests__/index.spec.js
Tests assert KUBE_CLIENT_REQUEST_TIMEOUT propagation into request.extend for package/dashboard/derived clients, per-client overrides (including 0), and fail-fast on invalid env values.
Helm values and chart test
charts/gardener-dashboard/values.yaml, charts/gardener-dashboard/charts/runtime/templates/dashboard/deployment.yaml, charts/__tests__/gardener-dashboard/runtime/dashboard/deployment.spec.js
Adds global.dashboard.kubeClient.requestTimeout value, conditionally injects KUBE_CLIENT_REQUEST_TIMEOUT into container env when set, and adds chart tests verifying rendered env values (e.g., '30000' and '0').

Request client requestTimeout implementation

Layer / File(s) Summary
SessionPool.request default options
packages/request/lib/SessionPool.js
SessionPool.request(headers, options = {}) defaults options to {} so session.request never receives undefined.
TimeoutError changes & Client import
packages/request/lib/errors.js, packages/request/lib/Client.js
TimeoutError constructor now accepts (message, options); Client imports TimeoutError for mapping timeout-origin aborts.
Client.fetch requestTimeout implementation
packages/request/lib/Client.js
Client.fetch() accepts requestTimeout (default this.#options.requestTimeout ?? 60000), creates a timeout AbortSignal, combines with caller signal, applies it to agent.request, maps timeout aborts to TimeoutError, and removes the old responseTimeout getter; stream() forces requestTimeout: 0.
Unit and acceptance tests for requestTimeout
packages/request/__tests__/client.spec.js, packages/request/__tests__/acceptance.spec.js
Adds acceptance /delay route that never responds and tests asserting TimeoutError for stalled requests; updates client unit tests to assert timeout mapping, default timeout value, and requestTimeout: 0 behavior for streams.

🎯 4 (Complex) | ⏱️ ~45 minutes

"I set a signal and watch the clock,
If headers lag, I give a knock.
Streams fold neat, timeouts named with care,
Env and charts pass settings everywhere. 🐰"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(kube-client): propagate abort signal through list requests' is a core change, but the PR also fixes broader issues including implementing a request timeout mechanism and supporting a new environment variable.
Description check ✅ Passed The PR description is well-structured with clear sections for categorization, root causes, fixes, special notes, and two detailed release notes covering both bugfix and feature aspects.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/propagate-abort-signal-through-list-requests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/request/__tests__/client.spec.js`:
- Around line 247-282: The test is using Jest APIs that don't exist under
Vitest; replace jest.fn() with vi.fn() for mocking (e.g., the mock for
stream.getHeaders and stream.destroy) and replace jest.advanceTimersByTime(...)
with vi.advanceTimersByTime(...) so the timer fast-forward works under Vitest;
update any other jest.* usages in this test (references around client.fetch,
stream.getHeaders, stream.destroy) to their vi equivalents so the test runs with
Vitest.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d7ca7124-6130-4b6d-a38e-dca1cc9a598f

📥 Commits

Reviewing files that changed from the base of the PR and between d2a7ab9 and 150fcc9.

📒 Files selected for processing (6)
  • packages/kube-client/__tests__/cache.list-watcher.spec.js
  • packages/kube-client/lib/cache/ListWatcher.js
  • packages/kube-client/lib/mixins.js
  • packages/request/__tests__/client.spec.js
  • packages/request/lib/Client.js
  • packages/request/lib/SessionPool.js

Comment thread packages/request/__tests__/client.spec.js Outdated
@petersutter petersutter force-pushed the fix/propagate-abort-signal-through-list-requests branch 2 times, most recently from f1f6bbb to 373a442 Compare May 12, 2026 08:30
@gardener-prow gardener-prow Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 14, 2026
@petersutter petersutter force-pushed the fix/propagate-abort-signal-through-list-requests branch from 88b3c9c to 71a4d63 Compare May 19, 2026 14:35
@gardener-prow gardener-prow Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 19, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/request/lib/Client.js`:
- Around line 44-67: The createTimeoutSignal currently allows values up to
MAX_UINT32 which exceeds Node.js timer clamp and causes very short timeouts;
update createTimeoutSignal to validate as before but clamp requestTimeout to the
Node.js max timer value (const MAX_TIMER = 0x7FFFFFFF) before calling
AbortSignal.timeout (i.e., if requestTimeout > MAX_TIMER set requestTimeout =
MAX_TIMER), keep the existing non-negative integer check and use
AbortSignal.timeout(clampedTimeout); reference the createTimeoutSignal function
and MAX_UINT32 constant when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a43aaf53-b84f-4142-9541-9f3852e90cd6

📥 Commits

Reviewing files that changed from the base of the PR and between 88b3c9c and 71a4d63.

📒 Files selected for processing (11)
  • charts/__tests__/gardener-dashboard/runtime/dashboard/deployment.spec.js
  • charts/gardener-dashboard/charts/runtime/templates/dashboard/deployment.yaml
  • charts/gardener-dashboard/values.yaml
  • packages/kube-client/__tests__/index.spec.js
  • packages/kube-client/__tests__/mixins.spec.js
  • packages/kube-client/lib/index.js
  • packages/kube-client/lib/mixins.js
  • packages/request/__tests__/acceptance.spec.js
  • packages/request/__tests__/client.spec.js
  • packages/request/lib/Client.js
  • packages/request/lib/errors.js

Comment thread packages/request/lib/Client.js Outdated
ListWatcher.list() did not forward this.signal, so list requests
could not be aborted by the reflector — unlike watch() which already
passes it. Mirror watch()'s behaviour and add assertSignal() guards
to the list mixins so a missing signal is caught at the boundary.
Caller passed `signal` into the cluster- and namespace-scoped `get`
mixins, but the mixin destructured it without forwarding to the
underlying request. Add `assertSignal(signal)` and propagate so a hung
get can be cancelled via `AbortController`.
@petersutter petersutter force-pushed the fix/propagate-abort-signal-through-list-requests branch from 71a4d63 to e71fc89 Compare May 19, 2026 15:14
Replace the unenforced `responseTimeout` (header-phase only) with a
total `requestTimeout` covering session establishment, headers, and
body. Body and async-iterator paths previously had no timeout: an
HTTP/2 stream that delivered `:status` then stalled mid-body would
hang forever.

Use `AbortSignal.timeout(ms)` combined with the caller's signal via
`AbortSignal.any([...])`. Map the resulting `AbortError` to
`TimeoutError` (`code: 'ETIMEDOUT'`) at every delayed-error surface
by identity-matching the timeout signal's reason; preserve the raw
abort as `cause`.

`stream()` opts out via `requestTimeout: 0` so watches stay
long-lived. Per-call options still override. Default 60 s. Drop the
dead `responseTimeout` getter on `Client`.
Read `KUBE_CLIENT_REQUEST_TIMEOUT` at module load and pass it as the
default `requestTimeout` to all clients (dashboard, user, derived
kubeconfig). Per-client options override, including
`requestTimeout: 0` to disable. Explicit `undefined` falls back to the
env default rather than masking it.

Validate as a non-negative integer in the `AbortSignal.timeout()`
range. Throw at module load on a malformed value so misconfiguration
surfaces immediately.
Render `KUBE_CLIENT_REQUEST_TIMEOUT` from
`.Values.global.dashboard.kubeClient.requestTimeout`. Use `ne ... nil`
so that `0` (disable) is rendered, not skipped as falsy.
@petersutter petersutter force-pushed the fix/propagate-abort-signal-through-list-requests branch from e71fc89 to b012be0 Compare May 19, 2026 15:33
@gardener-prow gardener-prow Bot added cla: no Indicates the PR's author has not signed the cla-assistant.io CLA. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. and removed cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. cla: no Indicates the PR's author has not signed the cla-assistant.io CLA. labels May 19, 2026
The two-shape check (direct reason vs ABORT_ERR wrapping) is
non-obvious — node delivers timeout-triggered aborts differently
depending on which await throws. JSDoc and inline comments
clarify intent. Helpers are also exposed so the branches can be
covered by direct unit tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/robustness Robustness, reliability, resilience related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/bug Bug size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant