Skip to content

fix: prefer Stalled=True as failure witness, fall back to newest condition#45

Merged
LittleChimera merged 1 commit intomainfrom
hc-witness-stalled-fallback
May 2, 2026
Merged

fix: prefer Stalled=True as failure witness, fall back to newest condition#45
LittleChimera merged 1 commit intomainfrom
hc-witness-stalled-fallback

Conversation

@LittleChimera
Copy link
Copy Markdown
Contributor

getFailureConditionTime previously scanned for any "failure-indicating" condition (Stalled=True, Ready=False, Progressing=False, etc.) and returned the latest LastTransitionTime among them. That worked for resources with a stable failure signal but mis-attributed the witness on resources whose "failure" conditions cycle (e.g. Kruise Rollout's Stalled flips False→True during retry-clear-set, then back). The cycling timestamp leaked into HealthCheck.LastErrorTime and made post-retry failures look fresh to the rollout-controller's bake check.

Switch to a layered witness:

  1. Prefer Stalled=True.LastTransitionTime when present — the kstatus-standard authoritative failure signal.
  2. Otherwise fall back to the newest LastTransitionTime across any condition, which gives the caller a stable witness instead of nil for resources that don't expose Stalled.

Stops the "every-condition-is-failure" scan and removes the misleading isFailureCondition helper.

…ition

getFailureConditionTime previously scanned for any "failure-indicating"
condition (Stalled=True, Ready=False, Progressing=False, etc.) and returned
the latest LastTransitionTime among them. That worked for resources with a
stable failure signal but mis-attributed the witness on resources whose
"failure" conditions cycle (e.g. Kruise Rollout's Stalled flips False→True
during retry-clear-set, then back). The cycling timestamp leaked into
HealthCheck.LastErrorTime and made post-retry failures look fresh to the
rollout-controller's bake check.

Switch to a layered witness:
  1. Prefer Stalled=True.LastTransitionTime when present — the kstatus-standard
     authoritative failure signal.
  2. Otherwise fall back to the newest LastTransitionTime across any condition,
     which gives the caller a stable witness instead of nil for resources that
     don't expose Stalled.

Stops the "every-condition-is-failure" scan and removes the misleading
isFailureCondition helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LittleChimera LittleChimera merged commit 62e36d4 into main May 2, 2026
3 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant