feat: add BakeFailureDisabled and DeploymentBlocked rollout conditions by LittleChimera · Pull Request #44 · kuberik/rollout-controller

LittleChimera · 2026-04-29T15:57:46Z

Summary

Adds BakeFailureDisabled rollout condition: True when the active deploy cannot be failed by health-check errors. Two reasons: PreviousBakeFailed (rollback recovery) or DeployedWithUnhealthyHealthChecks (force-deploy during incident, until bake starts).
Adds DeploymentBlocked rollout condition: True when automatic deploys are blocked by unhealthy health checks. Set independently of gate blocking so both blockers can surface concurrently — fixes the regression where the gate early-return left DeploymentBlocked stale.
Adds DeploymentHistoryEntry.deployedWithUnhealthyHealthChecks field, set when a manual deploy lands while any health check is Unhealthy. handleBakeTime honors the flag in both shouldFail paths (deploy timeout + health-check error) until BakeStartTime is set.

Why

Users had no observable signal that a rollout was in recovery mode (won't fail) or paused (waiting on health checks). When force-deploying into an active incident, transient health-check errors that re-fire after the new deploy were flipping the rollout to Failed — these incidents were known to the operator and shouldn't fail recovery.

Test plan

Unit + integration tests in internal/controller/recovery_mode_test.go (22 specs):
- setBakeFailureDisabledCondition: empty/single/normal/cancelled-predecessor/recovery-flag/terminal states.
- setDeploymentBlockedCondition: manual / unhealthy / healthy.
- End-to-end: flag set on force-deploy with unhealthy HC; not set when HC healthy or for automatic deploys; rollout doesn't fail via deploy timeout or HC error while flag set + bake not started; rollout DOES fail once bake has started; DeploymentBlocked surfaces through gate blocking.
make test (236/236 specs pass).
Verified live on dev cluster (hello-multi-dev):
- Force-deploy during incident: new entry got deployedWithUnhealthyHealthChecks=true; controller logged Current entry was deployed with unhealthy health checks and bake has not started; not failing despite health check error; BakeFailureDisabled=True (DeployedWithUnhealthyHealthChecks) with the correct message.
- DeploymentBlocked correctly transitioned through ManualDeployment → UnhealthyHealthChecks → HealthChecksHealthy.
- BakeFailureDisabled=True (PreviousBakeFailed) set correctly when a new deploy was started while predecessor was Cancelled, and cleared back to False after the bake completed.

🤖 Generated with Claude Code

Surface two recovery-mode signals that previously had no observable state: - BakeFailureDisabled: True when an active deploy cannot be failed by health check errors. Either because the previous bake was not Succeeded (rollback recovery), or because a manual deploy proceeded over unhealthy health checks (force-deploy during incident) and bake has not yet started. - DeploymentBlocked: True when automatic deployment of new versions is blocked by unhealthy health checks. Set independently of gate blocking so both blockers surface concurrently in the UI. To support the force-deploy-during-incident case, history entries created via manual deploy while at least one health check is Unhealthy are tagged with DeployedWithUnhealthyHealthChecks=true. handleBakeTime treats these entries as recovery mode (shouldFail=false) for both the deploy-timeout and health-check-error paths until BakeStartTime is set, so transient errors during the incident do not fail a deployment that was knowingly issued into a broken state. Adds 22 specs in recovery_mode_test.go covering the condition helpers, the recovery flag's effect on shouldFail in both paths, and the DeploymentBlocked + gate concurrent-block regression. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Drop the DeployedWithUnhealthyHealthChecks field on DeploymentHistoryEntry and use the BakeFailureDisabled condition as the single source of truth for whether the active deploy is in recovery mode. The condition is set once in deployRelease when the new history entry is created, with the reason determined at creation time (PreviousBakeFailed or DeployedWithUnhealthyHealthChecks). It persists unchanged for the entry's lifetime and is overwritten only when the next deploy starts. handleBakeTime's shouldFail logic now reads the condition directly (via meta.IsStatusConditionTrue) instead of recomputing from history state and an entry-level flag. Both deploy-timeout and HC-error paths become a one-liner. Behavioral note: the previous design lifted the force-deploy-during- incident exemption once BakeStartTime was set. With this refactor the exemption persists for the entire deploy lifetime, matching the "don't change until next rollout starts" semantics. Tests rewritten to reflect the new model (14 specs in recovery_mode_test.go); full suite 228/228 passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ments deployRelease no longer inlines the recovery-mode determination — moves ~50 lines of bookkeeping into a focused helper that mirrors setDeploymentBlockedCondition. Trimmed verbose explanatory comments in the main reconcile loop and handleBakeTime; the why is captured in the helpers' godoc. Behavior unchanged; 228/228 tests still passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

LittleChimera mentioned this pull request Apr 29, 2026

feat(dashboard): AlertPanel + recovery-mode UI kuberik/rollout-dashboard#10

Merged

2 tasks

LittleChimera and others added 2 commits April 29, 2026 16:16

LittleChimera merged commit 75123c1 into main Apr 30, 2026
3 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add BakeFailureDisabled and DeploymentBlocked rollout conditions#44

feat: add BakeFailureDisabled and DeploymentBlocked rollout conditions#44
LittleChimera merged 3 commits intomainfrom
recovery-mode-conditions

LittleChimera commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LittleChimera commented Apr 29, 2026

Summary

Why

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant