Skip to content

feat: add BakeFailureDisabled and DeploymentBlocked rollout conditions#44

Merged
LittleChimera merged 3 commits intomainfrom
recovery-mode-conditions
Apr 30, 2026
Merged

feat: add BakeFailureDisabled and DeploymentBlocked rollout conditions#44
LittleChimera merged 3 commits intomainfrom
recovery-mode-conditions

Conversation

@LittleChimera
Copy link
Copy Markdown
Contributor

Summary

  • Adds BakeFailureDisabled rollout condition: True when the active deploy cannot be failed by health-check errors. Two reasons: PreviousBakeFailed (rollback recovery) or DeployedWithUnhealthyHealthChecks (force-deploy during incident, until bake starts).
  • Adds DeploymentBlocked rollout condition: True when automatic deploys are blocked by unhealthy health checks. Set independently of gate blocking so both blockers can surface concurrently — fixes the regression where the gate early-return left DeploymentBlocked stale.
  • Adds DeploymentHistoryEntry.deployedWithUnhealthyHealthChecks field, set when a manual deploy lands while any health check is Unhealthy. handleBakeTime honors the flag in both shouldFail paths (deploy timeout + health-check error) until BakeStartTime is set.

Why

Users had no observable signal that a rollout was in recovery mode (won't fail) or paused (waiting on health checks). When force-deploying into an active incident, transient health-check errors that re-fire after the new deploy were flipping the rollout to Failed — these incidents were known to the operator and shouldn't fail recovery.

Test plan

  • Unit + integration tests in internal/controller/recovery_mode_test.go (22 specs):
    • setBakeFailureDisabledCondition: empty/single/normal/cancelled-predecessor/recovery-flag/terminal states.
    • setDeploymentBlockedCondition: manual / unhealthy / healthy.
    • End-to-end: flag set on force-deploy with unhealthy HC; not set when HC healthy or for automatic deploys; rollout doesn't fail via deploy timeout or HC error while flag set + bake not started; rollout DOES fail once bake has started; DeploymentBlocked surfaces through gate blocking.
  • make test (236/236 specs pass).
  • Verified live on dev cluster (hello-multi-dev):
    • Force-deploy during incident: new entry got deployedWithUnhealthyHealthChecks=true; controller logged Current entry was deployed with unhealthy health checks and bake has not started; not failing despite health check error; BakeFailureDisabled=True (DeployedWithUnhealthyHealthChecks) with the correct message.
    • DeploymentBlocked correctly transitioned through ManualDeployment → UnhealthyHealthChecks → HealthChecksHealthy.
    • BakeFailureDisabled=True (PreviousBakeFailed) set correctly when a new deploy was started while predecessor was Cancelled, and cleared back to False after the bake completed.

🤖 Generated with Claude Code

Surface two recovery-mode signals that previously had no observable
state:

- BakeFailureDisabled: True when an active deploy cannot be failed by
  health check errors. Either because the previous bake was not
  Succeeded (rollback recovery), or because a manual deploy proceeded
  over unhealthy health checks (force-deploy during incident) and bake
  has not yet started.

- DeploymentBlocked: True when automatic deployment of new versions is
  blocked by unhealthy health checks. Set independently of gate
  blocking so both blockers surface concurrently in the UI.

To support the force-deploy-during-incident case, history entries
created via manual deploy while at least one health check is Unhealthy
are tagged with DeployedWithUnhealthyHealthChecks=true. handleBakeTime
treats these entries as recovery mode (shouldFail=false) for both the
deploy-timeout and health-check-error paths until BakeStartTime is set,
so transient errors during the incident do not fail a deployment that
was knowingly issued into a broken state.

Adds 22 specs in recovery_mode_test.go covering the condition helpers,
the recovery flag's effect on shouldFail in both paths, and the
DeploymentBlocked + gate concurrent-block regression.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
LittleChimera and others added 2 commits April 29, 2026 16:16
Drop the DeployedWithUnhealthyHealthChecks field on DeploymentHistoryEntry
and use the BakeFailureDisabled condition as the single source of truth
for whether the active deploy is in recovery mode.

The condition is set once in deployRelease when the new history entry is
created, with the reason determined at creation time (PreviousBakeFailed
or DeployedWithUnhealthyHealthChecks). It persists unchanged for the
entry's lifetime and is overwritten only when the next deploy starts.

handleBakeTime's shouldFail logic now reads the condition directly
(via meta.IsStatusConditionTrue) instead of recomputing from history
state and an entry-level flag. Both deploy-timeout and HC-error paths
become a one-liner.

Behavioral note: the previous design lifted the force-deploy-during-
incident exemption once BakeStartTime was set. With this refactor the
exemption persists for the entire deploy lifetime, matching the
"don't change until next rollout starts" semantics.

Tests rewritten to reflect the new model (14 specs in
recovery_mode_test.go); full suite 228/228 passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ments

deployRelease no longer inlines the recovery-mode determination — moves
~50 lines of bookkeeping into a focused helper that mirrors
setDeploymentBlockedCondition. Trimmed verbose explanatory comments in
the main reconcile loop and handleBakeTime; the why is captured in the
helpers' godoc.

Behavior unchanged; 228/228 tests still passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@LittleChimera LittleChimera merged commit 75123c1 into main Apr 30, 2026
3 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant