feat: add BakeFailureDisabled and DeploymentBlocked rollout conditions#44
Merged
LittleChimera merged 3 commits intomainfrom Apr 30, 2026
Merged
feat: add BakeFailureDisabled and DeploymentBlocked rollout conditions#44LittleChimera merged 3 commits intomainfrom
LittleChimera merged 3 commits intomainfrom
Conversation
Surface two recovery-mode signals that previously had no observable state: - BakeFailureDisabled: True when an active deploy cannot be failed by health check errors. Either because the previous bake was not Succeeded (rollback recovery), or because a manual deploy proceeded over unhealthy health checks (force-deploy during incident) and bake has not yet started. - DeploymentBlocked: True when automatic deployment of new versions is blocked by unhealthy health checks. Set independently of gate blocking so both blockers surface concurrently in the UI. To support the force-deploy-during-incident case, history entries created via manual deploy while at least one health check is Unhealthy are tagged with DeployedWithUnhealthyHealthChecks=true. handleBakeTime treats these entries as recovery mode (shouldFail=false) for both the deploy-timeout and health-check-error paths until BakeStartTime is set, so transient errors during the incident do not fail a deployment that was knowingly issued into a broken state. Adds 22 specs in recovery_mode_test.go covering the condition helpers, the recovery flag's effect on shouldFail in both paths, and the DeploymentBlocked + gate concurrent-block regression. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2 tasks
Drop the DeployedWithUnhealthyHealthChecks field on DeploymentHistoryEntry and use the BakeFailureDisabled condition as the single source of truth for whether the active deploy is in recovery mode. The condition is set once in deployRelease when the new history entry is created, with the reason determined at creation time (PreviousBakeFailed or DeployedWithUnhealthyHealthChecks). It persists unchanged for the entry's lifetime and is overwritten only when the next deploy starts. handleBakeTime's shouldFail logic now reads the condition directly (via meta.IsStatusConditionTrue) instead of recomputing from history state and an entry-level flag. Both deploy-timeout and HC-error paths become a one-liner. Behavioral note: the previous design lifted the force-deploy-during- incident exemption once BakeStartTime was set. With this refactor the exemption persists for the entire deploy lifetime, matching the "don't change until next rollout starts" semantics. Tests rewritten to reflect the new model (14 specs in recovery_mode_test.go); full suite 228/228 passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ments deployRelease no longer inlines the recovery-mode determination — moves ~50 lines of bookkeeping into a focused helper that mirrors setDeploymentBlockedCondition. Trimmed verbose explanatory comments in the main reconcile loop and handleBakeTime; the why is captured in the helpers' godoc. Behavior unchanged; 228/228 tests still passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BakeFailureDisabledrollout condition: True when the active deploy cannot be failed by health-check errors. Two reasons:PreviousBakeFailed(rollback recovery) orDeployedWithUnhealthyHealthChecks(force-deploy during incident, until bake starts).DeploymentBlockedrollout condition: True when automatic deploys are blocked by unhealthy health checks. Set independently of gate blocking so both blockers can surface concurrently — fixes the regression where the gate early-return leftDeploymentBlockedstale.DeploymentHistoryEntry.deployedWithUnhealthyHealthChecksfield, set when a manual deploy lands while any health check isUnhealthy.handleBakeTimehonors the flag in bothshouldFailpaths (deploy timeout + health-check error) untilBakeStartTimeis set.Why
Users had no observable signal that a rollout was in recovery mode (won't fail) or paused (waiting on health checks). When force-deploying into an active incident, transient health-check errors that re-fire after the new deploy were flipping the rollout to
Failed— these incidents were known to the operator and shouldn't fail recovery.Test plan
internal/controller/recovery_mode_test.go(22 specs):setBakeFailureDisabledCondition: empty/single/normal/cancelled-predecessor/recovery-flag/terminal states.setDeploymentBlockedCondition: manual / unhealthy / healthy.DeploymentBlockedsurfaces through gate blocking.make test(236/236 specs pass).hello-multi-dev):deployedWithUnhealthyHealthChecks=true; controller loggedCurrent entry was deployed with unhealthy health checks and bake has not started; not failing despite health check error;BakeFailureDisabled=True (DeployedWithUnhealthyHealthChecks)with the correct message.DeploymentBlockedcorrectly transitioned throughManualDeployment → UnhealthyHealthChecks → HealthChecksHealthy.BakeFailureDisabled=True (PreviousBakeFailed)set correctly when a new deploy was started while predecessor was Cancelled, and cleared back to False after the bake completed.🤖 Generated with Claude Code