fix(backlog-manager): fail-closed when pipelineSnapshot is missing#1233
Merged
zbigniewsobiecki merged 2 commits intodevfrom Apr 29, 2026
Merged
fix(backlog-manager): fail-closed when pipelineSnapshot is missing#1233zbigniewsobiecki merged 2 commits intodevfrom
zbigniewsobiecki merged 2 commits intodevfrom
Conversation
Prod incident 2026-04-29 (ucho): backlog-manager moved MNG-422 from SPLITTING to TODO, kicking off implementation on a non-split card. Root cause: a manual `cascade runs trigger --agent-type backlog-manager` runs with `triggerEvent: undefined` → `resolveContextPipeline` returns `[]` → `pipelineSnapshot` never executes → agent improvises by listing all PM containers (BACKLOG + TODO + IN_PROGRESS + IN_REVIEW + ...) and picks "good-looking" cards from any of them. The MoveWorkItem gadget then moves them blindly because it does no source-state validation. The prompt strongly implies "from BACKLOG only" but never says "REFUSE otherwise" — the agent freelanced. Three coordinated fixes (defense in depth): A) Agent-level `requiredContext: ContextStepName[]` schema field. Steps listed here ALWAYS run, regardless of trigger source — manual, webhook, or internal. Backlog-manager declares `requiredContext: [pipelineSnapshot]` so the snapshot pre-load can no longer be skipped. C) Fail-closed: when a required step returns 0 injections OR throws, the agent run aborts with a structured error + Sentry capture under tag `context_pipeline_required_step_failed`. Today the snapshot step warn-and-returns `[]` when no PM provider is in scope — so even when it's wired, missing scope was silent. With this fix it's loud. D) Hard rule in `backlog-manager.eta`: explicit "NEVER move a card not in BACKLOG. NEVER call ListWorkItems against non-BACKLOG containers to discover candidates. If the snapshot is missing, ABORT — do not improvise." Required steps run BEFORE the per-trigger contextPipeline and are deduped (a step listed in both runs once). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Web tsc -b consumes the AgentDefinition output type which now requires requiredContext (zod default([]) makes the field always defined on the parsed shape). Add the explicit empty array to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
zbigniewsobiecki
added a commit
that referenced
this pull request
Apr 29, 2026
Merge dev → main: backlog-manager scope safety (#1233)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the 2026-04-29 prod incident where backlog-manager moved MNG-422 from SPLITTING to TODO, kicking off implementation on a card that hadn't been split.
Root cause. A manual
cascade runs trigger --agent-type backlog-managerruns withtriggerEvent: undefined→resolveContextPipelinereturns[]→pipelineSnapshotnever pre-loads → agent improvises by listing every PM container (BACKLOG + TODO + IN_PROGRESS + IN_REVIEW + SPLITTING + …) and picks "good-looking" cards from any of them. TheMoveWorkItemgadget then moves them blindly — zero source-state validation. The prompt strongly implies "BACKLOG only" but never says "REFUSE otherwise."Fix (defense-in-depth)
Three coordinated changes ship together:
A — Agent-level
requiredContext: ContextStepName[]schema fieldSteps listed here ALWAYS run, regardless of trigger source. Backlog-manager declares
requiredContext: [pipelineSnapshot]so the snapshot pre-load can no longer be skipped — manual triggers, internal chains, and webhook triggers all get it.Required steps run BEFORE the per-trigger
contextPipelineand are deduped (a step listed in both runs once).C — Fail-closed when the required step is empty or throws
fetchPipelineSnapshotStepwarn-and-returns[]when no PM provider is in scope (the silent failure mode that hid this bug). The new contract: a required step that returns 0 injections OR throws → agent run aborts with a structured error + Sentry capture under tagcontext_pipeline_required_step_failed.D — Hard prompt rule in
backlog-manager.etaWhat's NOT in this PR
Per user direction, dropped the originally-planned per-agent gadget restriction (replacing
MoveWorkItemwith a constrainedPullBacklogItemToTodo). The defense-in-depth from A + C + D is sufficient without expanding the gadget surface.Test plan
tests/unit/agents/definitions/profiles.test.tscovering: required-step runs withouttriggerEvent, fails when result empty, fails when step throws, dedupes when same step in both required + trigger pipeline, runs required first, no-op for agents withoutrequiredContext.npm run lintclean (13 pre-existing warnings).npm run typecheckclean.🤖 Generated with Claude Code