Skip to content

fix(backlog-manager): fail-closed when pipelineSnapshot is missing#1233

Merged
zbigniewsobiecki merged 2 commits intodevfrom
fix/backlog-manager-scope-safety
Apr 29, 2026
Merged

fix(backlog-manager): fail-closed when pipelineSnapshot is missing#1233
zbigniewsobiecki merged 2 commits intodevfrom
fix/backlog-manager-scope-safety

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Summary

Closes the 2026-04-29 prod incident where backlog-manager moved MNG-422 from SPLITTING to TODO, kicking off implementation on a card that hadn't been split.

Root cause. A manual cascade runs trigger --agent-type backlog-manager runs with triggerEvent: undefinedresolveContextPipeline returns []pipelineSnapshot never pre-loads → agent improvises by listing every PM container (BACKLOG + TODO + IN_PROGRESS + IN_REVIEW + SPLITTING + …) and picks "good-looking" cards from any of them. The MoveWorkItem gadget then moves them blindly — zero source-state validation. The prompt strongly implies "BACKLOG only" but never says "REFUSE otherwise."

Fix (defense-in-depth)

Three coordinated changes ship together:

A — Agent-level requiredContext: ContextStepName[] schema field

Steps listed here ALWAYS run, regardless of trigger source. Backlog-manager declares requiredContext: [pipelineSnapshot] so the snapshot pre-load can no longer be skipped — manual triggers, internal chains, and webhook triggers all get it.

Required steps run BEFORE the per-trigger contextPipeline and are deduped (a step listed in both runs once).

C — Fail-closed when the required step is empty or throws

fetchPipelineSnapshotStep warn-and-returns [] when no PM provider is in scope (the silent failure mode that hid this bug). The new contract: a required step that returns 0 injections OR throws → agent run aborts with a structured error + Sentry capture under tag context_pipeline_required_step_failed.

D — Hard prompt rule in backlog-manager.eta

HARD CONSTRAINT — NEVER MOVE A CARD NOT IN BACKLOG. [...] NEVER call ListWorkItems against non-BACKLOG containers to discover candidates [...] If the snapshot is missing, ABORT — do not improvise.

What's NOT in this PR

Per user direction, dropped the originally-planned per-agent gadget restriction (replacing MoveWorkItem with a constrained PullBacklogItemToTodo). The defense-in-depth from A + C + D is sufficient without expanding the gadget surface.

Test plan

  • 6 new unit tests in tests/unit/agents/definitions/profiles.test.ts covering: required-step runs without triggerEvent, fails when result empty, fails when step throws, dedupes when same step in both required + trigger pipeline, runs required first, no-op for agents without requiredContext.
  • All 8727 unit tests pass.
  • All 550 integration tests pass.
  • npm run lint clean (13 pre-existing warnings).
  • npm run typecheck clean.

🤖 Generated with Claude Code

zbigniewsobiecki and others added 2 commits April 29, 2026 17:48
Prod incident 2026-04-29 (ucho): backlog-manager moved MNG-422 from
SPLITTING to TODO, kicking off implementation on a non-split card.

Root cause: a manual `cascade runs trigger --agent-type backlog-manager`
runs with `triggerEvent: undefined` → `resolveContextPipeline` returns
`[]` → `pipelineSnapshot` never executes → agent improvises by listing
all PM containers (BACKLOG + TODO + IN_PROGRESS + IN_REVIEW + ...) and
picks "good-looking" cards from any of them. The MoveWorkItem gadget
then moves them blindly because it does no source-state validation. The
prompt strongly implies "from BACKLOG only" but never says "REFUSE
otherwise" — the agent freelanced.

Three coordinated fixes (defense in depth):

A) Agent-level `requiredContext: ContextStepName[]` schema field. Steps
   listed here ALWAYS run, regardless of trigger source — manual,
   webhook, or internal. Backlog-manager declares
   `requiredContext: [pipelineSnapshot]` so the snapshot pre-load can no
   longer be skipped.

C) Fail-closed: when a required step returns 0 injections OR throws,
   the agent run aborts with a structured error + Sentry capture under
   tag `context_pipeline_required_step_failed`. Today the snapshot step
   warn-and-returns `[]` when no PM provider is in scope — so even when
   it's wired, missing scope was silent. With this fix it's loud.

D) Hard rule in `backlog-manager.eta`: explicit "NEVER move a card not
   in BACKLOG. NEVER call ListWorkItems against non-BACKLOG containers
   to discover candidates. If the snapshot is missing, ABORT — do not
   improvise."

Required steps run BEFORE the per-trigger contextPipeline and are
deduped (a step listed in both runs once).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Web tsc -b consumes the AgentDefinition output type which now requires
requiredContext (zod default([]) makes the field always defined on the
parsed shape). Add the explicit empty array to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zbigniewsobiecki zbigniewsobiecki merged commit cbc9d96 into dev Apr 29, 2026
8 checks passed
@zbigniewsobiecki zbigniewsobiecki deleted the fix/backlog-manager-scope-safety branch April 29, 2026 17:58
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

zbigniewsobiecki added a commit that referenced this pull request Apr 29, 2026
Merge dev → main: backlog-manager scope safety (#1233)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant