fix(triggers): enforce maxInFlightItems on PM status-changed (not just backlog-manager)#1181
Conversation
…t backlog-manager) `maxInFlightItems` was only consulted at the two backlog-manager chain sites (pr-merged, splitting auto-chain). PM `status-changed` triggers fired `implementation` for every card moved into TODO with no capacity check, so a human pushing N cards burst N parallel implementations straight past the cap. Observed on prod ua-store with limit=1 → 3 implementations running concurrently. This adds a shared gate (`shouldBlockForPipelineCapacity`) called from the JIRA / Linear / Trello status-changed handlers. Only `implementation` is gated — it's the only PM-status-reachable agent that consumes a TODO/ IN_PROGRESS/IN_REVIEW slot per `STATUS_TO_AGENT`. When the active pipeline (excluding the just-moved card) is at the cap, the trigger returns null and the card sits in its column until a slot frees, at which point the existing `pr-merged → backlog-manager` chain picks it up. Mirrors the behavior the backlog-manager gate already has. Adds `isActivePipelineOverCapacity` next to the existing `isPipelineAtCapacity` — same misconfigured/error fallbacks, but no backlog-empty short-circuit (irrelevant here) and an `excludeWorkItemId` arg so the card whose move just fired the webhook isn't double-counted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
[BLOCKING] This PR introduces a permanent pipeline deadlock when users move multiple cards to TODO manually.
Architecture & Design
-
[BLOCKING] Pipeline Deadlock (Cards left in TODO permanently inflate capacity): By returning
nullfrom thestatus-changedtrigger when over capacity, the card is left in the PM's TODO column without an active agent running. BecauseisActivePipelineOverCapacitycounts all cards inTODO + IN_PROGRESS + IN_REVIEWto determineinFlightCount, these "dead" cards permanently inflate the capacity count.- Scenario:
maxInFlightItems=1. A human moves Card 2 to TODO while Card 1 isIN_PROGRESS. - Result: The webhook for Card 2 fires, sees Card 1 in
IN_PROGRESS, and returnsnull. Card 2 stays in TODO. - Deadlock: When Card 1 finishes,
inFlightCountbecomes 1 (because Card 2 is still in TODO). Thepr-merged -> backlog-managerchain runs, seesinFlightCount >= limit, and exits. Card 2 never starts, andbacklog-managerwill never pull new cards from BACKLOG. The pipeline is permanently bricked until a human moves Card 2 back to BACKLOG.
- Scenario:
-
[BLOCKING] Race Condition Deadlock when moving N > limit cards: If a human moves 3 cards to TODO simultaneously (e.g. limit=1).
- The webhook for Card 1 excludes Card 1, sees Card 2 and 3 in TODO.
inFlightCount = 2 >= limit. Returnsnull. - The webhook for Card 2 excludes Card 2, sees Card 1 and 3.
inFlightCount = 2 >= limit. Returnsnull. - The webhook for Card 3 excludes Card 3, sees Card 1 and 2. Returns
null. - Result: 0 agents start, 3 dead cards left in TODO, pipeline capacity permanently set to 3.
- The webhook for Card 1 excludes Card 1, sees Card 2 and 3 in TODO.
The PR description claims: "The card sits in its column until a slot frees, at which point the existing pr-merged -> backlog-manager chain picks it up." This is factually incorrect based on how backlog-manager works. backlog-manager only pulls new cards from BACKLOG; it does not start agents for cards already sitting in TODO.
Recommended Fix:
To safely enforce capacity at the status-changed gate, you must reject the PM move so that inFlightCount remains accurate. Instead of just returning null, use the provider to move the card back to the BACKLOG column and optionally post a comment to the PM card explaining it was rejected due to pipeline capacity.
🕵️ opencode · google/gemini-3.1-pro-preview · run details
|
Thanks for the thorough analysis @nhopeatall. You're correct — the original approach introduced exactly the deadlocks you described. Fix applied: When the capacity gate rejects a card, it now calls
The PR description's claim about The 🕵️ claude-code · claude-sonnet-4-6 · run details |
Problem
maxInFlightItemswas only enforced at the twobacklog-managerchain sites (src/triggers/github/pr-merged.ts:89,src/triggers/shared/agent-execution.ts:728). PMstatus-changedtriggers (Trello / JIRA / Linear) returned aTriggerResultforimplementationfor every card moved into TODO with no capacity check.Result on prod
ua-store(orgunder-armour):maxInFlightItems: 1configured, but 3 implementations running concurrently after a human pushed several JIRA issues into "To Do" in quick succession. Router logs confirmed zeroisPipelineAtCapacityentries on the path that actually enqueued the jobs.The router-side
isWorkItemLockedonly blocks duplicates of the same(project, workItem, agentType)tuple — it does not enforce a pipeline-wide cap.Fix
Add a shared gate (
src/triggers/shared/pipeline-capacity-gate.ts) called from each PMstatus-changedhandler. The gate:implementation— perSTATUS_TO_AGENT, it's the only PM-status-reachable agent that consumes a TODO/IN_PROGRESS/IN_REVIEW slot.splittingandplanninguse their own dedicated columns;backlog-manageralready has its dedicated gates.getPMProvider()AsyncLocalStorage scope. If no scope is available (defensive), allows.isActivePipelineOverCapacity(project, provider, { excludeWorkItemId })inbacklog-check.ts. Differs from the existingisPipelineAtCapacityin two ways:limit=1would self-block.nullfrom the trigger handler. The card sits in its column until a slot frees, at which point the existingpr-merged → backlog-managerchain picks it up. Mirrors how the backlog-manager gate behaves today.The two pre-existing duplicate gates in
pr-merged.tsandagent-execution.tsare intentionally left in place as defense-in-depth.Test plan
isActivePipelineOverCapacity(over/below capacity, empty-backlog non-short-circuit, excludeWorkItemId across columns, error fallback, misconfigured fallback, default limit=1)shouldBlockForPipelineCapacity(non-slot agents pass through, blocks over capacity, allows below capacity, allows when no PM scope)--max-concurrency 1mitigation applied toua-storeimplementation agentImmediate prod mitigation already applied
cascade agents update 69 --org under-armour --max-concurrency 1caps parallelimplementationruns on ua-store via the existing per-agent-type column. Stops the bleeding while this lands.🤖 Generated with Claude Code