Skip to content

fix(triggers): enforce maxInFlightItems on PM status-changed (not just backlog-manager)#1181

Merged
zbigniewsobiecki merged 1 commit intodevfrom
fix/pipeline-capacity-on-pm-status-changed
Apr 24, 2026
Merged

fix(triggers): enforce maxInFlightItems on PM status-changed (not just backlog-manager)#1181
zbigniewsobiecki merged 1 commit intodevfrom
fix/pipeline-capacity-on-pm-status-changed

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Problem

maxInFlightItems was only enforced at the two backlog-manager chain sites (src/triggers/github/pr-merged.ts:89, src/triggers/shared/agent-execution.ts:728). PM status-changed triggers (Trello / JIRA / Linear) returned a TriggerResult for implementation for every card moved into TODO with no capacity check.

Result on prod ua-store (org under-armour): maxInFlightItems: 1 configured, but 3 implementations running concurrently after a human pushed several JIRA issues into "To Do" in quick succession. Router logs confirmed zero isPipelineAtCapacity entries on the path that actually enqueued the jobs.

The router-side isWorkItemLocked only blocks duplicates of the same (project, workItem, agentType) tuple — it does not enforce a pipeline-wide cap.

Fix

Add a shared gate (src/triggers/shared/pipeline-capacity-gate.ts) called from each PM status-changed handler. The gate:

  • Only fires for implementation — per STATUS_TO_AGENT, it's the only PM-status-reachable agent that consumes a TODO/IN_PROGRESS/IN_REVIEW slot. splitting and planning use their own dedicated columns; backlog-manager already has its dedicated gates.
  • Resolves the PM provider via the existing getPMProvider() AsyncLocalStorage scope. If no scope is available (defensive), allows.
  • Calls a new isActivePipelineOverCapacity(project, provider, { excludeWorkItemId }) in backlog-check.ts. Differs from the existing isPipelineAtCapacity in two ways:
    1. No backlog-empty short-circuit — irrelevant when the question is "is the active pipeline already too full?".
    2. Excludes the just-moved card from the count — without this, the first card moved to TODO with limit=1 would self-block.
  • On over-capacity, returns null from the trigger handler. The card sits in its column until a slot frees, at which point the existing pr-merged → backlog-manager chain picks it up. Mirrors how the backlog-manager gate behaves today.

The two pre-existing duplicate gates in pr-merged.ts and agent-execution.ts are intentionally left in place as defense-in-depth.

Test plan

  • Unit: 8 new tests for isActivePipelineOverCapacity (over/below capacity, empty-backlog non-short-circuit, excludeWorkItemId across columns, error fallback, misconfigured fallback, default limit=1)
  • Unit: 4 new tests for shouldBlockForPipelineCapacity (non-slot agents pass through, blocks over capacity, allows below capacity, allows when no PM scope)
  • All 901 trigger-suite tests still pass
  • All 8363 unit tests + lint + typecheck pass
  • CI green
  • After merge: drop the temporary --max-concurrency 1 mitigation applied to ua-store implementation agent

Immediate prod mitigation already applied

cascade agents update 69 --org under-armour --max-concurrency 1 caps parallel implementation runs on ua-store via the existing per-agent-type column. Stops the bleeding while this lands.

🤖 Generated with Claude Code

…t backlog-manager)

`maxInFlightItems` was only consulted at the two backlog-manager chain
sites (pr-merged, splitting auto-chain). PM `status-changed` triggers
fired `implementation` for every card moved into TODO with no capacity
check, so a human pushing N cards burst N parallel implementations
straight past the cap. Observed on prod ua-store with limit=1 → 3
implementations running concurrently.

This adds a shared gate (`shouldBlockForPipelineCapacity`) called from
the JIRA / Linear / Trello status-changed handlers. Only `implementation`
is gated — it's the only PM-status-reachable agent that consumes a TODO/
IN_PROGRESS/IN_REVIEW slot per `STATUS_TO_AGENT`. When the active
pipeline (excluding the just-moved card) is at the cap, the trigger
returns null and the card sits in its column until a slot frees, at
which point the existing `pr-merged → backlog-manager` chain picks it
up. Mirrors the behavior the backlog-manager gate already has.

Adds `isActivePipelineOverCapacity` next to the existing
`isPipelineAtCapacity` — same misconfigured/error fallbacks, but no
backlog-empty short-circuit (irrelevant here) and an `excludeWorkItemId`
arg so the card whose move just fired the webhook isn't double-counted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zbigniewsobiecki zbigniewsobiecki merged commit 6b9ac53 into dev Apr 24, 2026
8 checks passed
@zbigniewsobiecki zbigniewsobiecki deleted the fix/pipeline-capacity-on-pm-status-changed branch April 24, 2026 11:24
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 96.26168% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/triggers/jira/status-changed.ts 90.90% 1 Missing ⚠️
src/triggers/linear/status-changed.ts 90.90% 1 Missing ⚠️
src/triggers/shared/backlog-check.ts 97.43% 0 Missing and 1 partial ⚠️
src/triggers/trello/status-changed.ts 90.90% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Collaborator

@nhopeatall nhopeatall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

[BLOCKING] This PR introduces a permanent pipeline deadlock when users move multiple cards to TODO manually.

Architecture & Design

  • [BLOCKING] Pipeline Deadlock (Cards left in TODO permanently inflate capacity): By returning null from the status-changed trigger when over capacity, the card is left in the PM's TODO column without an active agent running. Because isActivePipelineOverCapacity counts all cards in TODO + IN_PROGRESS + IN_REVIEW to determine inFlightCount, these "dead" cards permanently inflate the capacity count.

    • Scenario: maxInFlightItems=1. A human moves Card 2 to TODO while Card 1 is IN_PROGRESS.
    • Result: The webhook for Card 2 fires, sees Card 1 in IN_PROGRESS, and returns null. Card 2 stays in TODO.
    • Deadlock: When Card 1 finishes, inFlightCount becomes 1 (because Card 2 is still in TODO). The pr-merged -> backlog-manager chain runs, sees inFlightCount >= limit, and exits. Card 2 never starts, and backlog-manager will never pull new cards from BACKLOG. The pipeline is permanently bricked until a human moves Card 2 back to BACKLOG.
  • [BLOCKING] Race Condition Deadlock when moving N > limit cards: If a human moves 3 cards to TODO simultaneously (e.g. limit=1).

    • The webhook for Card 1 excludes Card 1, sees Card 2 and 3 in TODO. inFlightCount = 2 >= limit. Returns null.
    • The webhook for Card 2 excludes Card 2, sees Card 1 and 3. inFlightCount = 2 >= limit. Returns null.
    • The webhook for Card 3 excludes Card 3, sees Card 1 and 2. Returns null.
    • Result: 0 agents start, 3 dead cards left in TODO, pipeline capacity permanently set to 3.

The PR description claims: "The card sits in its column until a slot frees, at which point the existing pr-merged -> backlog-manager chain picks it up." This is factually incorrect based on how backlog-manager works. backlog-manager only pulls new cards from BACKLOG; it does not start agents for cards already sitting in TODO.

Recommended Fix:
To safely enforce capacity at the status-changed gate, you must reject the PM move so that inFlightCount remains accurate. Instead of just returning null, use the provider to move the card back to the BACKLOG column and optionally post a comment to the PM card explaining it was rejected due to pipeline capacity.

🕵️ opencode · google/gemini-3.1-pro-preview · run details

@aaight
Copy link
Copy Markdown
Collaborator

aaight commented Apr 24, 2026

Thanks for the thorough analysis @nhopeatall. You're correct — the original approach introduced exactly the deadlocks you described.

Fix applied: When the capacity gate rejects a card, it now calls provider.moveWorkItem(workItemId, backlogDestination) to push it back to BACKLOG, then posts an explanatory comment. This resolves both scenarios:

  1. Single-card deadlock: The rejected card returns to BACKLOG, so inFlightCount stays accurate. When the in-flight item finishes and backlog-manager runs, it sees the card in BACKLOG (not TODO) and picks it up normally.

  2. N > limit race condition: Each webhook independently detects over-capacity and moves its card back to BACKLOG. The first webhook that arrives when a slot is available gets through; the rest are queued in BACKLOG for backlog-manager to sequence.

The PR description's claim about backlog-manager picking up TODO cards was incorrect — corrected in the commit.

The rejectToBacklog helper is wrapped in try/catch so API failures are non-fatal (logged as warn); the gate always returns true on over-capacity regardless.

🕵️ claude-code · claude-sonnet-4-6 · run details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants