Skip to content

fix(router): replace in-memory PM coalesce window with BullMQ delayed-job coalescing#1226

Merged
zbigniewsobiecki merged 7 commits intodevfrom
fix/pm-coalesce-bullmq-delayed-jobs
Apr 29, 2026
Merged

fix(router): replace in-memory PM coalesce window with BullMQ delayed-job coalescing#1226
zbigniewsobiecki merged 7 commits intodevfrom
fix/pm-coalesce-bullmq-delayed-jobs

Conversation

@aaight
Copy link
Copy Markdown
Collaborator

@aaight aaight commented Apr 29, 2026

Summary

Replaces the in-memory create-coalesce-window.ts coalescing mechanism with BullMQ delayed-job coalescing for PM status-change webhooks.

Card: https://trello.com/c/69f1e7b68a89d30c2af829e3

Problem

The old mechanism used an in-memory timer per (projectId, workItemId) to de-duplicate JIRA's double-fire pattern (issue_created at initial status + issue_updated transitioning to target). This had two major gaps:

  1. Only coalesced create→update sequences. A rapid update→update sequence (user quickly moves a card twice) was not coalesced.
  2. Ack comment on superseded jobs. The router posted the ack comment before the coalesce window expired, so superseded jobs left orphaned ack comments.

Solution

All PM status-change dispatches with a coalesceKey now go through BullMQ delayed jobs keyed on coalesce:<projectId>:<workItemId>. If a new event arrives for the same work item before the window expires, the prior delayed job is removed and replaced — the last writer wins, across all agent types.

Ack comments are deferred to job fire time via pendingAck=true on the job data. The worker calls dispatchPMAck() at the start of processing instead of the router posting at schedule time, so superseded jobs never leave orphaned ack comments.

Changes

  • src/pm/coalesce-config.ts (new): getCoalesceWindowMs() reads PM_COALESCE_WINDOW_MSPM_CREATE_COALESCE_WINDOW_MS (backward-compat fallback) → default 10 000 ms (was 2 000 ms)
  • src/router/queue.ts: Added scheduleCoalescedJob(job, coalesceKey, delayMs) — looks up existing delayed/waiting job with same ID, removes it (supersede), adds new job with delay
  • src/router/webhook-processor.ts: Replaced in-memory coalesce block with BullMQ delayed-job block; sets pendingAck=true + ackMessage on job before scheduling
  • src/worker-entry.ts: Added pendingAck/ackMessage fields to TrelloJobData, JiraJobData, LinearJobData; added deferred ack handling in dispatchJob for all three PM types
  • src/types/index.ts: Removed coalesceRole?: 'create' | 'update' from TriggerResult (no longer needed — both create and update use the same coalesce path)
  • src/triggers/jira/status-changed.ts, src/triggers/linear/status-changed.ts: Removed coalesceRole from returned TriggerResult
  • src/pm/create-coalesce-window.ts (deleted): In-memory window logic removed
  • CLAUDE.md: Updated env var doc to reflect new name and BullMQ mechanism

Tests

  • tests/unit/router/queue.test.ts (new): 4 unit tests for scheduleCoalescedJob using vi.hoisted() pattern for BullMQ mocks
  • tests/unit/router/webhook-processor.test.ts: Replaced in-memory coalesce tests with 7 BullMQ delayed-job coalescing tests
  • tests/unit/triggers/jira-status-changed.test.ts, tests/unit/triggers/linear-status-changed.test.ts: Updated coalesce metadata tests (assert coalesceKey present, coalesceRole absent)
  • tests/integration/coalesce-bullmq.test.ts (new): 4 real-Redis integration tests on dedicated cascade-test-coalesce queue
  • tests/unit/pm/create-coalesce-window.test.ts (deleted): Test for removed module

Test plan

  • All unit tests pass: npm test
  • All integration tests pass: npm run test:integration
  • Lint clean: npm run lint
  • Typecheck clean: npm run typecheck

🕵️ claude-code · claude-sonnet-4-6 · run details

@aaight
Copy link
Copy Markdown
Collaborator Author

aaight commented Apr 29, 2026

⚠️ respond-to-ci agent failed

OpenCode transport failed after retries: {"code":504,"message":"The operation was aborted","metadata":{"error_type":"timeout"}}

Manual intervention may be required.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 96.44670% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/router/webhook-processor.ts 95.18% 4 Missing ⚠️
src/worker-entry.ts 96.25% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@zbigniewsobiecki
Copy link
Copy Markdown
Member

@aaight solve the CI issues

The new tests/integration/coalesce-bullmq.test.ts requires a live Redis
instance to verify BullMQ delayed-job supersede semantics, but the CI
integration-tests job only provisioned Postgres. Add a redis:7-alpine
service on the standard 6379 port and pass REDIS_URL through to the
test step so the BullMQ Queue can connect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aaight
Copy link
Copy Markdown
Collaborator Author

aaight commented Apr 29, 2026

@zbigniewsobiecki — fixed.

The integration-tests CI job only provisioned Postgres, but the new tests/integration/coalesce-bullmq.test.ts opens a real BullMQ Queue against Redis (redis://localhost:6379 by default) to verify delayed-job supersede semantics. Without a Redis service the BullMQ client could not connect and the suite failed.

Added a redis:7-alpine service on port 6379 (with redis-cli ping healthcheck) to the integration-tests job and pass REDIS_URL=redis://localhost:6379 through to the test step. Verified the full integration suite passes locally (550/550 tests, ~7 min) before pushing.

Pushed as 73e7e91.

🕵️ claude-code · claude-opus-4-7[1m] · run details

Copy link
Copy Markdown
Collaborator

@nhopeatall nhopeatall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Solid architectural improvement moving PM coalesce from in-memory timers to durable BullMQ delayed jobs. This fixes two real gaps (update→update not coalesced, orphaned ack comments on superseded jobs). However, the deferred ack message is set to the work item title instead of a proper ack message — this will post the issue's own title as a comment on itself for every coalesced JIRA/Linear job.

Code Issues

Should Fix

  • src/router/webhook-processor.ts:159ackMessage is set to result.workItemTitle (the issue summary/title), but the worker uses it as the literal text of the ack comment via dispatchPMAck({..., message: jobData.ackMessage}). For a JIRA issue titled "Fix authentication bug", the ack comment would be "Fix authentication bug" — the card's own title posted as a comment on itself. The normal (non-coalesced) path goes through adapter.postAck()generateAckMessage(agentType, context, projectId) which produces proper role-aware ack messages like "🔨 Getting to work on the auth bug fix" with optional run links. workItemTitle is commonly set on both JIRA (payload.issue?.fields?.summary) and Linear (issueTitle) triggers, so this hits virtually all coalesced jobs. Either generate the proper ack message at schedule time and store it, or call generateAckMessage() in the worker's deferred ack path.

  • src/router/queue.ts:152-161 — The getJob → getState → remove → add sequence in scheduleCoalescedJob is not atomic. Two concurrent webhooks for the same coalesceKey can both call getJob(), both see the same delayed job, then one remove() succeeds while the other silently no-ops, and the second add() is a BullMQ no-op (duplicate jobId for a non-completed job). The second event's data is silently lost. While near-simultaneous webhooks are uncommon, this IS the whole point of a coalesce window — multiple events within a time range. Consider using a BullMQ Lua script or at minimum document this race in a code comment so future maintainers are aware.

Questions

  • The coalesce path (step 7b) skips the agent-type concurrency check (step 8b). The old code also skipped it for "create" role events, but now ALL pm:status-changed events skip it. Is this intentional? The delayed job eventually fires through the BullMQ worker, which doesn't re-check concurrency at dispatch time.

🕵️ claude-code · claude-opus-4-6 · run details

Comment thread src/router/webhook-processor.ts Outdated
if (job.type === 'trello' || job.type === 'jira' || job.type === 'linear') {
job.pendingAck = true;
// Use workItemTitle as context if available; generateAckMessage
// already has its own LLM-backed fallback path via the adapter.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD_FIX] ackMessage is assigned result.workItemTitle — the issue summary/title (e.g., "Fix authentication bug"). The worker then passes this as the literal message parameter to dispatchPMAck(), which calls client.postComment(workItemId, message). This will post the card's own title as a comment on itself.

The normal (non-coalesced) path produces a proper ack via adapter.postAck()generateAckMessage(agentType, context, projectId), which uses an LLM to generate role-aware messages like "🔨 Getting to work on the auth bug fix" and appends run links.

workItemTitle is set on virtually all JIRA and Linear triggers (the only ones with coalesceKey), so this affects all coalesced jobs.

Suggested fix: either (a) call generateAckMessage() at schedule time and store the result, or (b) have the worker call generateAckMessage() at fire time using the stored context.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Instead of using directly as the literal comment text, the worker now calls at fire time — the same LLM-backed path as the non-coalesced . Context is extracted from the stored webhook payload via the platform-specific extractors (, , ), with the stored (workItemTitle) as a fallback hint. A JIRA issue titled "Fix authentication bug" now produces a proper role-aware message like "🔨 Getting to work on the auth bug fix" rather than the raw title. The docstrings in the job types are updated to clarify the field is a context hint, not a pre-generated message.

Comment thread src/router/queue.ts

// Remove any existing delayed/waiting job with the same key so the new
// job supersedes it. Active jobs are left alone — they are already running.
const existing = await jobQueue.getJob(jobId);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD_FIX] TOCTOU race: getJobgetStateremoveadd spans multiple Redis round trips. Two concurrent webhook handlers for the same coalesceKey can both see the existing delayed job, both attempt remove (second no-ops), then the second add is a BullMQ silent no-op (duplicate jobId). The second event's data is lost.

Practical impact is low for truly simultaneous events, but the coalesce window exists specifically to handle events that arrive close together. If two Node.js async handlers are interleaved during the Redis round-trips, the race is real.

Consider adding a code comment documenting this race, or using BullMQ's Job.fromId + removal in a single pipeline, or a Redis Lua script to make the check-and-replace atomic.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented. Added a code comment in scheduleCoalescedJob explaining the TOCTOU race: two concurrent handlers can both see the existing delayed job, both attempt remove() (the second no-ops), then the second add() is a BullMQ no-op for a duplicate jobId — losing the second event's data. The comment notes that the practical impact is low (the window exists for events milliseconds-to-seconds apart, not truly simultaneous arrivals), and that a Lua-script atomic compare-and-replace would close it, but best-effort is acceptable here.

Comment thread src/worker-entry.ts
case 'trello': {
logger.info('[Worker] Processing Trello job', {
jobId,
workItemId: jobData.workItemId,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NITPICK] The deferred ack blocks for trello/jira/linear are nearly identical (same structure, same .catch, same String(ackResult.commentId) conversion). Consider extracting a shared helper like async function postDeferredAck(jobData, pmType) to reduce the 3× copy-paste.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted into a shared postDeferredAck(projectId, workItemId, pmType, payload, agentType, contextHint) helper. The three copy-paste blocks are now single-line calls to this helper. As a bonus, the complexity suppression comment on dispatchJob is no longer needed (extracted in a follow-up cleanup commit).

Cascade Bot and others added 2 commits April 29, 2026 12:30
- worker-entry: call generateAckMessage() at job fire time instead of
  using workItemTitle as the literal comment text. Adds postDeferredAck()
  shared helper that extracts context from the stored payload (same path
  as the non-coalesced adapter.postAck), generates a proper role-aware
  ack message via LLM/fallback, then calls dispatchPMAck. Eliminates
  the 3x copy-paste deferred ack blocks (nitpick fix).

- queue.ts: document the TOCTOU race in scheduleCoalescedJob
  (getJob → getState → remove → add is not atomic; two concurrent
  handlers can lose the second event's data). Practical impact is low
  but documented for future maintainers.

- webhook-processor.ts + types: update ackMessage field docstrings to
  clarify it stores the workItemTitle as a context hint for
  generateAckMessage at fire time, not a pre-generated comment text.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The noExcessiveCognitiveComplexity suppression on dispatchJob is no
longer needed — extracting postDeferredAck() reduced the function's
complexity below the threshold.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@nhopeatall nhopeatall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Solid architectural improvement — replacing in-memory coalescing with BullMQ delayed-job coalescing is the right move. It closes both gaps described in the PR (update→update not coalesced, orphaned ack comments on supersede). The deferred-ack pattern is clean, CI is properly updated with a Redis service for integration tests, and test coverage is comprehensive. Two lock-lifecycle issues worth addressing before merge.

Code Issues

Should Fix

  • src/router/webhook-processor.ts:198-203Lock leak on supersede. When scheduleCoalescedJob supersedes a prior delayed job (returns superseded: true), the superseded job's in-memory locks (markWorkItemEnqueued, markRecentlyDispatched, markAgentTypeEnqueued) are never released. Each supersede increments the work-item lock count, but only the final job runs and clears 1 count via cleanupWorker → clearWorkItemEnqueued. Phantom entries persist until the 30-minute TTL. While this doesn't block subsequent PM status-change events (they bypass the lock check via the coalesced path), it would block a non-PM-status-change trigger firing the same (project, workItem, agentType) through the normal step-8 lock check. Fix: when superseded === true, read the old job's data from the return value (requires scheduleCoalescedJob to return it) and call clearWorkItemEnqueued/clearAgentTypeEnqueued/clearRecentlyDispatched for the superseded job. The same class of leak occurs when an active job exists with the same jobId — add() is silently ignored by BullMQ, locks are marked, but no new job is created.

  • src/router/queue.ts:159-168Misleading status when active job exists. When getState() returns 'active', the function skips removal and calls add() — which BullMQ silently ignores for an active duplicate jobId. The function returns { superseded: false } and the caller logs "coalesced dispatch scheduled," but no new job was actually created. Consider returning a third status (e.g., { superseded: false, activeExists: true }) so the caller can log accurately and skip lock marking.

Questions

  • The default coalesce window changed from 2s to 10s (5×). Was this tuned against production JIRA/Linear webhook timing, or is there a specific rationale? A 10s window means every PM status-change event is delayed by 10s before agent work begins, which is a noticeable UX change.

🕵️ claude-code · claude-opus-4-6 · run details

markWorkItemEnqueued(project.id, result.workItemId, result.agentType);
}
markRecentlyDispatched(project.id, result.agentType, result.workItemId);
markAgentTypeEnqueued(project.id, result.agentType);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD_FIX] Lock leak on supersede. When superseded === true, the previously-scheduled job's locks (markWorkItemEnqueued, markRecentlyDispatched, markAgentTypeEnqueued from its original webhook-processor invocation) are never released — existing.remove() in scheduleCoalescedJob doesn't fire BullMQ's worker.on("failed"), so releaseLocksForFailedJob is never called.

Each supersede increments the work-item lock count by 1. After the final job completes, cleanupWorker clears 1, leaving N phantom entries (N = number of supersedes). These persist until the 30-min TTL.

Suggested fix: have scheduleCoalescedJob return the old job's data when superseding, then call clearWorkItemEnqueued / clearAgentTypeEnqueued / clearRecentlyDispatched here before marking the new locks.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. scheduleCoalescedJob now captures existing.data before calling remove() and returns it as supersededJobData in the result. In webhook-processor.ts, when superseded === true and supersededJobData is present, the caller now calls clearWorkItemEnqueued / clearAgentTypeEnqueued / clearRecentlyDispatched for the old job's (projectId, workItemId, agentType) triple, exactly undoing the lock marks from the previous webhook invocation.

The activeExists case (the same class of leak) is also addressed — when getState() === 'active', scheduleCoalescedJob now returns early with activeExists: true without calling add(), and the caller skips lock marking entirely (see the second comment fix below).

Comment thread src/router/queue.ts
// low enough that a documented best-effort approach is acceptable here.
const existing = await jobQueue.getJob(jobId);
if (existing) {
const state = await existing.getState();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SHOULD_FIX] Silent no-op when active job has same ID. When the existing job is active, the code skips removal but still calls add() on line 168. BullMQ silently ignores add() for a duplicate jobId that's already active — so no new job is created, but the caller doesn't know.

The return value { superseded: false } is misleading (caller logs "coalesced dispatch scheduled" and marks locks). Consider checking the add() result or returning activeExists: true so the caller can handle this distinctly (skip lock marking, log accurately).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. When getState() returns 'active', scheduleCoalescedJob now:

  1. Skips add() entirely (BullMQ would have silently ignored it anyway).
  2. Returns { superseded: false, activeExists: true }.

In webhook-processor.ts, when activeExists === true, the caller returns immediately with a distinct decision reason (Coalesced dispatch skipped: active job already running for work item ...) without calling markWorkItemEnqueued / markAgentTypeEnqueued / markRecentlyDispatched — so no phantom lock entries accumulate for a job that was never actually created.

Tests added for both the scheduleCoalescedJob return value (queue.test.ts) and the webhook-processor behaviour (webhook-processor.test.ts: "skips lock marking when activeExists=true").

…ced dispatch

Two lock-lifecycle bugs in BullMQ delayed-job coalescing:

1. Lock leak on supersede: when scheduleCoalescedJob removes a delayed
   job (returns superseded=true), the superseded job's in-memory locks
   are never released because BullMQ's remove() does not fire
   worker.on('failed'). Fix: return supersededJobData from
   scheduleCoalescedJob so the caller can call clearWorkItemEnqueued /
   clearAgentTypeEnqueued / clearRecentlyDispatched for the old job.

2. Misleading status when active job exists: when getState() returns
   'active', the previous code called add() (BullMQ silently ignores it
   for duplicate active IDs) and returned superseded=false, causing the
   caller to mark locks for a non-existent job. Fix: return
   activeExists=true and skip add() entirely; caller skips lock marking
   and logs accurately.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zbigniewsobiecki
Copy link
Copy Markdown
Member

@aaight failing checks
codecov/patch
codecov/patch — 58.28% of diff hit (target 80.00%) - fix the coverage

Adds 8 unit tests for `getCoalesceWindowMs()` (every branch — default,
PM_COALESCE_WINDOW_MS, legacy fallback, precedence, 0/disable,
non-numeric, negative, empty) and 9 unit tests for the deferred-ack
flow in `dispatchJob` / `postDeferredAck` (Trello/JIRA/Linear pendingAck
paths, payload-extractor → contextHint fallback, missing agentType,
non-fatal `dispatchPMAck` rejection, undefined-result fallback to the
pre-existing ackCommentId, Linear-without-workItemId skip, plus the
non-pendingAck Linear routing branch that was wholly uncovered).

Closes the codecov/patch gap on PR #1226 — the new BullMQ deferred-ack
codepath added in this PR was at 15% (worker-entry.ts) / 0%
(coalesce-config.ts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aaight
Copy link
Copy Markdown
Collaborator Author

aaight commented Apr 29, 2026

@zbigniewsobiecki — coverage fix pushed as 182b472.

The codecov/patch failure was driven almost entirely by the new BullMQ deferred-ack pipeline: src/worker-entry.ts was at 15% (68 missing lines) and src/pm/coalesce-config.ts was at 0% (whole file untested). Added two test files:

  • tests/unit/pm/coalesce-config.test.ts — 8 tests covering every branch of getCoalesceWindowMs(): default, PM_COALESCE_WINDOW_MS, legacy PM_CREATE_COALESCE_WINDOW_MS fallback, precedence when both are set, 0 as a valid disable, plus three malformed-input fallbacks (non-numeric, negative, empty string).
  • tests/unit/worker-entry.test.ts — 9 new tests for the deferred-ack flow:
    • The non-pendingAck Linear routing branch (the entire case 'linear' in dispatchJob was previously uncovered — its happy path is now pinned).
    • pendingAck=true for Trello, JIRA, and Linear: asserts the platform-specific extractor is called with the payload, generateAckMessage(agentType, contextSnippet, projectId) is invoked, dispatchPMAck is called with the right shape, and the new commentId is forwarded to the platform's processXxxWebhook.
    • The contextHint fallback when the payload extractor returns '' (uses Issue: <hint> form).
    • Empty-string agentType when triggerResult.agentType is missing.
    • Non-fatal dispatchPMAck rejection: the worker preserves the original ackCommentId from the job data and continues processing.
    • dispatchPMAck returning undefined: same fallback to original ackCommentId.
    • Linear pendingAck=true without a workItemId: deferred-ack is skipped entirely.

Verified locally: npx vitest run --project unit-core tests/unit/pm/coalesce-config.test.ts tests/unit/worker-entry.test.ts → 38/38 pass; pre-push hook (full unit suite, 552 tests + 23 skipped) green; lint + typecheck clean.

🕵️ claude-code · claude-opus-4-7[1m] · run details

@nhopeatall
Copy link
Copy Markdown
Collaborator

nhopeatall commented Apr 29, 2026

⚠️ review agent failed

Error: Claude Code returned an error result: Reached maximum budget ($1.2326)

Manual intervention may be required.

…kMessage hint field

Addresses both items from the (failed-to-submit) review run c63c7835:

1. **Should-fix: Sentry capture gap in coalesce-schedule catch block.**
   The `scheduleCoalescedJob` failure path in `webhook-processor.ts` logged
   via `logger.error` but did not call `captureException`. Other dispatch-
   failure paths flow through BullMQ retry → `worker.on('failed')` →
   `releaseLocksForFailedJob` → Sentry (per spec 015 plan 1), but this catch
   handles a Redis-side failure BEFORE the job is enqueued, so it bypasses
   that pipeline entirely. Redis hiccups during coalesced scheduling were
   silently escaping observability. Now captures under stable tag
   `coalesce_schedule_failure` with `{projectId, workItemId, agentType,
   coalesceKey, adapterType}` extras.

2. **Nit: rename `ackMessage` → `ackContextHint` on the new job interfaces.**
   The PR added an `ackMessage?: string` field on TrelloJob/JiraJob/LinearJob
   (and corresponding `*JobData` interfaces in worker-entry.ts) to store the
   `workItemTitle` as a context hint for `generateAckMessage()` at deferred-
   ack fire time — explicitly NOT the literal comment text, per the JSDoc.
   The name read like the literal text and required reading the JSDoc to
   avoid misuse. Renamed to `ackContextHint` for self-documentation. The
   pre-existing `GitHubJob.ackMessage` field is left unchanged because that
   one IS the literal comment text.

Renamed 3 type fields (Trello/JIRA/Linear job interfaces in queue.ts +
matching shapes in worker-entry.ts), 1 set site in webhook-processor.ts,
3 read sites in worker-entry.ts's deferred-ack helper invocations, plus
the JSDoc reference and 4 test mocks (test names included).

Full unit suite green (473 files / 8700 tests). Typecheck + lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zbigniewsobiecki zbigniewsobiecki merged commit 8f9da02 into dev Apr 29, 2026
9 checks passed
@zbigniewsobiecki zbigniewsobiecki deleted the fix/pm-coalesce-bullmq-delayed-jobs branch April 29, 2026 15:28
zbigniewsobiecki added a commit that referenced this pull request Apr 29, 2026
…1227)

* feat(alerting): add configurable PM results list for alerting agent + enable UX

* fix(alerting): map PM provider to correct discovery capability in container picker

Replace the non-existent `containers` capability (which caused NOT_IMPLEMENTED
errors for every provider) with provider-specific mappings: Trello→`boards`,
JIRA→`projects`, Linear→`teams`. Disable the Fetch button with an explanatory
tooltip for unknown providers.

Also simplify the `alertingResultsContainerId ?? undefined` ternary to a plain
`??` expression in promptContext.ts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(triggers): persist workItemId on respond-to-review and respond-to-pr-comment runs

Both GitHub triggers set `workItemId` at the top level of their TriggerResult
but omitted it from `agentInput`. tryCreateRun reads workItemId from
agentInput, so 0/103 respond-to-review and 0/9 respond-to-pr-comment runs
had a non-null `agent_runs.work_item_id` — hiding every one of them from
the dashboard's work-item page (which filters by that column). Confirmed
on 2026-04-29 via `cascade runs list --project ucho`: four respond-to-review
runs for ucho/MNG-400 (PR #136) on 2026-04-28 were invisible despite firing
successfully.

The runtime safety net in `runAgentExecutionPipeline` was supposed to catch
this but only patched when re-resolution returned a value DIFFERENT from
`result.workItemId`. Since both triggers populated `result.workItemId`
correctly, the safety net no-op'd.

Three changes:

1. Add `workItemId` (shorthand) to the `agentInput` literal in
   `pr-review-submitted.ts:82` and `pr-comment-mention.ts:143`.

2. Widen the runtime patch in `agent-execution.ts:498` so it triggers on
   `agentInput.workItemId !== resolvedWorkItemId` rather than on
   `result.workItemId !== resolvedWorkItemId`. Now any future trigger that
   forgets the field still gets repaired before it reaches the DB.

3. Add `tests/unit/triggers/trigger-work-item-id-consistency.test.ts` —
   a static-analysis guard modeled on `trigger-event-consistency.test.ts`.
   Walks every TriggerHandler file via the TypeScript AST and asserts that
   any return literal with a top-level `workItemId` and a real `agentType`
   also includes `workItemId` inside `agentInput`. Fires loudly with
   `file:line` when violated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(alerting): fix PM container picker to fetch correct sub-container level

- JIRA: switch from `projects` to `states` capability (fetches workflow
  status names, which is what backlogListId expects); pass configured
  projectKey as args.containerId; save item.name not numeric state ID
- Trello: disable the dropdown picker entirely — no `lists` discovery
  capability exists in the manifest; users must enter the list ID manually
- Linear: keep `teams` capability (team ID is already the correct value)

Extract ContainerInput helper component to keep AlertingTab complexity
below the project's cognitive-complexity limit. Pass pmConfig from
integration-form.tsx so JIRA states can be scoped to the right project.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(spec): 017 router-silent-failure-hardening + plans

Three production hardening fixes derived from the 2026-04-29 24h log/webhook
audit. Discovered alongside PR #1220 (workItemId-on-respond-to-* fix).

Failure modes:
  A) Linear PM-ack silently skipped on PM-focused agents (24/day)
  B) Pipeline-capacity gate fails open on every PM status-changed (32/day)
  C) Progress-comment double-delete race produces 404 log spam (72/day)

Decomposed per spec strategic decision #8 into three independent plans
(none blocks another):
  1-pm-ack-coverage  (consolidate dispatch via manifest registry)
  2-capacity-gate-pm-scope  (shared adapter helper, fail-closed semantics)
  3-progress-comment-double-delete  (consumed-flag, gated fallback, 404-DEBUG)

This commit ships scaffolding only. Plan 3's implementation lands in this
branch; plans 1 and 2 follow as their own PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(plan): 017/3 lock

* test: add coverage for buildExecutionPlan in secretOrchestrator

* fix(triggers): suppress redundant progress-comment DELETE after gadget mid-run cleanup

Plan 017/3 (progress-comment-double-delete). Closes failure mode C from spec
017's 24h log audit on 2026-04-29.

Root cause: deleteProgressCommentOnSuccess (post-agent hook) read
sessionState.initialCommentId, fell back to result.agentInput.ackCommentId
when session state was empty, and issued a DELETE. But "session state cleared
by an in-run gadget" was indistinguishable from "session state never
populated", so the fallback fired and re-deleted comments that gadgets had
already disposed of mid-run. GitHub returned 404 and produced 72 WARN entries
per day on cascade-router (no functional impact — the comment WAS deleted —
but the noise dominated WARN volume and obscured real failures).

Three changes:

1. New `initialCommentIdConsumed: boolean` flag on SessionStateData (default
   false). Both `deleteInitialComment` (gadget-driven) and
   `clearInitialComment` (sidecar-driven) set it to true after disposing of
   the comment. Distinguishes "had a comment, now gone" from "never had one".

2. `deleteProgressCommentOnSuccess` reads the flag first; if true, skips the
   entire deletion path including the legacy `agentInput.ackCommentId`
   fallback. The legacy fallback continues to work for paths that never
   populated session state (consumed stays false).

3. `githubClient.deletePRComment` treats HTTP 404 as success-equivalent under
   RFC-7231 idempotency: returns without throwing, logs at DEBUG with the
   comment id. Other HTTP errors (5xx, 401, network) continue to throw. Defense
   in depth: even if a future regression of similar shape recurs, the WARN
   volume stays clean while a DEBUG breadcrumb persists for audit.

Test coverage: +7 tests in sessionState (new `initialCommentIdConsumed flag`
describe block), +3 in ack-comments (consumed-bypass + headline regression
pin + legacy-fallback preservation), +4 in client (404 → DEBUG + 5xx/401/network
rejection regression pins). Full suite: 470 files / 8649 tests passing
(+14 from baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(plan): 017/1 lock

* fix(router): consolidate PM-ack dispatch via manifest registry, restore Linear coverage

Plan 017/1 (pm-ack-coverage). Closes failure mode A from spec 017's 24h log
audit on 2026-04-29: PM-focused agents (e.g. backlog-manager) triggered from
a GitHub webhook against Linear-based projects silently skipped their PM-side
ack. The router-adapter's local `postPMAck` helper had `if (pmType === 'trello')`
/ `if (pmType === 'jira')` literal branches but no Linear case, so control fell
through to `WARN: Unknown PM type for PM-focused agent ack, skipping` — 24×
per day on prod cascade-router, all from `ucho`. A near-identical helper at
`src/triggers/shared/pm-ack.ts:postPMAckComment` had the Linear branch, so
this was pure parallel-path drift between two helpers that should have been
one (same shape as PR #1220).

Three changes:

1. New consolidated helper `dispatchPMAck` at `src/router/pm-ack-dispatch.ts`.
   Indexes the manifest registry directly via
   `getPMProvider(pmType).platformClientFactory(projectId).postComment(...)`.
   Zero per-PM-type literal branching anywhere on the dispatch surface.
   Adding a future PM provider lands the dispatch path for free.

2. Both legacy call sites delegate:
   - `src/router/adapters/github.ts:postPMAck` (the buggy one)
   - `src/triggers/shared/pm-ack.ts:postPMAckComment` (had Linear; preserves
     the existing `string | null` return contract via String() normalization)

3. The "Unknown PM type" branch converts from silent WARN+skip to ERROR-level
   log + Sentry capture under stable tag `pm_ack_unknown_pm_type`. Mirrors
   the spec-015 `wedged_lock_canary` precedent. Once the consolidation ships,
   hitting that branch represents a real configuration error (project pinned
   to a deleted provider), not steady-state noise.

Regression nets:

- Per-provider conformance assertion: the existing PM manifest harness gains
  one new `it()` inside its `describe.each` block —
  `dispatchPMAck reaches this provider without throwing`. Adding a future
  provider whose `platformClientFactory` is misconfigured fails CI loudly.

- Static guard against future literal-branching drift:
  `tests/unit/router/pm-ack-dispatch.test.ts` reads each of the three call
  sites' source and asserts no `pmType === 'trello' | 'jira' | 'linear'`
  patterns appear within their bodies. Modeled on
  `trigger-event-consistency.test.ts`. A future maintainer who hand-codes a
  branch fails this guard with a precise file:line citation.

Test coverage: +6 tests on `dispatchPMAck` (Trello/JIRA/Linear happy paths,
null-from-postComment, unknown-pmType Sentry path, undefined-pmType same),
+4 on `postPMAckComment` delegation contract, +1 per-provider assertion in
the conformance harness (4 currently-registered providers including the test
fixture), +1 Linear regression pin on the github router adapter's `postAck`
public surface, plus 2 existing github adapter tests migrated from mocking
`postTrelloAck` to mocking `dispatchPMAck` (the dependency direction
changed). Full suite: 472 files / 8668 tests (+19 from baseline).

Doc updates:
- `src/integrations/README.md` gains a "PM-ack dispatch coverage invariant"
  subsection in the conformance harness section, documenting the no-literal-
  branching rule, the Sentry-tag escalation, and pointing at both regression
  nets.
- `CHANGELOG.md` entry under spec 017.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(plan): 017/2 lock

* fix(router): wrap PM-source dispatch in PM-provider scope so capacity gate enforces maxInFlightItems

Plan 017/2 (capacity-gate-pm-scope). Closes failure mode B from spec 017's
24h log audit on 2026-04-29.

Root cause: the pipeline-capacity gate at
`src/triggers/shared/pipeline-capacity-gate.ts:33-45` calls `getPMProvider()`
to count in-flight items. For every PM `status-changed` trigger this threw
`No PMProvider in scope` because the three PM router adapters
(`src/router/adapters/{linear,trello,jira}.ts`) wrapped trigger dispatch in
their per-PM-type credential AsyncLocalStorage scope (`withLinearCredentials`
/ `withTrelloCredentials` / `withJiraCredentials`) but NOT in PM-provider
scope. The GitHub adapter at `src/router/adapters/github.ts:280` already had
both wrappings — the PM router adapters were the outliers.

The gate caught the throw and fell through to its conservative branch:
`WARN: pipeline-capacity-gate: PM provider unavailable, allowing run` +
`return false` (allow). Net effect: `maxInFlightItems` was silently no-op for
the entire PM-source path. 32 occurrences/day on prod cascade-router. The
gate's whole purpose is preventing the multi-card-into-TODO incident class
documented in the file header — and the protection had been disabled since
the gate was first introduced.

Three changes:

1. New shared helper `withPMScopeForDispatch(project, dispatch)` at
   `src/router/adapters/_shared.ts`. Resolves the PM provider via
   `createPMProvider(project)` and wraps `dispatch` in `withPMProvider`.
   Mirrors the GitHub adapter's correct shape.

2. The three PM router adapters consume the helper inside
   `dispatchWithCredentials`, layering PM-provider scope on top of their
   existing per-PM-type credential scope.

3. The capacity gate's "PM provider unavailable" branch flips from
   `WARN + return false` (allow) to ERROR + Sentry capture under stable tag
   `pipeline_capacity_gate_no_pm_provider` + `return true` (block). Once the
   routine path establishes scope (changes 1+2 above), hitting that branch
   represents a real AsyncLocalStorage scope leak that operators need to
   investigate. Failing closed is preferable to silently failing open and
   re-introducing the original incident class.

Regression nets:

- `tests/unit/integrations/pm-router-adapter-pm-scope.test.ts`: static guard
  reading each PM router adapter source and asserting it contains either
  `withPMScopeForDispatch` or `withPMProvider`. A future PM router adapter
  added without the wrapping fails this guard with a precise file path.
  Modeled on `trigger-event-consistency.test.ts`.

- `tests/unit/router/adapters/with-pm-scope-for-dispatch.test.ts`: 4 tests
  covering happy path (PM provider resolves inside the wrapped callback),
  return-value passthrough, null dispatch result, and error propagation.

- `pipeline-capacity-gate.test.ts`: existing "allows when no PM provider"
  test inverted to "FAILS CLOSED (blocks) when no PM provider scope" (the
  headline behavioral change), plus 2 new positive-path regression pins.

Side effects on existing trigger tests: three trigger-handler test files
(`linear-status-changed.test.ts`, `jira-status-changed.test.ts`,
`status-changed.test.ts`) call `shouldBlockForPipelineCapacity` indirectly
via the trigger handlers. After the fail-closed flip, those handlers were
blocked because the unit tests don't establish PM-provider scope. Each test
file gains a `vi.mock('.../pipeline-capacity-gate.js')` passthrough so
trigger-logic assertions still run — the gate's behavior is tested in its
own dedicated test file. Three PM router adapter test files
(`linear.test.ts`, `trello.test.ts`, `jira.test.ts`) gain a similar
passthrough mock for `_shared.js:withPMScopeForDispatch` so they don't pull
the real PM manifest registry.

Doc updates:
- `CLAUDE.md` Architecture section gains a "Capacity-gate invariant"
  paragraph documenting the wrapping requirement, the fail-closed policy,
  the Sentry tag, and the static guard.
- `CHANGELOG.md` entry under spec 017.

Full suite: 474 files / 8677 tests passing (+9 from Plan 1's baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(spec): 017 done — all three plans complete (router-side silent-failure hardening)

* fix(integration-tests): wrap implementation-trigger handle() in withPMProvider scope + plan 2 polish

CI integration-tests caught a regression I missed locally because lefthook's
pre-push only runs `npm run test:fast` (unit-only), not the full integration
suite. Four integration tests asserted `result?.agentType === 'implementation'`
but received `undefined` after plan 2's fail-closed flip, because they call
real trigger `.handle(ctx)` directly without establishing PM-provider scope —
the same class of breakage I fixed in unit tests with vi.mock passthroughs.

Failing tests:
  - tests/integration/trigger-registry.test.ts:323 (TrelloStatusChangedTodoTrigger.handle)
  - tests/integration/trigger-registry.test.ts:591 (registry.dispatch with implementation trigger)
  - tests/integration/pm-provider-switching.test.ts:258 (Trello implementation dispatch)
  - tests/integration/pm-provider-switching.test.ts:293 (JIRA implementation dispatch)

Integration tests don't mock the gate (by design), so the right fix is to
mirror production wrapping: each call site now wraps in
`withPMProvider(createPMProvider(project), () => trigger.handle(ctx))` —
which is exactly what `withPMScopeForDispatch` does in the PM router adapters.

The synthetic in-test trigger at trigger-registry.test.ts:117 doesn't call
the gate (it returns a hard-coded TriggerResult), so no scope wrap needed
there. Same for any non-implementation agent (planning / splitting / etc.)
since `SLOT_CONSUMING_AGENTS` only contains 'implementation'.

Also applies polish from local-review on the spec 017 work:

1. Reworded stale comment in `src/gadgets/sessionState.ts:186` — the comment
   claimed clearing `initialCommentId` made the post-agent callback "see
   null and short-circuit", but post-plan-3 the callback's actual gate is
   the `initialCommentIdConsumed` flag set on the line below. Comment now
   describes both the eager-clear AND the consumed-flag gate together.

2. Trimmed redundant 3-line wrapping comments to zero lines across the
   three PM router adapters (`linear.ts`, `trello.ts`, `jira.ts`). The
   shared helper's name (`withPMScopeForDispatch`) is self-documenting and
   CLAUDE.md's "Capacity-gate invariant" paragraph carries the rationale.
   Per CLAUDE.md "Default to writing no comments" guidance.

3. Tightened conformance harness assertion at `pm-conformance.test.ts`'s
   `dispatchPMAck reaches this provider without throwing` test. Previously
   the only `expect(...)` was conditional (`if (result !== undefined)`), so
   when result was undefined the test passed with zero assertions executed
   — a "tests that would still pass if the code under test were deleted"
   anti-pattern. Now uses `expect.assertions(1)` and asserts the result is
   either undefined OR has the expected AckResult shape, with a clear
   failure message naming the provider.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(router): replace in-memory PM coalesce window with BullMQ delayed-job coalescing

* ci: add Redis service to integration-tests job

The new tests/integration/coalesce-bullmq.test.ts requires a live Redis
instance to verify BullMQ delayed-job supersede semantics, but the CI
integration-tests job only provisioned Postgres. Add a redis:7-alpine
service on the standard 6379 port and pass REDIS_URL through to the
test step so the BullMQ Queue can connect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(router): address review feedback on PM coalesce deferred ack

- worker-entry: call generateAckMessage() at job fire time instead of
  using workItemTitle as the literal comment text. Adds postDeferredAck()
  shared helper that extracts context from the stored payload (same path
  as the non-coalesced adapter.postAck), generates a proper role-aware
  ack message via LLM/fallback, then calls dispatchPMAck. Eliminates
  the 3x copy-paste deferred ack blocks (nitpick fix).

- queue.ts: document the TOCTOU race in scheduleCoalescedJob
  (getJob → getState → remove → add is not atomic; two concurrent
  handlers can lose the second event's data). Practical impact is low
  but documented for future maintainers.

- webhook-processor.ts + types: update ackMessage field docstrings to
  clarify it stores the workItemTitle as a context hint for
  generateAckMessage at fire time, not a pre-generated comment text.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore(worker): remove now-unused biome complexity suppression

The noExcessiveCognitiveComplexity suppression on dispatchJob is no
longer needed — extracting postDeferredAck() reduced the function's
complexity below the threshold.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(router): address lock-leak and active-job-status issues in coalesced dispatch

Two lock-lifecycle bugs in BullMQ delayed-job coalescing:

1. Lock leak on supersede: when scheduleCoalescedJob removes a delayed
   job (returns superseded=true), the superseded job's in-memory locks
   are never released because BullMQ's remove() does not fire
   worker.on('failed'). Fix: return supersededJobData from
   scheduleCoalescedJob so the caller can call clearWorkItemEnqueued /
   clearAgentTypeEnqueued / clearRecentlyDispatched for the old job.

2. Misleading status when active job exists: when getState() returns
   'active', the previous code called add() (BullMQ silently ignores it
   for duplicate active IDs) and returned superseded=false, causing the
   caller to mark locks for a non-existent job. Fix: return
   activeExists=true and skip add() entirely; caller skips lock marking
   and logs accurately.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(coverage): cover deferred-ack worker path + coalesce-config

Adds 8 unit tests for `getCoalesceWindowMs()` (every branch — default,
PM_COALESCE_WINDOW_MS, legacy fallback, precedence, 0/disable,
non-numeric, negative, empty) and 9 unit tests for the deferred-ack
flow in `dispatchJob` / `postDeferredAck` (Trello/JIRA/Linear pendingAck
paths, payload-extractor → contextHint fallback, missing agentType,
non-fatal `dispatchPMAck` rejection, undefined-result fallback to the
pre-existing ackCommentId, Linear-without-workItemId skip, plus the
non-pendingAck Linear routing branch that was wholly uncovered).

Closes the codecov/patch gap on PR #1226 — the new BullMQ deferred-ack
codepath added in this PR was at 15% (worker-entry.ts) / 0%
(coalesce-config.ts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(router): capture coalesce-schedule failures to Sentry + rename ackMessage hint field

Addresses both items from the (failed-to-submit) review run c63c7835:

1. **Should-fix: Sentry capture gap in coalesce-schedule catch block.**
   The `scheduleCoalescedJob` failure path in `webhook-processor.ts` logged
   via `logger.error` but did not call `captureException`. Other dispatch-
   failure paths flow through BullMQ retry → `worker.on('failed')` →
   `releaseLocksForFailedJob` → Sentry (per spec 015 plan 1), but this catch
   handles a Redis-side failure BEFORE the job is enqueued, so it bypasses
   that pipeline entirely. Redis hiccups during coalesced scheduling were
   silently escaping observability. Now captures under stable tag
   `coalesce_schedule_failure` with `{projectId, workItemId, agentType,
   coalesceKey, adapterType}` extras.

2. **Nit: rename `ackMessage` → `ackContextHint` on the new job interfaces.**
   The PR added an `ackMessage?: string` field on TrelloJob/JiraJob/LinearJob
   (and corresponding `*JobData` interfaces in worker-entry.ts) to store the
   `workItemTitle` as a context hint for `generateAckMessage()` at deferred-
   ack fire time — explicitly NOT the literal comment text, per the JSDoc.
   The name read like the literal text and required reading the JSDoc to
   avoid misuse. Renamed to `ackContextHint` for self-documentation. The
   pre-existing `GitHubJob.ackMessage` field is left unchanged because that
   one IS the literal comment text.

Renamed 3 type fields (Trello/JIRA/Linear job interfaces in queue.ts +
matching shapes in worker-entry.ts), 1 set site in webhook-processor.ts,
3 read sites in worker-entry.ts's deferred-ack helper invocations, plus
the JSDoc reference and 4 test mocks (test names included).

Full unit suite green (473 files / 8700 tests). Typecheck + lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Cascade Bot <bot@cascade.dev>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
zbigniewsobiecki added a commit that referenced this pull request Apr 29, 2026
…ontainer names (#1229)

* feat(alerting): add configurable PM results list for alerting agent + enable UX

* fix(alerting): map PM provider to correct discovery capability in container picker

Replace the non-existent `containers` capability (which caused NOT_IMPLEMENTED
errors for every provider) with provider-specific mappings: Trello→`boards`,
JIRA→`projects`, Linear→`teams`. Disable the Fetch button with an explanatory
tooltip for unknown providers.

Also simplify the `alertingResultsContainerId ?? undefined` ternary to a plain
`??` expression in promptContext.ts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(triggers): persist workItemId on respond-to-review and respond-to-pr-comment runs

Both GitHub triggers set `workItemId` at the top level of their TriggerResult
but omitted it from `agentInput`. tryCreateRun reads workItemId from
agentInput, so 0/103 respond-to-review and 0/9 respond-to-pr-comment runs
had a non-null `agent_runs.work_item_id` — hiding every one of them from
the dashboard's work-item page (which filters by that column). Confirmed
on 2026-04-29 via `cascade runs list --project ucho`: four respond-to-review
runs for ucho/MNG-400 (PR #136) on 2026-04-28 were invisible despite firing
successfully.

The runtime safety net in `runAgentExecutionPipeline` was supposed to catch
this but only patched when re-resolution returned a value DIFFERENT from
`result.workItemId`. Since both triggers populated `result.workItemId`
correctly, the safety net no-op'd.

Three changes:

1. Add `workItemId` (shorthand) to the `agentInput` literal in
   `pr-review-submitted.ts:82` and `pr-comment-mention.ts:143`.

2. Widen the runtime patch in `agent-execution.ts:498` so it triggers on
   `agentInput.workItemId !== resolvedWorkItemId` rather than on
   `result.workItemId !== resolvedWorkItemId`. Now any future trigger that
   forgets the field still gets repaired before it reaches the DB.

3. Add `tests/unit/triggers/trigger-work-item-id-consistency.test.ts` —
   a static-analysis guard modeled on `trigger-event-consistency.test.ts`.
   Walks every TriggerHandler file via the TypeScript AST and asserts that
   any return literal with a top-level `workItemId` and a real `agentType`
   also includes `workItemId` inside `agentInput`. Fires loudly with
   `file:line` when violated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(alerting): fix PM container picker to fetch correct sub-container level

- JIRA: switch from `projects` to `states` capability (fetches workflow
  status names, which is what backlogListId expects); pass configured
  projectKey as args.containerId; save item.name not numeric state ID
- Trello: disable the dropdown picker entirely — no `lists` discovery
  capability exists in the manifest; users must enter the list ID manually
- Linear: keep `teams` capability (team ID is already the correct value)

Extract ContainerInput helper component to keep AlertingTab complexity
below the project's cognitive-complexity limit. Pass pmConfig from
integration-form.tsx so JIRA states can be scoped to the right project.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(spec): 017 router-silent-failure-hardening + plans

Three production hardening fixes derived from the 2026-04-29 24h log/webhook
audit. Discovered alongside PR #1220 (workItemId-on-respond-to-* fix).

Failure modes:
  A) Linear PM-ack silently skipped on PM-focused agents (24/day)
  B) Pipeline-capacity gate fails open on every PM status-changed (32/day)
  C) Progress-comment double-delete race produces 404 log spam (72/day)

Decomposed per spec strategic decision #8 into three independent plans
(none blocks another):
  1-pm-ack-coverage  (consolidate dispatch via manifest registry)
  2-capacity-gate-pm-scope  (shared adapter helper, fail-closed semantics)
  3-progress-comment-double-delete  (consumed-flag, gated fallback, 404-DEBUG)

This commit ships scaffolding only. Plan 3's implementation lands in this
branch; plans 1 and 2 follow as their own PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(plan): 017/3 lock

* test: add coverage for buildExecutionPlan in secretOrchestrator

* fix(triggers): suppress redundant progress-comment DELETE after gadget mid-run cleanup

Plan 017/3 (progress-comment-double-delete). Closes failure mode C from spec
017's 24h log audit on 2026-04-29.

Root cause: deleteProgressCommentOnSuccess (post-agent hook) read
sessionState.initialCommentId, fell back to result.agentInput.ackCommentId
when session state was empty, and issued a DELETE. But "session state cleared
by an in-run gadget" was indistinguishable from "session state never
populated", so the fallback fired and re-deleted comments that gadgets had
already disposed of mid-run. GitHub returned 404 and produced 72 WARN entries
per day on cascade-router (no functional impact — the comment WAS deleted —
but the noise dominated WARN volume and obscured real failures).

Three changes:

1. New `initialCommentIdConsumed: boolean` flag on SessionStateData (default
   false). Both `deleteInitialComment` (gadget-driven) and
   `clearInitialComment` (sidecar-driven) set it to true after disposing of
   the comment. Distinguishes "had a comment, now gone" from "never had one".

2. `deleteProgressCommentOnSuccess` reads the flag first; if true, skips the
   entire deletion path including the legacy `agentInput.ackCommentId`
   fallback. The legacy fallback continues to work for paths that never
   populated session state (consumed stays false).

3. `githubClient.deletePRComment` treats HTTP 404 as success-equivalent under
   RFC-7231 idempotency: returns without throwing, logs at DEBUG with the
   comment id. Other HTTP errors (5xx, 401, network) continue to throw. Defense
   in depth: even if a future regression of similar shape recurs, the WARN
   volume stays clean while a DEBUG breadcrumb persists for audit.

Test coverage: +7 tests in sessionState (new `initialCommentIdConsumed flag`
describe block), +3 in ack-comments (consumed-bypass + headline regression
pin + legacy-fallback preservation), +4 in client (404 → DEBUG + 5xx/401/network
rejection regression pins). Full suite: 470 files / 8649 tests passing
(+14 from baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(plan): 017/1 lock

* fix(router): consolidate PM-ack dispatch via manifest registry, restore Linear coverage

Plan 017/1 (pm-ack-coverage). Closes failure mode A from spec 017's 24h log
audit on 2026-04-29: PM-focused agents (e.g. backlog-manager) triggered from
a GitHub webhook against Linear-based projects silently skipped their PM-side
ack. The router-adapter's local `postPMAck` helper had `if (pmType === 'trello')`
/ `if (pmType === 'jira')` literal branches but no Linear case, so control fell
through to `WARN: Unknown PM type for PM-focused agent ack, skipping` — 24×
per day on prod cascade-router, all from `ucho`. A near-identical helper at
`src/triggers/shared/pm-ack.ts:postPMAckComment` had the Linear branch, so
this was pure parallel-path drift between two helpers that should have been
one (same shape as PR #1220).

Three changes:

1. New consolidated helper `dispatchPMAck` at `src/router/pm-ack-dispatch.ts`.
   Indexes the manifest registry directly via
   `getPMProvider(pmType).platformClientFactory(projectId).postComment(...)`.
   Zero per-PM-type literal branching anywhere on the dispatch surface.
   Adding a future PM provider lands the dispatch path for free.

2. Both legacy call sites delegate:
   - `src/router/adapters/github.ts:postPMAck` (the buggy one)
   - `src/triggers/shared/pm-ack.ts:postPMAckComment` (had Linear; preserves
     the existing `string | null` return contract via String() normalization)

3. The "Unknown PM type" branch converts from silent WARN+skip to ERROR-level
   log + Sentry capture under stable tag `pm_ack_unknown_pm_type`. Mirrors
   the spec-015 `wedged_lock_canary` precedent. Once the consolidation ships,
   hitting that branch represents a real configuration error (project pinned
   to a deleted provider), not steady-state noise.

Regression nets:

- Per-provider conformance assertion: the existing PM manifest harness gains
  one new `it()` inside its `describe.each` block —
  `dispatchPMAck reaches this provider without throwing`. Adding a future
  provider whose `platformClientFactory` is misconfigured fails CI loudly.

- Static guard against future literal-branching drift:
  `tests/unit/router/pm-ack-dispatch.test.ts` reads each of the three call
  sites' source and asserts no `pmType === 'trello' | 'jira' | 'linear'`
  patterns appear within their bodies. Modeled on
  `trigger-event-consistency.test.ts`. A future maintainer who hand-codes a
  branch fails this guard with a precise file:line citation.

Test coverage: +6 tests on `dispatchPMAck` (Trello/JIRA/Linear happy paths,
null-from-postComment, unknown-pmType Sentry path, undefined-pmType same),
+4 on `postPMAckComment` delegation contract, +1 per-provider assertion in
the conformance harness (4 currently-registered providers including the test
fixture), +1 Linear regression pin on the github router adapter's `postAck`
public surface, plus 2 existing github adapter tests migrated from mocking
`postTrelloAck` to mocking `dispatchPMAck` (the dependency direction
changed). Full suite: 472 files / 8668 tests (+19 from baseline).

Doc updates:
- `src/integrations/README.md` gains a "PM-ack dispatch coverage invariant"
  subsection in the conformance harness section, documenting the no-literal-
  branching rule, the Sentry-tag escalation, and pointing at both regression
  nets.
- `CHANGELOG.md` entry under spec 017.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(plan): 017/2 lock

* fix(router): wrap PM-source dispatch in PM-provider scope so capacity gate enforces maxInFlightItems

Plan 017/2 (capacity-gate-pm-scope). Closes failure mode B from spec 017's
24h log audit on 2026-04-29.

Root cause: the pipeline-capacity gate at
`src/triggers/shared/pipeline-capacity-gate.ts:33-45` calls `getPMProvider()`
to count in-flight items. For every PM `status-changed` trigger this threw
`No PMProvider in scope` because the three PM router adapters
(`src/router/adapters/{linear,trello,jira}.ts`) wrapped trigger dispatch in
their per-PM-type credential AsyncLocalStorage scope (`withLinearCredentials`
/ `withTrelloCredentials` / `withJiraCredentials`) but NOT in PM-provider
scope. The GitHub adapter at `src/router/adapters/github.ts:280` already had
both wrappings — the PM router adapters were the outliers.

The gate caught the throw and fell through to its conservative branch:
`WARN: pipeline-capacity-gate: PM provider unavailable, allowing run` +
`return false` (allow). Net effect: `maxInFlightItems` was silently no-op for
the entire PM-source path. 32 occurrences/day on prod cascade-router. The
gate's whole purpose is preventing the multi-card-into-TODO incident class
documented in the file header — and the protection had been disabled since
the gate was first introduced.

Three changes:

1. New shared helper `withPMScopeForDispatch(project, dispatch)` at
   `src/router/adapters/_shared.ts`. Resolves the PM provider via
   `createPMProvider(project)` and wraps `dispatch` in `withPMProvider`.
   Mirrors the GitHub adapter's correct shape.

2. The three PM router adapters consume the helper inside
   `dispatchWithCredentials`, layering PM-provider scope on top of their
   existing per-PM-type credential scope.

3. The capacity gate's "PM provider unavailable" branch flips from
   `WARN + return false` (allow) to ERROR + Sentry capture under stable tag
   `pipeline_capacity_gate_no_pm_provider` + `return true` (block). Once the
   routine path establishes scope (changes 1+2 above), hitting that branch
   represents a real AsyncLocalStorage scope leak that operators need to
   investigate. Failing closed is preferable to silently failing open and
   re-introducing the original incident class.

Regression nets:

- `tests/unit/integrations/pm-router-adapter-pm-scope.test.ts`: static guard
  reading each PM router adapter source and asserting it contains either
  `withPMScopeForDispatch` or `withPMProvider`. A future PM router adapter
  added without the wrapping fails this guard with a precise file path.
  Modeled on `trigger-event-consistency.test.ts`.

- `tests/unit/router/adapters/with-pm-scope-for-dispatch.test.ts`: 4 tests
  covering happy path (PM provider resolves inside the wrapped callback),
  return-value passthrough, null dispatch result, and error propagation.

- `pipeline-capacity-gate.test.ts`: existing "allows when no PM provider"
  test inverted to "FAILS CLOSED (blocks) when no PM provider scope" (the
  headline behavioral change), plus 2 new positive-path regression pins.

Side effects on existing trigger tests: three trigger-handler test files
(`linear-status-changed.test.ts`, `jira-status-changed.test.ts`,
`status-changed.test.ts`) call `shouldBlockForPipelineCapacity` indirectly
via the trigger handlers. After the fail-closed flip, those handlers were
blocked because the unit tests don't establish PM-provider scope. Each test
file gains a `vi.mock('.../pipeline-capacity-gate.js')` passthrough so
trigger-logic assertions still run — the gate's behavior is tested in its
own dedicated test file. Three PM router adapter test files
(`linear.test.ts`, `trello.test.ts`, `jira.test.ts`) gain a similar
passthrough mock for `_shared.js:withPMScopeForDispatch` so they don't pull
the real PM manifest registry.

Doc updates:
- `CLAUDE.md` Architecture section gains a "Capacity-gate invariant"
  paragraph documenting the wrapping requirement, the fail-closed policy,
  the Sentry tag, and the static guard.
- `CHANGELOG.md` entry under spec 017.

Full suite: 474 files / 8677 tests passing (+9 from Plan 1's baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(spec): 017 done — all three plans complete (router-side silent-failure hardening)

* fix(integration-tests): wrap implementation-trigger handle() in withPMProvider scope + plan 2 polish

CI integration-tests caught a regression I missed locally because lefthook's
pre-push only runs `npm run test:fast` (unit-only), not the full integration
suite. Four integration tests asserted `result?.agentType === 'implementation'`
but received `undefined` after plan 2's fail-closed flip, because they call
real trigger `.handle(ctx)` directly without establishing PM-provider scope —
the same class of breakage I fixed in unit tests with vi.mock passthroughs.

Failing tests:
  - tests/integration/trigger-registry.test.ts:323 (TrelloStatusChangedTodoTrigger.handle)
  - tests/integration/trigger-registry.test.ts:591 (registry.dispatch with implementation trigger)
  - tests/integration/pm-provider-switching.test.ts:258 (Trello implementation dispatch)
  - tests/integration/pm-provider-switching.test.ts:293 (JIRA implementation dispatch)

Integration tests don't mock the gate (by design), so the right fix is to
mirror production wrapping: each call site now wraps in
`withPMProvider(createPMProvider(project), () => trigger.handle(ctx))` —
which is exactly what `withPMScopeForDispatch` does in the PM router adapters.

The synthetic in-test trigger at trigger-registry.test.ts:117 doesn't call
the gate (it returns a hard-coded TriggerResult), so no scope wrap needed
there. Same for any non-implementation agent (planning / splitting / etc.)
since `SLOT_CONSUMING_AGENTS` only contains 'implementation'.

Also applies polish from local-review on the spec 017 work:

1. Reworded stale comment in `src/gadgets/sessionState.ts:186` — the comment
   claimed clearing `initialCommentId` made the post-agent callback "see
   null and short-circuit", but post-plan-3 the callback's actual gate is
   the `initialCommentIdConsumed` flag set on the line below. Comment now
   describes both the eager-clear AND the consumed-flag gate together.

2. Trimmed redundant 3-line wrapping comments to zero lines across the
   three PM router adapters (`linear.ts`, `trello.ts`, `jira.ts`). The
   shared helper's name (`withPMScopeForDispatch`) is self-documenting and
   CLAUDE.md's "Capacity-gate invariant" paragraph carries the rationale.
   Per CLAUDE.md "Default to writing no comments" guidance.

3. Tightened conformance harness assertion at `pm-conformance.test.ts`'s
   `dispatchPMAck reaches this provider without throwing` test. Previously
   the only `expect(...)` was conditional (`if (result !== undefined)`), so
   when result was undefined the test passed with zero assertions executed
   — a "tests that would still pass if the code under test were deleted"
   anti-pattern. Now uses `expect.assertions(1)` and asserts the result is
   either undefined OR has the expected AckResult shape, with a clear
   failure message naming the provider.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(router): replace in-memory PM coalesce window with BullMQ delayed-job coalescing

* ci: add Redis service to integration-tests job

The new tests/integration/coalesce-bullmq.test.ts requires a live Redis
instance to verify BullMQ delayed-job supersede semantics, but the CI
integration-tests job only provisioned Postgres. Add a redis:7-alpine
service on the standard 6379 port and pass REDIS_URL through to the
test step so the BullMQ Queue can connect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(router): address review feedback on PM coalesce deferred ack

- worker-entry: call generateAckMessage() at job fire time instead of
  using workItemTitle as the literal comment text. Adds postDeferredAck()
  shared helper that extracts context from the stored payload (same path
  as the non-coalesced adapter.postAck), generates a proper role-aware
  ack message via LLM/fallback, then calls dispatchPMAck. Eliminates
  the 3x copy-paste deferred ack blocks (nitpick fix).

- queue.ts: document the TOCTOU race in scheduleCoalescedJob
  (getJob → getState → remove → add is not atomic; two concurrent
  handlers can lose the second event's data). Practical impact is low
  but documented for future maintainers.

- webhook-processor.ts + types: update ackMessage field docstrings to
  clarify it stores the workItemTitle as a context hint for
  generateAckMessage at fire time, not a pre-generated comment text.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore(worker): remove now-unused biome complexity suppression

The noExcessiveCognitiveComplexity suppression on dispatchJob is no
longer needed — extracting postDeferredAck() reduced the function's
complexity below the threshold.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(router): address lock-leak and active-job-status issues in coalesced dispatch

Two lock-lifecycle bugs in BullMQ delayed-job coalescing:

1. Lock leak on supersede: when scheduleCoalescedJob removes a delayed
   job (returns superseded=true), the superseded job's in-memory locks
   are never released because BullMQ's remove() does not fire
   worker.on('failed'). Fix: return supersededJobData from
   scheduleCoalescedJob so the caller can call clearWorkItemEnqueued /
   clearAgentTypeEnqueued / clearRecentlyDispatched for the old job.

2. Misleading status when active job exists: when getState() returns
   'active', the previous code called add() (BullMQ silently ignores it
   for duplicate active IDs) and returned superseded=false, causing the
   caller to mark locks for a non-existent job. Fix: return
   activeExists=true and skip add() entirely; caller skips lock marking
   and logs accurately.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(coverage): cover deferred-ack worker path + coalesce-config

Adds 8 unit tests for `getCoalesceWindowMs()` (every branch — default,
PM_COALESCE_WINDOW_MS, legacy fallback, precedence, 0/disable,
non-numeric, negative, empty) and 9 unit tests for the deferred-ack
flow in `dispatchJob` / `postDeferredAck` (Trello/JIRA/Linear pendingAck
paths, payload-extractor → contextHint fallback, missing agentType,
non-fatal `dispatchPMAck` rejection, undefined-result fallback to the
pre-existing ackCommentId, Linear-without-workItemId skip, plus the
non-pendingAck Linear routing branch that was wholly uncovered).

Closes the codecov/patch gap on PR #1226 — the new BullMQ deferred-ack
codepath added in this PR was at 15% (worker-entry.ts) / 0%
(coalesce-config.ts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(router): capture coalesce-schedule failures to Sentry + rename ackMessage hint field

Addresses both items from the (failed-to-submit) review run c63c7835:

1. **Should-fix: Sentry capture gap in coalesce-schedule catch block.**
   The `scheduleCoalescedJob` failure path in `webhook-processor.ts` logged
   via `logger.error` but did not call `captureException`. Other dispatch-
   failure paths flow through BullMQ retry → `worker.on('failed')` →
   `releaseLocksForFailedJob` → Sentry (per spec 015 plan 1), but this catch
   handles a Redis-side failure BEFORE the job is enqueued, so it bypasses
   that pipeline entirely. Redis hiccups during coalesced scheduling were
   silently escaping observability. Now captures under stable tag
   `coalesce_schedule_failure` with `{projectId, workItemId, agentType,
   coalesceKey, adapterType}` extras.

2. **Nit: rename `ackMessage` → `ackContextHint` on the new job interfaces.**
   The PR added an `ackMessage?: string` field on TrelloJob/JiraJob/LinearJob
   (and corresponding `*JobData` interfaces in worker-entry.ts) to store the
   `workItemTitle` as a context hint for `generateAckMessage()` at deferred-
   ack fire time — explicitly NOT the literal comment text, per the JSDoc.
   The name read like the literal text and required reading the JSDoc to
   avoid misuse. Renamed to `ackContextHint` for self-documentation. The
   pre-existing `GitHubJob.ackMessage` field is left unchanged because that
   one IS the literal comment text.

Renamed 3 type fields (Trello/JIRA/Linear job interfaces in queue.ts +
matching shapes in worker-entry.ts), 1 set site in webhook-processor.ts,
3 read sites in worker-entry.ts's deferred-ack helper invocations, plus
the JSDoc reference and 4 test mocks (test names included).

Full unit suite green (473 files / 8700 tests). Typecheck + lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(router): sanitize Docker-invalid chars in jobId when building worker container name

Live regression on prod: PR #1226 introduced coalesced-job IDs shaped
`coalesce:${projectId}:${workItemId}` (e.g. `coalesce:ucho:MNG-413`).
`spawnWorker()` blindly concatenated this into the Docker container name
(`cascade-worker-coalesce:ucho:MNG-413`), which Docker rejected with HTTP 400
"bad parameter — Invalid container name (...), only [a-zA-Z0-9][a-zA-Z0-9_.-]
are allowed". Every coalesced job that fired post-deploy failed at
`docker createContainer`, the worker never started, the run was lost.

Two fixes considered:
- Change the BullMQ jobId format upstream in `queue.ts` to use a Docker-safe
  separator. Risk: jobIds appear in logs / dedup keys / metrics; switching
  the canonical form widens the blast radius.
- Sanitize at the spawn site only. Lower risk: the original jobId stays
  intact for logs and dedup; only the Docker-name derivation gets normalized.

Chose the second. Replace any character not in `[a-zA-Z0-9_.-]` with `_`.
The leading char is `cascade-worker-...` so the Docker-required leading
`[a-zA-Z0-9]` is already satisfied.

Tests:
- New regression pin: jobId `coalesce:ucho:MNG-413` → container name
  `cascade-worker-coalesce_ucho_MNG-413` (the live failing case).
- New positive pin: ordinary jobIds (alphanumeric + dashes/dots/underscores)
  pass through unchanged.

Hotfix scope. Going straight to dev with a tight diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Cascade Bot <bot@cascade.dev>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants