feat(router): implement cancel listener and API integration for run cancellation#777
feat(router): implement cancel listener and API integration for run cancellation#777
Conversation
…ancellation - New src/router/cancel-listener.ts: subscribes to Redis cancel channel and kills workers - Fallback Docker label scanning when jobId not found in database (race condition handling) - Updated router startup in src/router/index.ts to start/stop cancel listener with lifecycle - Updated runs.ts cancel mutation to publish cancel command after cancelRunById succeeds - Fire-and-forget publish with error logging to prevent API failures - Comprehensive unit tests for cancel-listener with Docker and DB mocking - Updated runs.test.ts to verify publishCancelCommand is called with correct parameters All tests pass, lint and typecheck clean.
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
The Docker fallback in cancel-listener.ts will kill the wrong container in multi-run environments. It picks the first cascade.managed=true container without matching by runId, making it a correctness risk in production.
Architecture & Design
The overall design — Redis pub/sub for cancel command distribution, fire-and-forget pattern in the API, and the cancel listener in the router — fits well with the existing architecture. The integration into the router lifecycle (startRouter/shutdown) follows established patterns.
Code Issues
Blocking
- src/router/cancel-listener.ts:108 —
fallbackKillByDockerLabel(runId)acceptsrunIdbut never uses it to match a container. It lists all containers withcascade.managed=truelabel and killscascadeContainers[0]— an arbitrary container. In a multi-run environment, this kills a random healthy worker. The containers carrycascade.job.idandcascade.job.typelabels (seecontainer-manager.ts:194-196) but nocascade.run.idlabel, so the fallback cannot match by run even in principle. This needs either: (a) adding acascade.run.idlabel when spawning containers, or (b) removing the fallback entirely and logging a warning instead, which is safer than killing the wrong container.
Should Fix
-
src/router/cancel-listener.ts:62-63 —
stopCancelListener()setscancelSubscriber = nullbut never actually unsubscribes from Redis or closes the subscriber connection. Thequeue/cancel.tsmodule exposes no cleanup function. During graceful shutdown, this leaves an open Redis connection, which may delay process exit or leak resources. Consider adding anunsubscribeFromCancelCommands()function toqueue/cancel.ts. -
src/router/cancel-listener.ts:63 —
cancelSubscriber = true as unknown as ReturnType<typeof subscribeToCancelCommands>is a misleading type hack.subscribeToCancelCommandsreturnsPromise<void>, so this variable holds a boolean cast toPromise<void>. It's only used as a truthy/falsy flag — declare it aslet cancelSubscriberActive = falsefor clarity. -
src/router/cancel-listener.ts:101-108 — Redundant condition. After the
cascadeContainers.length === 0early return on line 101, theif (cascadeContainers.length > 0)check on line 108 is always true. Remove the redundant check.
- Remove unsafe Docker fallback that could kill wrong container in multi-run environments; containers have no run ID label so matching is impossible — replaced with a warning log - Add unsubscribeFromCancelCommands() to queue/cancel.ts so the Redis subscriber connection is properly closed on graceful shutdown - Replace misleading type hack (cancelSubscriber = true as unknown as ...) with a clear boolean flag (cancelSubscriberActive) - Remove redundant cascadeContainers.length > 0 check that was always true after the early return guard - Update tests to cover new warning-log behavior and verify shutdown properly calls unsubscribeFromCancelCommands Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ted field The sidecar now includes ackCommentDeleted in its output; update the assertion to use toMatchObject to allow for extra fields. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
LGTM — Clean, well-structured implementation of the cancel listener and API integration. The code follows existing patterns, has comprehensive test coverage, and handles error/edge cases correctly.
Minor Notes
-
PR description mismatch: The PR description's "What was implemented" section mentions a "Docker fallback: When jobId not found in DB (race condition), scans containers with
cascade.managed=truelabel" — but the actual implementation deliberately skips the Docker fallback and logs a warning instead. The code comments correctly explain why (container labels carry no run ID, so a fallback would risk killing the wrong container). The PR description should be updated to match the actual behavior, as someone reading only the description would have incorrect expectations. -
console.errorinunsubscribeFromCancelCommands(line 108 insrc/queue/cancel.ts): Usesconsole.errorinstead of the structuredloggerused everywhere else in the codebase. This is consistent with the pre-existingconsole.erroron line 87 of the same file, but both are inconsistent with the project convention. Not blocking — thecancel-listener.tswrapper already usesloggerfor its own error paths. -
Missing test for no-REDIS_URL path: The cancel-listener tests don't cover the early return when
REDIS_URLis not set. Minor gap since the path is trivial (just a log + return), but noting for completeness.
Neither of these rises to the level of blocking. The implementation is sound, the fire-and-forget pattern is correctly applied with .catch() error logging, shutdown lifecycle is properly ordered, tests cover the important paths, and all CI checks pass.
Summary
Implements Story #2 of the cancel-run epic: Router cancel listener and Dashboard API integration.
When a user requests to cancel a running agent job via the Dashboard API, the router now listens for cancel commands via Redis pub/sub, looks up the job's Docker container, and kills it. This prevents wasted compute and enables immediate cancellation of long-running jobs.
PR Link: https://trello.com/c/jyj0g3yF/303-as-a-developer-i-want-the-router-to-listen-for-cancel-commands
What was implemented
Core functionality
src/router/cancel-listener.ts: Subscribes to Redis cancel channel and handles terminationgetRunJobId(runId)killWorker(jobId)from container-manager to stop the Docker containercascade.managed=truelabelRouter integration
src/router/index.tslifecycle:startCancelListener()instartRouter()stopCancelListener()inshutdown()Dashboard API integration
src/api/routers/runs.tscancel mutation:publishCancelCommand(runId, reason)after DB operation succeedsTesting
Key decisions
Fire-and-forget pattern for publish: The API response succeeds after the DB cancel completes, regardless of whether Redis publish succeeds.
Docker fallback strategy: Handles the race condition where a container is running but the jobId hasn't been written to the database yet.
Non-Redis support: When no
REDIS_URLis configured, the cancel listener gracefully skips startup.Testing
Files changed
src/router/cancel-listener.ts(new, 130 lines)src/router/index.ts(modified: 4 lines)src/api/routers/runs.ts(modified: 9 lines)tests/unit/router/cancel-listener.test.ts(new, 221 lines)tests/unit/api/routers/runs.test.ts(modified: 31 lines)