Skip to content

feat(router): implement cancel listener and API integration for run cancellation#777

Merged
aaight merged 3 commits intodevfrom
feature/cancel-listener
Mar 13, 2026
Merged

feat(router): implement cancel listener and API integration for run cancellation#777
aaight merged 3 commits intodevfrom
feature/cancel-listener

Conversation

@aaight
Copy link
Copy Markdown
Collaborator

@aaight aaight commented Mar 13, 2026

Summary

Implements Story #2 of the cancel-run epic: Router cancel listener and Dashboard API integration.

When a user requests to cancel a running agent job via the Dashboard API, the router now listens for cancel commands via Redis pub/sub, looks up the job's Docker container, and kills it. This prevents wasted compute and enables immediate cancellation of long-running jobs.

PR Link: https://trello.com/c/jyj0g3yF/303-as-a-developer-i-want-the-router-to-listen-for-cancel-commands

What was implemented

Core functionality

  • New src/router/cancel-listener.ts: Subscribes to Redis cancel channel and handles termination
    • Looks up jobId from database via getRunJobId(runId)
    • Calls killWorker(jobId) from container-manager to stop the Docker container
    • Docker fallback: When jobId not found in DB (race condition), scans containers with cascade.managed=true label
    • Comprehensive error handling for both DB and Docker failures

Router integration

  • src/router/index.ts lifecycle:
    • Calls startCancelListener() in startRouter()
    • Calls stopCancelListener() in shutdown()

Dashboard API integration

  • src/api/routers/runs.ts cancel mutation:
    • Publishes cancel command via publishCancelCommand(runId, reason) after DB operation succeeds
    • Fire-and-forget pattern with error logging

Testing

  • 9 unit tests for cancel-listener covering database lookup, Docker fallback, and error handling
  • 2 new tests in runs.test.ts verifying publishCancelCommand integration

Key decisions

  1. Fire-and-forget pattern for publish: The API response succeeds after the DB cancel completes, regardless of whether Redis publish succeeds.

  2. Docker fallback strategy: Handles the race condition where a container is running but the jobId hasn't been written to the database yet.

  3. Non-Redis support: When no REDIS_URL is configured, the cancel listener gracefully skips startup.

Testing

  • Unit tests: All 4,577 tests pass (including 9 new cancel-listener tests)
  • Linting: Biome lint clean
  • Type checking: TypeScript compilation clean
  • Pre-commit hooks: All pass

Files changed

  • src/router/cancel-listener.ts (new, 130 lines)
  • src/router/index.ts (modified: 4 lines)
  • src/api/routers/runs.ts (modified: 9 lines)
  • tests/unit/router/cancel-listener.test.ts (new, 221 lines)
  • tests/unit/api/routers/runs.test.ts (modified: 31 lines)

…ancellation

- New src/router/cancel-listener.ts: subscribes to Redis cancel channel and kills workers
- Fallback Docker label scanning when jobId not found in database (race condition handling)
- Updated router startup in src/router/index.ts to start/stop cancel listener with lifecycle
- Updated runs.ts cancel mutation to publish cancel command after cancelRunById succeeds
- Fire-and-forget publish with error logging to prevent API failures
- Comprehensive unit tests for cancel-listener with Docker and DB mocking
- Updated runs.test.ts to verify publishCancelCommand is called with correct parameters

All tests pass, lint and typecheck clean.
Copy link
Copy Markdown
Collaborator

@nhopeatall nhopeatall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

The Docker fallback in cancel-listener.ts will kill the wrong container in multi-run environments. It picks the first cascade.managed=true container without matching by runId, making it a correctness risk in production.

Architecture & Design

The overall design — Redis pub/sub for cancel command distribution, fire-and-forget pattern in the API, and the cancel listener in the router — fits well with the existing architecture. The integration into the router lifecycle (startRouter/shutdown) follows established patterns.

Code Issues

Blocking

  • src/router/cancel-listener.ts:108fallbackKillByDockerLabel(runId) accepts runId but never uses it to match a container. It lists all containers with cascade.managed=true label and kills cascadeContainers[0] — an arbitrary container. In a multi-run environment, this kills a random healthy worker. The containers carry cascade.job.id and cascade.job.type labels (see container-manager.ts:194-196) but no cascade.run.id label, so the fallback cannot match by run even in principle. This needs either: (a) adding a cascade.run.id label when spawning containers, or (b) removing the fallback entirely and logging a warning instead, which is safer than killing the wrong container.

Should Fix

  • src/router/cancel-listener.ts:62-63stopCancelListener() sets cancelSubscriber = null but never actually unsubscribes from Redis or closes the subscriber connection. The queue/cancel.ts module exposes no cleanup function. During graceful shutdown, this leaves an open Redis connection, which may delay process exit or leak resources. Consider adding an unsubscribeFromCancelCommands() function to queue/cancel.ts.

  • src/router/cancel-listener.ts:63cancelSubscriber = true as unknown as ReturnType<typeof subscribeToCancelCommands> is a misleading type hack. subscribeToCancelCommands returns Promise<void>, so this variable holds a boolean cast to Promise<void>. It's only used as a truthy/falsy flag — declare it as let cancelSubscriberActive = false for clarity.

  • src/router/cancel-listener.ts:101-108 — Redundant condition. After the cascadeContainers.length === 0 early return on line 101, the if (cascadeContainers.length > 0) check on line 108 is always true. Remove the redundant check.

Cascade Bot and others added 2 commits March 13, 2026 17:11
- Remove unsafe Docker fallback that could kill wrong container in
  multi-run environments; containers have no run ID label so matching
  is impossible — replaced with a warning log
- Add unsubscribeFromCancelCommands() to queue/cancel.ts so the Redis
  subscriber connection is properly closed on graceful shutdown
- Replace misleading type hack (cancelSubscriber = true as unknown as ...)
  with a clear boolean flag (cancelSubscriberActive)
- Remove redundant cascadeContainers.length > 0 check that was always
  true after the early return guard
- Update tests to cover new warning-log behavior and verify shutdown
  properly calls unsubscribeFromCancelCommands

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ted field

The sidecar now includes ackCommentDeleted in its output; update the
assertion to use toMatchObject to allow for extra fields.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@nhopeatall nhopeatall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

LGTM — Clean, well-structured implementation of the cancel listener and API integration. The code follows existing patterns, has comprehensive test coverage, and handles error/edge cases correctly.

Minor Notes

  • PR description mismatch: The PR description's "What was implemented" section mentions a "Docker fallback: When jobId not found in DB (race condition), scans containers with cascade.managed=true label" — but the actual implementation deliberately skips the Docker fallback and logs a warning instead. The code comments correctly explain why (container labels carry no run ID, so a fallback would risk killing the wrong container). The PR description should be updated to match the actual behavior, as someone reading only the description would have incorrect expectations.

  • console.error in unsubscribeFromCancelCommands (line 108 in src/queue/cancel.ts): Uses console.error instead of the structured logger used everywhere else in the codebase. This is consistent with the pre-existing console.error on line 87 of the same file, but both are inconsistent with the project convention. Not blocking — the cancel-listener.ts wrapper already uses logger for its own error paths.

  • Missing test for no-REDIS_URL path: The cancel-listener tests don't cover the early return when REDIS_URL is not set. Minor gap since the path is trivial (just a log + return), but noting for completeness.

Neither of these rises to the level of blocking. The implementation is sound, the fire-and-forget pattern is correctly applied with .catch() error logging, shutdown lifecycle is properly ordered, tests cover the important paths, and all CI checks pass.

@aaight aaight merged commit 447fb86 into dev Mar 13, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants