Skip to content

fix(router): post ack comment before enqueuing job to eliminate race condition#647

Merged
zbigniewsobiecki merged 1 commit intodevfrom
fix/ack-before-enqueue-race-condition
Mar 7, 2026
Merged

fix(router): post ack comment before enqueuing job to eliminate race condition#647
zbigniewsobiecki merged 1 commit intodevfrom
fix/ack-before-enqueue-race-condition

Conversation

@aaight
Copy link
Copy Markdown
Collaborator

@aaight aaight commented Mar 7, 2026

Summary

Fixes a race condition where the BullMQ Worker could pick up a job and serialize job.data into JOB_DATA before the router patched ackCommentId via updateData(), causing the worker to start without a pre-seeded ack comment.

  • Root cause: addJob → (race window) → postAckupdateData() — worker could dequeue during the race window
  • Fix: Restructure processRouterWebhook() to call postAck before buildJob and addJob, so ackCommentId is embedded directly in the job payload at enqueue time — eliminating the race entirely
  • Adapters updated: Trello, GitHub, and JIRA buildJob methods now accept an optional AckResult parameter and embed ackCommentId/ackMessage directly in the job
  • Removed: The updateData() patch-after-enqueue block and the jobQueue import from webhook-processor.ts
  • Added: Info/debug logging for ack success and ack failure cases

Test plan

  • Updated "enqueues job before posting ack" test → now asserts ['postAck', 'addJob'] order
  • Replaced "patches ack info via updateData" test → now asserts buildJob is called with ackResult as 5th arg
  • Added test: job still enqueues when postAck returns undefined (ack failure is non-fatal)
  • All 3922 unit tests pass
  • Lint and typecheck pass

Card: https://trello.com/c/VH29Y9mx/190-lets-make-sure-initial-comment-id-is-successfully-passed-from-router-to-worker-on-order-to-work-on-the-same-comment-from-acknowl

🤖 Generated with Claude Code

Copy link
Copy Markdown
Collaborator

@nhopeatall nhopeatall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

LGTM — clean fix for a real race condition between job enqueue and ack comment patching.

The restructuring is sound: moving postAck before buildJob/addJob eliminates the race window where a BullMQ worker could dequeue a job before ackCommentId was patched via updateData(). The fix is minimal, consistent across all three adapters (GitHub, Trello, JIRA), and removes the fragile updateData() patch-after-enqueue pattern along with the now-unnecessary jobQueue import.

Key things verified:

  • Error resilience preserved: All three adapter postAck implementations wrap their logic in try/catch and return undefined on failure, so a failed ack never blocks job enqueue.
  • Job types consistent: ackCommentId is embedded in all three job types; ackMessage is correctly included only in GitHubJob (the only type that has that field).
  • No stale references: jobQueue and updateData are fully removed from webhook-processor.ts.
  • Interface + all implementations updated: buildJob signature extended with optional ackResult in the interface and all three adapters.
  • Tests updated and comprehensive: Call order test flipped to ['postAck', 'addJob'], new test for buildJob receiving ackResult, new test confirming enqueue proceeds when postAck returns undefined.
  • CI passing: All checks green.

The tradeoff (ack latency now on the enqueue critical path) is well justified — a single comment-posting API call is cheap compared to the cost of workers starting without a pre-seeded ack comment.

@zbigniewsobiecki
Copy link
Copy Markdown
Member

@aaight resolve conflicts

@aaight aaight force-pushed the fix/ack-before-enqueue-race-condition branch from e10bf95 to d153774 Compare March 7, 2026 12:35
@aaight
Copy link
Copy Markdown
Collaborator Author

aaight commented Mar 7, 2026

Conflicts resolved! Rebased onto dev and resolved the two conflicts:

src/router/webhook-processor.ts: Kept our ack-before-enqueue pipeline while incorporating the decisionReason field from dev's recent logging changes (PR #645). The final return now includes decisionReason: 'Job queued: {agentType} agent for work item {workItemId}'.

tests/unit/router/webhook-processor.test.ts: Kept our test assertions (ack-before-enqueue ordering, buildJob with 5th ackResult arg) and added the decisionReason assertion from dev.

All 19 webhook-processor tests pass, typecheck clean, lint clean.

@nhopeatall
Copy link
Copy Markdown
Collaborator

nhopeatall commented Mar 7, 2026

Review complete — APPROVED

Clean fix for a real race condition between job enqueue and ack comment posting. The reordering (postAck → buildJob → addJob) eliminates the window where BullMQ could dequeue a job before ackCommentId was patched. All failure semantics are preserved, interface changes are consistent across all three adapters, and tests cover the key scenarios.

Copy link
Copy Markdown
Collaborator

@nhopeatall nhopeatall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

LGTM — clean fix for a real race condition between job enqueue and ack comment posting. The reordering (postAck → buildJob → addJob) eliminates the window where BullMQ could dequeue a job before ackCommentId was patched via updateData().

What I verified:

  • Race condition eliminated: ackResult is now threaded directly into buildJob as the 5th parameter, so the job payload is complete at enqueue time. No more post-hoc patching via jobQueue.getJob() + updateData().

  • Failure semantics preserved: All three postAck implementations (GitHub, Trello, JIRA) have try/catch guards and return undefined on failure — they never throw. So moving postAck before addJob doesn't risk blocking job enqueuing on ack failure.

  • Minor behavioral change is acceptable: In the new ordering, if addJob fails (e.g., Redis down), the ack comment is already posted but the job never enqueues. This is a strict improvement over the old race condition and only manifests in rare Redis failure scenarios.

  • Interface consistency: The AckResult parameter is optional in both the interface and all three adapter implementations. The as number | undefined / as string | undefined casts in each adapter correctly narrow the union type from AckResult.commentId to match the platform-specific job types.

  • ackMessage omission in Trello/JIRA is intentional: Only GitHubJob has an ackMessage field (confirmed in queue.ts and worker-entry.ts), so Trello and JIRA adapters correctly omit it.

  • Cleanup is complete: The jobQueue import is removed from webhook-processor.ts, and all related updateData patching code is gone. No dead references remain.

  • Tests cover the key scenarios: ordering assertion (['postAck', 'addJob']), ackResult threading to buildJob, and graceful handling of undefined ackResult.

@zbigniewsobiecki zbigniewsobiecki merged commit a96c660 into dev Mar 7, 2026
6 checks passed
@zbigniewsobiecki zbigniewsobiecki deleted the fix/ack-before-enqueue-race-condition branch March 16, 2026 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants