Skip to content

fix: stuck refineries and polecats — MR bead failure paths don't update convoy progress #1243

@jrf0110

Description

@jrf0110

Problem

Refineries and polecats are getting stuck. Two related root causes were found:

1. MR bead failures bypass updateBeadStatus()

In processReviewQueue() (Town.do.ts), three early-exit failure paths call reviewQueue.completeReview(sql, entry.id, 'failed') directly (raw SQL UPDATE). This bypasses beadOps.updateBeadStatus(), which means:

  • No status_changed bead_event is logged
  • Convoy progress is never updated — the convoy counter never increments, so the convoy never lands even when all source beads are closed
  • The MR bead goes to failed silently and the convoy stalls

The three affected paths are:

  • No rig_id on the MR bead
  • No rig config found for the rig
  • Refinery container fails to start (calls completeReview then returns)

These should all call beadOps.updateBeadStatus() instead (same as the convoy-bead-failure-reasons convoy is doing for bead events, but the convoy-progress side effect is the critical fix here).

2. Rework re-dispatch path may also be involved

When a refinery or reviewer signals rework (agentCompleted in review-queue.ts and completeReviewWithResult in Town.do.ts), a new polecat is hooked and dispatched fire-and-forget. If anything in that path fails silently, the source bead can be left in_progress with no agent dispatched to it. Worth auditing this path for missing error handling or guard conditions.

Investigation notes

  • completeReview() in review-queue.ts is a raw SQL UPDATE — it does NOT call updateBeadStatus() or trigger updateConvoyProgress()
  • updateBeadStatus()updateConvoyProgress() is the only path that increments convoy closed_bead counters and triggers auto-land
  • The processReviewQueue() refinery dispatch also passes kilocodeToken: rigConfig.kilocodeToken directly (line ~3629) rather than the resolved kilocodeToken used in all other dispatch sites — worth checking if this could be undefined and cause silent failures

Fix

Replace the reviewQueue.completeReview(sql, entry.id, 'failed') calls in processReviewQueue() with beadOps.updateBeadStatus() calls so convoy progress is properly updated on MR bead failure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Blocks soft launchbugSomething isn't workinggt:coreReconciler, state machine, bead lifecycle, convoy flowgt:refineryReview queue, merge strategies, rework flowkilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions