Skip to content

perf(gastown): Reduce alarm loop latency for active towns #1428

@jrf0110

Description

@jrf0110

Parent

Part of #204 (Phase 4: Hardening)

Problem

The Town DO alarm loop introduces noticeable latency in the UI. Mayor-slung beads take seconds to appear, agent hookings take a while to transition beads to in_progress, and PR merges take multiple alarm cycles to propagate. The root causes:

  1. armAlarmIfNeeded() doesn't bump a pending idle alarm. When a town is idle (alarm set 60s in the future) and new work arrives, the method checks if (!current || current < Date.now()) — if an alarm is pending, it does nothing. Work waits up to 60s for the existing idle alarm to fire. Affects slingConvoy(), agentDone(), agentCompleted(), submitToReviewQueue(), and all other callers.

  2. Active interval (5s) runs every phase on every tick. Bead assignment, event drain, PR polling, patrol, GC, mail delivery, and status broadcast all run every 5s. Most of this work is unnecessary most ticks, but the bundling means the fast stuff (event drain + reconcile + dispatch) can't run faster without also running the slow stuff more often.

  3. PR polling has no rate limiting. Every in-progress MR bead with a pr_url hits the GitHub API every 5s. A town with 5 open PRs makes ~60 API calls/minute against GitHub's 5000/hour (~83/min) authenticated rate limit. No headroom for other consumers sharing the token.

  4. Event-driven work routes through the poll loop. agentDone inserts a town_event → waits for alarm → Phase 0 drains → Phase 1 reconciles → Phase 2 side effects. That's 5-10s of latency for an operation that already has full context at call time. Similarly, slingConvoy creates beads but relies on the reconciler to assign agents on the next tick.

Proposed Changes

A. Fix armAlarmIfNeeded() — bump idle alarms to active interval

Effort: Low | Impact: High | Risk: Very low

When new work arrives during an idle period, reschedule the alarm to fire within the active interval instead of waiting for the idle alarm:

private async armAlarmIfNeeded(): Promise<void> {
  const storedId = await this.ctx.storage.get<string>('town:id');
  if (!storedId) return;
  const current = await this.ctx.storage.getAlarm();
  const activeDeadline = Date.now() + ACTIVE_ALARM_INTERVAL_MS;
  if (!current || current > activeDeadline) {
    await this.ctx.storage.setAlarm(activeDeadline);
  }
}

Eliminates the 60s idle-to-active transition penalty.

B. Split the alarm into fast and slow phases

Effort: Medium | Impact: High | Risk: Medium

Not all work needs to run every tick. Separate into:

  • Fast path (1-2s): Event drain → reconcile beads/agents → dispatch → status broadcast
  • Slow path (30-60s): PR polling, patrol/GUPP, GC, mail delivery, escalation checks, container health

Use a timestamp to throttle slow phases:

const FAST_ALARM_INTERVAL_MS = 2_000;
const SLOW_PHASE_INTERVAL_MS = 30_000;

// In alarm():
const now = Date.now();
const runSlowPhase = !this.lastSlowPhaseAt || (now - this.lastSlowPhaseAt) >= SLOW_PHASE_INTERVAL_MS;
if (runSlowPhase) this.lastSlowPhaseAt = now;

This lets bead assignment and agent hooking happen in ~2s while keeping expensive operations throttled.

C. Add PR polling rate limiting

Effort: Low | Impact: Medium | Risk: Low

  • Track last_polled_at on MR beads, skip if polled within the last 30s
  • Consider GitHub webhooks for immediate PR status updates (eliminates polling for GitHub repos entirely)

D. Process events inline where possible

Effort: Medium | Impact: Medium | Risk: Medium

For operations that already have full context (like agentDone), apply the state transition immediately in the RPC handler instead of inserting an event and waiting for the alarm to drain it. The event can still be inserted for audit purposes. This eliminates the 5-10s round-trip through the alarm loop for common operations.

E. Immediate dispatch for slingConvoy

Effort: Low | Impact: Medium | Risk: Low

After creating convoy beads, run a targeted mini-reconcile that assigns agents to the initially-unblocked beads, the same way slingBead() already does fire-and-forget dispatch. Eliminates the 5s wait for the first batch of convoy beads.

Priority Order

Change Effort Impact Risk
A. Fix armAlarmIfNeeded Low High (eliminates 60s idle penalty) Very low
B. Split fast/slow phases Medium High (2s bead assignment) Medium
C. PR polling rate limit Low Medium (prevents rate limit issues) Low
D. Inline event processing Medium Medium (eliminates 5-10s for agent done) Medium
E. Immediate convoy dispatch Low Medium (faster convoy starts) Low

Key Files

  • cloudflare-gastown/src/dos/Town.do.ts — alarm handler (line 2830), armAlarmIfNeeded() (line 3442), interval constants (line 121)
  • cloudflare-gastown/src/dos/town/scheduling.ts — dispatch logic, hasActiveWork()
  • cloudflare-gastown/src/dos/town/reconciler.ts — all reconciliation rules, PR polling actions
  • cloudflare-gastown/src/dos/town/actions.ts — action application and deferred side effects
  • cloudflare-gastown/src/dos/town/patrol.ts — GUPP thresholds and patrol checks

Notes

  • Change A can be shipped independently and immediately — it's a one-line fix with no risk
  • Change B is the big architectural improvement but needs careful testing to ensure slow-phase operations still run reliably
  • Change C is important for correctness regardless of latency goals — without it, towns with many open PRs can hit GitHub rate limits
  • Changes D and E are incremental optimizations that can be done independently

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Post-launchenhancementNew feature or requestgt:coreReconciler, state machine, bead lifecycle, convoy flowkilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions