Skip to content

feat(gastown): add POST /debug/reconcile-dry-run endpoint#1367

Merged
jrf0110 merged 1 commit intoconvoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/headfrom
convoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/gt/birch/3e3850a1
Mar 21, 2026
Merged

feat(gastown): add POST /debug/reconcile-dry-run endpoint#1367
jrf0110 merged 1 commit intoconvoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/headfrom
convoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/gt/birch/3e3850a1

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented Mar 21, 2026

Summary

Add a debug endpoint (POST /debug/towns/:townId/reconcile-dry-run) that runs the reconciler against current live state and returns the actions it would emit without applying them. The debugDryRun() method on TownDO calls reconciler.reconcile(this.sql) — which is side-effect-free (only SELECTs) — then returns the actions array along with metrics (actionsEmitted, actionsByType, pendingEventCount). The route follows the same unauthenticated debug pattern as the existing GET /debug/towns/:townId/status endpoint.

Verification

  • No quality gates configured for this convoy branch
  • Code review: verified reconciler.reconcile() is side-effect-free (only SELECT queries, returns Action[])
  • Verified events.pendingEventCount() is a read-only COUNT query
  • Verified Action type import is correct (from ./town/actions)
  • Verified route follows the existing debug endpoint pattern exactly
  • Polecat reports typecheck, oxlint, and format all pass

Visual Changes

N/A

Reviewer Notes

  • The /debug/ endpoints are unauthenticated by design (temporary debug tooling, marked for removal). This matches the existing pattern.
  • POST is used instead of GET since the endpoint runs the full reconciler and could be expensive — callers should be intentional about invoking it.
  • The Pick<ReconcilerMetrics, ...> type on the return value keeps the contract explicit without duplicating the metric type definitions.

Add a debug endpoint that runs the reconciler against current live state
and returns the actions it would emit without applying them. This enables
inspecting what the reconciler thinks should happen at any given moment.

- Add debugDryRun() method to TownDO that calls reconciler.reconcile()
  and returns actions + metrics without calling applyAction()
- Add POST /debug/towns/:townId/reconcile-dry-run route following the
  same unauthenticated debug pattern as GET /debug/towns/:townId/status
- Response includes actions array, actionsEmitted count, actionsByType
  breakdown, and pendingEventCount
'actionsEmitted' | 'actionsByType' | 'pendingEventCount'
>;
}> {
const actions = reconciler.reconcile(this.sql);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Dry-run skips pending events and can report stale actions

reconcile() is only Phase 1 of the alarm loop. Facts still sitting in town_events (for example agent_done, agent_completed, bead_created, or container_status) are normally applied in Phase 0 via drainEvents() + applyEvent() before reconciliation runs, so this endpoint can disagree with the very next real alarm tick whenever pendingEventCount > 0.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Mar 21, 2026

Code Review Summary

Status: 1 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
cloudflare-gastown/src/dos/Town.do.ts 3697 debugDryRun() calls reconcile() without first applying pending town_events, so the endpoint can return stale actions that do not match the next real alarm tick.
Other Observations (not in diff)

N/A

Files Reviewed (2 files)
  • cloudflare-gastown/src/dos/Town.do.ts - 1 issue
  • cloudflare-gastown/src/gastown.worker.ts - 0 issues

Reviewed by gpt-5.4-20260305 · 549,377 tokens

@jrf0110 jrf0110 merged commit f63be07 into convoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/head Mar 21, 2026
2 checks passed
@jrf0110 jrf0110 deleted the convoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/gt/birch/3e3850a1 branch March 21, 2026 17:52
jrf0110 added a commit that referenced this pull request Mar 24, 2026
#1373)

* fix: skip container_status events for running containers (#1368)

Filter out 'running' status in the alarm pre-phase before calling
upsertContainerStatus(). Running is the steady-state for healthy agents
and a no-op in applyEvent(), so recording it just bloats the event table
(~720 events/hour/agent). Non-running statuses (stopped, error, unknown)
still get inserted for reconciler detection.

* feat(gastown): add POST /debug/reconcile-dry-run endpoint (#1367)

Add a debug endpoint that runs the reconciler against current live state
and returns the actions it would emit without applying them. This enables
inspecting what the reconciler thinks should happen at any given moment.

- Add debugDryRun() method to TownDO that calls reconciler.reconcile()
  and returns actions + metrics without calling applyAction()
- Add POST /debug/towns/:townId/reconcile-dry-run route following the
  same unauthenticated debug pattern as GET /debug/towns/:townId/status
- Response includes actions array, actionsEmitted count, actionsByType
  breakdown, and pendingEventCount

* feat(gastown): add debug dry-run endpoint with event draining (#1370)

* feat(claw): evaluate button-vs-card feature flag for PostHog experiment tracking

* fix(claw): move button-vs-card flag eval to CreateInstanceCard

Moves useFeatureFlagVariantKey('button-vs-card') from ClawDashboard
(which renders for all users including those with existing instances)
to CreateInstanceCard (which only renders for users who haven't
provisioned yet). This scopes the experiment exposure to users who
can actually see the create CTA, avoiding population dilution.

* feat(gastown): add POST /debug/reconcile-dry-run endpoint

Add a debug endpoint that runs the reconciler against current live state
and returns the actions it would emit without applying them. This enables
inspecting what the reconciler thinks should happen at any given moment.

- Add debugDryRun() method to TownDO that calls reconciler.reconcile()
  and returns actions + metrics without calling applyAction()
- Add POST /debug/towns/:townId/reconcile-dry-run route following the
  same unauthenticated debug pattern as GET /debug/towns/:townId/status
- Response includes actions array, actionsEmitted count, actionsByType
  breakdown, and pendingEventCount

* fix(gastown): drain pending events in debugDryRun() before reconciling

Wrap debugDryRun() in a SQLite savepoint so it can drain and apply
pending town_events (Phase 0) before running reconcile (Phase 1),
matching the real alarm loop behavior. The savepoint is rolled back
in a finally block so the endpoint remains fully side-effect-free.

Adds eventsDrained to the returned metrics.

---------

Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>
Co-authored-by: Pedro Heyerdahl <pedro@kilocode.ai>
Co-authored-by: Pedro Heyerdahl <61753986+pedroheyerdahl@users.noreply.github.com>

* feat(gastown): add POST /debug/replay-events endpoint for event replay debugging

Adds debugReplayEvents(from, to) method to Town.do.ts that queries all
town_events in a time range (regardless of processed_at), applies them
to reconstruct state transitions, runs the reconciler, and returns the
computed actions and a state snapshot. Uses a SQLite SAVEPOINT that is
rolled back so the endpoint remains fully side-effect-free.

Route: POST /debug/towns/:townId/replay-events
Body: { from: ISO, to: ISO }
Response: { eventsReplayed, actions, stateSnapshot }

* feat(gastown): emit reconciler metrics to Analytics Engine and add Grafana dashboard panels (#1372)

- Extend writeEvent() to support double3-double10 fields for reconciler metrics
- Emit reconciler_tick event after each alarm tick with all 9 metrics
- Add Reconciler row to Grafana dashboard with 6 panels:
  1. Events drained per tick (timeseries)
  2. Actions emitted per tick by type (stacked bar)
  3. Side effects attempted/succeeded/failed (timeseries)
  4. Invariant violations (stat with >0 alert threshold)
  5. Reconciler wall clock time (timeseries with >500ms threshold)
  6. Pending event queue depth (gauge with >50 threshold)

* fix(gastown): add replay caveat and fix Grafana pending-events gauge query

Add a caveat comment and response field to debugReplayEvents explaining
that events are re-applied on top of live state, not from a pre-window
snapshot — results are approximate, useful for debugging event flow but
not faithful historical reconstruction.

Fix the Grafana 'Pending Event Queue Depth' gauge to show the latest
row's double8 value instead of averaging across the time window.

* feat(gastown): add Sentry capture for reconciler invariant violations

Each invariant violation now triggers Sentry.captureMessage with structured
context (invariant number, message, townId) as both extra data and tags.
Existing analytics event emission is preserved. Added TODO for future
auto-recovery of invariant #7 (working agent with no hook).

---------

Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>
Co-authored-by: Pedro Heyerdahl <pedro@kilocode.ai>
Co-authored-by: Pedro Heyerdahl <61753986+pedroheyerdahl@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant