Skip to content

feat(gastown): add debug dry-run endpoint with event draining#1370

Merged
jrf0110 merged 6 commits intoconvoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/headfrom
gt/toast/c127ebe8
Mar 21, 2026
Merged

feat(gastown): add debug dry-run endpoint with event draining#1370
jrf0110 merged 6 commits intoconvoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/headfrom
gt/toast/c127ebe8

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented Mar 21, 2026

Summary

Adds a POST /debug/towns/:townId/reconcile-dry-run endpoint that executes the reconciler's Phase 0 (drain events → apply → mark processed) and Phase 1 (reconcile) against current state, returning the actions it would emit without applying them. The entire operation runs inside a SQLite SAVEPOINT that is rolled back in a finally block, keeping the endpoint fully side-effect-free.

This gives operators a way to preview what the next alarm tick would do — including the effect of pending unprocessed events — without triggering any side effects.

Verification

  • Polecat reports typecheck and oxlint pass
  • Code review confirmed the drain/apply/reconcile sequence matches the real alarm loop (Town.do.ts L2925-2955)
  • SAVEPOINT/ROLLBACK pattern is correct SQLite (ROLLBACK TO + RELEASE in finally)
  • Pick<ReconcilerMetrics, ...> fields all exist on the type
  • No build artifacts or secrets in the diff

Visual Changes

N/A

Reviewer Notes

  • The endpoint is unauthenticated, following the same pattern as the existing GET /debug/towns/:townId/status endpoint
  • Unlike the real alarm loop, there is no per-event try/catch — a single applyEvent failure will propagate up and trigger the savepoint rollback. This is intentional for a debug tool where you want errors to be visible
  • The Action type import was added alongside the existing ApplyActionContext import

kilo-code-bot Bot and others added 5 commits March 20, 2026 17:58
Moves useFeatureFlagVariantKey('button-vs-card') from ClawDashboard
(which renders for all users including those with existing instances)
to CreateInstanceCard (which only renders for users who haven't
provisioned yet). This scopes the experiment exposure to users who
can actually see the create CTA, avoiding population dilution.
…nt (#1338)

## Summary

Evaluates the `button-vs-card` PostHog feature flag in
`CreateInstanceCard` so the SDK attaches `$feature/button-vs-card` to
subsequent events (including `claw_create_instance_clicked`). Without
this, the cloud app's PostHog SDK never evaluates the flag, so the
experiment gets 0 conversions even though users are clicking.

The flag is evaluated in `CreateInstanceCard` (not `ClawDashboard`) so
only users who can actually see the create CTA are marked as exposed.
`ClawDashboard` also renders for users with existing instances,
mid-onboarding, or viewing settings — evaluating there would dilute the
experiment population.

## Verification

- [x] Verified formatting with `oxfmt` on changed files
- [x] Typecheck passes (no new errors from this change)
- [x] Confirmed `useFeatureFlagVariantKey` is exported by
`posthog-js/react` (v1.360.2)

## Visual Changes

N/A

## Reviewer Notes

- No UI or behavior changes. The hook return value is intentionally
unused — the sole purpose is flag evaluation so PostHog auto-attaches
`$feature/button-vs-card` to tracked events.
- Users who reach `CreateInstanceCard` without coming through the
landing page will get a variant assigned by the cloud app SDK. This is
expected — PostHog uses the same hash for the same distinct_id, so the
variant will be consistent.
Add a debug endpoint that runs the reconciler against current live state
and returns the actions it would emit without applying them. This enables
inspecting what the reconciler thinks should happen at any given moment.

- Add debugDryRun() method to TownDO that calls reconciler.reconcile()
  and returns actions + metrics without calling applyAction()
- Add POST /debug/towns/:townId/reconcile-dry-run route following the
  same unauthenticated debug pattern as GET /debug/towns/:townId/status
- Response includes actions array, actionsEmitted count, actionsByType
  breakdown, and pendingEventCount
Wrap debugDryRun() in a SQLite savepoint so it can drain and apply
pending town_events (Phase 0) before running reconcile (Phase 1),
matching the real alarm loop behavior. The savepoint is rolled back
in a finally block so the endpoint remains fully side-effect-free.

Adds eventsDrained to the returned metrics.
return c.json({ alarmStatus, agentMeta, beadSummary });
});

app.post('/debug/towns/:townId/reconcile-dry-run', async c => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: This debug route is publicly reachable and returns raw reconcile actions

Because this route is registered before any auth middleware, anyone who knows a townId can trigger a full dry-run reconcile and read the returned Action[]. That response can include internal bead/agent IDs, PR URLs, nudge or mayor messages, and it also lets unauthenticated callers force the worker through the full drain/apply/reconcile path. Please gate this endpoint behind the same auth as other town operator routes, or keep it behind perimeter-only protection before merging.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Mar 21, 2026

Code Review Summary

Status: 2 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 2
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
cloudflare-gastown/src/dos/Town.do.ts 3711 Dry-run aborts on the first applyEvent failure, so it no longer matches the real alarm loop and can hide reconcile actions behind a 500.
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File Line Issue
cloudflare-gastown/src/gastown.worker.ts 209 Unauthenticated debug route exposes raw reconcile actions and allows expensive dry-run execution for any known town ID.
Files Reviewed (1 files)
  • cloudflare-gastown/src/dos/Town.do.ts - 1 issue

Reviewed by gpt-5.4-20260305 · 264,919 tokens

@jrf0110 jrf0110 changed the base branch from main to convoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/head March 21, 2026 17:51
// Phase 0: Drain and apply pending events (same as real alarm loop)
const pending = events.drainEvents(this.sql);
for (const event of pending) {
reconciler.applyEvent(this.sql, event);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Dry run no longer matches the real alarm loop when an event is bad

debugDryRun() now applies each drained event without the per-event try/catch that the real alarm loop uses in cloudflare-gastown/src/dos/Town.do.ts:2938. If one pending event throws here, the whole endpoint returns an error and you never see the reconcile actions for the remaining queue, even though the actual alarm tick would log the failure, skip that event, and continue. That makes this preview unreliable հենց when the queue contains the malformed event you're trying to inspect.

@jrf0110 jrf0110 merged commit c8a756f into convoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/head Mar 21, 2026
2 checks passed
@jrf0110 jrf0110 deleted the gt/toast/c127ebe8 branch March 21, 2026 17:59
jrf0110 added a commit that referenced this pull request Mar 24, 2026
#1373)

* fix: skip container_status events for running containers (#1368)

Filter out 'running' status in the alarm pre-phase before calling
upsertContainerStatus(). Running is the steady-state for healthy agents
and a no-op in applyEvent(), so recording it just bloats the event table
(~720 events/hour/agent). Non-running statuses (stopped, error, unknown)
still get inserted for reconciler detection.

* feat(gastown): add POST /debug/reconcile-dry-run endpoint (#1367)

Add a debug endpoint that runs the reconciler against current live state
and returns the actions it would emit without applying them. This enables
inspecting what the reconciler thinks should happen at any given moment.

- Add debugDryRun() method to TownDO that calls reconciler.reconcile()
  and returns actions + metrics without calling applyAction()
- Add POST /debug/towns/:townId/reconcile-dry-run route following the
  same unauthenticated debug pattern as GET /debug/towns/:townId/status
- Response includes actions array, actionsEmitted count, actionsByType
  breakdown, and pendingEventCount

* feat(gastown): add debug dry-run endpoint with event draining (#1370)

* feat(claw): evaluate button-vs-card feature flag for PostHog experiment tracking

* fix(claw): move button-vs-card flag eval to CreateInstanceCard

Moves useFeatureFlagVariantKey('button-vs-card') from ClawDashboard
(which renders for all users including those with existing instances)
to CreateInstanceCard (which only renders for users who haven't
provisioned yet). This scopes the experiment exposure to users who
can actually see the create CTA, avoiding population dilution.

* feat(gastown): add POST /debug/reconcile-dry-run endpoint

Add a debug endpoint that runs the reconciler against current live state
and returns the actions it would emit without applying them. This enables
inspecting what the reconciler thinks should happen at any given moment.

- Add debugDryRun() method to TownDO that calls reconciler.reconcile()
  and returns actions + metrics without calling applyAction()
- Add POST /debug/towns/:townId/reconcile-dry-run route following the
  same unauthenticated debug pattern as GET /debug/towns/:townId/status
- Response includes actions array, actionsEmitted count, actionsByType
  breakdown, and pendingEventCount

* fix(gastown): drain pending events in debugDryRun() before reconciling

Wrap debugDryRun() in a SQLite savepoint so it can drain and apply
pending town_events (Phase 0) before running reconcile (Phase 1),
matching the real alarm loop behavior. The savepoint is rolled back
in a finally block so the endpoint remains fully side-effect-free.

Adds eventsDrained to the returned metrics.

---------

Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>
Co-authored-by: Pedro Heyerdahl <pedro@kilocode.ai>
Co-authored-by: Pedro Heyerdahl <61753986+pedroheyerdahl@users.noreply.github.com>

* feat(gastown): add POST /debug/replay-events endpoint for event replay debugging

Adds debugReplayEvents(from, to) method to Town.do.ts that queries all
town_events in a time range (regardless of processed_at), applies them
to reconstruct state transitions, runs the reconciler, and returns the
computed actions and a state snapshot. Uses a SQLite SAVEPOINT that is
rolled back so the endpoint remains fully side-effect-free.

Route: POST /debug/towns/:townId/replay-events
Body: { from: ISO, to: ISO }
Response: { eventsReplayed, actions, stateSnapshot }

* feat(gastown): emit reconciler metrics to Analytics Engine and add Grafana dashboard panels (#1372)

- Extend writeEvent() to support double3-double10 fields for reconciler metrics
- Emit reconciler_tick event after each alarm tick with all 9 metrics
- Add Reconciler row to Grafana dashboard with 6 panels:
  1. Events drained per tick (timeseries)
  2. Actions emitted per tick by type (stacked bar)
  3. Side effects attempted/succeeded/failed (timeseries)
  4. Invariant violations (stat with >0 alert threshold)
  5. Reconciler wall clock time (timeseries with >500ms threshold)
  6. Pending event queue depth (gauge with >50 threshold)

* fix(gastown): add replay caveat and fix Grafana pending-events gauge query

Add a caveat comment and response field to debugReplayEvents explaining
that events are re-applied on top of live state, not from a pre-window
snapshot — results are approximate, useful for debugging event flow but
not faithful historical reconstruction.

Fix the Grafana 'Pending Event Queue Depth' gauge to show the latest
row's double8 value instead of averaging across the time window.

* feat(gastown): add Sentry capture for reconciler invariant violations

Each invariant violation now triggers Sentry.captureMessage with structured
context (invariant number, message, townId) as both extra data and tags.
Existing analytics event emission is preserved. Added TODO for future
auto-recovery of invariant #7 (working agent with no hook).

---------

Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>
Co-authored-by: Pedro Heyerdahl <pedro@kilocode.ai>
Co-authored-by: Pedro Heyerdahl <61753986+pedroheyerdahl@users.noreply.github.com>
@jrf0110
Copy link
Copy Markdown
Contributor Author

jrf0110 commented Apr 6, 2026

Refinery code review passed. All quality gates pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants