Summary
Complete the remaining Phase 5 (Hardening + Observability) deliverables from the reconciliation implementation plan. The reconciler is live in production with invariant checking and per-tick metrics, but is missing debug tooling, dashboard integration, and alerting.
Parent issue: #204 (Phase 4: Hardening)
Deliverables
1. Event replay debug endpoint
POST /api/towns/:townId/debug/replay-events
- Accepts a time range (
from/to ISO timestamps)
- Reads all
town_events for that range
- Applies them to a fresh in-memory state snapshot
- Runs the reconciler against the resulting state
- Returns the computed actions without applying them
This is the highest-value debug tool — it allows replaying the exact sequence of events that led to a stuck state. Would have saved significant time during the review-round debugging sessions that motivated the reconciler rewrite.
2. Reconciler dry-run debug endpoint
POST /api/towns/:townId/debug/reconcile-dry-run
- Runs
reconciler.reconcile(sql) against current live state
- Returns the actions it would emit
- Does NOT apply them
Useful for inspecting what the reconciler thinks should happen right now without affecting state. Simpler than event replay — no time range, no state reconstruction.
3. Grafana dashboard integration
Update the existing Grafana dashboard (gastown-grafana-dash-1.json) with reconciler-specific panels:
- Events drained per tick (timeseries)
- Actions emitted per tick by type (stacked bar)
- Side effects attempted / succeeded / failed (timeseries)
- Invariant violations (counter, alert threshold > 0)
- Reconciler wall clock time (timeseries, alert threshold > 500ms)
- Pending event queue depth (gauge, alert threshold > 50)
Data source: the _lastReconcilerMetrics field exposed via getAlarmStatus().
4. Alerting on invariant violations
Currently invariant violations are logged as console.error. Add:
- Sentry error capture for each violation (with structured context: invariant number, message, townId)
- A counter metric for violations that Grafana can alert on
- Consider: auto-recovery actions for specific invariants (e.g., invariant 7 "working agent with no hook" → transition to idle)
5. Filter container_status events to reduce noise
The alarm pre-phase currently inserts a container_status event for every working/stalled agent on every tick (every 5s). Most return running which is a no-op in applyEvent. This creates ~720 events/hour per working agent.
Fix: only insert container_status events when the status is NOT running, or when it differs from the last-observed status. Track last-observed status on agent_metadata or in DO transient storage.
Verification
References
Summary
Complete the remaining Phase 5 (Hardening + Observability) deliverables from the reconciliation implementation plan. The reconciler is live in production with invariant checking and per-tick metrics, but is missing debug tooling, dashboard integration, and alerting.
Parent issue: #204 (Phase 4: Hardening)
Deliverables
1. Event replay debug endpoint
POST /api/towns/:townId/debug/replay-eventsfrom/toISO timestamps)town_eventsfor that rangeThis is the highest-value debug tool — it allows replaying the exact sequence of events that led to a stuck state. Would have saved significant time during the review-round debugging sessions that motivated the reconciler rewrite.
2. Reconciler dry-run debug endpoint
POST /api/towns/:townId/debug/reconcile-dry-runreconciler.reconcile(sql)against current live stateUseful for inspecting what the reconciler thinks should happen right now without affecting state. Simpler than event replay — no time range, no state reconstruction.
3. Grafana dashboard integration
Update the existing Grafana dashboard (
gastown-grafana-dash-1.json) with reconciler-specific panels:Data source: the
_lastReconcilerMetricsfield exposed viagetAlarmStatus().4. Alerting on invariant violations
Currently invariant violations are logged as
console.error. Add:5. Filter container_status events to reduce noise
The alarm pre-phase currently inserts a
container_statusevent for every working/stalled agent on every tick (every 5s). Most returnrunningwhich is a no-op inapplyEvent. This creates ~720 events/hour per working agent.Fix: only insert
container_statusevents when the status is NOTrunning, or when it differs from the last-observed status. Track last-observed status onagent_metadataor in DO transient storage.Verification
References