Skip to content

feat(gastown): add observability infrastructure#1075

Merged
jrf0110 merged 13 commits intomainfrom
228-observability
Mar 14, 2026
Merged

feat(gastown): add observability infrastructure#1075
jrf0110 merged 13 commits intomainfrom
228-observability

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented Mar 12, 2026

Summary

  • Adds Sentry integration to the Gastown worker for structured error tracking and tracing
  • Implements structured container process logging (logger.ts) with key lifecycle events (agent start/stop/crash, container init/eviction)
  • Broadcasts bead events and convoy progress updates over the status WebSocket for real-time dashboard consumption
  • Adds alarm-based alerting checks for review queue depth, escalation rate spikes, and agent restart loops
  • Instruments agent lifecycle events via Cloudflare Analytics Engine binding

Closes #228

Verification

  • Code review of diff against fork branch — all patches applied cleanly to origin/main
  • No automated tests were run (observability instrumentation is infrastructure wiring)

Visual Changes

N/A

Reviewer Notes

  • The Analytics Engine binding (GASTOWN_AE) must be configured in the Cloudflare dashboard/wrangler config for metrics to flow
  • Alerting thresholds (review queue depth, escalation rate, restart loop count) are hardcoded in the alarm handler — may want to make these configurable via town config in a follow-up
  • The Sentry DSN is pulled from environment; ensure it's set in production secrets

Comment thread cloudflare-gastown/worker-configuration.d.ts Outdated
Comment thread cloudflare-gastown/src/ui/dashboard.ui.ts
Comment thread cloudflare-gastown/src/dos/Town.do.ts Outdated
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Mar 13, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (2 files)
  • cloudflare-gastown/.gitignore
  • cloudflare-gastown/container/src/logger.ts

Reviewed by gpt-5.4-20260305 · 365,039 tokens

jrf0110 added 2 commits March 13, 2026 13:25
…g, alerting, and usage metrics

Adds Sentry integration to the Gastown worker, structured container process
logging, bead/convoy event broadcasting over WebSocket, alarm-based alerting
for review queue depth, escalation rate, and agent restart loops, and
Analytics Engine instrumentation for lifecycle events.

Closes #228
…eference directive

- Regenerate worker-configuration.d.ts with --include-runtime=false
- Install @cloudflare/workers-types and add to tsconfig types array
- Update types script to use --include-runtime=false flag
- Preserve GitTokenService RPC types as manual override
@jrf0110 jrf0110 force-pushed the 228-observability branch from fb72f5c to 08e09d0 Compare March 13, 2026 18:25
Comment thread cloudflare-gastown/src/util/analytics.util.ts
Comment thread cloudflare-gastown/src/ui/dashboard.ui.ts
Comment thread cloudflare-gastown/src/dos/Town.do.ts Outdated
Comment thread cloudflare-gastown/src/util/analytics.util.ts Outdated
Comment thread cloudflare-gastown/src/util/analytics.util.ts Outdated
- Add userId to all analytics events via cached owner_user_id
- Remove meta/alerting events (queue_depth_alert, rate_spike, restart_loop)
  and their check methods — alerting belongs upstream
- Remove container.cold_start/oom from event union (needs TownContainerDO
  refactoring, deferred to follow-up)
- Implement all previously-declared-but-unimplemented event emissions:
  bead.status_changed, escalation.acknowledged, nudge.queued, nudge.delivered
- Fix dashboard status WebSocket to reconnect when town ID changes
- Remove eager connectStatusWs() on page load (was connecting to
  random placeholder town ID)
Comment thread cloudflare-gastown/src/dos/Town.do.ts
Comment thread cloudflare-gastown/src/ui/dashboard.ui.ts
- Enable upload_source_maps and version_metadata in wrangler.jsonc
- Add sentry-cli sourcemap upload to deploy:prod via postdeploy hook
- Pass CF_VERSION_METADATA.id as Sentry release for stack trace linking
- Remove empty SENTRY_DSN var (now a worker secret set via dashboard)
- Install @sentry/cli as devDependency
Comment thread cloudflare-gastown/src/dos/Town.do.ts
Comment thread cloudflare-gastown/src/ui/dashboard.ui.ts
…alytics

- Add delivery (http/trpc/internal), route, and error fields to events
- Add timing middleware to capture high-res request start timestamp
- Add instrumented() wrapper applied to all 81 HTTP route handlers
- Add tRPC analytics middleware on base procedure (wraps all 36 procedures)
- Capture all errors to Sentry in both HTTP and tRPC layers
- Tag DO-internal events with delivery: 'internal'
Comment thread cloudflare-gastown/container/src/process-manager.ts
Comment thread cloudflare-gastown/src/middleware/analytics.middleware.ts Outdated
Comment thread cloudflare-gastown/src/trpc/init.ts Outdated
The beads table on pre-existing DOs still had the old CHECK constraint
`status in ('open', 'in_progress', 'closed', 'failed')` which rejects
the newer 'in_review' status, causing SQLITE_CONSTRAINT errors in
handleAgentDone.

- Remove all CHECK constraints from all table definitions (Zod validates
  at the application layer)
- Add dropCheckConstraints() migration that detects tables with CHECK
  constraints via sqlite_master and recreates them without constraints
- Migration is idempotent and includes rollback on failure
Comment thread cloudflare-gastown/package.json
Comment thread cloudflare-gastown/src/dos/town/beads.ts Outdated
Comment thread cloudflare-gastown/src/middleware/analytics.middleware.ts Outdated
jrf0110 added 2 commits March 13, 2026 18:50
- Add API route that proxies SQL queries to CF Analytics Engine
  (overview, events timeseries, error rates, top users, latency, delivery)
- Add React hooks for each query type with 1-minute auto-refresh
- Build dashboard page with:
  - Overview KPI cards (total events, unique users, avg latency, error rate)
  - Stacked area chart: events over time (top 15 by volume)
  - Stacked bar chart: delivery breakdown (HTTP/tRPC/internal) over time
  - Horizontal bar chart: success vs error rates by event with error % line
  - Latency table: avg response time by event and delivery type
  - Top users table: most active users with links to admin panel
- Configurable time window (1h to 30d) via dropdown
- Requires CF_ANALYTICS_ENGINE_TOKEN env var for API access
23 panels across 6 sections:
- Overview: total events, unique users, avg latency, error rate stats
- Throughput: RPS by delivery, event volume stacked bars, top events
- Errors: error count over time, error rate by delivery, error
  counts table, top error messages table
- Latency: avg latency by delivery, avg latency by top events,
  slowest endpoints table with route-level detail
- Users & Accounts: active users/towns over time, top users by
  event count, top users by error count
- Domain Breakdown: delivery type pie, top events pie, all events
  summary table with success/error/latency
- Internal DO Events: bead lifecycle, agent/review/convoy events

Uses $timeSeries, $timeFilter, $interval_s Grafana macros for the
cloudflare-analytics-engine datasource plugin.
Comment thread cloudflare-gastown/src/dos/town/beads.ts
…source plugin

All panels now have the required target properties:
- dateTimeType: DATETIME
- dateTimeColDataType: timestamp
- editorMode: sql
- table: gastown_events
- query field set (not just rawSql)
- datasource type: vertamedia-clickhouse-datasource
- $interval_s replaced with $interval
Comment thread src/app/admin/api/gastown-analytics/route.ts
Comment thread src/app/admin/gastown/page.tsx Outdated
CF Analytics Engine doesn't support IN (SELECT ...) subqueries.
Removed the top-N subquery filter from:
- Grafana panels 7 (Event Volume top events) and 13 (Avg Latency top events)
- Admin API events-timeseries query

The queries now return all events grouped by time — users can
toggle individual series via the Grafana legend.
Comment thread src/app/admin/api/gastown-analytics/route.ts
Comment thread cloudflare-gastown/src/gastown.worker.ts
Comment thread cloudflare-gastown/gastown-grafana-dash-1.json Outdated
Comment thread cloudflare-gastown/gastown-grafana-dash-1.json Outdated
Comment thread cloudflare-gastown/container/src/process-manager.ts
Comment thread cloudflare-gastown/src/util/analytics.util.ts
- Fix Sentry double-capture: remove captureException from instrumented()
  and tRPC analytics middleware; keep single capture in app.onError() and
  trpcServer onError (guarded to skip TRPCErrors)
- Fix CHECK constraint regex: handle nested parens in check(col in (...))
- Fix sourcemap release mismatch: inject SENTRY_RELEASE via --var at
  deploy time using sentry-cli propose-version (git SHA)
- Fix agent-auth userId: fall back to agentJWT.userId when kiloUserId
  is unset (agent-authenticated routes)
- Fix status WebSocket: connect on initial page load, not just on change
- Fix SDK session leak: decrement sessionCount when agent completes
  normally via session.idle
- Fix index loss: reorder initBeadTables to run dropCheckConstraints
  before index creation
- Fix lint: use type narrowing instead of String() for sqlite_master rows
- Fix format: prettier on container/src/logger.ts
- Fix Grafana: convert Total Events and Unique Users to time series
  stat panels with correct field selectors
@jrf0110 jrf0110 force-pushed the 228-observability branch from e401786 to b53a6c7 Compare March 14, 2026 02:35
Comment thread cloudflare-gastown/src/middleware/analytics.middleware.ts
Comment thread src/app/admin/api/gastown-analytics/route.ts Outdated
Comment thread src/app/admin/api/gastown-analytics/route.ts Outdated
- Fix SDK session leak on stream errors (process-manager catch path)
- Add convoyId/role/beadType to Analytics Engine blobs (blob11-13)
- Fix error-rate line plotted on count axis — add secondary X axis
- Add client-side top-15 filtering to EventsTimeseriesChart
- Add LIMIT 500 to unbounded Grafana time series panels (7, 13)
- Update Grafana panel titles to reflect actual behavior

process.on('uncaughtException', err => {
log.error('container.uncaught_exception', { error: err.message, stack: err.stack });
process.exit(1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Immediate exit can drop the crash log

process.exit(1) terminates the process synchronously, so the container.uncaught_exception entry emitted just above is not guaranteed to flush to stderr/Workers Logs first. That makes the only structured signal for this failure mode easy to lose during postmortems.

…ting

- Fix deriveHttpEventName: distinguish list vs get by checking if route
  ends with a param segment; keep 'mayor' as a meaningful segment
- Fix overview avg_latency_ms: only average http/trpc events (skip
  zero-duration internal events)
- Fix top-users avg_latency_ms: same conditional filtering
- Format gastown-grafana-dash-1.json with prettier
@jrf0110 jrf0110 merged commit 8f22340 into main Mar 14, 2026
18 checks passed
@jrf0110 jrf0110 deleted the 228-observability branch March 14, 2026 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Gastown] PR 22: Observability

2 participants