feat(gastown): add observability infrastructure#1075
Merged
Conversation
jrf0110
commented
Mar 13, 2026
jeanduplessis
approved these changes
Mar 13, 2026
Contributor
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Files Reviewed (2 files)
Reviewed by gpt-5.4-20260305 · 365,039 tokens |
…g, alerting, and usage metrics Adds Sentry integration to the Gastown worker, structured container process logging, bead/convoy event broadcasting over WebSocket, alarm-based alerting for review queue depth, escalation rate, and agent restart loops, and Analytics Engine instrumentation for lifecycle events. Closes #228
…eference directive - Regenerate worker-configuration.d.ts with --include-runtime=false - Install @cloudflare/workers-types and add to tsconfig types array - Update types script to use --include-runtime=false flag - Preserve GitTokenService RPC types as manual override
fb72f5c to
08e09d0
Compare
jrf0110
commented
Mar 13, 2026
jrf0110
commented
Mar 13, 2026
jrf0110
commented
Mar 13, 2026
- Add userId to all analytics events via cached owner_user_id - Remove meta/alerting events (queue_depth_alert, rate_spike, restart_loop) and their check methods — alerting belongs upstream - Remove container.cold_start/oom from event union (needs TownContainerDO refactoring, deferred to follow-up) - Implement all previously-declared-but-unimplemented event emissions: bead.status_changed, escalation.acknowledged, nudge.queued, nudge.delivered - Fix dashboard status WebSocket to reconnect when town ID changes - Remove eager connectStatusWs() on page load (was connecting to random placeholder town ID)
- Enable upload_source_maps and version_metadata in wrangler.jsonc - Add sentry-cli sourcemap upload to deploy:prod via postdeploy hook - Pass CF_VERSION_METADATA.id as Sentry release for stack trace linking - Remove empty SENTRY_DSN var (now a worker secret set via dashboard) - Install @sentry/cli as devDependency
…alytics - Add delivery (http/trpc/internal), route, and error fields to events - Add timing middleware to capture high-res request start timestamp - Add instrumented() wrapper applied to all 81 HTTP route handlers - Add tRPC analytics middleware on base procedure (wraps all 36 procedures) - Capture all errors to Sentry in both HTTP and tRPC layers - Tag DO-internal events with delivery: 'internal'
The beads table on pre-existing DOs still had the old CHECK constraint
`status in ('open', 'in_progress', 'closed', 'failed')` which rejects
the newer 'in_review' status, causing SQLITE_CONSTRAINT errors in
handleAgentDone.
- Remove all CHECK constraints from all table definitions (Zod validates
at the application layer)
- Add dropCheckConstraints() migration that detects tables with CHECK
constraints via sqlite_master and recreates them without constraints
- Migration is idempotent and includes rollback on failure
- Add API route that proxies SQL queries to CF Analytics Engine (overview, events timeseries, error rates, top users, latency, delivery) - Add React hooks for each query type with 1-minute auto-refresh - Build dashboard page with: - Overview KPI cards (total events, unique users, avg latency, error rate) - Stacked area chart: events over time (top 15 by volume) - Stacked bar chart: delivery breakdown (HTTP/tRPC/internal) over time - Horizontal bar chart: success vs error rates by event with error % line - Latency table: avg response time by event and delivery type - Top users table: most active users with links to admin panel - Configurable time window (1h to 30d) via dropdown - Requires CF_ANALYTICS_ENGINE_TOKEN env var for API access
23 panels across 6 sections: - Overview: total events, unique users, avg latency, error rate stats - Throughput: RPS by delivery, event volume stacked bars, top events - Errors: error count over time, error rate by delivery, error counts table, top error messages table - Latency: avg latency by delivery, avg latency by top events, slowest endpoints table with route-level detail - Users & Accounts: active users/towns over time, top users by event count, top users by error count - Domain Breakdown: delivery type pie, top events pie, all events summary table with success/error/latency - Internal DO Events: bead lifecycle, agent/review/convoy events Uses $timeSeries, $timeFilter, $interval_s Grafana macros for the cloudflare-analytics-engine datasource plugin.
…source plugin All panels now have the required target properties: - dateTimeType: DATETIME - dateTimeColDataType: timestamp - editorMode: sql - table: gastown_events - query field set (not just rawSql) - datasource type: vertamedia-clickhouse-datasource - $interval_s replaced with $interval
CF Analytics Engine doesn't support IN (SELECT ...) subqueries. Removed the top-N subquery filter from: - Grafana panels 7 (Event Volume top events) and 13 (Avg Latency top events) - Admin API events-timeseries query The queries now return all events grouped by time — users can toggle individual series via the Grafana legend.
- Fix Sentry double-capture: remove captureException from instrumented() and tRPC analytics middleware; keep single capture in app.onError() and trpcServer onError (guarded to skip TRPCErrors) - Fix CHECK constraint regex: handle nested parens in check(col in (...)) - Fix sourcemap release mismatch: inject SENTRY_RELEASE via --var at deploy time using sentry-cli propose-version (git SHA) - Fix agent-auth userId: fall back to agentJWT.userId when kiloUserId is unset (agent-authenticated routes) - Fix status WebSocket: connect on initial page load, not just on change - Fix SDK session leak: decrement sessionCount when agent completes normally via session.idle - Fix index loss: reorder initBeadTables to run dropCheckConstraints before index creation - Fix lint: use type narrowing instead of String() for sqlite_master rows - Fix format: prettier on container/src/logger.ts - Fix Grafana: convert Total Events and Unique Users to time series stat panels with correct field selectors
e401786 to
b53a6c7
Compare
- Fix SDK session leak on stream errors (process-manager catch path) - Add convoyId/role/beadType to Analytics Engine blobs (blob11-13) - Fix error-rate line plotted on count axis — add secondary X axis - Add client-side top-15 filtering to EventsTimeseriesChart - Add LIMIT 500 to unbounded Grafana time series panels (7, 13) - Update Grafana panel titles to reflect actual behavior
|
|
||
| process.on('uncaughtException', err => { | ||
| log.error('container.uncaught_exception', { error: err.message, stack: err.stack }); | ||
| process.exit(1); |
Contributor
There was a problem hiding this comment.
WARNING: Immediate exit can drop the crash log
process.exit(1) terminates the process synchronously, so the container.uncaught_exception entry emitted just above is not guaranteed to flush to stderr/Workers Logs first. That makes the only structured signal for this failure mode easy to lose during postmortems.
…ting - Fix deriveHttpEventName: distinguish list vs get by checking if route ends with a param segment; keep 'mayor' as a meaningful segment - Fix overview avg_latency_ms: only average http/trpc events (skip zero-duration internal events) - Fix top-users avg_latency_ms: same conditional filtering - Format gastown-grafana-dash-1.json with prettier
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
logger.ts) with key lifecycle events (agent start/stop/crash, container init/eviction)Closes #228
Verification
origin/mainVisual Changes
N/A
Reviewer Notes
GASTOWN_AE) must be configured in the Cloudflare dashboard/wrangler config for metrics to flow