mongrel-intelligence · zbigniewsobiecki · Apr 1, 2026 · Apr 1, 2026
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
@@ -0,0 +1,137 @@
+# CASCADE Architecture
+
+CASCADE is a PM-to-Code automation platform that connects project management tools (Trello, JIRA), source control (GitHub), and monitoring (Sentry) to AI-powered agents that autonomously implement features, review PRs, debug failures, and manage backlogs. Webhooks from external providers flow through a router, get queued in Redis, and are processed by ephemeral worker containers that run agents against cloned repositories.
+
+> **Relationship to CLAUDE.md**: `CLAUDE.md` is the operational reference (commands, env vars, how-to). This document and its deep-dives cover the *system design* — how components fit together and why.
+
+## System Overview
+
+```mermaid
+graph TB
+    subgraph External["External Providers"]
+        Trello
+        JIRA
+        GitHub
+        Sentry
+    end
+
+    subgraph CASCADE["CASCADE Platform"]
+        Router["Router :3000<br/>Webhook receiver"]
+        Redis[(Redis / BullMQ)]
+        Worker["Worker containers<br/>One job per container"]
+        Dashboard["Dashboard :3001<br/>API + tRPC"]
+        DB[(PostgreSQL)]
+    end
+
+    subgraph Clients
+        WebUI["Dashboard UI"]
+        CLI["cascade CLI"]
+    end
+
+    Trello -->|webhook| Router
+    JIRA -->|webhook| Router
+    GitHub -->|webhook| Router
+    Sentry -->|webhook| Router
+
+    Router -->|enqueue job| Redis
+    Redis -->|dequeue job| Worker
+
+    Worker -->|PRs, comments| GitHub
+    Worker -->|status updates| Trello
+    Worker -->|status updates| JIRA
+
+    Router <--> DB
+    Worker <--> DB
+    Dashboard <--> DB
+    Dashboard <--> Redis
+
+    WebUI <--> Dashboard
+    CLI <--> Dashboard
+```
+
+See also: [`docs/architecture.d2`](architecture.d2) for the D2 source diagram.
+
+## Service Topology
+
+| Service | Entry Point | Default Port | Responsibility |
+|---------|-------------|-------------|----------------|
+| **Router** | `src/router/index.ts` | 3000 | Receive webhooks, verify signatures, run trigger dispatch, enqueue jobs to Redis, manage worker containers |
+| **Worker** | `src/worker-entry.ts` | N/A (ephemeral) | Process one job per container — run trigger handlers, execute agents, exit on completion |
+| **Dashboard** | `src/dashboard.ts` | 3001 | tRPC API for web UI and CLI, session auth, serve frontend static files in self-hosted mode |
+
+## End-to-End Request Flow
+
+The canonical path from webhook to pull request:
+
+```mermaid
+sequenceDiagram
+    participant P as Provider<br/>(Trello/GitHub/JIRA/Sentry)
+    participant R as Router
+    participant Q as Redis/BullMQ
+    participant W as Worker
+    participant A as Agent Engine
+
+    P->>R: POST /provider/webhook
+    R->>R: Parse, verify signature, dedup
+    R->>R: Lookup project, dispatch triggers
+    R->>R: Check concurrency, post ack comment
+    R->>Q: Enqueue job
+    Q->>W: Spawn container with job env vars
+    W->>W: Bootstrap integrations, dispatch by job type
+    W->>W: Match trigger, resolve agent definition
+    W->>A: Execute agent (clone repo, run engine)
+    A->>A: LLM loop: read, edit, test, commit
+    A-->>P: Create PR / post comments / update status
+    W->>W: Finalize run record, cleanup, exit
+```
+
+## Architectural Patterns
+
+**Registry pattern** — Integrations, triggers, engines, PM providers, and capabilities all use registries (singleton maps populated at bootstrap). Infrastructure code looks up by key with no provider-specific branching.
+
+**Capability-driven tool resolution** — Agent YAML definitions declare required capabilities (`fs:read`, `pm:write`, `scm:pr`). At runtime, capabilities are resolved against available integrations to determine which gadgets (tools) the agent receives.
+
+**Two-tier credential resolution** — In the router and dashboard, credentials are read from the `project_credentials` database table. In workers, the router pre-loads credentials as environment variables to avoid giving workers direct DB access to secrets.
+
+**Dual-persona GitHub model** — Each project uses two GitHub bot accounts (implementer and reviewer) to prevent feedback loops. Agent type determines which persona token is used.
+
+**YAML-based agent definitions** — Agents are defined declaratively in YAML files specifying identity, capabilities, triggers, prompts, and lifecycle hooks. Definitions resolve via three tiers: in-memory cache, database, then YAML files on disk.
+
+**AsyncLocalStorage credential scoping** — Provider clients (GitHub, Trello, JIRA) use Node.js `AsyncLocalStorage` to scope credentials per-request, preventing cross-request credential leakage.
+
+## Directory Map
+
+| Directory | Purpose |
+|-----------|---------|
+| `src/router/` | Webhook receiver, BullMQ producer, worker container management |
+| `src/webhook/` | Shared webhook handler factory, parsers, signature verification, logging |
+| `src/triggers/` | Event-to-agent routing: TriggerRegistry, TriggerHandler implementations |
+| `src/agents/` | Agent definitions (YAML), profiles, capabilities, prompt templates |
+| `src/backends/` | LLM execution engines: Claude Code, LLMist, Codex, OpenCode |
+| `src/gadgets/` | Tool implementations agents use (file ops, PM, SCM, alerting, shell) |
+| `src/integrations/` | Unified integration interfaces, registry, bootstrap |
+| `src/pm/` | PM abstraction layer: provider interface, Trello/JIRA adapters, lifecycle |
+| `src/github/` | GitHub API client, dual-persona model, PR operations |
+| `src/trello/` | Trello API client |
+| `src/jira/` | JIRA API client (jira.js wrapper) |
+| `src/sentry/` | Sentry API client, alerting integration |
+| `src/config/` | Configuration provider, caching, credential resolution, integration roles |
+| `src/db/` | Drizzle ORM schema, repositories, migrations |
+| `src/api/` | tRPC routers for dashboard API |
+| `src/cli/` | Two CLIs: `cascade` (dashboard) and `cascade-tools` (agent tools) |
+| `src/utils/` | Logging, repo cloning, lifecycle/watchdog, env scrubbing |
+| `src/types/` | Shared TypeScript types |
+| `src/queue/` | BullMQ queue helpers |
+
+## Deep-Dive Documents
+
+1. [Services and Deployment](./architecture/01-services.md) — Three-service architecture, startup sequences, container model
+2. [Webhook Pipeline](./architecture/02-webhook-pipeline.md) — Handler factory, platform adapters, processing pipeline
+3. [Trigger System](./architecture/03-trigger-system.md) — TriggerRegistry, handlers, config resolution, context pipeline
+4. [Agent System](./architecture/04-agent-system.md) — YAML definitions, profiles, capabilities, prompts, hooks
+5. [Engine Backends](./architecture/05-engine-backends.md) — AgentEngine interface, archetypes, execution adapter
+6. [Integration Layer](./architecture/06-integration-layer.md) — IntegrationModule, registry, categories, provider implementations
+7. [Gadgets](./architecture/07-gadgets.md) — Capability-to-gadget mapping, built-in tools, cascade-tools CLI
+8. [Configuration and Credentials](./architecture/08-config-credentials.md) — Config provider, credential resolution, encryption
+9. [Database](./architecture/09-database.md) — Schema, ER diagram, repositories, migrations
+10. [Resilience](./architecture/10-resilience.md) — Watchdog, concurrency controls, rate limiting, retry, loop prevention
diff --git a/docs/architecture/01-services.md b/docs/architecture/01-services.md
@@ -0,0 +1,172 @@
+# Services and Deployment
+
+CASCADE runs as three independent services. There is no monolithic server mode — each service has a distinct entry point, lifecycle, and scaling model.
+
+```mermaid
+graph LR
+    subgraph Router["Router Container"]
+        R_Hono["Hono :3000"]
+        R_BullMQ["BullMQ Producer"]
+        R_WM["Worker Manager"]
+    end
+
+    subgraph Workers["Worker Containers (ephemeral)"]
+        W1["Worker 1"]
+        W2["Worker 2"]
+        WN["Worker N"]
+    end
+
+    subgraph Dashboard["Dashboard Container"]
+        D_Hono["Hono :3001"]
+        D_tRPC["tRPC Router"]
+    end
+
+    Redis[(Redis)]
+    DB[(PostgreSQL)]
+
+    R_Hono --> R_BullMQ --> Redis
+    R_WM --> Workers
+    Redis --> R_WM
+
+    D_Hono --> D_tRPC
+    Dashboard <--> DB
+    Router <--> DB
+    Workers <--> DB
+```
+
+## Router
+
+**Entry point**: `src/router/index.ts`
+**Default port**: 3000
+
+The router is the webhook ingestion point. It receives HTTP POST requests from external providers, processes them through a multi-step pipeline, and enqueues jobs to Redis for worker containers.
+
+### Webhook endpoints
+
+| Route | Provider | Notes |
+|-------|----------|-------|
+| `POST /trello/webhook` | Trello | HEAD/GET returns 200 for Trello's verification |
+| `POST /github/webhook` | GitHub | Injects `X-GitHub-Event` header into payload |
+| `POST /jira/webhook` | JIRA | HEAD/GET returns 200 for JIRA verification |
+| `POST /sentry/webhook/:projectId` | Sentry | Project ID in URL for unambiguous routing |
+| `GET /health` | Internal | Queue stats, active worker count |
+
+### Startup sequence
+
+Module-load phase (runs at import time, before `startRouter()`):
+1. `registerBuiltInEngines()` — register engine settings schemas (required before any `loadConfig()`)
+2. `createTriggerRegistry()` + `registerBuiltInTriggers()` — populate trigger handlers
+
+`startRouter()` async phase:
+3. `seedAgentDefinitions()` — sync built-in YAML definitions to database
+4. `initAgentMessages()` — load ack message templates
+5. `initPrompts()` — load prompt templates
+6. `startCancelListener()` — listen for run cancellation requests
+7. `startWorkerProcessor()` — begin polling BullMQ for jobs and spawning containers
+8. `serve()` — start Hono HTTP server
+
+### Key modules
+
+| File | Purpose |
+|------|---------|
+| `webhook-processor.ts` | Generic 12-step pipeline (see [02-webhook-pipeline](./02-webhook-pipeline.md)) |
+| `platform-adapter.ts` | `RouterPlatformAdapter` interface |
+| `adapters/` | Per-provider adapter implementations |
+| `worker-manager.ts` | Spawns/monitors Docker worker containers |
+| `queue.ts` | BullMQ `addJob()`, queue stats |
+| `action-dedup.ts` | In-memory deduplication of webhook deliveries |
+| `work-item-lock.ts` | Prevents concurrent agents on the same work item |
+| `agent-type-lock.ts` | Agent-type concurrency limits |
+| `cancel-listener.ts` | Listens for run cancellation via BullMQ events |
+| `webhookVerification.ts` | HMAC signature verification per provider |
+
+## Worker
+
+**Entry point**: `src/worker-entry.ts`
+**Port**: None (ephemeral container, no HTTP server)
+
+Workers are stateless, one-job-per-container processes spawned by the router's worker manager. Each worker reads its job from environment variables, processes it, and exits.
+
+### Environment variables
+
+The router passes job data to workers via Docker container env vars:
+
+| Variable | Purpose |
+|----------|---------|
+| `JOB_ID` | Unique job identifier |
+| `JOB_TYPE` | `trello`, `github`, `jira`, `sentry`, `manual-run`, `retry-run`, `debug-analysis` |
+| `JOB_DATA` | JSON-encoded job payload |
+| `CASCADE_CREDENTIAL_KEYS` | Comma-separated list of credential env var names |
+| Individual credential vars | Pre-loaded project credentials (e.g., `GITHUB_TOKEN_IMPLEMENTER`) |
+
+### Job types
+
+```typescript
+type JobData =
+  | TrelloJobData      // Trello webhook payload
+  | GitHubJobData      // GitHub webhook payload
+  | JiraJobData        // JIRA webhook payload
+  | SentryJobData      // Sentry webhook payload
+  | ManualRunJobData   // Dashboard-initiated run
+  | RetryRunJobData    // Retry a failed run
+  | DebugAnalysisJobData; // Post-mortem debug analysis
+```
+
+### Startup sequence
+
+1. `loadEnvConfigSafe()` — load `.cascade/env` if present
+2. `getDb()` — eagerly initialize DB connection (caches pool before env scrub)
+3. `registerBuiltInEngines()` — register engine settings schemas (before `loadConfig()`)
+4. `loadConfig()` — cache project config from database
+5. `seedAgentDefinitions()` — sync built-in YAML definitions to database
+6. `initAgentMessages()` — load ack message templates
+7. `initPrompts()` — load prompt templates
+8. `scrubSensitiveEnv()` — remove `DATABASE_URL` and other secrets from `process.env`
+9. `createTriggerRegistry()` + `registerBuiltInTriggers()` — populate trigger handlers
+10. `dispatchJob()` — route to the appropriate handler based on `JOB_TYPE`
+
+The security scrub in step 8 prevents agent engines (which execute arbitrary LLM-generated commands) from accessing database credentials. Note that trigger registration (step 9) happens after the scrub — it only needs the in-memory config, not the database.
+
+### Dispatch flow
+
+`dispatchJob()` switches on the job type:
+- **Webhook jobs** (`trello`, `github`, `jira`, `sentry`) — call the provider-specific webhook processor, which re-runs trigger dispatch and executes the matched agent
+- **Dashboard jobs** (`manual-run`, `retry-run`, `debug-analysis`) — call `processDashboardJob()`, which loads project config and invokes the appropriate runner
+
+## Dashboard
+
+**Entry point**: `src/dashboard.ts`
+**Default port**: 3001
+
+The dashboard serves the tRPC API consumed by both the web frontend and the `cascade` CLI. In self-hosted mode, it also serves the built frontend as static files.
+
+### Routes
+
+| Route | Purpose |
+|-------|---------|
+| `POST /api/auth/login` | Email/password authentication |
+| `POST /api/auth/logout` | Session invalidation |
+| `/trpc/*` | tRPC API endpoints |
+| `GET /health` | Service health check |
+| `/*` (static) | Frontend files from `dist/web/` (self-hosted mode only) |
+
+### Startup sequence
+
+Module-load phase (runs at import time, before `startDashboard()`):
+1. `registerBuiltInEngines()` — register engine settings schemas
+2. CORS middleware, logging middleware registered on Hono app
+3. Auth routes mounted (`/api/auth/login`, `/api/auth/logout`)
+4. tRPC router mounted with session-based context resolution
+5. Static file serving (if `dist/web/` exists)
+
+`startDashboard()` async phase:
+6. `initPrompts()` — load prompt templates
+7. `serve()` — start Hono HTTP server
+
+### tRPC context
+
+Every tRPC request builds a context containing:
+- `user` — resolved from session cookie via `resolveUserFromSession()`
+- `effectiveOrgId` — computed from user's org membership or `x-org-context` header
+
+Procedure types enforce auth levels: `publicProcedure`, `protectedProcedure`, `adminProcedure`, `superAdminProcedure`.