Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# CASCADE Architecture

CASCADE is a PM-to-Code automation platform that connects project management tools (Trello, JIRA), source control (GitHub), and monitoring (Sentry) to AI-powered agents that autonomously implement features, review PRs, debug failures, and manage backlogs. Webhooks from external providers flow through a router, get queued in Redis, and are processed by ephemeral worker containers that run agents against cloned repositories.

> **Relationship to CLAUDE.md**: `CLAUDE.md` is the operational reference (commands, env vars, how-to). This document and its deep-dives cover the *system design* — how components fit together and why.

## System Overview

```mermaid
graph TB
subgraph External["External Providers"]
Trello
JIRA
GitHub
Sentry
end

subgraph CASCADE["CASCADE Platform"]
Router["Router :3000<br/>Webhook receiver"]
Redis[(Redis / BullMQ)]
Worker["Worker containers<br/>One job per container"]
Dashboard["Dashboard :3001<br/>API + tRPC"]
DB[(PostgreSQL)]
end

subgraph Clients
WebUI["Dashboard UI"]
CLI["cascade CLI"]
end

Trello -->|webhook| Router
JIRA -->|webhook| Router
GitHub -->|webhook| Router
Sentry -->|webhook| Router

Router -->|enqueue job| Redis
Redis -->|dequeue job| Worker

Worker -->|PRs, comments| GitHub
Worker -->|status updates| Trello
Worker -->|status updates| JIRA

Router <--> DB
Worker <--> DB
Dashboard <--> DB
Dashboard <--> Redis

WebUI <--> Dashboard
CLI <--> Dashboard
```

See also: [`docs/architecture.d2`](architecture.d2) for the D2 source diagram.

## Service Topology

| Service | Entry Point | Default Port | Responsibility |
|---------|-------------|-------------|----------------|
| **Router** | `src/router/index.ts` | 3000 | Receive webhooks, verify signatures, run trigger dispatch, enqueue jobs to Redis, manage worker containers |
| **Worker** | `src/worker-entry.ts` | N/A (ephemeral) | Process one job per container — run trigger handlers, execute agents, exit on completion |
| **Dashboard** | `src/dashboard.ts` | 3001 | tRPC API for web UI and CLI, session auth, serve frontend static files in self-hosted mode |

## End-to-End Request Flow

The canonical path from webhook to pull request:

```mermaid
sequenceDiagram
participant P as Provider<br/>(Trello/GitHub/JIRA/Sentry)
participant R as Router
participant Q as Redis/BullMQ
participant W as Worker
participant A as Agent Engine

P->>R: POST /provider/webhook
R->>R: Parse, verify signature, dedup
R->>R: Lookup project, dispatch triggers
R->>R: Check concurrency, post ack comment
R->>Q: Enqueue job
Q->>W: Spawn container with job env vars
W->>W: Bootstrap integrations, dispatch by job type
W->>W: Match trigger, resolve agent definition
W->>A: Execute agent (clone repo, run engine)
A->>A: LLM loop: read, edit, test, commit
A-->>P: Create PR / post comments / update status
W->>W: Finalize run record, cleanup, exit
```

## Architectural Patterns

**Registry pattern** — Integrations, triggers, engines, PM providers, and capabilities all use registries (singleton maps populated at bootstrap). Infrastructure code looks up by key with no provider-specific branching.

**Capability-driven tool resolution** — Agent YAML definitions declare required capabilities (`fs:read`, `pm:write`, `scm:pr`). At runtime, capabilities are resolved against available integrations to determine which gadgets (tools) the agent receives.

**Two-tier credential resolution** — In the router and dashboard, credentials are read from the `project_credentials` database table. In workers, the router pre-loads credentials as environment variables to avoid giving workers direct DB access to secrets.

**Dual-persona GitHub model** — Each project uses two GitHub bot accounts (implementer and reviewer) to prevent feedback loops. Agent type determines which persona token is used.

**YAML-based agent definitions** — Agents are defined declaratively in YAML files specifying identity, capabilities, triggers, prompts, and lifecycle hooks. Definitions resolve via three tiers: in-memory cache, database, then YAML files on disk.

**AsyncLocalStorage credential scoping** — Provider clients (GitHub, Trello, JIRA) use Node.js `AsyncLocalStorage` to scope credentials per-request, preventing cross-request credential leakage.

## Directory Map

| Directory | Purpose |
|-----------|---------|
| `src/router/` | Webhook receiver, BullMQ producer, worker container management |
| `src/webhook/` | Shared webhook handler factory, parsers, signature verification, logging |
| `src/triggers/` | Event-to-agent routing: TriggerRegistry, TriggerHandler implementations |
| `src/agents/` | Agent definitions (YAML), profiles, capabilities, prompt templates |
| `src/backends/` | LLM execution engines: Claude Code, LLMist, Codex, OpenCode |
| `src/gadgets/` | Tool implementations agents use (file ops, PM, SCM, alerting, shell) |
| `src/integrations/` | Unified integration interfaces, registry, bootstrap |
| `src/pm/` | PM abstraction layer: provider interface, Trello/JIRA adapters, lifecycle |
| `src/github/` | GitHub API client, dual-persona model, PR operations |
| `src/trello/` | Trello API client |
| `src/jira/` | JIRA API client (jira.js wrapper) |
| `src/sentry/` | Sentry API client, alerting integration |
| `src/config/` | Configuration provider, caching, credential resolution, integration roles |
| `src/db/` | Drizzle ORM schema, repositories, migrations |
| `src/api/` | tRPC routers for dashboard API |
| `src/cli/` | Two CLIs: `cascade` (dashboard) and `cascade-tools` (agent tools) |
| `src/utils/` | Logging, repo cloning, lifecycle/watchdog, env scrubbing |
| `src/types/` | Shared TypeScript types |
| `src/queue/` | BullMQ queue helpers |

## Deep-Dive Documents

1. [Services and Deployment](./architecture/01-services.md) — Three-service architecture, startup sequences, container model
2. [Webhook Pipeline](./architecture/02-webhook-pipeline.md) — Handler factory, platform adapters, processing pipeline
3. [Trigger System](./architecture/03-trigger-system.md) — TriggerRegistry, handlers, config resolution, context pipeline
4. [Agent System](./architecture/04-agent-system.md) — YAML definitions, profiles, capabilities, prompts, hooks
5. [Engine Backends](./architecture/05-engine-backends.md) — AgentEngine interface, archetypes, execution adapter
6. [Integration Layer](./architecture/06-integration-layer.md) — IntegrationModule, registry, categories, provider implementations
7. [Gadgets](./architecture/07-gadgets.md) — Capability-to-gadget mapping, built-in tools, cascade-tools CLI
8. [Configuration and Credentials](./architecture/08-config-credentials.md) — Config provider, credential resolution, encryption
9. [Database](./architecture/09-database.md) — Schema, ER diagram, repositories, migrations
10. [Resilience](./architecture/10-resilience.md) — Watchdog, concurrency controls, rate limiting, retry, loop prevention
172 changes: 172 additions & 0 deletions docs/architecture/01-services.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# Services and Deployment

CASCADE runs as three independent services. There is no monolithic server mode — each service has a distinct entry point, lifecycle, and scaling model.

```mermaid
graph LR
subgraph Router["Router Container"]
R_Hono["Hono :3000"]
R_BullMQ["BullMQ Producer"]
R_WM["Worker Manager"]
end

subgraph Workers["Worker Containers (ephemeral)"]
W1["Worker 1"]
W2["Worker 2"]
WN["Worker N"]
end

subgraph Dashboard["Dashboard Container"]
D_Hono["Hono :3001"]
D_tRPC["tRPC Router"]
end

Redis[(Redis)]
DB[(PostgreSQL)]

R_Hono --> R_BullMQ --> Redis
R_WM --> Workers
Redis --> R_WM

D_Hono --> D_tRPC
Dashboard <--> DB
Router <--> DB
Workers <--> DB
```

## Router

**Entry point**: `src/router/index.ts`
**Default port**: 3000

The router is the webhook ingestion point. It receives HTTP POST requests from external providers, processes them through a multi-step pipeline, and enqueues jobs to Redis for worker containers.

### Webhook endpoints

| Route | Provider | Notes |
|-------|----------|-------|
| `POST /trello/webhook` | Trello | HEAD/GET returns 200 for Trello's verification |
| `POST /github/webhook` | GitHub | Injects `X-GitHub-Event` header into payload |
| `POST /jira/webhook` | JIRA | HEAD/GET returns 200 for JIRA verification |
| `POST /sentry/webhook/:projectId` | Sentry | Project ID in URL for unambiguous routing |
| `GET /health` | Internal | Queue stats, active worker count |

### Startup sequence

Module-load phase (runs at import time, before `startRouter()`):
1. `registerBuiltInEngines()` — register engine settings schemas (required before any `loadConfig()`)
2. `createTriggerRegistry()` + `registerBuiltInTriggers()` — populate trigger handlers

`startRouter()` async phase:
3. `seedAgentDefinitions()` — sync built-in YAML definitions to database
4. `initAgentMessages()` — load ack message templates
5. `initPrompts()` — load prompt templates
6. `startCancelListener()` — listen for run cancellation requests
7. `startWorkerProcessor()` — begin polling BullMQ for jobs and spawning containers
8. `serve()` — start Hono HTTP server

### Key modules

| File | Purpose |
|------|---------|
| `webhook-processor.ts` | Generic 12-step pipeline (see [02-webhook-pipeline](./02-webhook-pipeline.md)) |
| `platform-adapter.ts` | `RouterPlatformAdapter` interface |
| `adapters/` | Per-provider adapter implementations |
| `worker-manager.ts` | Spawns/monitors Docker worker containers |
| `queue.ts` | BullMQ `addJob()`, queue stats |
| `action-dedup.ts` | In-memory deduplication of webhook deliveries |
| `work-item-lock.ts` | Prevents concurrent agents on the same work item |
| `agent-type-lock.ts` | Agent-type concurrency limits |
| `cancel-listener.ts` | Listens for run cancellation via BullMQ events |
| `webhookVerification.ts` | HMAC signature verification per provider |

## Worker

**Entry point**: `src/worker-entry.ts`
**Port**: None (ephemeral container, no HTTP server)

Workers are stateless, one-job-per-container processes spawned by the router's worker manager. Each worker reads its job from environment variables, processes it, and exits.

### Environment variables

The router passes job data to workers via Docker container env vars:

| Variable | Purpose |
|----------|---------|
| `JOB_ID` | Unique job identifier |
| `JOB_TYPE` | `trello`, `github`, `jira`, `sentry`, `manual-run`, `retry-run`, `debug-analysis` |
| `JOB_DATA` | JSON-encoded job payload |
| `CASCADE_CREDENTIAL_KEYS` | Comma-separated list of credential env var names |
| Individual credential vars | Pre-loaded project credentials (e.g., `GITHUB_TOKEN_IMPLEMENTER`) |

### Job types

```typescript
type JobData =
| TrelloJobData // Trello webhook payload
| GitHubJobData // GitHub webhook payload
| JiraJobData // JIRA webhook payload
| SentryJobData // Sentry webhook payload
| ManualRunJobData // Dashboard-initiated run
| RetryRunJobData // Retry a failed run
| DebugAnalysisJobData; // Post-mortem debug analysis
```

### Startup sequence

1. `loadEnvConfigSafe()` — load `.cascade/env` if present
2. `getDb()` — eagerly initialize DB connection (caches pool before env scrub)
3. `registerBuiltInEngines()` — register engine settings schemas (before `loadConfig()`)
4. `loadConfig()` — cache project config from database
5. `seedAgentDefinitions()` — sync built-in YAML definitions to database
6. `initAgentMessages()` — load ack message templates
7. `initPrompts()` — load prompt templates
8. `scrubSensitiveEnv()` — remove `DATABASE_URL` and other secrets from `process.env`
9. `createTriggerRegistry()` + `registerBuiltInTriggers()` — populate trigger handlers
10. `dispatchJob()` — route to the appropriate handler based on `JOB_TYPE`

The security scrub in step 8 prevents agent engines (which execute arbitrary LLM-generated commands) from accessing database credentials. Note that trigger registration (step 9) happens after the scrub — it only needs the in-memory config, not the database.

### Dispatch flow

`dispatchJob()` switches on the job type:
- **Webhook jobs** (`trello`, `github`, `jira`, `sentry`) — call the provider-specific webhook processor, which re-runs trigger dispatch and executes the matched agent
- **Dashboard jobs** (`manual-run`, `retry-run`, `debug-analysis`) — call `processDashboardJob()`, which loads project config and invokes the appropriate runner

## Dashboard

**Entry point**: `src/dashboard.ts`
**Default port**: 3001

The dashboard serves the tRPC API consumed by both the web frontend and the `cascade` CLI. In self-hosted mode, it also serves the built frontend as static files.

### Routes

| Route | Purpose |
|-------|---------|
| `POST /api/auth/login` | Email/password authentication |
| `POST /api/auth/logout` | Session invalidation |
| `/trpc/*` | tRPC API endpoints |
| `GET /health` | Service health check |
| `/*` (static) | Frontend files from `dist/web/` (self-hosted mode only) |

### Startup sequence

Module-load phase (runs at import time, before `startDashboard()`):
1. `registerBuiltInEngines()` — register engine settings schemas
2. CORS middleware, logging middleware registered on Hono app
3. Auth routes mounted (`/api/auth/login`, `/api/auth/logout`)
4. tRPC router mounted with session-based context resolution
5. Static file serving (if `dist/web/` exists)

`startDashboard()` async phase:
6. `initPrompts()` — load prompt templates
7. `serve()` — start Hono HTTP server

### tRPC context

Every tRPC request builds a context containing:
- `user` — resolved from session cookie via `resolveUserFromSession()`
- `effectiveOrgId` — computed from user's org membership or `x-org-context` header

Procedure types enforce auth levels: `publicProcedure`, `protectedProcedure`, `adminProcedure`, `superAdminProcedure`.
Loading
Loading