diff --git a/docs/advisor_framework_strategy.md b/docs/advisor_framework_strategy.md new file mode 100644 index 0000000..58b66ed --- /dev/null +++ b/docs/advisor_framework_strategy.md @@ -0,0 +1,197 @@ +# Advisor framework strategy (multi-source; push-first as a working hypothesis) + +This doc is **product/architecture oriented** and intended as a jumping-off point for engineering investigation. It’s written to be OSS-friendly (conceptual, not a binding implementation spec). + +It does *not* propose adding non-SQL “advisor engines” into Splinter. Instead, Splinter remains a **SQL-only** producer of Postgres findings, while other sources (logs/traces/metrics/edge/humans) can create Advisor issues via an ingestion API. + +References: +- Splinter docs: `https://supabase.github.io/splinter/` +- Supabase Database Advisors docs: `https://supabase.com/docs/guides/database/database-advisors?queryGroups=lint&lint=0012_auth_allow_anonymous_sign_ins` + +## Key design choice: push vs pull + +### Push (often a good fit) +Producers evaluate signals “close to the data” and **push** Advisor issues: +- Logs/traces: Logflare/ClickHouse queries + thresholding in a job or alerting layer +- Metrics: **VictoriaMetrics + vmalert** (or Alertmanager) → webhook to the ingestion API +- Product services: plan/usage thresholds, resource exhaustion, quota events +- Edge Functions: detect misconfigurations and push issues +- Humans / Assistant: manual issues (“broadcasting important updates”) + +Pros: +- Uses each system’s native evaluation semantics (PromQL + alert windows, etc.) +- Avoids Supabase embedding every query engine + auth model +- Reduces load on Postgres/telemetry backends from “always-on polling” + +### Pull (fallback / managed checks) +Supabase periodically queries sources (ClickHouse, PromQL, etc.) and synthesizes issues. + +Pros: +- Great for onboarding (“turnkey checks”) +- Works even if users haven’t wired alerting/webhooks + +Cons: +- Forces Supabase to own connectors, auth, safe query execution, and alert semantics + +**Working hypothesis:** provide a **push-first ingestion API** and optionally layer **managed pull-based checks** behind the same ingestion + issue model. + +## Advisor API surface (conceptual capabilities) + +Minimum viable operations (regardless of URL shape): +- **Upsert** an issue (create or update, typically setting state to `open`) +- **Resolve** an issue (set state to `resolved`) +- **Dismiss** an issue (set state to `dismissed`) +- **Snooze** an issue (set state to `snoozed` until a timestamp) + +You can also unify create/resolve as a single “upsert current state” operation if the payload includes `state`. + +## Webhooks as first-class plumbing + +If Supabase is investing in “more webhooks” functionality, it fits the Advisor strategy in **two** complementary ways: + +### 1) Inbound webhooks (create/update Advisor issues) +Treat a webhook delivery as a **producer** of Advisor issues. Examples: +- VictoriaMetrics/vmalert/Alertmanager firing events → create/update/resolve Advisor issues +- log-based alerting (Logflare/ClickHouse) → create/update/resolve Advisor issues +- internal platform events (resource exhaustion, quota events, deploy events) → create/update/resolve Advisor issues +- schema change events (e.g. “policy changed”, “RLS disabled”) → create Advisor issues or prompt re-run of Splinter + +This aligns with your goal: “Advisor as a flexible issue creation and triggers framework”. + +### 2) Outbound webhooks (broadcast Advisor issues) +When an Advisor issue is created/updated/resolved/dismissed, Supabase can emit **outbound** events to user-configured destinations: +- Slack, email, PagerDuty, generic webhooks, etc. + +This is separate from issue *creation* but part of the end-to-end “alerting/notifications + advisor” experience: +- Advisor issue = the durable, discussable, dismissable record +- Outbound webhooks = delivery/routing based on user preferences + +Practical implementation detail: +- model outbound delivery rules separately from issue types +- support per-project routing and per-type routing (e.g. only `ERROR` to PagerDuty) + +## Issue identity: generalize Splinter’s `cache_key` + +Splinter already encodes the correct idea: +- each finding has a stable `cache_key` that can be used for suppression/dedupe + +Generalize this for *all* sources: +- **type** (aka check id): e.g. `splinter/unindexed_foreign_keys`, `logs/http_5xx_spike`, `metrics/cpu_hot` +- **fingerprint**: stable instance key within a type + +Store uniqueness as: +- `(project_ref, type, fingerprint)` + +### Lifecycle state +- **open**: currently firing / present +- **resolved**: no longer firing / no longer present +- **dismissed**: user doesn’t want to see it (until explicitly reopened) +- **snoozed**: suppressed until a timestamp + +## Suggested ingestion payload (conceptual) + +An ingestion payload should be able to carry: + +- `type` (string) +- `fingerprint` (string) +- `severity` (`ERROR|WARN|INFO`) +- `title` (string) +- `detail` (string) +- `categories` (string[]) +- `remediation` (url/string, optional) +- `metadata` (object, optional) +- `source` (object, optional): `{ kind: "splinter|clickhouse|vmalert|edge|manual", ref?: string }` +- `state` (optional): if absent, default to `open` +- `observed_at` (timestamp, optional) + +On the backend, this becomes a durable issue record with timestamps (`first_seen`, `last_seen`, etc.). + +## How sources map to the ingestion model + +### Splinter (Postgres schema advisors) +Splinter stays SQL-only: +- lints in `lints/` emit rows with `cache_key` +- `splinter.sql` unions all lints + +Producer job options: +- UI job (on-demand “rerun advisors”) +- backend cron (e.g. daily) to keep issues warm + +Flow: +1) run `splinter.sql` +2) convert each row to `type = "splinter/{name}"` and `fingerprint = cache_key` +3) upsert issues +4) resolve issues for that `type` that were not returned (or use an explicit `resolve` call) + +### Logs/traces (Logflare → ClickHouse) +Push-first options: +- a scheduled job queries ClickHouse for the last N minutes and pushes issues +- a logs alerting layer (if present) calls the ingestion endpoint directly + +Example types: +- `logs/http_5xx_spike` (fingerprint: `service=X,route=Y`) +- `traces/p95_latency_regression` (fingerprint: `service=X,operation=Y`) + +### Metrics (VictoriaMetrics) +Preferred path: **vmalert → webhook**. + +Why: +- vmalert already encodes windowing, alert state, de-dupe, and silence/inhibition concepts +- Supabase only needs to map alert events into Advisor issues + +Fallback path: managed pull checks. +- Supabase runs curated PromQL on a schedule +- returns series above a threshold → upsert issues +- no longer returned → resolve issues + +### VictoriaMetrics SaaS alerting (managed alert rules) +If you’re using VictoriaMetrics’ SaaS/managed offerings for alerting (rule management UI, managed alert evaluation, etc.), treat it the same as vmalert conceptually: +- alert evaluation happens in VictoriaMetrics’ alerting system +- alerts deliver to a webhook receiver +- the Advisor ingestion API is one such receiver + +The key mapping is still: +- `type`: stable alert name (namespaced, e.g. `metrics/cpu_hot`) +- `fingerprint`: stable label-set identity (e.g. `instance=db-01` or `cluster=X:instance=Y`) +- `state`: open when firing, resolved when cleared + +What we likely need to confirm (since SaaS feature sets vary): +- what webhook payload shape is emitted (Alertmanager-compatible vs custom) +- whether “resolved” notifications are delivered +- how de-dupe/inhibition/silences are represented + +## Standardized Advisor types (registry) vs custom + +You likely want both: +- **Curated registry** of known types for UI, docs, and consistent severity/category semantics +- **Custom types** for user-defined rules and internal experimentation + +Practical compromise: +- accept any `type` string +- Studio treats known prefixes (e.g. `splinter/`, `logs/`, `traces/`, `metrics/`, `billing/`) as first-class +- optionally expose `GET /advisor/types` for metadata (title, docs link, default severity) + +## What belongs in Splinter (and what doesn’t) + +Splinter should remain focused on what it’s good at: +- deterministic Postgres catalog/state checks +- SQL queries that return structured findings +- stable fingerprints (`cache_key`) + +It should *not* embed: +- alerting engines +- time-windowed evaluation for logs/traces/metrics +- notification routing logic + +## Open questions (for an engineering spike) + +These are intentionally left open to avoid over-prescribing implementation: + +- **Inbound webhook shape**: will inbound alerts be Alertmanager-compatible, vendor-specific, or both? +- **Resolved semantics**: do upstream systems send “resolved” events (or do we infer resolution)? +- **Auth**: how do producers authenticate to the ingestion API (per-project secrets, signed webhooks, service tokens)? +- **Multi-tenancy**: how does a producer target the correct `project_ref` (explicit routing vs token-bound project)? +- **Dedupe**: what’s the recommended fingerprint scheme per source (labelset hashing, stable IDs, etc.)? +- **Type registry**: do we want a curated set of “known types” (for UI/UX), plus custom types (for extensibility)? +- **Outbound routing**: how should outbound webhooks/notifications relate to issues (per type, per severity, per project)? + diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 0000000..9887aa1 --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,114 @@ +# Architecture: Splinter and Supabase Database Advisors + +This document explains how this repository works today (Splinter), how it maps to Supabase’s “Database Advisors” UI, and why the current approach is inherently **SQL/Postgres-only**. + +For end-user facing documentation of individual lints, see the lint pages on the Splinter docs site (`https://supabase.github.io/splinter/`) and Supabase docs ([Database Advisors](https://supabase.com/docs/guides/database/database-advisors?queryGroups=lint&lint=0012_auth_allow_anonymous_sign_ins)). + +## What Splinter is (and is not) + +Splinter is a **collection of SQL lints** intended to be run against a Supabase Postgres database. Each lint is implemented as a view that returns a standardized “issue row” shape. + +Splinter is **not** a generic alerting/advisory system. It does not: +- schedule periodic evaluations +- store issue lifecycle state (open/resolved/dismissed) +- evaluate non-Postgres sources (logs, traces, metrics, client errors, billing/usage) + +## The lint interface (the contract) + +Each lint view returns the same columns (see `README.md`): +- `name` (text) — stable check id (e.g. `unindexed_foreign_keys`) +- `title` (text) — human title +- `level` (text) — `ERROR` / `WARN` / `INFO` +- `facing` (text) — `INTERNAL` / `EXTERNAL` +- `categories` (text[]) — e.g. `{SECURITY}` or `{PERFORMANCE}` +- `description` (text) — why this matters +- `detail` (text) — specific instance detail (table/view/function etc.) +- `remediation` (text, optional) — documentation URL or guidance +- `metadata` (jsonb, optional) — structured data for UI rendering or deep links +- `cache_key` (text) — **stable fingerprint** used for de-dupe/suppression + +**Key point:** `cache_key` is the only stable identifier for “the same finding” across runs. It’s the hook needed for an issue lifecycle system. + +## How lints are authored + +Lints live in `lints/` as one `.sql` file per lint, each creating a view under the `lint` schema: + +- Example: `lints/0001_unindexed_foreign_keys.sql` creates `lint."0001_unindexed_foreign_keys"`. +- Tests read from these views directly (see `test/sql/`). + +## How Splinter is “compiled” into one query + +Some consumers (like a UI) want to run a single query and receive a single result set containing all findings. + +Splinter ships that as `splinter.sql`, produced by `bin/compile.py`: +- reads `lints/*.sql` +- strips the `create view` line and semicolons +- wraps each lint query in parentheses +- concatenates them with `UNION ALL` +- prefixes `set local search_path = '';` to enforce schema qualification + +Net: **`splinter.sql` is “all lints as one unionable query”.** + +## How Splinter is tested + +Tests are Postgres regression tests that validate each lint has at least a true-positive case: +- `test/fixtures.sql` provisions minimal schemas/roles/functions needed for lint execution. +- `test/sql/*.sql` sets up example schema states and selects from `lint."00xx_*"`. +- `test/expected/*.out` captures expected output. +- `bin/installcheck` runs pg_regress and also validates that `splinter.sql` runs. + +## How this maps to Supabase “Database Advisors” + +Supabase’s “Performance and Security Advisors” are a set of checks surfaced in the dashboard UI ([Database Advisors docs](https://supabase.com/docs/guides/database/database-advisors?queryGroups=lint&lint=0012_auth_allow_anonymous_sign_ins)). + +The “DB schema / security / performance” subset of those checks can be implemented as Splinter lints because they are answerable by querying Postgres system catalogs and Supabase schemas. + +Operationally, a consumer (such as Supabase Studio) can: +- run `splinter.sql` (or an equivalent union query) +- render each row as an “advisor finding” +- allow users to dismiss/suppress findings via `cache_key` + +## When are lints run (triggering) and where do results live? + +Splinter itself does **not** define scheduling, triggers, or persistence. It only defines the SQL that produces findings. + +In Supabase’s product, there are typically two different “execution contexts” to be aware of: + +### 1) Interactive / on-demand (e.g. Supabase Studio) +In Supabase Studio (in the `supabase/supabase` repo), the current “lints/advisors” flow is implemented as an **on-demand SQL query** executed against the project’s Postgres database when the UI needs it (for example, when you open the Advisors page or click a re-run action). + +Concretely, Studio sends the lint SQL to a platform endpoint that proxies to `pg-meta`, executing the query against the project database (see `executeSql` in Studio which posts to `/platform/pg-meta/{ref}/query` in `supabase/supabase`). + +This is why Splinter focuses on: +- deterministic SQL +- predictable output shape +- stable `cache_key` values (so consumers can de-dupe and suppress repeated findings) + +### 2) Scheduled / aggregated (analytics, fleet-level views) +Separately from the UI, Supabase may run similar checks on a schedule and/or export results for analytics. For example, you mentioned querying a BigQuery table named `supabase-etl-prod-eu.dbt.project_lints`. + +That table name is a good example of historical terminology: the underlying concept is “advisor findings/issues”, but earlier naming often used “lints”. It’s worth calling out explicitly as Supabase’s long-term direction expands Advisors beyond SQL lints. + +Because this repo is OSS and Splinter-only, the precise scheduling/ETL details live outside this repo (Studio/backend/data pipelines). A reasonable mental model is: +- Splinter provides the *finding generator* (SQL) +- Supabase systems decide *when to run it* and *whether/how to store it* (UI state, issue lifecycle, analytics exports) + +## A concrete limitation example (docs vs SQL) + +This repo contains documentation for check `0012_auth_allow_anonymous_sign_ins` (`docs/0012_auth_allow_anonymous_sign_ins.md`), but there is no corresponding SQL lint in `lints/`. + +That’s a useful reminder: some “advisor” checks depend on **non-Postgres state** (e.g., auth configuration) or **time-windowed telemetry** (logs/traces/metrics), which can’t be expressed as a static Postgres catalog query. + +## Why this matters for “multi-source Advisors” + +If you want Advisors that can come from logs/traces/metrics/UI errors/etc., you need something beyond “run a SQL union query against Postgres”. + +Splinter’s lint rows are a good *finding format* for DB checks, but a full Advisor system also needs: +- source-specific evaluation (ClickHouse, PromQL, etc.) +- scheduling and/or event triggers +- a stable issue model + dedupe +- lifecycle storage (open/resolved/dismissed/snoozed) +- routing/notification integrations + + +