Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 197 additions & 0 deletions docs/advisor_framework_strategy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
# Advisor framework strategy (multi-source; push-first as a working hypothesis)

This doc is **product/architecture oriented** and intended as a jumping-off point for engineering investigation. It’s written to be OSS-friendly (conceptual, not a binding implementation spec).

It does *not* propose adding non-SQL “advisor engines” into Splinter. Instead, Splinter remains a **SQL-only** producer of Postgres findings, while other sources (logs/traces/metrics/edge/humans) can create Advisor issues via an ingestion API.

References:
- Splinter docs: `https://supabase.github.io/splinter/`
- Supabase Database Advisors docs: `https://supabase.com/docs/guides/database/database-advisors?queryGroups=lint&lint=0012_auth_allow_anonymous_sign_ins`

## Key design choice: push vs pull

### Push (often a good fit)
Producers evaluate signals “close to the data” and **push** Advisor issues:
- Logs/traces: Logflare/ClickHouse queries + thresholding in a job or alerting layer
- Metrics: **VictoriaMetrics + vmalert** (or Alertmanager) → webhook to the ingestion API
- Product services: plan/usage thresholds, resource exhaustion, quota events
- Edge Functions: detect misconfigurations and push issues
- Humans / Assistant: manual issues (“broadcasting important updates”)

Pros:
- Uses each system’s native evaluation semantics (PromQL + alert windows, etc.)
- Avoids Supabase embedding every query engine + auth model
- Reduces load on Postgres/telemetry backends from “always-on polling”

### Pull (fallback / managed checks)
Supabase periodically queries sources (ClickHouse, PromQL, etc.) and synthesizes issues.

Pros:
- Great for onboarding (“turnkey checks”)
- Works even if users haven’t wired alerting/webhooks

Cons:
- Forces Supabase to own connectors, auth, safe query execution, and alert semantics

**Working hypothesis:** provide a **push-first ingestion API** and optionally layer **managed pull-based checks** behind the same ingestion + issue model.

## Advisor API surface (conceptual capabilities)

Minimum viable operations (regardless of URL shape):
- **Upsert** an issue (create or update, typically setting state to `open`)
- **Resolve** an issue (set state to `resolved`)
- **Dismiss** an issue (set state to `dismissed`)
- **Snooze** an issue (set state to `snoozed` until a timestamp)

You can also unify create/resolve as a single “upsert current state” operation if the payload includes `state`.

## Webhooks as first-class plumbing

If Supabase is investing in “more webhooks” functionality, it fits the Advisor strategy in **two** complementary ways:

### 1) Inbound webhooks (create/update Advisor issues)
Treat a webhook delivery as a **producer** of Advisor issues. Examples:
- VictoriaMetrics/vmalert/Alertmanager firing events → create/update/resolve Advisor issues
- log-based alerting (Logflare/ClickHouse) → create/update/resolve Advisor issues
- internal platform events (resource exhaustion, quota events, deploy events) → create/update/resolve Advisor issues
- schema change events (e.g. “policy changed”, “RLS disabled”) → create Advisor issues or prompt re-run of Splinter

This aligns with your goal: “Advisor as a flexible issue creation and triggers framework”.

### 2) Outbound webhooks (broadcast Advisor issues)
When an Advisor issue is created/updated/resolved/dismissed, Supabase can emit **outbound** events to user-configured destinations:
- Slack, email, PagerDuty, generic webhooks, etc.

This is separate from issue *creation* but part of the end-to-end “alerting/notifications + advisor” experience:
- Advisor issue = the durable, discussable, dismissable record
- Outbound webhooks = delivery/routing based on user preferences

Practical implementation detail:
- model outbound delivery rules separately from issue types
- support per-project routing and per-type routing (e.g. only `ERROR` to PagerDuty)

## Issue identity: generalize Splinter’s `cache_key`

Splinter already encodes the correct idea:
- each finding has a stable `cache_key` that can be used for suppression/dedupe

Generalize this for *all* sources:
- **type** (aka check id): e.g. `splinter/unindexed_foreign_keys`, `logs/http_5xx_spike`, `metrics/cpu_hot`
- **fingerprint**: stable instance key within a type

Store uniqueness as:
- `(project_ref, type, fingerprint)`

### Lifecycle state
- **open**: currently firing / present
- **resolved**: no longer firing / no longer present
- **dismissed**: user doesn’t want to see it (until explicitly reopened)
- **snoozed**: suppressed until a timestamp

## Suggested ingestion payload (conceptual)

An ingestion payload should be able to carry:

- `type` (string)
- `fingerprint` (string)
- `severity` (`ERROR|WARN|INFO`)
- `title` (string)
- `detail` (string)
- `categories` (string[])
- `remediation` (url/string, optional)
- `metadata` (object, optional)
- `source` (object, optional): `{ kind: "splinter|clickhouse|vmalert|edge|manual", ref?: string }`
- `state` (optional): if absent, default to `open`
- `observed_at` (timestamp, optional)

On the backend, this becomes a durable issue record with timestamps (`first_seen`, `last_seen`, etc.).

## How sources map to the ingestion model

### Splinter (Postgres schema advisors)
Splinter stays SQL-only:
- lints in `lints/` emit rows with `cache_key`
- `splinter.sql` unions all lints

Producer job options:
- UI job (on-demand “rerun advisors”)
- backend cron (e.g. daily) to keep issues warm

Flow:
1) run `splinter.sql`
2) convert each row to `type = "splinter/{name}"` and `fingerprint = cache_key`
3) upsert issues
4) resolve issues for that `type` that were not returned (or use an explicit `resolve` call)

### Logs/traces (Logflare → ClickHouse)
Push-first options:
- a scheduled job queries ClickHouse for the last N minutes and pushes issues
- a logs alerting layer (if present) calls the ingestion endpoint directly

Example types:
- `logs/http_5xx_spike` (fingerprint: `service=X,route=Y`)
- `traces/p95_latency_regression` (fingerprint: `service=X,operation=Y`)

### Metrics (VictoriaMetrics)
Preferred path: **vmalert → webhook**.

Why:
- vmalert already encodes windowing, alert state, de-dupe, and silence/inhibition concepts
- Supabase only needs to map alert events into Advisor issues

Fallback path: managed pull checks.
- Supabase runs curated PromQL on a schedule
- returns series above a threshold → upsert issues
- no longer returned → resolve issues

### VictoriaMetrics SaaS alerting (managed alert rules)
If you’re using VictoriaMetrics’ SaaS/managed offerings for alerting (rule management UI, managed alert evaluation, etc.), treat it the same as vmalert conceptually:
- alert evaluation happens in VictoriaMetrics’ alerting system
- alerts deliver to a webhook receiver
- the Advisor ingestion API is one such receiver

The key mapping is still:
- `type`: stable alert name (namespaced, e.g. `metrics/cpu_hot`)
- `fingerprint`: stable label-set identity (e.g. `instance=db-01` or `cluster=X:instance=Y`)
- `state`: open when firing, resolved when cleared

What we likely need to confirm (since SaaS feature sets vary):
- what webhook payload shape is emitted (Alertmanager-compatible vs custom)
- whether “resolved” notifications are delivered
- how de-dupe/inhibition/silences are represented

## Standardized Advisor types (registry) vs custom

You likely want both:
- **Curated registry** of known types for UI, docs, and consistent severity/category semantics
- **Custom types** for user-defined rules and internal experimentation

Practical compromise:
- accept any `type` string
- Studio treats known prefixes (e.g. `splinter/`, `logs/`, `traces/`, `metrics/`, `billing/`) as first-class
- optionally expose `GET /advisor/types` for metadata (title, docs link, default severity)

## What belongs in Splinter (and what doesn’t)

Splinter should remain focused on what it’s good at:
- deterministic Postgres catalog/state checks
- SQL queries that return structured findings
- stable fingerprints (`cache_key`)

It should *not* embed:
- alerting engines
- time-windowed evaluation for logs/traces/metrics
- notification routing logic

## Open questions (for an engineering spike)

These are intentionally left open to avoid over-prescribing implementation:

- **Inbound webhook shape**: will inbound alerts be Alertmanager-compatible, vendor-specific, or both?
- **Resolved semantics**: do upstream systems send “resolved” events (or do we infer resolution)?
- **Auth**: how do producers authenticate to the ingestion API (per-project secrets, signed webhooks, service tokens)?
- **Multi-tenancy**: how does a producer target the correct `project_ref` (explicit routing vs token-bound project)?
- **Dedupe**: what’s the recommended fingerprint scheme per source (labelset hashing, stable IDs, etc.)?
- **Type registry**: do we want a curated set of “known types” (for UI/UX), plus custom types (for extensibility)?
- **Outbound routing**: how should outbound webhooks/notifications relate to issues (per type, per severity, per project)?

114 changes: 114 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Architecture: Splinter and Supabase Database Advisors

This document explains how this repository works today (Splinter), how it maps to Supabase’s “Database Advisors” UI, and why the current approach is inherently **SQL/Postgres-only**.

For end-user facing documentation of individual lints, see the lint pages on the Splinter docs site (`https://supabase.github.io/splinter/`) and Supabase docs ([Database Advisors](https://supabase.com/docs/guides/database/database-advisors?queryGroups=lint&lint=0012_auth_allow_anonymous_sign_ins)).

## What Splinter is (and is not)

Splinter is a **collection of SQL lints** intended to be run against a Supabase Postgres database. Each lint is implemented as a view that returns a standardized “issue row” shape.

Splinter is **not** a generic alerting/advisory system. It does not:
- schedule periodic evaluations
- store issue lifecycle state (open/resolved/dismissed)
- evaluate non-Postgres sources (logs, traces, metrics, client errors, billing/usage)

## The lint interface (the contract)

Each lint view returns the same columns (see `README.md`):
- `name` (text) — stable check id (e.g. `unindexed_foreign_keys`)
- `title` (text) — human title
- `level` (text) — `ERROR` / `WARN` / `INFO`
- `facing` (text) — `INTERNAL` / `EXTERNAL`
- `categories` (text[]) — e.g. `{SECURITY}` or `{PERFORMANCE}`
- `description` (text) — why this matters
- `detail` (text) — specific instance detail (table/view/function etc.)
- `remediation` (text, optional) — documentation URL or guidance
- `metadata` (jsonb, optional) — structured data for UI rendering or deep links
- `cache_key` (text) — **stable fingerprint** used for de-dupe/suppression

**Key point:** `cache_key` is the only stable identifier for “the same finding” across runs. It’s the hook needed for an issue lifecycle system.

## How lints are authored

Lints live in `lints/` as one `.sql` file per lint, each creating a view under the `lint` schema:

- Example: `lints/0001_unindexed_foreign_keys.sql` creates `lint."0001_unindexed_foreign_keys"`.
- Tests read from these views directly (see `test/sql/`).

## How Splinter is “compiled” into one query

Some consumers (like a UI) want to run a single query and receive a single result set containing all findings.

Splinter ships that as `splinter.sql`, produced by `bin/compile.py`:
- reads `lints/*.sql`
- strips the `create view` line and semicolons
- wraps each lint query in parentheses
- concatenates them with `UNION ALL`
- prefixes `set local search_path = '';` to enforce schema qualification

Net: **`splinter.sql` is “all lints as one unionable query”.**

## How Splinter is tested

Tests are Postgres regression tests that validate each lint has at least a true-positive case:
- `test/fixtures.sql` provisions minimal schemas/roles/functions needed for lint execution.
- `test/sql/*.sql` sets up example schema states and selects from `lint."00xx_*"`.
- `test/expected/*.out` captures expected output.
- `bin/installcheck` runs pg_regress and also validates that `splinter.sql` runs.

## How this maps to Supabase “Database Advisors”

Supabase’s “Performance and Security Advisors” are a set of checks surfaced in the dashboard UI ([Database Advisors docs](https://supabase.com/docs/guides/database/database-advisors?queryGroups=lint&lint=0012_auth_allow_anonymous_sign_ins)).

The “DB schema / security / performance” subset of those checks can be implemented as Splinter lints because they are answerable by querying Postgres system catalogs and Supabase schemas.

Operationally, a consumer (such as Supabase Studio) can:
- run `splinter.sql` (or an equivalent union query)
- render each row as an “advisor finding”
- allow users to dismiss/suppress findings via `cache_key`

## When are lints run (triggering) and where do results live?

Splinter itself does **not** define scheduling, triggers, or persistence. It only defines the SQL that produces findings.

In Supabase’s product, there are typically two different “execution contexts” to be aware of:

### 1) Interactive / on-demand (e.g. Supabase Studio)
In Supabase Studio (in the `supabase/supabase` repo), the current “lints/advisors” flow is implemented as an **on-demand SQL query** executed against the project’s Postgres database when the UI needs it (for example, when you open the Advisors page or click a re-run action).

Concretely, Studio sends the lint SQL to a platform endpoint that proxies to `pg-meta`, executing the query against the project database (see `executeSql` in Studio which posts to `/platform/pg-meta/{ref}/query` in `supabase/supabase`).

This is why Splinter focuses on:
- deterministic SQL
- predictable output shape
- stable `cache_key` values (so consumers can de-dupe and suppress repeated findings)

### 2) Scheduled / aggregated (analytics, fleet-level views)
Separately from the UI, Supabase may run similar checks on a schedule and/or export results for analytics. For example, you mentioned querying a BigQuery table named `supabase-etl-prod-eu.dbt.project_lints`.

That table name is a good example of historical terminology: the underlying concept is “advisor findings/issues”, but earlier naming often used “lints”. It’s worth calling out explicitly as Supabase’s long-term direction expands Advisors beyond SQL lints.

Because this repo is OSS and Splinter-only, the precise scheduling/ETL details live outside this repo (Studio/backend/data pipelines). A reasonable mental model is:
- Splinter provides the *finding generator* (SQL)
- Supabase systems decide *when to run it* and *whether/how to store it* (UI state, issue lifecycle, analytics exports)

## A concrete limitation example (docs vs SQL)

This repo contains documentation for check `0012_auth_allow_anonymous_sign_ins` (`docs/0012_auth_allow_anonymous_sign_ins.md`), but there is no corresponding SQL lint in `lints/`.

That’s a useful reminder: some “advisor” checks depend on **non-Postgres state** (e.g., auth configuration) or **time-windowed telemetry** (logs/traces/metrics), which can’t be expressed as a static Postgres catalog query.

## Why this matters for “multi-source Advisors”

If you want Advisors that can come from logs/traces/metrics/UI errors/etc., you need something beyond “run a SQL union query against Postgres”.

Splinter’s lint rows are a good *finding format* for DB checks, but a full Advisor system also needs:
- source-specific evaluation (ClickHouse, PromQL, etc.)
- scheduling and/or event triggers
- a stable issue model + dedupe
- lifecycle storage (open/resolved/dismissed/snoozed)
- routing/notification integrations



Loading