z23cc · z23cc · Apr 7, 2026 · Apr 6, 2026 · Apr 6, 2026 · Apr 6, 2026
diff --git a/agents/cross-model-reviewer.md b/agents/cross-model-reviewer.md
@@ -63,11 +63,54 @@ The `flowctl_review` MCP tool exposes cross-model review:
 ## Review Types
 
 ### ReviewFinding
-Individual issue with severity (critical/warning/info), category, description, and optional file/line.
+Individual issue with severity, category, dimension, description, and optional file/line.
+
+**Severity classification** (aligned with quality-auditor dimensions):
+
+| Severity | Meaning | Ship impact |
+|----------|---------|-------------|
+| `Critical` | Correctness failure, security vulnerability, data loss risk | Blocks ship |
+| `Important` | Architecture violation, readability problem, missing test coverage | Must fix before/shortly after ship |
+| `Suggestion` | Performance improvement, naming nit, minor simplification | Optional |
+
+**Dimension tags** — each finding maps to one of five review dimensions:
+- **Correctness** — edge cases, race conditions, state inconsistencies, off-by-one
+- **Readability** — naming consistency, control flow clarity, module organization
+- **Architecture** — pattern adherence, module boundaries, abstraction levels, dependency direction
+- **Security** — injection, auth, data exposure, dependencies
+- **Performance** — N+1, unbounded loops, blocking operations
+
+### Structured Output Format
+
+Each model's review MUST produce findings in this structure:
+
+```json
+{
+  "verdict": "SHIP | NEEDS_WORK | ABSTAIN",
+  "confidence": 0.0-1.0,
+  "findings": [
+    {
+      "severity": "critical | warning | info",
+      "category": "Correctness | Readability | Architecture | Security | Performance",
+      "file": "path/to/file.rs",
+      "line": 42,
+      "description": "What is wrong",
+      "suggestion": "How to fix it",
+      "confidence": 0.0-1.0,
+      "evidence": ["grep output", "test failure"]
+    }
+  ],
+  "positives": ["At least one positive observation"]
+}
+```
+
+**Parser compatibility:** severity uses `critical/warning/info` (matching flowctl's ReviewFinding parser). The dimension maps to the `category` field. Findings with `confidence < 0.6` are suppressed unless severity is `critical`.
+
+The consensus algorithm uses severity to weight disagreements: a `Critical` finding from any model blocks SHIP regardless of the other model's verdict. `Suggestion`-only findings do not block.
 
 ### ReviewVerdict
 - **SHIP**: Code is ready
-- **NEEDS_WORK**: Code needs fixes
+- **NEEDS_WORK**: Code needs fixes (at least one Critical or multiple Important findings)
 - **ABSTAIN**: Model cannot determine (excluded from consensus)
 
 ### ConsensusResult

diff --git a/agents/quality-auditor.md b/agents/quality-auditor.md
@@ -17,7 +17,7 @@ You're invoked after implementation, before shipping. Review the changes and fla
 
 ## Audit Strategy
 
-### 1. Get the Diff
+### Step 0. Get the Diff + Quick Scan
 ```bash
 # What changed?
 git diff main --stat
@@ -27,72 +27,91 @@ git diff main --name-only
 git diff main
 ```
 
-### 2. Quick Scan (find obvious issues fast)
+Scan the diff for obvious issues before deep review:
 - **Secrets**: API keys, passwords, tokens in code
 - **Debug code**: console.log, debugger, TODO/FIXME
 - **Commented code**: Dead code that should be deleted
 - **Large files**: Accidentally committed binaries, logs
 
-### 3. Correctness Review
-- Does the code match the stated intent?
-- Are there off-by-one errors, wrong operators, inverted conditions?
-- Do error paths actually handle errors?
-- Are promises/async properly awaited?
+### Five-Dimension Review
+
+Evaluate every change across all five dimensions. Each dimension produces findings tagged with severity (see Output Format).
 
-### 4. Security Scan
+### 1. Correctness
+- Does the code match the stated intent?
+- Edge cases: off-by-one errors, wrong operators, inverted conditions
+- Race conditions and state inconsistencies in concurrent code
+- Error paths: do they actually handle errors or silently swallow?
+- Async correctness: are promises/async properly awaited?
+- Are new code paths tested? Do tests assert behavior (not just run)?
+- Are error paths and edge cases from gap analysis covered?
+
+### 2. Readability
+- Naming consistency: do new names follow existing conventions?
+- Control flow clarity: can you follow the logic without mental gymnastics?
+- Module organization: are things in the right files/directories?
+- Could this be simpler? Is there duplicated code that should be extracted?
+- Are there unnecessary abstractions or over-engineering for hypothetical futures?
+
+### 3. Architecture
+- Pattern adherence: does the change follow established project patterns?
+- Module boundaries: are concerns properly separated?
+- Abstraction levels: is the right level of abstraction used?
+- Dependency direction: do dependencies flow in the correct direction?
+
+### 4. Security
 - **Injection**: SQL, XSS, command injection vectors
 - **Auth/AuthZ**: Are permissions checked? Can they be bypassed?
 - **Data exposure**: Is sensitive data logged, leaked, or over-exposed?
 - **Dependencies**: Any known vulnerable packages added?
 
-### 5. Simplicity Check
-- Could this be simpler?
-- Is there duplicated code that should be extracted?
-- Are there unnecessary abstractions?
-- Over-engineering for hypothetical future needs?
-
-### 6. Test Coverage
-- Are new code paths tested?
-- Do tests actually assert behavior (not just run)?
-- Are edge cases from gap analysis covered?
-- Are error paths tested?
-
-### 7. Performance Red Flags
-- N+1 queries or O(n²) loops
-- Unbounded data fetching
-- Missing pagination/limits
+### 5. Performance
+- N+1 queries or O(n^2) loops
+- Unbounded data fetching, missing pagination/limits
 - Blocking operations on hot paths
+- Resource leaks (unclosed handles, missing cleanup)
 
 ## Output Format
 
+Every finding MUST carry a severity prefix. Use exactly these four levels:
+
+- **`Critical:`** — Blocks ship. Could cause outage, data loss, security breach, or correctness failure.
+- **`Important:`** — Must fix before or shortly after ship. Significant quality, readability, or architecture issue.
+- **`Nit:`** — Optional improvement. Style, naming, minor simplification.
+- **`FYI`** — Informational. Context for the author, no action required.
+
+Every review MUST include a "What's Good" section with at least one positive observation. Acknowledge patterns followed, good design decisions, thorough error handling, or clean naming.
+
 ```markdown
 ## Quality Audit: [Branch/Feature]
 
 ### Summary
 - Files changed: N
+- Dimensions reviewed: Correctness, Readability, Architecture, Security, Performance
 - Risk level: Low / Medium / High
 - Ship recommendation: ✅ Ship / ⚠️ Fix first / ❌ Major rework
 
-### Critical (MUST fix before shipping)
-- **[File:line]**: [Issue]
+### Critical (blocks ship)
+- **Critical:** [File:line] — [Issue] (Dimension: Correctness/Security/etc.)
   - Risk: [What could go wrong]
   - Fix: [Specific suggestion]
 
-### Should Fix (High priority)
-- **[File:line]**: [Issue]
-  - [Brief fix suggestion]
+### Important (must fix)
+- **Important:** [File:line] — [Issue] (Dimension: ...)
+  - Fix: [Brief fix suggestion]
 
-### Consider (Nice to have)
-- [Minor improvement suggestion]
+### Nit (optional)
+- **Nit:** [File:line] — [Improvement suggestion]
+
+### FYI
+- **FYI** [Informational observation]
 
 ### Test Gaps
 - [ ] [Untested scenario]
 
-### Security Notes
-- [Any security observations]
-
 ### What's Good
-- [Positive observations - patterns followed, good decisions]
+- [At least one positive observation — patterns followed, good decisions, clean code]
+- [Additional positives as warranted]
 ```
 
 ## Rules

diff --git a/agents/worker.md b/agents/worker.md
@@ -413,6 +413,27 @@ The main conversation will resolve the conflict and re-dispatch you (or update t
 - Required API endpoint already exists with different signature
 <!-- /section:core -->
 
+<!-- section:core -->
+## Phase 2.3: Plan Alignment Check
+
+Quick sanity check — did implementation stay within plan scope?
+
+1. Re-read epic spec: `<FLOWCTL> cat <EPIC_ID>`
+2. Compare implementation scope against epic's scope section:
+   - Files changed match expected files in task spec?
+   - No features added beyond what spec described?
+   - No architectural changes not covered in the plan?
+3. If drift detected:
+   - **Minor** (extra helper, renamed file): note in evidence as `"plan_drift": "<description>"`
+   - **Major** (new feature, changed architecture, different approach): send protocol message:
+     ```
+     Spec conflict: <TASK_ID> — implementation diverged from plan.
+     Drift: <description of what changed and why>
+     ```
+
+**This is a 30-second check, not a full re-review.** Read the spec, glance at your diff, note any drift. Then proceed to Phase 2.5.
+<!-- /section:core -->
+
 <!-- section:core -->
 ## Phase 2.5: Verify & Fix
 

diff --git a/docs/references/performance-checklist.md b/docs/references/performance-checklist.md
@@ -0,0 +1,152 @@
+# Performance Checklist
+
+Quick reference for performance in flow-code. Rust backend with libSQL, async runtime.
+
+## Measurement Commands
+
+```bash
+# Rust benchmarks (add benches/ directory with criterion)
+cargo bench
+
+# Criterion (statistical benchmarking)
+cargo bench --bench my_benchmark
+
+# Flamegraph (CPU profiling)
+cargo install flamegraph
+cargo flamegraph --bin flowctl -- status --json
+
+# Memory profiling (Linux)
+valgrind --tool=massif target/release/flowctl status --json
+
+# Binary size
+ls -lh target/release/flowctl
+cargo bloat --release --crates
+
+# Compile time
+cargo build --release --timings
+```
+
+| Tool | What It Measures | When to Use |
+|---|---|---|
+| `cargo bench` (criterion) | Function-level latency | Before/after optimization |
+| `cargo flamegraph` | CPU hotspots | Investigating slow commands |
+| `cargo bloat` | Binary size by crate | When binary grows unexpectedly |
+| `valgrind --tool=massif` | Heap allocations | Investigating memory usage |
+| `hyperfine` | CLI command latency | Comparing command performance |
+
+## Database Anti-Patterns (libSQL)
+
+| Anti-Pattern | Impact | Fix |
+|---|---|---|
+| N+1 queries | Linear DB round-trips | Use JOINs or batch queries |
+| Unbounded queries | Memory exhaustion, timeouts | Always use LIMIT, paginate |
+| Missing indexes | Slow reads as data grows | Add indexes on filtered/sorted columns |
+| No connection pooling | Connection overhead per query | Reuse connections |
+| SELECT * | Fetches unused columns | Select only needed columns |
+| String concatenation in SQL | Injection risk + no query cache | Use parameterized queries |
+| No transactions for batch ops | Partial failures, inconsistency | Wrap in BEGIN/COMMIT |
+
+## Backend CLI Checklist
+
+| Area | Target | How to Verify |
+|---|---|---|
+| Command startup | < 50ms | `hyperfine 'flowctl status --json'` |
+| Task list (100 tasks) | < 200ms | Benchmark with test data |
+| Database open | < 20ms | Profile with `tracing` spans |
+| JSON serialization | < 5ms for typical output | `cargo bench` |
+| Binary size | < 20MB release | `ls -lh target/release/flowctl` |
+| Memory usage (idle) | < 10MB | `valgrind` or `/usr/bin/time -v` |
+
+## Rust-Specific Performance
+
+### Async/Await Patterns
+
+| Do | Don't |
+|---|---|
+| Use `tokio::spawn` for concurrent I/O | Block the async runtime with sync I/O |
+| Use `tokio::task::spawn_blocking` for CPU work | Run CPU-heavy code in async context |
+| Use `tokio::join!` for parallel awaits | Await sequentially when independent |
+| Use bounded channels for backpressure | Use unbounded channels (memory leak risk) |
+
+```rust
+// Good: parallel async operations
+let (tasks, epics) = tokio::join!(
+    db.list_tasks(),
+    db.list_epics()
+);
+
+// Bad: sequential when independent
+let tasks = db.list_tasks().await?;
+let epics = db.list_epics().await?;  // Waits unnecessarily
+```
+
+### Memory Patterns
+
+| Do | Don't |
+|---|---|
+| Use `&str` over `String` where possible | Clone strings unnecessarily |
+| Use `Cow<str>` for maybe-owned data | Allocate when borrowing suffices |
+| Pre-allocate with `Vec::with_capacity` | Push to vec without size hint |
+| Use iterators over collecting into vecs | `collect()` when you can chain |
+| Use `Arc` for shared read-only data | Clone large structs across threads |
+
+```rust
+// Good: pre-allocate
+let mut results = Vec::with_capacity(tasks.len());
+
+// Bad: repeated reallocation
+let mut results = Vec::new();  // Grows 1, 2, 4, 8...
+```
+
+### String Performance
+
+| Do | Don't |
+|---|---|
+| Use `write!` or `format!` once | Repeated `push_str` / `+` concatenation |
+| Use `String::with_capacity` for building | Build strings without size hint |
+| Return `impl Display` for lazy formatting | Format eagerly when output may not be used |
+| Use `serde_json::to_writer` for streaming | Serialize to String then write |
+
+## Tokio Runtime
+
+| Setting | Recommendation |
+|---|---|
+| Runtime flavor | `current_thread` for CLI (lower overhead) |
+| Worker threads | Default for server, 1 for CLI tools |
+| Blocking threads | Default (512), reduce for CLI |
+| Task budget | Let tokio manage; avoid manual yielding |
+
+```rust
+// CLI tool: single-threaded runtime (faster startup)
+#[tokio::main(flavor = "current_thread")]
+async fn main() { ... }
+
+// Server: multi-threaded (default)
+#[tokio::main]
+async fn main() { ... }
+```
+
+## Performance Anti-Patterns
+
+| Anti-Pattern | Impact | Fix |
+|---|---|---|
+| Blocking async runtime | Starves other tasks | Use `spawn_blocking` |
+| N+1 database queries | Linear latency growth | Batch or JOIN queries |
+| Cloning large structs | Memory + CPU waste | Use references or `Arc` |
+| Unbounded collections | Memory exhaustion | Add limits, use streaming |
+| Sync file I/O in async | Blocks executor thread | Use `tokio::fs` |
+| Excessive logging in hot path | I/O overhead | Use log levels, sample |
+| Compiling in debug mode for perf tests | 10-100x slower | Always `--release` |
+| Not reusing DB connections | Connection overhead | Use connection pool |
+| Serializing then parsing JSON | Double work | Pass typed structs |
+| Formatting strings nobody reads | CPU waste | Lazy formatting with `Display` |
+
+## Profiling Workflow
+
+| Step | Command | Purpose |
+|---|---|---|
+| 1. Baseline | `hyperfine 'flowctl status --json'` | Measure current perf |
+| 2. Profile | `cargo flamegraph --bin flowctl -- status` | Find hotspots |
+| 3. Optimize | Edit code targeting hotspot | Reduce cost |
+| 4. Verify | Re-run baseline command | Confirm improvement |
+| 5. Benchmark | `cargo bench` | Prevent regression |