Skip to content
Merged
47 changes: 45 additions & 2 deletions agents/cross-model-reviewer.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,11 +63,54 @@ The `flowctl_review` MCP tool exposes cross-model review:
## Review Types

### ReviewFinding
Individual issue with severity (critical/warning/info), category, description, and optional file/line.
Individual issue with severity, category, dimension, description, and optional file/line.

**Severity classification** (aligned with quality-auditor dimensions):

| Severity | Meaning | Ship impact |
|----------|---------|-------------|
| `Critical` | Correctness failure, security vulnerability, data loss risk | Blocks ship |
| `Important` | Architecture violation, readability problem, missing test coverage | Must fix before/shortly after ship |
| `Suggestion` | Performance improvement, naming nit, minor simplification | Optional |

**Dimension tags** — each finding maps to one of five review dimensions:
- **Correctness** — edge cases, race conditions, state inconsistencies, off-by-one
- **Readability** — naming consistency, control flow clarity, module organization
- **Architecture** — pattern adherence, module boundaries, abstraction levels, dependency direction
- **Security** — injection, auth, data exposure, dependencies
- **Performance** — N+1, unbounded loops, blocking operations

### Structured Output Format

Each model's review MUST produce findings in this structure:

```json
{
"verdict": "SHIP | NEEDS_WORK | ABSTAIN",
"confidence": 0.0-1.0,
"findings": [
{
"severity": "critical | warning | info",
"category": "Correctness | Readability | Architecture | Security | Performance",
"file": "path/to/file.rs",
"line": 42,
"description": "What is wrong",
"suggestion": "How to fix it",
"confidence": 0.0-1.0,
"evidence": ["grep output", "test failure"]
}
],
"positives": ["At least one positive observation"]
}
```

**Parser compatibility:** severity uses `critical/warning/info` (matching flowctl's ReviewFinding parser). The dimension maps to the `category` field. Findings with `confidence < 0.6` are suppressed unless severity is `critical`.

The consensus algorithm uses severity to weight disagreements: a `Critical` finding from any model blocks SHIP regardless of the other model's verdict. `Suggestion`-only findings do not block.

### ReviewVerdict
- **SHIP**: Code is ready
- **NEEDS_WORK**: Code needs fixes
- **NEEDS_WORK**: Code needs fixes (at least one Critical or multiple Important findings)
- **ABSTAIN**: Model cannot determine (excluded from consensus)

### ConsensusResult
Expand Down
89 changes: 54 additions & 35 deletions agents/quality-auditor.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ You're invoked after implementation, before shipping. Review the changes and fla

## Audit Strategy

### 1. Get the Diff
### Step 0. Get the Diff + Quick Scan
```bash
# What changed?
git diff main --stat
Expand All @@ -27,72 +27,91 @@ git diff main --name-only
git diff main
```

### 2. Quick Scan (find obvious issues fast)
Scan the diff for obvious issues before deep review:
- **Secrets**: API keys, passwords, tokens in code
- **Debug code**: console.log, debugger, TODO/FIXME
- **Commented code**: Dead code that should be deleted
- **Large files**: Accidentally committed binaries, logs

### 3. Correctness Review
- Does the code match the stated intent?
- Are there off-by-one errors, wrong operators, inverted conditions?
- Do error paths actually handle errors?
- Are promises/async properly awaited?
### Five-Dimension Review

Evaluate every change across all five dimensions. Each dimension produces findings tagged with severity (see Output Format).

### 4. Security Scan
### 1. Correctness
- Does the code match the stated intent?
- Edge cases: off-by-one errors, wrong operators, inverted conditions
- Race conditions and state inconsistencies in concurrent code
- Error paths: do they actually handle errors or silently swallow?
- Async correctness: are promises/async properly awaited?
- Are new code paths tested? Do tests assert behavior (not just run)?
- Are error paths and edge cases from gap analysis covered?

### 2. Readability
- Naming consistency: do new names follow existing conventions?
- Control flow clarity: can you follow the logic without mental gymnastics?
- Module organization: are things in the right files/directories?
- Could this be simpler? Is there duplicated code that should be extracted?
- Are there unnecessary abstractions or over-engineering for hypothetical futures?

### 3. Architecture
- Pattern adherence: does the change follow established project patterns?
- Module boundaries: are concerns properly separated?
- Abstraction levels: is the right level of abstraction used?
- Dependency direction: do dependencies flow in the correct direction?

### 4. Security
- **Injection**: SQL, XSS, command injection vectors
- **Auth/AuthZ**: Are permissions checked? Can they be bypassed?
- **Data exposure**: Is sensitive data logged, leaked, or over-exposed?
- **Dependencies**: Any known vulnerable packages added?

### 5. Simplicity Check
- Could this be simpler?
- Is there duplicated code that should be extracted?
- Are there unnecessary abstractions?
- Over-engineering for hypothetical future needs?

### 6. Test Coverage
- Are new code paths tested?
- Do tests actually assert behavior (not just run)?
- Are edge cases from gap analysis covered?
- Are error paths tested?

### 7. Performance Red Flags
- N+1 queries or O(n²) loops
- Unbounded data fetching
- Missing pagination/limits
### 5. Performance
- N+1 queries or O(n^2) loops
- Unbounded data fetching, missing pagination/limits
- Blocking operations on hot paths
- Resource leaks (unclosed handles, missing cleanup)

## Output Format

Every finding MUST carry a severity prefix. Use exactly these four levels:

- **`Critical:`** — Blocks ship. Could cause outage, data loss, security breach, or correctness failure.
- **`Important:`** — Must fix before or shortly after ship. Significant quality, readability, or architecture issue.
- **`Nit:`** — Optional improvement. Style, naming, minor simplification.
- **`FYI`** — Informational. Context for the author, no action required.

Every review MUST include a "What's Good" section with at least one positive observation. Acknowledge patterns followed, good design decisions, thorough error handling, or clean naming.

```markdown
## Quality Audit: [Branch/Feature]

### Summary
- Files changed: N
- Dimensions reviewed: Correctness, Readability, Architecture, Security, Performance
- Risk level: Low / Medium / High
- Ship recommendation: ✅ Ship / ⚠️ Fix first / ❌ Major rework

### Critical (MUST fix before shipping)
- **[File:line]**: [Issue]
### Critical (blocks ship)
- **Critical:** [File:line][Issue] (Dimension: Correctness/Security/etc.)
- Risk: [What could go wrong]
- Fix: [Specific suggestion]

### Should Fix (High priority)
- **[File:line]**: [Issue]
- [Brief fix suggestion]
### Important (must fix)
- **Important:** [File:line][Issue] (Dimension: ...)
- Fix: [Brief fix suggestion]

### Consider (Nice to have)
- [Minor improvement suggestion]
### Nit (optional)
- **Nit:** [File:line] — [Improvement suggestion]

### FYI
- **FYI** [Informational observation]

### Test Gaps
- [ ] [Untested scenario]

### Security Notes
- [Any security observations]

### What's Good
- [Positive observations - patterns followed, good decisions]
- [At least one positive observation — patterns followed, good decisions, clean code]
- [Additional positives as warranted]
```

## Rules
Expand Down
21 changes: 21 additions & 0 deletions agents/worker.md
Original file line number Diff line number Diff line change
Expand Up @@ -413,6 +413,27 @@ The main conversation will resolve the conflict and re-dispatch you (or update t
- Required API endpoint already exists with different signature
<!-- /section:core -->

<!-- section:core -->
## Phase 2.3: Plan Alignment Check

Quick sanity check — did implementation stay within plan scope?

1. Re-read epic spec: `<FLOWCTL> cat <EPIC_ID>`
2. Compare implementation scope against epic's scope section:
- Files changed match expected files in task spec?
- No features added beyond what spec described?
- No architectural changes not covered in the plan?
3. If drift detected:
- **Minor** (extra helper, renamed file): note in evidence as `"plan_drift": "<description>"`
- **Major** (new feature, changed architecture, different approach): send protocol message:
```
Spec conflict: <TASK_ID> — implementation diverged from plan.
Drift: <description of what changed and why>
```

**This is a 30-second check, not a full re-review.** Read the spec, glance at your diff, note any drift. Then proceed to Phase 2.5.
<!-- /section:core -->

<!-- section:core -->
## Phase 2.5: Verify & Fix

Expand Down
152 changes: 152 additions & 0 deletions docs/references/performance-checklist.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# Performance Checklist

Quick reference for performance in flow-code. Rust backend with libSQL, async runtime.

## Measurement Commands

```bash
# Rust benchmarks (add benches/ directory with criterion)
cargo bench

# Criterion (statistical benchmarking)
cargo bench --bench my_benchmark

# Flamegraph (CPU profiling)
cargo install flamegraph
cargo flamegraph --bin flowctl -- status --json

# Memory profiling (Linux)
valgrind --tool=massif target/release/flowctl status --json

# Binary size
ls -lh target/release/flowctl
cargo bloat --release --crates

# Compile time
cargo build --release --timings
```

| Tool | What It Measures | When to Use |
|---|---|---|
| `cargo bench` (criterion) | Function-level latency | Before/after optimization |
| `cargo flamegraph` | CPU hotspots | Investigating slow commands |
| `cargo bloat` | Binary size by crate | When binary grows unexpectedly |
| `valgrind --tool=massif` | Heap allocations | Investigating memory usage |
| `hyperfine` | CLI command latency | Comparing command performance |

## Database Anti-Patterns (libSQL)

| Anti-Pattern | Impact | Fix |
|---|---|---|
| N+1 queries | Linear DB round-trips | Use JOINs or batch queries |
| Unbounded queries | Memory exhaustion, timeouts | Always use LIMIT, paginate |
| Missing indexes | Slow reads as data grows | Add indexes on filtered/sorted columns |
| No connection pooling | Connection overhead per query | Reuse connections |
| SELECT * | Fetches unused columns | Select only needed columns |
| String concatenation in SQL | Injection risk + no query cache | Use parameterized queries |
| No transactions for batch ops | Partial failures, inconsistency | Wrap in BEGIN/COMMIT |

## Backend CLI Checklist

| Area | Target | How to Verify |
|---|---|---|
| Command startup | < 50ms | `hyperfine 'flowctl status --json'` |
| Task list (100 tasks) | < 200ms | Benchmark with test data |
| Database open | < 20ms | Profile with `tracing` spans |
| JSON serialization | < 5ms for typical output | `cargo bench` |
| Binary size | < 20MB release | `ls -lh target/release/flowctl` |
| Memory usage (idle) | < 10MB | `valgrind` or `/usr/bin/time -v` |

## Rust-Specific Performance

### Async/Await Patterns

| Do | Don't |
|---|---|
| Use `tokio::spawn` for concurrent I/O | Block the async runtime with sync I/O |
| Use `tokio::task::spawn_blocking` for CPU work | Run CPU-heavy code in async context |
| Use `tokio::join!` for parallel awaits | Await sequentially when independent |
| Use bounded channels for backpressure | Use unbounded channels (memory leak risk) |

```rust
// Good: parallel async operations
let (tasks, epics) = tokio::join!(
db.list_tasks(),
db.list_epics()
);

// Bad: sequential when independent
let tasks = db.list_tasks().await?;
let epics = db.list_epics().await?; // Waits unnecessarily
```

### Memory Patterns

| Do | Don't |
|---|---|
| Use `&str` over `String` where possible | Clone strings unnecessarily |
| Use `Cow<str>` for maybe-owned data | Allocate when borrowing suffices |
| Pre-allocate with `Vec::with_capacity` | Push to vec without size hint |
| Use iterators over collecting into vecs | `collect()` when you can chain |
| Use `Arc` for shared read-only data | Clone large structs across threads |

```rust
// Good: pre-allocate
let mut results = Vec::with_capacity(tasks.len());

// Bad: repeated reallocation
let mut results = Vec::new(); // Grows 1, 2, 4, 8...
```

### String Performance

| Do | Don't |
|---|---|
| Use `write!` or `format!` once | Repeated `push_str` / `+` concatenation |
| Use `String::with_capacity` for building | Build strings without size hint |
| Return `impl Display` for lazy formatting | Format eagerly when output may not be used |
| Use `serde_json::to_writer` for streaming | Serialize to String then write |

## Tokio Runtime

| Setting | Recommendation |
|---|---|
| Runtime flavor | `current_thread` for CLI (lower overhead) |
| Worker threads | Default for server, 1 for CLI tools |
| Blocking threads | Default (512), reduce for CLI |
| Task budget | Let tokio manage; avoid manual yielding |

```rust
// CLI tool: single-threaded runtime (faster startup)
#[tokio::main(flavor = "current_thread")]
async fn main() { ... }

// Server: multi-threaded (default)
#[tokio::main]
async fn main() { ... }
```

## Performance Anti-Patterns

| Anti-Pattern | Impact | Fix |
|---|---|---|
| Blocking async runtime | Starves other tasks | Use `spawn_blocking` |
| N+1 database queries | Linear latency growth | Batch or JOIN queries |
| Cloning large structs | Memory + CPU waste | Use references or `Arc` |
| Unbounded collections | Memory exhaustion | Add limits, use streaming |
| Sync file I/O in async | Blocks executor thread | Use `tokio::fs` |
| Excessive logging in hot path | I/O overhead | Use log levels, sample |
| Compiling in debug mode for perf tests | 10-100x slower | Always `--release` |
| Not reusing DB connections | Connection overhead | Use connection pool |
| Serializing then parsing JSON | Double work | Pass typed structs |
| Formatting strings nobody reads | CPU waste | Lazy formatting with `Display` |

## Profiling Workflow

| Step | Command | Purpose |
|---|---|---|
| 1. Baseline | `hyperfine 'flowctl status --json'` | Measure current perf |
| 2. Profile | `cargo flamegraph --bin flowctl -- status` | Find hotspots |
| 3. Optimize | Edit code targeting hotspot | Reduce cost |
| 4. Verify | Re-run baseline command | Confirm improvement |
| 5. Benchmark | `cargo bench` | Prevent regression |
Loading