diff --git a/agents/cross-model-reviewer.md b/agents/cross-model-reviewer.md index f58e4cdb..197b81e4 100644 --- a/agents/cross-model-reviewer.md +++ b/agents/cross-model-reviewer.md @@ -63,11 +63,54 @@ The `flowctl_review` MCP tool exposes cross-model review: ## Review Types ### ReviewFinding -Individual issue with severity (critical/warning/info), category, description, and optional file/line. +Individual issue with severity, category, dimension, description, and optional file/line. + +**Severity classification** (aligned with quality-auditor dimensions): + +| Severity | Meaning | Ship impact | +|----------|---------|-------------| +| `Critical` | Correctness failure, security vulnerability, data loss risk | Blocks ship | +| `Important` | Architecture violation, readability problem, missing test coverage | Must fix before/shortly after ship | +| `Suggestion` | Performance improvement, naming nit, minor simplification | Optional | + +**Dimension tags** — each finding maps to one of five review dimensions: +- **Correctness** — edge cases, race conditions, state inconsistencies, off-by-one +- **Readability** — naming consistency, control flow clarity, module organization +- **Architecture** — pattern adherence, module boundaries, abstraction levels, dependency direction +- **Security** — injection, auth, data exposure, dependencies +- **Performance** — N+1, unbounded loops, blocking operations + +### Structured Output Format + +Each model's review MUST produce findings in this structure: + +```json +{ + "verdict": "SHIP | NEEDS_WORK | ABSTAIN", + "confidence": 0.0-1.0, + "findings": [ + { + "severity": "critical | warning | info", + "category": "Correctness | Readability | Architecture | Security | Performance", + "file": "path/to/file.rs", + "line": 42, + "description": "What is wrong", + "suggestion": "How to fix it", + "confidence": 0.0-1.0, + "evidence": ["grep output", "test failure"] + } + ], + "positives": ["At least one positive observation"] +} +``` + +**Parser compatibility:** severity uses `critical/warning/info` (matching flowctl's ReviewFinding parser). The dimension maps to the `category` field. Findings with `confidence < 0.6` are suppressed unless severity is `critical`. + +The consensus algorithm uses severity to weight disagreements: a `Critical` finding from any model blocks SHIP regardless of the other model's verdict. `Suggestion`-only findings do not block. ### ReviewVerdict - **SHIP**: Code is ready -- **NEEDS_WORK**: Code needs fixes +- **NEEDS_WORK**: Code needs fixes (at least one Critical or multiple Important findings) - **ABSTAIN**: Model cannot determine (excluded from consensus) ### ConsensusResult diff --git a/agents/quality-auditor.md b/agents/quality-auditor.md index ebd3eaba..3e21b01c 100644 --- a/agents/quality-auditor.md +++ b/agents/quality-auditor.md @@ -17,7 +17,7 @@ You're invoked after implementation, before shipping. Review the changes and fla ## Audit Strategy -### 1. Get the Diff +### Step 0. Get the Diff + Quick Scan ```bash # What changed? git diff main --stat @@ -27,72 +27,91 @@ git diff main --name-only git diff main ``` -### 2. Quick Scan (find obvious issues fast) +Scan the diff for obvious issues before deep review: - **Secrets**: API keys, passwords, tokens in code - **Debug code**: console.log, debugger, TODO/FIXME - **Commented code**: Dead code that should be deleted - **Large files**: Accidentally committed binaries, logs -### 3. Correctness Review -- Does the code match the stated intent? -- Are there off-by-one errors, wrong operators, inverted conditions? -- Do error paths actually handle errors? -- Are promises/async properly awaited? +### Five-Dimension Review + +Evaluate every change across all five dimensions. Each dimension produces findings tagged with severity (see Output Format). -### 4. Security Scan +### 1. Correctness +- Does the code match the stated intent? +- Edge cases: off-by-one errors, wrong operators, inverted conditions +- Race conditions and state inconsistencies in concurrent code +- Error paths: do they actually handle errors or silently swallow? +- Async correctness: are promises/async properly awaited? +- Are new code paths tested? Do tests assert behavior (not just run)? +- Are error paths and edge cases from gap analysis covered? + +### 2. Readability +- Naming consistency: do new names follow existing conventions? +- Control flow clarity: can you follow the logic without mental gymnastics? +- Module organization: are things in the right files/directories? +- Could this be simpler? Is there duplicated code that should be extracted? +- Are there unnecessary abstractions or over-engineering for hypothetical futures? + +### 3. Architecture +- Pattern adherence: does the change follow established project patterns? +- Module boundaries: are concerns properly separated? +- Abstraction levels: is the right level of abstraction used? +- Dependency direction: do dependencies flow in the correct direction? + +### 4. Security - **Injection**: SQL, XSS, command injection vectors - **Auth/AuthZ**: Are permissions checked? Can they be bypassed? - **Data exposure**: Is sensitive data logged, leaked, or over-exposed? - **Dependencies**: Any known vulnerable packages added? -### 5. Simplicity Check -- Could this be simpler? -- Is there duplicated code that should be extracted? -- Are there unnecessary abstractions? -- Over-engineering for hypothetical future needs? - -### 6. Test Coverage -- Are new code paths tested? -- Do tests actually assert behavior (not just run)? -- Are edge cases from gap analysis covered? -- Are error paths tested? - -### 7. Performance Red Flags -- N+1 queries or O(n²) loops -- Unbounded data fetching -- Missing pagination/limits +### 5. Performance +- N+1 queries or O(n^2) loops +- Unbounded data fetching, missing pagination/limits - Blocking operations on hot paths +- Resource leaks (unclosed handles, missing cleanup) ## Output Format +Every finding MUST carry a severity prefix. Use exactly these four levels: + +- **`Critical:`** — Blocks ship. Could cause outage, data loss, security breach, or correctness failure. +- **`Important:`** — Must fix before or shortly after ship. Significant quality, readability, or architecture issue. +- **`Nit:`** — Optional improvement. Style, naming, minor simplification. +- **`FYI`** — Informational. Context for the author, no action required. + +Every review MUST include a "What's Good" section with at least one positive observation. Acknowledge patterns followed, good design decisions, thorough error handling, or clean naming. + ```markdown ## Quality Audit: [Branch/Feature] ### Summary - Files changed: N +- Dimensions reviewed: Correctness, Readability, Architecture, Security, Performance - Risk level: Low / Medium / High - Ship recommendation: ✅ Ship / ⚠️ Fix first / ❌ Major rework -### Critical (MUST fix before shipping) -- **[File:line]**: [Issue] +### Critical (blocks ship) +- **Critical:** [File:line] — [Issue] (Dimension: Correctness/Security/etc.) - Risk: [What could go wrong] - Fix: [Specific suggestion] -### Should Fix (High priority) -- **[File:line]**: [Issue] - - [Brief fix suggestion] +### Important (must fix) +- **Important:** [File:line] — [Issue] (Dimension: ...) + - Fix: [Brief fix suggestion] -### Consider (Nice to have) -- [Minor improvement suggestion] +### Nit (optional) +- **Nit:** [File:line] — [Improvement suggestion] + +### FYI +- **FYI** [Informational observation] ### Test Gaps - [ ] [Untested scenario] -### Security Notes -- [Any security observations] - ### What's Good -- [Positive observations - patterns followed, good decisions] +- [At least one positive observation — patterns followed, good decisions, clean code] +- [Additional positives as warranted] ``` ## Rules diff --git a/agents/worker.md b/agents/worker.md index d4006cda..2fccb934 100644 --- a/agents/worker.md +++ b/agents/worker.md @@ -413,6 +413,27 @@ The main conversation will resolve the conflict and re-dispatch you (or update t - Required API endpoint already exists with different signature + +## Phase 2.3: Plan Alignment Check + +Quick sanity check — did implementation stay within plan scope? + +1. Re-read epic spec: ` cat ` +2. Compare implementation scope against epic's scope section: + - Files changed match expected files in task spec? + - No features added beyond what spec described? + - No architectural changes not covered in the plan? +3. If drift detected: + - **Minor** (extra helper, renamed file): note in evidence as `"plan_drift": ""` + - **Major** (new feature, changed architecture, different approach): send protocol message: + ``` + Spec conflict: — implementation diverged from plan. + Drift: + ``` + +**This is a 30-second check, not a full re-review.** Read the spec, glance at your diff, note any drift. Then proceed to Phase 2.5. + + ## Phase 2.5: Verify & Fix diff --git a/docs/references/performance-checklist.md b/docs/references/performance-checklist.md new file mode 100644 index 00000000..936141a5 --- /dev/null +++ b/docs/references/performance-checklist.md @@ -0,0 +1,152 @@ +# Performance Checklist + +Quick reference for performance in flow-code. Rust backend with libSQL, async runtime. + +## Measurement Commands + +```bash +# Rust benchmarks (add benches/ directory with criterion) +cargo bench + +# Criterion (statistical benchmarking) +cargo bench --bench my_benchmark + +# Flamegraph (CPU profiling) +cargo install flamegraph +cargo flamegraph --bin flowctl -- status --json + +# Memory profiling (Linux) +valgrind --tool=massif target/release/flowctl status --json + +# Binary size +ls -lh target/release/flowctl +cargo bloat --release --crates + +# Compile time +cargo build --release --timings +``` + +| Tool | What It Measures | When to Use | +|---|---|---| +| `cargo bench` (criterion) | Function-level latency | Before/after optimization | +| `cargo flamegraph` | CPU hotspots | Investigating slow commands | +| `cargo bloat` | Binary size by crate | When binary grows unexpectedly | +| `valgrind --tool=massif` | Heap allocations | Investigating memory usage | +| `hyperfine` | CLI command latency | Comparing command performance | + +## Database Anti-Patterns (libSQL) + +| Anti-Pattern | Impact | Fix | +|---|---|---| +| N+1 queries | Linear DB round-trips | Use JOINs or batch queries | +| Unbounded queries | Memory exhaustion, timeouts | Always use LIMIT, paginate | +| Missing indexes | Slow reads as data grows | Add indexes on filtered/sorted columns | +| No connection pooling | Connection overhead per query | Reuse connections | +| SELECT * | Fetches unused columns | Select only needed columns | +| String concatenation in SQL | Injection risk + no query cache | Use parameterized queries | +| No transactions for batch ops | Partial failures, inconsistency | Wrap in BEGIN/COMMIT | + +## Backend CLI Checklist + +| Area | Target | How to Verify | +|---|---|---| +| Command startup | < 50ms | `hyperfine 'flowctl status --json'` | +| Task list (100 tasks) | < 200ms | Benchmark with test data | +| Database open | < 20ms | Profile with `tracing` spans | +| JSON serialization | < 5ms for typical output | `cargo bench` | +| Binary size | < 20MB release | `ls -lh target/release/flowctl` | +| Memory usage (idle) | < 10MB | `valgrind` or `/usr/bin/time -v` | + +## Rust-Specific Performance + +### Async/Await Patterns + +| Do | Don't | +|---|---| +| Use `tokio::spawn` for concurrent I/O | Block the async runtime with sync I/O | +| Use `tokio::task::spawn_blocking` for CPU work | Run CPU-heavy code in async context | +| Use `tokio::join!` for parallel awaits | Await sequentially when independent | +| Use bounded channels for backpressure | Use unbounded channels (memory leak risk) | + +```rust +// Good: parallel async operations +let (tasks, epics) = tokio::join!( + db.list_tasks(), + db.list_epics() +); + +// Bad: sequential when independent +let tasks = db.list_tasks().await?; +let epics = db.list_epics().await?; // Waits unnecessarily +``` + +### Memory Patterns + +| Do | Don't | +|---|---| +| Use `&str` over `String` where possible | Clone strings unnecessarily | +| Use `Cow` for maybe-owned data | Allocate when borrowing suffices | +| Pre-allocate with `Vec::with_capacity` | Push to vec without size hint | +| Use iterators over collecting into vecs | `collect()` when you can chain | +| Use `Arc` for shared read-only data | Clone large structs across threads | + +```rust +// Good: pre-allocate +let mut results = Vec::with_capacity(tasks.len()); + +// Bad: repeated reallocation +let mut results = Vec::new(); // Grows 1, 2, 4, 8... +``` + +### String Performance + +| Do | Don't | +|---|---| +| Use `write!` or `format!` once | Repeated `push_str` / `+` concatenation | +| Use `String::with_capacity` for building | Build strings without size hint | +| Return `impl Display` for lazy formatting | Format eagerly when output may not be used | +| Use `serde_json::to_writer` for streaming | Serialize to String then write | + +## Tokio Runtime + +| Setting | Recommendation | +|---|---| +| Runtime flavor | `current_thread` for CLI (lower overhead) | +| Worker threads | Default for server, 1 for CLI tools | +| Blocking threads | Default (512), reduce for CLI | +| Task budget | Let tokio manage; avoid manual yielding | + +```rust +// CLI tool: single-threaded runtime (faster startup) +#[tokio::main(flavor = "current_thread")] +async fn main() { ... } + +// Server: multi-threaded (default) +#[tokio::main] +async fn main() { ... } +``` + +## Performance Anti-Patterns + +| Anti-Pattern | Impact | Fix | +|---|---|---| +| Blocking async runtime | Starves other tasks | Use `spawn_blocking` | +| N+1 database queries | Linear latency growth | Batch or JOIN queries | +| Cloning large structs | Memory + CPU waste | Use references or `Arc` | +| Unbounded collections | Memory exhaustion | Add limits, use streaming | +| Sync file I/O in async | Blocks executor thread | Use `tokio::fs` | +| Excessive logging in hot path | I/O overhead | Use log levels, sample | +| Compiling in debug mode for perf tests | 10-100x slower | Always `--release` | +| Not reusing DB connections | Connection overhead | Use connection pool | +| Serializing then parsing JSON | Double work | Pass typed structs | +| Formatting strings nobody reads | CPU waste | Lazy formatting with `Display` | + +## Profiling Workflow + +| Step | Command | Purpose | +|---|---|---| +| 1. Baseline | `hyperfine 'flowctl status --json'` | Measure current perf | +| 2. Profile | `cargo flamegraph --bin flowctl -- status` | Find hotspots | +| 3. Optimize | Edit code targeting hotspot | Reduce cost | +| 4. Verify | Re-run baseline command | Confirm improvement | +| 5. Benchmark | `cargo bench` | Prevent regression | diff --git a/docs/references/security-checklist.md b/docs/references/security-checklist.md new file mode 100644 index 00000000..a44bff1a --- /dev/null +++ b/docs/references/security-checklist.md @@ -0,0 +1,141 @@ +# Security Checklist + +Quick reference for security in flow-code. CLI tool handling local state, no web server. + +## Pre-Commit Secret Scan + +```bash +# Check staged files for secrets before committing +git diff --cached | grep -iE \ + "password\s*=|secret\s*=|api_key\s*=|token\s*=|private_key|BEGIN RSA|BEGIN EC" + +# Patterns to catch +grep -rn "AKIA[0-9A-Z]{16}" # AWS access keys +grep -rn "sk-[a-zA-Z0-9]{48}" # OpenAI API keys +grep -rn "ghp_[a-zA-Z0-9]{36}" # GitHub personal tokens +grep -rn "xoxb-[0-9]{10,}" # Slack bot tokens +``` + +## Files to Never Commit + +| File/Pattern | Reason | Mitigation | +|---|---|---| +| `.env` | Contains secrets | Add to `.gitignore` | +| `*.pem`, `*.key` | Private keys | Add to `.gitignore` | +| `.flow/` | Runtime state (may contain tokens) | Add to `.gitignore` | +| `credentials.json` | Service account keys | Add to `.gitignore` | +| `*.upstream` | Backup files with potential secrets | Add to `.gitignore` | + +Commit `.env.example` with placeholder values instead of `.env`. + +## Secrets Management + +| Do | Don't | +|---|---| +| Use environment variables for secrets | Hardcode secrets in source | +| Commit `.env.example` with placeholders | Commit `.env` with real values | +| Use `--json` flag for machine parsing | Parse human-readable output with secrets | +| Rotate secrets after exposure | Ignore leaked credentials | +| Scope API keys to minimum permissions | Use admin keys for all operations | + +## OWASP Top 10 Quick Reference + +| # | Vulnerability | Prevention | +|---|---|---| +| 1 | Broken Access Control | Auth checks on every endpoint, ownership verification | +| 2 | Cryptographic Failures | HTTPS, strong hashing, no secrets in code | +| 3 | Injection | Parameterized queries, input validation | +| 4 | Insecure Design | Threat modeling, spec-driven development | +| 5 | Security Misconfiguration | Minimal permissions, audit deps, secure defaults | +| 6 | Vulnerable Components | `cargo audit`, `cargo deny`, minimal deps | +| 7 | Auth Failures | Strong passwords, rate limiting, session management | +| 8 | Data Integrity Failures | Verify updates/dependencies, signed artifacts | +| 9 | Logging Failures | Log security events, don't log secrets | +| 10 | SSRF | Validate/allowlist URLs, restrict outbound requests | + +## Dependency Audit (Rust) + +```bash +# Audit for known vulnerabilities +cargo audit + +# Deny specific licenses or advisories +cargo deny check + +# Check for outdated dependencies +cargo outdated + +# Minimal dependency tree (fewer deps = smaller attack surface) +cargo tree | wc -l +``` + +| Tool | Purpose | When to Run | +|---|---|---| +| `cargo audit` | CVE database check | Pre-commit, CI | +| `cargo deny` | License + advisory policy | CI pipeline | +| `cargo outdated` | Stale dependency check | Weekly/monthly | +| `cargo tree` | Dependency graph audit | When adding deps | + +## Input Validation + +| Rule | Example | +|---|---| +| Validate at boundaries only | CLI args, JSON input, file reads | +| Use allowlists, not denylists | Accept known-good, reject everything else | +| Constrain string lengths | Task titles: 1-500 chars | +| Validate file paths | No `..` traversal, stay within `.flow/` | +| Sanitize output for shells | Escape special chars in generated commands | +| Parse, don't validate | Deserialize into typed structs, not raw strings | + +```rust +// Good: parse into typed struct at boundary +let task: TaskInput = serde_json::from_str(&input)?; + +// Bad: pass raw string around and validate later +fn process(raw: &str) { /* who validates this? */ } +``` + +## Error Handling + +| Context | Approach | +|---|---| +| CLI output (user-facing) | Clear error message, no stack traces | +| JSON output (`--json`) | Structured error: `{"error": "message"}` | +| Internal logging | Full context, file/line, but no secrets | +| Panics | Catch at boundaries with `catch_unwind` or `?` | + +```rust +// Good: generic error to user, detail in logs +eprintln!("Error: failed to open database"); +tracing::error!("DB open failed: {:?} at {:?}", err, path); + +// Bad: expose internals +eprintln!("Error: {}", err); // May leak path, SQL, etc. +``` + +## SQL Safety (libSQL) + +| Do | Don't | +|---|---| +| Use parameterized queries | Concatenate user input into SQL | +| Use `?` placeholders | Use `format!()` for query building | +| Validate types before query | Trust raw input as safe | +| Use transactions for multi-step ops | Execute multiple queries without atomicity | + +```rust +// Good: parameterized +conn.execute("INSERT INTO tasks (title) VALUES (?1)", [&title]).await?; + +// Bad: string concatenation +conn.execute(&format!("INSERT INTO tasks (title) VALUES ('{}')", title), ()).await?; +``` + +## File System Security + +| Rule | Rationale | +|---|---| +| Validate paths stay within `.flow/` | Prevent directory traversal | +| Use temp dirs for scratch work | Don't pollute user directories | +| Set restrictive permissions on state files | Protect `.flow/` data | +| Clean up temp files on exit | No stale sensitive data | +| Don't follow symlinks blindly | Potential symlink attacks | diff --git a/docs/references/testing-patterns.md b/docs/references/testing-patterns.md new file mode 100644 index 00000000..8b192a85 --- /dev/null +++ b/docs/references/testing-patterns.md @@ -0,0 +1,167 @@ +# Testing Patterns Reference + +Quick reference for testing patterns in flow-code. Rust backend, Bash scripts, Markdown skills. + +## Test Pyramid + +| Layer | Coverage | What to Test | Tools | +|-------|----------|--------------|-------| +| Unit | ~80% | Pure functions, logic, parsing, serialization | `cargo test`, `#[test]` | +| Integration | ~15% | DB queries, CLI commands, cross-crate calls | `cargo test --test`, temp dirs | +| E2E | ~5% | Full workflows (smoke tests, ralph loops) | `bash scripts/smoke_test.sh` | + +## Test Structure (Arrange-Act-Assert) + +```rust +#[test] +fn creates_task_with_default_status() { + // Arrange: set up test data + let input = TaskInput { title: "Test".into(), ..Default::default() }; + + // Act: perform the action + let result = create_task(input); + + // Assert: verify outcome + assert_eq!(result.status, Status::Todo); + assert!(!result.id.is_empty()); +} +``` + +## Rust Test Patterns + +| Pattern | Usage | Example | +|---------|-------|---------| +| `#[test]` | Basic unit test | `fn test_parse() { assert_eq!(parse("x"), Ok("x")); }` | +| `#[should_panic]` | Expected panic | `#[should_panic(expected = "empty")] fn test_empty()` | +| `#[tokio::test]` | Async test | `async fn test_db_query() { ... }` | +| `assert_eq!` | Equality check | `assert_eq!(actual, expected)` | +| `assert!` | Boolean check | `assert!(result.is_ok())` | +| `assert_ne!` | Inequality check | `assert_ne!(id1, id2)` | +| Test modules | Organize tests | `#[cfg(test)] mod tests { use super::*; }` | +| `#[ignore]` | Skip slow tests | Run with `cargo test -- --ignored` | + +## Running Tests + +```bash +# All tests +cd flowctl && cargo test --all + +# Specific crate +cargo test -p flowctl-core +cargo test -p flowctl-db +cargo test -p flowctl-cli + +# Single test by name +cargo test test_create_task + +# With output +cargo test -- --nocapture + +# Bash smoke/e2e tests +bash scripts/smoke_test.sh +bash scripts/ci_test.sh +bash scripts/teams_e2e_test.sh +``` + +## Mock Boundaries + +| Mock These | Don't Mock These | +|------------|-----------------| +| Database calls (libSQL) | Internal utility functions | +| HTTP/network requests | Business logic | +| File system operations | Data transformations | +| External API calls | Validation functions | +| Time/Date (`SystemTime`) | Pure functions | +| Environment variables | Parsing/serialization | + +## Prove-It Pattern (Bug Fixes) + +| Step | Action | Gate | +|------|--------|------| +| 1. Red | Write a test that reproduces the bug | Test MUST fail | +| 2. Confirm | Run test suite, verify only new test fails | No other regressions | +| 3. Fix | Implement the minimal fix | Keep scope tight | +| 4. Green | Run test suite, verify all tests pass | New test now passes | +| 5. Commit | Include test + fix in same commit | Evidence-based fix | + +## Test Naming Conventions + +```rust +#[cfg(test)] +mod tests { + use super::*; + + // Pattern: action_condition_expected_result + #[test] + fn parse_valid_input_returns_ok() { ... } + + #[test] + fn parse_empty_string_returns_error() { ... } + + #[test] + fn create_task_assigns_unique_id() { ... } +} +``` + +## Integration Test Structure + +```rust +// tests/integration_test.rs (separate file in tests/ directory) +use tempfile::TempDir; + +#[tokio::test] +async fn full_task_lifecycle() { + // Setup: temp directory for .flow/ state + let tmp = TempDir::new().unwrap(); + let db = Database::open(tmp.path()).await.unwrap(); + + // Create + let task = db.create_task("Test task").await.unwrap(); + assert_eq!(task.status, "todo"); + + // Transition + db.start_task(&task.id).await.unwrap(); + let updated = db.get_task(&task.id).await.unwrap(); + assert_eq!(updated.status, "in_progress"); + + // Complete + db.complete_task(&task.id, "Done").await.unwrap(); + let done = db.get_task(&task.id).await.unwrap(); + assert_eq!(done.status, "done"); +} +``` + +## Bash Test Patterns + +```bash +# Temp directory isolation +TMPDIR=$(mktemp -d) +trap "rm -rf $TMPDIR" EXIT + +# Assert command succeeds +$FLOWCTL create-epic "test" --json || { echo "FAIL: create-epic"; exit 1; } + +# Assert output contains expected string +OUTPUT=$($FLOWCTL show fn-1 --json) +echo "$OUTPUT" | grep -q '"status":"todo"' || { echo "FAIL: status"; exit 1; } + +# Assert command fails (expected) +if $FLOWCTL done nonexistent 2>/dev/null; then + echo "FAIL: should have failed"; exit 1 +fi +``` + +## Test Anti-Patterns + +| Anti-Pattern | Problem | Better Approach | +|---|---|---| +| Testing implementation details | Breaks on refactor | Test inputs/outputs only | +| Shared mutable state between tests | Tests pollute each other | Use temp dirs, fresh DB per test | +| No assertions (test runs = passes) | False confidence | Every test needs `assert!` | +| Hardcoded paths in tests | Breaks on other machines | Use `TempDir`, relative paths | +| Skipping tests to pass CI | Hides real bugs | Fix or delete the test | +| Testing third-party code | Wastes time | Mock the boundary | +| Overly broad assertions (`is_ok()`) | Doesn't catch regressions | Assert specific values | +| Giant test functions (50+ lines) | Hard to diagnose failures | One assertion per concept | +| Test depends on execution order | Flaky in parallel | Each test self-contained | +| Sleeping for async (`thread::sleep`) | Slow and flaky | Use proper async awaiting | diff --git a/docs/skill-anatomy.md b/docs/skill-anatomy.md new file mode 100644 index 00000000..0a8e6b1c --- /dev/null +++ b/docs/skill-anatomy.md @@ -0,0 +1,169 @@ +# Skill Anatomy + +Standard structure for flow-code skill files. Use this when creating or reviewing skills. + +## File Location + +Every skill lives in its own directory under `skills/`: + +``` +skills/flow-code-/ + SKILL.md # Required: The skill definition (<=500 lines) + workflow.md # Optional: Extended workflow if SKILL.md overflows + templates/ # Optional: Scripts, config templates +``` + +Prefix all skill directories with `flow-code-`. Main file is always `SKILL.md` (uppercase). + +## YAML Frontmatter (Required) + +```yaml +--- +name: flow-code- +description: Use when [triggering conditions and symptoms only] +--- +``` + +**Rules:** +- `name`: Lowercase, hyphen-separated. Must match directory name. Always starts with `flow-code-`. +- `description`: Starts with "Use when...". Max 500 characters. Third person. + - Include: triggering conditions, symptoms, contexts. + - Exclude: workflow summary, process steps, what the skill does. + +**Why:** Descriptions are injected into system prompts for skill discovery. If the description contains process steps, agents follow the summary and skip the actual skill content. + +## Required Sections + +```markdown +# Skill Title + +## Overview +Core principle in 1-2 sentences. What this skill enforces and why it matters. + +## When to Use +- Triggering conditions (symptoms, task types, failure patterns) +- **When NOT to use:** exclusions to prevent misapplication + +## Core Process +The step-by-step workflow. Numbered phases or steps. +Include inline code examples. Use ASCII flowcharts for decision points. + +## Common Rationalizations + +| Excuse | Reality | +|--------|---------| +| "Too simple to need this" | Simple things break too | +| "I'll do it properly later" | Later never comes | + +## Red Flags +- Observable symptoms indicating the skill is being violated +- Patterns to watch for during self-check and review + +## Verification +After completing the process, confirm: +- [ ] Checklist item with verifiable evidence +- [ ] Another checkpoint (test output, build result, etc.) +``` + +## Section Purposes + +### Overview +The elevator pitch. Answers: what does this skill enforce, and why should an agent follow it? + +### When to Use +Helps agents decide if this skill applies. Include both positive triggers ("Use when X") and negative exclusions ("NOT for Y"). The flow-code-debug exemplar: +> Test failures, bugs, unexpected behavior, performance problems, build failures. +> **Especially when:** under time pressure, "quick fix" seems obvious. + +### Core Process +The heart of the skill. Step-by-step workflow the agent follows. Must be specific and actionable. + +**Good:** "Run `cargo test --all` and verify zero failures" +**Bad:** "Make sure the tests work" + +For flow-code skills, phases follow the convention: Phase 1, Phase 2, Phase 2.5 (verify), Phase 3 (commit), etc. Reference `flowctl` commands where applicable: +```bash +$FLOWCTL guard # Run all guards +$FLOWCTL invariants check # Check architecture invariants +``` + +### Common Rationalizations +The most distinctive section. Excuses agents use to skip important steps, paired with factual rebuttals. Think of every time an agent said "I'll add tests later" or "This is simple enough to skip" -- those go here. + +This is the core anti-rationalization principle. Every skip-worthy step needs a counter-argument. Without this section, agents reliably talk themselves out of following the process. + +### Red Flags +Observable signs the skill is being violated. Phrased as quotes or behaviors: +- "Quick fix for now, investigate later" +- "I don't fully understand but this might work" +- Proposing solutions before completing investigation +- Each fix reveals a new problem in a different place + +### Verification +Exit criteria as a checkbox checklist. Every item must be verifiable with evidence (test output, build result, git diff, flowctl output). No subjective items like "code looks good." + +## Supporting Files + +Create supporting files only when: +- SKILL.md exceeds 500 lines (overflow to `workflow.md` or similar) +- Reusable scripts or templates are needed +- Long checklists justify separate files + +Keep patterns and principles inline when under ~50 lines. Most skills need only SKILL.md. + +## Writing Principles + +1. **Process over knowledge.** Skills are workflows, not reference docs. Steps, not facts. +2. **Specific over general.** `$FLOWCTL guard` beats "verify the code works." +3. **Evidence over assumption.** Every verification checkbox requires proof. +4. **Anti-rationalization is core.** Every skip-worthy step needs a counter-argument in the rationalizations table. This is what separates effective skills from advice. +5. **Token-conscious.** Every section must justify its inclusion. If removing it wouldn't change agent behavior, remove it. +6. **Progressive disclosure.** SKILL.md is the entry point. Supporting files load on demand. + +## Flow-Code Specifics + +### flowctl Integration +Skills that enforce process should reference flowctl commands: +- `$FLOWCTL guard` for verification gates +- `$FLOWCTL done --summary-file --evidence-json` for evidence-based completion +- `$FLOWCTL invariants check` for architecture invariant enforcement + +### Evidence Requirements +Flow-code workers produce evidence JSON. Skills should specify what evidence their process generates: +```json +{"commits": ["abc123"], "tests": ["cargo test --all"], "prs": []} +``` + +### Phase Naming +Follow the worker agent convention: +- Phase 1: Re-anchor (read spec) +- Phase 2: Implement +- Phase 2.5: Verify & Fix +- Phase 3: Commit +- Phase 4: Review +- Phase 5: Complete + +Skills that define their own phases should use this numbering style (Phase 1, Phase 1.5, Phase 2) for consistency with the worker pipeline. + +### Cross-Skill References +Reference other skills by name, don't duplicate: +```markdown +If the build breaks, use the `flow-code-debug` skill. +For test-first development, see `flow-code-tdd`. +``` + +## Exemplar + +The `flow-code-debug` skill (`skills/flow-code-debug/SKILL.md`) is the reference implementation. It demonstrates all required sections including the Common Rationalizations table (lines 137-151) and Red Flags list (lines 128-135). + +## Checklist for New Skills + +- [ ] Directory created as `skills/flow-code-/` +- [ ] SKILL.md has valid YAML frontmatter with `name` and `description` +- [ ] Description starts with "Use when..." (no workflow summary) +- [ ] All six required sections present (Overview, When to Use, Core Process, Common Rationalizations, Red Flags, Verification) +- [ ] Rationalizations table has 3+ entries with factual rebuttals +- [ ] Red flags list has observable symptoms (not vague advice) +- [ ] Verification checklist items are evidence-backed +- [ ] SKILL.md is under 500 lines +- [ ] Registered in plugin.json and README.md skills table diff --git a/hooks/SIMPLIFY-IGNORE.md b/hooks/SIMPLIFY-IGNORE.md new file mode 100644 index 00000000..f8fd18c6 --- /dev/null +++ b/hooks/SIMPLIFY-IGNORE.md @@ -0,0 +1,92 @@ +# simplify-ignore hook + +Annotation-based code protection for Claude Code. Mark code blocks with `simplify-ignore-start` / `simplify-ignore-end` annotations and the hook replaces them with `BLOCK_` placeholders before the model sees them. On edit or write, placeholders are expanded back to the original code. On session stop, all files are fully restored. + +## Setup + +Add to your project's `.claude/settings.json` (NOT `hooks/hooks.json` -- this is opt-in per project): + +```json +{ + "hooks": { + "PreToolUse": [ + { + "matcher": "Read", + "hooks": [{ "type": "command", "command": "bash ${CLAUDE_PROJECT_DIR}/hooks/simplify-ignore.sh" }] + } + ], + "PostToolUse": [ + { + "matcher": "Edit|Write", + "hooks": [{ "type": "command", "command": "bash ${CLAUDE_PROJECT_DIR}/hooks/simplify-ignore.sh" }] + } + ], + "Stop": [ + { + "hooks": [{ "type": "command", "command": "bash ${CLAUDE_PROJECT_DIR}/hooks/simplify-ignore.sh" }] + } + ] + } +} +``` + +Add `.claude/.simplify-ignore-cache/` to your `.gitignore`. + +## Annotation syntax + +```js +/* simplify-ignore-start: perf-critical */ +// manually unrolled XOR -- 3x faster than a loop +result[0] = buf[0] ^ key[0]; +result[1] = buf[1] ^ key[1]; +/* simplify-ignore-end */ +``` + +The reason after the colon is optional. It appears in the placeholder (`BLOCK_a1b2c3d4: perf-critical`) so the model knows why the block is hidden. + +## Supported comment styles + +| Style | Example | +|-------|---------| +| `/* */` | `/* simplify-ignore-start: reason */` | +| `//` | `// simplify-ignore-start: reason` | +| `#` | `# simplify-ignore-start: reason` | +| `` | `` | + +Multiple blocks per file and single-line blocks (start + end on same line) are supported. Placeholders preserve the original comment prefix/suffix. + +## How it works + +| Hook event | Action | +|------------|--------| +| PreToolUse Read | Backs up file, replaces annotated blocks with `BLOCK_` placeholders | +| PostToolUse Edit/Write | Expands placeholders back, saves model changes, re-filters | +| Stop | Restores all files from backup | + +Block hashes are 8 hex chars from `shasum`/`sha1sum` of the block content, making round-trips unambiguous. + +## Crash recovery + +If Claude Code exits without triggering the Stop hook, files may still have `BLOCK_` placeholders. Restore manually: + +```bash +echo '{}' | bash hooks/simplify-ignore.sh +``` + +Backups are in `.claude/.simplify-ignore-cache/` within your project directory. + +## Known limitations + +- **Single-line blocks hide the entire line.** Use dedicated lines for annotations. +- **Suffix detection covers `*/` and `-->` only.** Non-standard template comment closers (ERB, Blade) may not work; use `#` or `//` instead. +- **Fuzzy fallback on altered placeholders.** If the model changes a placeholder's reason text, the hook tries a hash-only match which may leave cosmetic debris. +- **File renaming leaves placeholders.** Renamed files retain placeholders; originals are saved as `.recovered` on stop. +- **Annotation must be in a comment.** Only lines with `//`, `/*`, `#`, or `'*) suffix=" -->" ;; esac + reason=$(printf '%s' "$line" | sed -n 's/.*simplify-ignore-start:[[:space:]]*//p' \ + | sed 's/[[:space:]]*\*\/.*$//' | sed 's/[[:space:]]*-->.*$//' | sed 's/[[:space:]]*$//') + # Handle single-line block (start + end on same line) + case "$line" in *simplify-ignore-end*) + in_block=0 + local h; h=$(block_hash "$buf") + count=$((count + 1)) + printf '%s' "$buf" > "$CACHE/${fid}.block.${h}" + [ -n "$reason" ] && printf '%s' "$reason" > "$CACHE/${fid}.reason.${h}" + printf '%s' "$prefix" > "$CACHE/${fid}.prefix.${h}" + printf '%s' "$suffix" > "$CACHE/${fid}.suffix.${h}" + if [ -n "$reason" ]; then + printf '%s\n' "${prefix}BLOCK_${h}: ${reason}${suffix}" >> "$dest" + else + printf '%s\n' "${prefix}BLOCK_${h}${suffix}" >> "$dest" + fi + buf=""; reason=""; prefix=""; suffix="" + continue + ;; *) + continue + ;; + esac + ;; esac + fi + # Accumulate block content + if [ $in_block -eq 1 ]; then + buf="${buf} +${line}" + fi + # Detect end marker + case "$line" in *simplify-ignore-end*) + if [ $in_block -eq 1 ]; then + local h; h=$(block_hash "$buf") + count=$((count + 1)) + printf '%s' "$buf" > "$CACHE/${fid}.block.${h}" + [ -n "$reason" ] && printf '%s' "$reason" > "$CACHE/${fid}.reason.${h}" + printf '%s' "$prefix" > "$CACHE/${fid}.prefix.${h}" + printf '%s' "$suffix" > "$CACHE/${fid}.suffix.${h}" + if [ -n "$reason" ]; then + printf '%s\n' "${prefix}BLOCK_${h}: ${reason}${suffix}" >> "$dest" + else + printf '%s\n' "${prefix}BLOCK_${h}${suffix}" >> "$dest" + fi + in_block=0; buf=""; reason=""; prefix=""; suffix="" + continue + fi + ;; + esac + [ $in_block -eq 0 ] && printf '%s\n' "$line" >> "$dest" + done < "$src" + + # Unclosed block: flush as-is with warning + if [ $in_block -eq 1 ] && [ -n "$buf" ]; then + printf 'Warning: unclosed simplify-ignore-start in %s\n' "$src" >&2 + printf '%s\n' "$buf" >> "$dest" + fi + + [ $count -gt 0 ] && return 0 || return 1 +} + +# ── Stop: restore all files from backup ────────────────────────────────────── +if [ -z "$TOOL_NAME" ]; then + [ -d "$CACHE" ] || exit 0 + for bak in "$CACHE"/*.bak; do + [ -f "$bak" ] || continue + fid="${bak##*/}"; fid="${fid%.bak}" + pathfile="$CACHE/${fid}.path" + [ -f "$pathfile" ] || { rm -f "$bak"; continue; } + orig=$(cat "$pathfile") + if [ -f "$orig" ]; then + cat "$bak" > "$orig" + else + mkdir -p "$(dirname "${orig}.recovered")" + mv "$bak" "${orig}.recovered" + printf 'Warning: %s moved/deleted. Recovered to %s.recovered\n' "$orig" "$orig" >&2 + fi + rm -f "$bak" "$pathfile" "$CACHE/${fid}".block.* "$CACHE/${fid}".reason.* \ + "$CACHE/${fid}".prefix.* "$CACHE/${fid}".suffix.* + done + exit 0 +fi + +[ -z "$FILE_PATH" ] && exit 0 + +# ── PreToolUse Read: filter in-place ───────────────────────────────────────── +if [ "$TOOL_NAME" = "Read" ]; then + [ -f "$FILE_PATH" ] || exit 0 + # Don't filter our own files + case "$(basename "$FILE_PATH")" in simplify-ignore*|SIMPLIFY-IGNORE*) exit 0 ;; esac + + mkdir -p "$CACHE" + ID=$(file_id "$FILE_PATH") + + # Already filtered + [ -f "$CACHE/${ID}.bak" ] && exit 0 + + # No annotations in this file + grep -q 'simplify-ignore-start' -- "$FILE_PATH" || exit 0 + + # Back up original + cp -p "$FILE_PATH" "$CACHE/${ID}.bak" 2>/dev/null || cp "$FILE_PATH" "$CACHE/${ID}.bak" + printf '%s' "$FILE_PATH" > "$CACHE/${ID}.path" + + # Filter in-place + FILTERED="$CACHE/${ID}.$$.tmp" + rm -f "$FILTERED" + if filter_file "$FILE_PATH" "$FILTERED" "$ID"; then + cat "$FILTERED" > "$FILE_PATH" + rm -f "$FILTERED" + else + # No blocks found (race condition) - clean up + rm -f "$FILTERED" "$CACHE/${ID}.bak" "$CACHE/${ID}.path" + fi + exit 0 +fi + +# ── PostToolUse Edit|Write: expand placeholders, then re-filter ────────────── +if [ "$TOOL_NAME" = "Edit" ] || [ "$TOOL_NAME" = "Write" ]; then + ID=$(file_id "$FILE_PATH") + [ -f "$CACHE/${ID}.bak" ] || exit 0 + ls "$CACHE/${ID}".block.* >/dev/null 2>&1 || exit 0 + + # Check for missing placeholders BEFORE expanding — if a placeholder was dropped + # by a whole-file Write, restore from the original backup to prevent data loss. + MISSING_BLOCKS=0 + for bf in "$CACHE/${ID}".block.*; do + [ -f "$bf" ] || continue + h="${bf##*.}" + if ! grep -q "BLOCK_${h}" "$FILE_PATH" 2>/dev/null; then + MISSING_BLOCKS=$((MISSING_BLOCKS + 1)) + printf 'Warning: BLOCK_%s placeholder missing — restoring from backup\n' "$h" >&2 + fi + done + if [ "$MISSING_BLOCKS" -gt 0 ]; then + # Re-insert missing blocks into the edited file (not a full backup restore, + # which would discard legitimate edits outside protected regions). + # Strategy: for each missing block, append it at the end with a warning comment. + for bf in "$CACHE/${ID}".block.*; do + [ -f "$bf" ] || continue + h="${bf##*.}" + if ! grep -q "BLOCK_${h}" "$FILE_PATH" 2>/dev/null; then + block_content=$(cat "$bf") + printf '\n%s\n' "$block_content" >> "$FILE_PATH" + printf 'Warning: BLOCK_%s was dropped — re-appended protected block at end of file\n' "$h" >&2 + fi + done + # Clear cache so next Read re-establishes placeholders cleanly + rm -f "$CACHE/${ID}.bak" "$CACHE/${ID}.path" "$CACHE/${ID}".block.* \ + "$CACHE/${ID}".reason.* "$CACHE/${ID}".prefix.* "$CACHE/${ID}".suffix.* + exit 0 + fi + + # Expand placeholders back to original blocks + EXPANDED="$CACHE/${ID}.$$.expanded" + rm -f "$EXPANDED" + while IFS= read -r line || [ -n "$line" ]; do + case "$line" in *BLOCK_*) + for bf in "$CACHE/${ID}".block.*; do + [ -f "$bf" ] || continue + h="${bf##*.}" + case "$line" in *"BLOCK_${h}"*) + bp=""; bs=""; br="" + [ -f "$CACHE/${ID}.prefix.${h}" ] && bp=$(cat "$CACHE/${ID}.prefix.${h}") + [ -f "$CACHE/${ID}.suffix.${h}" ] && bs=$(cat "$CACHE/${ID}.suffix.${h}") + [ -f "$CACHE/${ID}.reason.${h}" ] && br=$(cat "$CACHE/${ID}.reason.${h}") + if [ -n "$br" ]; then + placeholder="${bp}BLOCK_${h}: ${br}${bs}" + else + placeholder="${bp}BLOCK_${h}${bs}" + fi + block_content=$(cat "$bf"; printf x); block_content="${block_content%x}" + esc_placeholder=$(escape_glob "$placeholder") + line="${line//$esc_placeholder/$block_content}" + # Fuzzy fallback if model altered reason text + case "$block_content" in *"BLOCK_${h}"*) ;; *) + case "$line" in *"BLOCK_${h}"*) + printf 'Warning: BLOCK_%s modified by model, using fuzzy match\n' "$h" >&2 + line="${line//BLOCK_${h}/$block_content}" + ;; esac + ;; esac + ;; esac + done + ;; esac + printf '%s\n' "$line" >> "$EXPANDED" + done < "$FILE_PATH" + + # Preserve inode and permissions + cat "$EXPANDED" > "$FILE_PATH" + rm -f "$EXPANDED" + + # DO NOT overwrite the original backup — keep it pristine for crash recovery. + # The backup always contains the pre-filter original, never the expanded version. + + # Re-filter in-place so next Read sees placeholders + FILTERED="$CACHE/${ID}.$$.tmp" + rm -f "$FILTERED" + if filter_file "$FILE_PATH" "$FILTERED" "$ID"; then + cat "$FILTERED" > "$FILE_PATH" + rm -f "$FILTERED" + fi + + exit 0 +fi diff --git a/skills/flow-code-debug/SKILL.md b/skills/flow-code-debug/SKILL.md index 8fb89322..2463b24b 100644 --- a/skills/flow-code-debug/SKILL.md +++ b/skills/flow-code-debug/SKILL.md @@ -88,19 +88,20 @@ END ## Phase 4: Implementation -1. **Write failing test** (if TDD mode or test framework available): - ```bash - # Test must fail, proving the bug exists - guard --layer - ``` +### Prove-It Pattern (mandatory for bug fixes) -2. **Implement single fix** — address root cause, ONE change, no "while I'm here" improvements. - **No bundling:** Do NOT fix multiple things at once. If you're tempted to "also fix this other thing", STOP — commit the single fix first, verify, then address the next issue separately. +1. **Write reproduction test** — a test that demonstrates the bug (MUST FAIL) +2. **Confirm RED** — run the test, verify it actually fails. If it passes, your test doesn't reproduce the bug +3. **Fix root cause** — implement the fix (not a workaround) +4. **Confirm GREEN** — run the test, verify it now passes +5. **Run full suite** — check for regressions: ` guard` -3. **Verify fix:** - ```bash - guard - ``` +**If the test passes on step 2, your test is wrong.** Go back to Phase 1 and refine your understanding of the bug. + +### Fix Discipline + +- **Implement single fix** — address root cause, ONE change, no "while I'm here" improvements. + **No bundling:** Do NOT fix multiple things at once. If you're tempted to "also fix this other thing", STOP — commit the single fix first, verify, then address the next issue separately. 4. **If fix doesn't work — failure escalation:** @@ -148,6 +149,7 @@ END | "Need more context" | You have tools. Search first, ask only what's truly unavailable | | "Suggest handling manually" | This is your bug. Own it. Exhaust all options first | | Same logic, different parameters | Tweaking parameters is NOT a different approach. Change the method. | +| "Bug is too simple for a test" | Simple bugs regress. The test takes 2 minutes. The re-diagnosis takes 2 hours. | ## Quick Reference @@ -157,7 +159,7 @@ END | 1.5 RP Investigate | context_builder(question) with symptoms + hypotheses | Cross-file context gathered (or skipped if no RP) | | 2. Pattern | Find working examples, compare differences | Identified the delta | | 3. Hypothesis | Form theory, test ONE variable | Confirmed or new hypothesis | -| 4. Implement | Write test, fix root cause, verify | Bug resolved, guards pass | +| 4. Implement | Prove-It: RED test → fix root cause → GREEN test → full suite | Bug resolved, guards pass | ## After Fix diff --git a/skills/flow-code-epic-review/SKILL.md b/skills/flow-code-epic-review/SKILL.md index 4ed8474a..01f7096e 100644 --- a/skills/flow-code-epic-review/SKILL.md +++ b/skills/flow-code-epic-review/SKILL.md @@ -188,3 +188,29 @@ Epic review passed. Next: 2) Ship the code: push + create PR 3) Start next epic: `/flow-code:plan ` ``` + +## Common Rationalizations + +| Excuse | Reality | +|--------|---------| +| "All tasks passed, epic must be fine" | Task-level success ≠ epic-level coherence. Integration gaps hide between tasks | +| "Tests pass, ship it" | Tests verify code behavior, not user value. Goal-backward check verifies intent | +| "We already reviewed each task" | Per-task review misses cross-task consistency and architectural drift | +| "The user will catch any issues" | Your job is to catch issues before the user does | +| "Good enough for now" | "Good enough" compounds into technical debt. Ship complete or scope down | + +## Red Flags + +- All acceptance criteria marked "Yes" but no test evidence provided +- Implementation diverged from epic plan (compare git diff vs spec scope) +- Goal-backward verification skipped or done superficially +- Epic has open required gaps but no justification for proceeding +- No cross-task integration verification performed + +## Verification + +- [ ] Every acceptance criterion verified with evidence (test output, screenshot, or manual check) +- [ ] Goal-backward verification completed for each criterion +- [ ] No open required gaps in gap registry +- [ ] Epic spec updated if scope changed during implementation +- [ ] Cross-task consistency verified (no conflicting changes between tasks) diff --git a/skills/flow-code-impl-review/SKILL.md b/skills/flow-code-impl-review/SKILL.md index 149a6d41..0df30abe 100644 --- a/skills/flow-code-impl-review/SKILL.md +++ b/skills/flow-code-impl-review/SKILL.md @@ -155,3 +155,29 @@ If verdict is NEEDS_WORK, loop internally until SHIP: 5. **Repeat** until `SHIP` **CRITICAL**: For RP, re-reviews must stay in the SAME chat so reviewer has context. Only use `--new-chat` on the FIRST review. + +## Common Rationalizations + +| Excuse | Reality | +|--------|---------| +| "Reviewer said fix, so it must be right" | Review feedback needs technical verification too. Reviewers can be wrong | +| "This is just cosmetic" | 3 cosmetic issues often signal a structural problem underneath | +| "3 iterations is overkill for this" | The circuit breaker exists for a reason. If you hit it, the plan was wrong | +| "I already fixed the important ones" | "Important" is defined by severity, not your estimate of effort | +| "The code works, reviewer is being pedantic" | Working code is the minimum bar, not the finish line | + +## Red Flags + +- Same issue appears across 3+ review iterations (underlying design problem) +- Fix introduces a new Critical finding +- SHIP verdict issued with zero test evidence +- Review only checks code quality but ignores spec compliance +- ≥3 Critical findings in a single review pass + +## Verification + +- [ ] All Critical findings addressed with specific fixes +- [ ] No fix introduced new Critical or Important issues +- [ ] Tests still pass after all fixes applied +- [ ] Review receipt saved to .flow/reviews/ +- [ ] Acceptance criteria verified (not just code quality) diff --git a/skills/flow-code-plan/SKILL.md b/skills/flow-code-plan/SKILL.md index b01eeff0..cb237b15 100644 --- a/skills/flow-code-plan/SKILL.md +++ b/skills/flow-code-plan/SKILL.md @@ -142,3 +142,31 @@ All plans go into `.flow/`: **After work completes** (if auto-executed): - All tasks done → Layer 3 adversarial review runs automatically (Phase 3j) - Then auto push + draft PR (Phase 5) + +## Common Rationalizations + +| Excuse | Reality | +|--------|---------| +| "This is obvious, skip scouts" | Scouts find patterns and reuse points you don't know about | +| "Only one task, no DAG needed" | Even single tasks need dependency analysis and acceptance criteria | +| "Requirements are clear, skip interview" | Your understanding ≠ user's intent. A 2-minute check prevents 2-hour rework | +| "This is too small to plan" | Small plans still need acceptance criteria. A 3-line spec is fine | +| "I'll refine the plan during implementation" | Implementation drift is the #1 cause of rework. Plan now | + +## Red Flags + +- Any task estimated >L size without being split +- Scout returns "Cannot analyze" but planning continues without investigation +- Task title contains "and" (likely two tasks combined) +- No acceptance criteria on any task +- Plan has zero file references from repo-scout +- 7+ tasks in a single epic (possible over-splitting) + +## Verification + +- [ ] Every task has ≥1 testable acceptance criterion +- [ ] Every task is ≤M size (L tasks split) +- [ ] File refs from repo-scout included in task specs +- [ ] DAG validated (`flowctl validate`) +- [ ] Quick commands section exists in epic spec +- [ ] No task title contains "and" without justification diff --git a/skills/flow-code-skill-create/SKILL.md b/skills/flow-code-skill-create/SKILL.md index 9d17c044..56e1a085 100644 --- a/skills/flow-code-skill-create/SKILL.md +++ b/skills/flow-code-skill-create/SKILL.md @@ -5,6 +5,8 @@ description: Use when creating new flow-code skills, editing existing skills, or # Creating Flow-Code Skills +**Template spec:** See [docs/skill-anatomy.md](../../docs/skill-anatomy.md) for the full specification. + Writing skills IS Test-Driven Development applied to documentation. If you didn't watch an agent fail without the skill, you don't know if the skill teaches the right thing. ## When to Create @@ -46,12 +48,27 @@ Core principle in 1-2 sentences. ## When to Use Symptoms, triggers, use cases. +- **When NOT to use:** exclusions to prevent misapplication ## Core Process / Pattern The actual workflow — inline code for simple patterns. -## Common Mistakes -What goes wrong + fixes. +## Common Rationalizations + +| Excuse | Reality | +|--------|---------| +| "Too simple to need this" | Simple things break too | +| "I'll do it properly later" | Later never comes | +| "I already know this" | Knowing != doing consistently | + +## Red Flags +- Observable symptoms indicating the skill is being violated +- Patterns to watch for during self-check and review + +## Verification +After completing the process, confirm: +- [ ] Checklist item with verifiable evidence +- [ ] Another checkpoint (test output, build result, etc.) ``` ## The Iron Law @@ -92,9 +109,9 @@ NO SKILL WITHOUT A FAILING TEST FIRST - Exclude: workflow summary, process steps, what the skill does - Keywords in body for discovery: error messages, symptoms, tool names -## Bulletproofing Discipline Skills +## Bulletproofing All Skills -For skills that enforce rules (debugging, TDD, verification): +Every skill — not just discipline skills — needs rationalization counters and red flags. Agents rationalize skipping steps in ALL skill types (technique, pattern, and reference). The sections below apply universally. **Close every loophole explicitly:** ```markdown @@ -133,14 +150,18 @@ Write code before test? Delete it. Start over. - [ ] Failure patterns identified **GREEN:** -- [ ] SKILL.md with frontmatter, overview, process, mistakes +- [ ] SKILL.md with frontmatter, overview, process - [ ] Description starts with "Use when..." (no workflow summary) - [ ] Scenario re-run with skill — agent complies +- [ ] "When NOT to use" included in When to Use section +- [ ] Common Rationalizations table present (>=3 entries) +- [ ] Red Flags list present (observable symptoms, not opinions) +- [ ] Verification checklist present (checkbox format) **REFACTOR:** - [ ] New rationalizations countered -- [ ] Rationalization table added (if discipline skill) -- [ ] Red flags list added (if discipline skill) +- [ ] Rationalization table complete (covers all observed agent excuses) +- [ ] Red flags list complete (covers all observed violation patterns) **DEPLOY:** - [ ] Committed to git diff --git a/skills/flow-code-work/SKILL.md b/skills/flow-code-work/SKILL.md index aa6ccf0c..3df07e43 100644 --- a/skills/flow-code-work/SKILL.md +++ b/skills/flow-code-work/SKILL.md @@ -136,3 +136,30 @@ $FLOWCTL restart --force - Don't leave tasks half-done - Never use TodoWrite for task tracking - Never create plan files outside `.flow/` + +## Common Rationalizations + +| Excuse | Reality | +|--------|---------| +| "Spec is good enough, skip re-anchor" | Specs change during plan-review. Always re-read before implementing | +| "Guard passed, definitely no issues" | Guards may not cover your specific change. Manual verification still needed | +| "Small change, no test needed" | Small changes are the #1 source of regressions | +| "I'll test after all tasks are done" | Cross-task integration bugs are 10x harder to diagnose than per-task bugs | +| "One more attempt will fix it" | 3+ failed attempts = architectural problem. Escalate, don't retry | + +## Red Flags + +- ≥2 tasks failed in the same wave (systemic issue, stop and investigate) +- Same file needs 3+ fixes across iterations +- Guard passes but manual test reveals broken behavior +- Worker reports "spec unclear" but continues implementing +- >100 lines changed without running any tests +- Wave checkpoint skipped or results not reviewed + +## Verification + +- [ ] All tasks in epic are done or explicitly skipped +- [ ] Guards pass on final state (`flowctl guard`) +- [ ] Evidence files exist for each completed task +- [ ] No unresolved spec conflicts +- [ ] Wave checkpoints completed for each wave