Overview
Deep analysis of Anvil — a context engineering and project execution framework for Claude Code (and other runtimes). Anvil focuses on preventing context rot through structured decomposition and atomic execution across multi-phase projects.
Key philosophical difference: DevFlow = "Make every Claude interaction high quality" (skills, review, ambient). Anvil = "Make Claude reliably ship multi-phase projects" (milestones, phases, plans, waves). They're complementary, not competing.
Priority Findings (Investigate First)
P1: Context Monitor Hook with Graduated Thresholds
Anvil implements a PostToolUse hook that monitors context window consumption and injects additionalContext warnings at graduated thresholds:
- WARNING (≤35% remaining / ~65% used): Agent should wrap up current task
- CRITICAL (≤25% remaining / ~75% used): Agent should save state immediately
- Debouncing: 5 tool uses between warnings to prevent spam; severity escalation (WARNING→CRITICAL) bypasses debounce
- Bridge file pattern: Statusline hook writes metrics to
/tmp/claude-ctx-{session_id}.json, context monitor reads it — decoupled inter-hook communication without modifying settings.json
- GSD-aware: Detects project state and tailors recommendations ("save state using pause command" vs generic advice)
Why this matters: Our agents (Coder, Reviewer, etc.) have zero awareness of context consumption. They keep working until compaction triggers. Anvil's agents know when to wrap up. Our PreCompact hook saves snapshots reactively — this would add proactive awareness before compaction fires.
What we'd build: A PostToolUse hook that monitors context usage and injects graduated warnings. Combined with our existing PreCompact snapshot pattern, this creates a defense-in-depth approach to context management.
P1: Model Profiles (quality/balanced/budget)
Anvil uses a single model_profile config setting to control which model each agent uses:
| Agent Role |
quality |
balanced |
budget |
| Planner/Architect |
opus |
opus |
sonnet |
| Executor/Coder |
opus |
sonnet |
sonnet |
| Researcher |
opus |
sonnet |
haiku |
| Verifier |
sonnet |
sonnet |
haiku |
| Mapper/Explorer |
sonnet |
haiku |
haiku |
Key details:
- Per-agent overrides:
"model_overrides": { "executor": "opus" } for fine-grained control
- Opus → "inherit": Opus agents resolve to session model, respecting org policies that block specific model versions
- Profile switching: Single config change affects all agents simultaneously
- Global user defaults:
~/.gsd/defaults.json persists preferences across projects
Why this matters: Our /implement spawns ~8 agents with no cost control. Users can't choose between maximum quality vs fast-and-cheap. A model profile system would let users make this tradeoff explicitly.
What we'd build: A --profile quality|balanced|budget flag on orchestration commands (/implement, /code-review, /debug), with a resolution table mapping agent roles to models per profile. Could also be set in project config for persistent preference.
P1: Goal-Backward Verification for Shepherd
Anvil's verification philosophy: Don't check "did you complete all tasks?" — check "does the codebase actually deliver what was promised?"
Their verifier agent:
- Reads the phase goal and success criteria
- Inspects the actual codebase (not the summary/report)
- Checks observable truths: things that must be TRUE for the goal to be met
- Verifies artifacts at 3 levels: exists → substantive → wired (connected to the system)
- Detects stubs: components returning
<div>Component</div>, APIs returning "Not implemented", empty handlers
- Re-verification mode: when previous gaps exist, focuses only on failed items
Verification hierarchy:
- Pre-execution (plan-checker): Will these plans achieve the goal?
- During execution (deviation rules): Is the executor staying on track?
- Post-execution (verifier): Did the result actually deliver the goal?
Why this matters: Our Shepherd checks intent alignment ("does implementation match what was asked"), which is close but less structured. Anvil's approach is more systematic — it has explicit observable truths, artifact depth checks (exists vs substantive vs wired), and re-verification mode.
What we'd enhance: Strengthen Shepherd to do explicit goal-backward verification: define observable truths from the original request, verify artifacts are substantive (not stubs), and check wiring (components connected, APIs consumed, state rendered). Add stub detection patterns to Scrutinizer.
P2: Decision Preservation (Locked/Flexible/Deferred)
Anvil has a CONTEXT.md pattern created during a "discuss phase" step:
## Implementation Decisions
### Authentication
- **LOCKED**: Use JWT with httpOnly cookies (user decision)
- **Claude's discretion**: Token expiry duration, refresh strategy
- **Deferred**: OAuth providers (out of scope for this phase)
Key aspects:
- Locked decisions are NON-NEGOTIABLE — all downstream agents (planner, researcher, executor) MUST honor them
- Claude's discretion areas give the agent explicit freedom
- Deferred ideas are scope guardrails — explicitly captured but excluded
- Created once during discussion, consumed everywhere downstream
- Plan-checker verifies plans comply with locked decisions
Why this matters: Our /specify captures requirements but doesn't separate "user already decided this" from "Claude can choose." When Coder implements, it might re-debate something the user already settled. Our Shepherd validates intent alignment, but doesn't enforce decision preservation at the constraint level.
What we'd build: During /specify or /implement planning phase, capture user decisions with locked/flexible/deferred classification. Propagate locked decisions to Coder (must honor) and Shepherd (must verify compliance). This prevents re-debating and gives Claude clear autonomy boundaries.
Secondary Findings (Documented for Future Reference)
Quick Tasks with Composable Flags
Anvil has a /quick command for lightweight single-task execution with composable flags:
- Default: Just implement the task with atomic commits
--discuss: Pre-planning discussion to surface gray areas, captures decisions
--full: Plan checking (max 2 iterations) + post-execution verification
--discuss --full: Full workflow with all guardrails
Relevance: We have a gap between ambient/QUICK (zero overhead) and full /implement (full ceremony). A /quick command would fill this middle ground — single task, optional discussion, optional verification.
Deviation Rules for Autonomous Decision-Making
Anvil codifies explicit rules for when executors should auto-fix vs ask:
- Rule 1 (Auto-fix bugs): Code doesn't work as intended → fix immediately
- Rule 2 (Auto-add critical): Missing error handling, validation, auth → add it
- Rule 3 (Auto-fix blockers): Something prevents completing task → fix it
- Rule 4 (Ask about architecture): Significant structural changes → STOP and ask
Relevance: Our Coder agent has implicit behavior about when to ask vs proceed. Codifying this makes behavior predictable and documented.
Health Check Command
Anvil runs 8 diagnostic checks on project integrity:
- E001-E005: Critical errors (missing dirs, invalid config JSON)
- W001-W009: Warnings (orphaned files, validation consistency)
- Auto-repair path for fixable issues (createConfig, resetConfig, regenerateState)
- Structured output with code/message/repairable flags
Relevance: A /health command that validates DevFlow installation, hook integrity, settings.json consistency, and plugin state would help troubleshooting.
Disk-Status Inference for Progress
Instead of explicit status tracking, Anvil infers phase status from files on disk:
- Has PLAN.md? → "planned"
- Has PLAN.md + SUMMARY.md? → "complete"
- Has RESEARCH.md only? → "researched"
- Self-healing: manually adding artifacts auto-updates status
Relevance: Our .docs/reviews/ directory could provide similar intelligence for review progress without explicit state management.
Wave-Based Parallel Execution
Anvil groups plans by dependency and runs independent plans in parallel within waves:
- Wave 1: All independent plans (parallel)
- Wave 2: Plans depending on Wave 1 (wait, then parallel)
- Wave N: Sequential dependency chain
Relevance: Our /implement already parallelizes some agents, but for multi-task implementations, wave-based parallelism could improve throughput.
Verification Pattern Templates (Stub Detection)
Detailed patterns for detecting real vs stub implementations:
React stubs: return <div>Component</div>, return null, onClick={() => {}}
API stubs: return Response.json({ message: "Not implemented" })
Hook stubs: return { user: null, login: () => {}, logout: () => {} }
Wiring gaps: fetch without await, query without return, state exists but not rendered
Relevance: Could enhance Scrutinizer's detection capabilities with these patterns.
Per-Task Atomic Commits
Anvil commits after every task (sub-feature level), not just per feature. Enables git bisect at granular level and failure recovery (task 1 committed, task 2 failed → next session knows exactly where to resume).
Relevance: Our Coder does atomic commits per feature. Going more granular might add overhead, but the failure recovery benefit is worth considering.
Session Handoff Files (.continue-here.md)
Explicit handoff files created on pause with YAML frontmatter + structured sections (current_state, completed_work, remaining_work, decisions_made, blockers, next_action). Deleted after resume.
Relevance: Our WORKING-MEMORY.md serves a similar purpose but is automatic (background write). Anvil's approach is more explicit and targeted. Both have merit.
Architecture Comparison Summary
| Dimension |
DevFlow |
Anvil |
| Core metaphor |
Plugin marketplace + skills |
Project lifecycle manager |
| Unit of work |
Single task/PR |
Phase → Plan → Task (3 levels) |
| State |
.memory/WORKING-MEMORY.md (session) |
.planning/STATE.md + ROADMAP.md + CONTEXT.md (persistent) |
| Quality |
Skills loaded per-prompt (ambient) |
Goal-backward verification at every stage |
| Execution |
Agent spawns per command |
Wave-based parallel execution of plans |
| Git |
Atomic commits per feature |
Atomic commits per task (sub-feature) |
| Context mgmt |
Pre-compact hook + session start |
Context monitor hook + bridge files + statusline |
| Model selection |
Hardcoded per agent |
Config-driven profiles with per-agent overrides |
| Session continuity |
WORKING-MEMORY.md (background write) |
.continue-here.md handoff files + STATE.md |
| Multi-runtime |
Claude Code only |
Claude Code + OpenCode + Gemini CLI + Codex |
What NOT to Borrow
- Full milestone/phase/roadmap hierarchy — Too opinionated for DevFlow's composable philosophy
- Multi-runtime support — Dilutes focus; Claude Code is our target
- XML task format — Our markdown-based approach with agent prompts is cleaner
- Single-system monolith — Our plugin architecture is more composable
- Interactive project setup wizard — Heavy for DevFlow's "enhance every prompt" approach
Priority Matrix
| Priority |
Feature |
Effort |
Impact |
| P1 |
Context monitor hook (graduated thresholds) |
Medium |
High |
| P1 |
Model profiles (quality/balanced/budget) |
Medium |
High |
| P1 |
Goal-backward verification in Shepherd |
Low |
Medium-High |
| P2 |
Decision preservation (locked/flexible/deferred) |
Low |
Medium |
| P3 |
/quick command with composable flags |
Medium |
Medium |
| P3 |
Deviation rules for Coder |
Low |
Medium |
| P4 |
Health check command |
Low |
Low |
| P4 |
Stub detection patterns for Scrutinizer |
Low |
Low |
| P4 |
Bridge files for inter-hook communication |
Low |
Low |
Overview
Deep analysis of Anvil — a context engineering and project execution framework for Claude Code (and other runtimes). Anvil focuses on preventing context rot through structured decomposition and atomic execution across multi-phase projects.
Key philosophical difference: DevFlow = "Make every Claude interaction high quality" (skills, review, ambient). Anvil = "Make Claude reliably ship multi-phase projects" (milestones, phases, plans, waves). They're complementary, not competing.
Priority Findings (Investigate First)
P1: Context Monitor Hook with Graduated Thresholds
Anvil implements a
PostToolUsehook that monitors context window consumption and injectsadditionalContextwarnings at graduated thresholds:/tmp/claude-ctx-{session_id}.json, context monitor reads it — decoupled inter-hook communication without modifying settings.jsonWhy this matters: Our agents (Coder, Reviewer, etc.) have zero awareness of context consumption. They keep working until compaction triggers. Anvil's agents know when to wrap up. Our PreCompact hook saves snapshots reactively — this would add proactive awareness before compaction fires.
What we'd build: A PostToolUse hook that monitors context usage and injects graduated warnings. Combined with our existing PreCompact snapshot pattern, this creates a defense-in-depth approach to context management.
P1: Model Profiles (quality/balanced/budget)
Anvil uses a single
model_profileconfig setting to control which model each agent uses:qualitybalancedbudgetKey details:
"model_overrides": { "executor": "opus" }for fine-grained control~/.gsd/defaults.jsonpersists preferences across projectsWhy this matters: Our
/implementspawns ~8 agents with no cost control. Users can't choose between maximum quality vs fast-and-cheap. A model profile system would let users make this tradeoff explicitly.What we'd build: A
--profile quality|balanced|budgetflag on orchestration commands (/implement,/code-review,/debug), with a resolution table mapping agent roles to models per profile. Could also be set in project config for persistent preference.P1: Goal-Backward Verification for Shepherd
Anvil's verification philosophy: Don't check "did you complete all tasks?" — check "does the codebase actually deliver what was promised?"
Their verifier agent:
<div>Component</div>, APIs returning "Not implemented", empty handlersVerification hierarchy:
Why this matters: Our Shepherd checks intent alignment ("does implementation match what was asked"), which is close but less structured. Anvil's approach is more systematic — it has explicit observable truths, artifact depth checks (exists vs substantive vs wired), and re-verification mode.
What we'd enhance: Strengthen Shepherd to do explicit goal-backward verification: define observable truths from the original request, verify artifacts are substantive (not stubs), and check wiring (components connected, APIs consumed, state rendered). Add stub detection patterns to Scrutinizer.
P2: Decision Preservation (Locked/Flexible/Deferred)
Anvil has a
CONTEXT.mdpattern created during a "discuss phase" step:Key aspects:
Why this matters: Our
/specifycaptures requirements but doesn't separate "user already decided this" from "Claude can choose." When Coder implements, it might re-debate something the user already settled. Our Shepherd validates intent alignment, but doesn't enforce decision preservation at the constraint level.What we'd build: During
/specifyor/implementplanning phase, capture user decisions with locked/flexible/deferred classification. Propagate locked decisions to Coder (must honor) and Shepherd (must verify compliance). This prevents re-debating and gives Claude clear autonomy boundaries.Secondary Findings (Documented for Future Reference)
Quick Tasks with Composable Flags
Anvil has a
/quickcommand for lightweight single-task execution with composable flags:--discuss: Pre-planning discussion to surface gray areas, captures decisions--full: Plan checking (max 2 iterations) + post-execution verification--discuss --full: Full workflow with all guardrailsRelevance: We have a gap between ambient/QUICK (zero overhead) and full
/implement(full ceremony). A/quickcommand would fill this middle ground — single task, optional discussion, optional verification.Deviation Rules for Autonomous Decision-Making
Anvil codifies explicit rules for when executors should auto-fix vs ask:
Relevance: Our Coder agent has implicit behavior about when to ask vs proceed. Codifying this makes behavior predictable and documented.
Health Check Command
Anvil runs 8 diagnostic checks on project integrity:
Relevance: A
/healthcommand that validates DevFlow installation, hook integrity, settings.json consistency, and plugin state would help troubleshooting.Disk-Status Inference for Progress
Instead of explicit status tracking, Anvil infers phase status from files on disk:
Relevance: Our
.docs/reviews/directory could provide similar intelligence for review progress without explicit state management.Wave-Based Parallel Execution
Anvil groups plans by dependency and runs independent plans in parallel within waves:
Relevance: Our
/implementalready parallelizes some agents, but for multi-task implementations, wave-based parallelism could improve throughput.Verification Pattern Templates (Stub Detection)
Detailed patterns for detecting real vs stub implementations:
React stubs:
return <div>Component</div>,return null,onClick={() => {}}API stubs:
return Response.json({ message: "Not implemented" })Hook stubs:
return { user: null, login: () => {}, logout: () => {} }Wiring gaps: fetch without await, query without return, state exists but not rendered
Relevance: Could enhance Scrutinizer's detection capabilities with these patterns.
Per-Task Atomic Commits
Anvil commits after every task (sub-feature level), not just per feature. Enables
git bisectat granular level and failure recovery (task 1 committed, task 2 failed → next session knows exactly where to resume).Relevance: Our Coder does atomic commits per feature. Going more granular might add overhead, but the failure recovery benefit is worth considering.
Session Handoff Files (.continue-here.md)
Explicit handoff files created on pause with YAML frontmatter + structured sections (current_state, completed_work, remaining_work, decisions_made, blockers, next_action). Deleted after resume.
Relevance: Our WORKING-MEMORY.md serves a similar purpose but is automatic (background write). Anvil's approach is more explicit and targeted. Both have merit.
Architecture Comparison Summary
.memory/WORKING-MEMORY.md(session).planning/STATE.md+ROADMAP.md+CONTEXT.md(persistent).continue-here.mdhandoff files + STATE.mdWhat NOT to Borrow
Priority Matrix
/quickcommand with composable flags