diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index 611d96a..58fcfd1 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -1,7 +1,7 @@ { "$schema": "https://anthropic.com/claude-code/marketplace.schema.json", "name": "context-engineering-kit", - "version": "2.2.2", + "version": "2.2.3", "description": "Hand-crafted collection of advanced context engineering techniques and patterns with minimal token footprint focused on improving agent result quality.", "owner": { "name": "NeoLabHQ", @@ -55,7 +55,7 @@ { "name": "sadd", "description": "Introduces skills for subagent-driven development, dispatches fresh subagent for each task with code review between tasks, enabling fast iteration with quality gates.", - "version": "1.3.2", + "version": "1.3.3", "author": { "name": "Vlad Goncharov", "email": "vlad.goncharov@neolab.finance" diff --git a/.specs/research/research-resources.md b/.specs/research/research-resources.md index 2ef266b..b9ffb64 100644 --- a/.specs/research/research-resources.md +++ b/.specs/research/research-resources.md @@ -54,4 +54,6 @@ claude --agents '{ [] Add git workspaces usage for competitive model writing [] Research how git notes can be used during code writing and review [] Research how to add RAG style pipline with vector search to prepent relevant code to context window before code writing -[] Check "Prompting Science" series. https://arxiv.org/abs/2503.04818, https://arxiv.org/abs/2512.05858, https://chatpaper.com/paper/172346, https://arxiv.org/abs/2508.00614, https://www.researchgate.net/publication/392530384_Prompting_Science_Report_2_The_Decreasing_Value_of_Chain_of_Thought_in_Prompting \ No newline at end of file +[] Check "Prompting Science" series. https://arxiv.org/abs/2503.04818, https://arxiv.org/abs/2512.05858, https://chatpaper.com/paper/172346, https://arxiv.org/abs/2508.00614, https://www.researchgate.net/publication/392530384_Prompting_Science_Report_2_The_Decreasing_Value_of_Chain_of_Thought_in_Prompting +[] https://arxiv.org/html/2602.16666v1 - Towards a Science of AI Agent Reliability +[] https://arxiv.org/html/2601.06112v1 - ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions diff --git a/README.md b/README.md index 9efe841..251daa5 100644 --- a/README.md +++ b/README.md @@ -20,18 +20,23 @@ Hand-crafted collection of advanced context engineering techniques and patterns The marketplace is based on prompts used daily by our company developers for a long time, supplemented by plugins from benchmarked papers and high-quality projects. -> [!IMPORTANT] -> **v2 marketplace release:** [Spec-Driven Development plugin](https://cek.neolab.finance/plugins/sdd) was rewritten from scratch. It is now able to produce working code in 100% of cases on real-life production projects! - ## Key Features - **Simple to Use** - Easy to install and use without any dependencies. Contains automatically used skills and self-explanatory commands. - **Token-Efficient** - Carefully crafted prompts and architecture, preferring command-oriented skills with sub-agents over general information skills when possible, to minimize populating context with unnecessary information. - **Quality-Focused** - Each plugin is focused on meaningfully improving agent results in a specific area. -- **Granular** - Install only the plugins you need. Each plugin loads only its specific agents, commands, and skills. Each without overlap and redundant skills. +- **Granular** - Install only the plugins you need. Each plugin loads only its specific agents, commands, and skills. Each without overlap or redundant skills. - **Scientifically proven** - Plugins are based on proven techniques and patterns that were tested by well-trusted benchmarks and studies. - **Open-Standards** - Skills are based on [agentskills.io](https://agentskills.io) specification. The [SDD](https://cek.neolab.finance/plugins/sdd) plugin is based on the **Arc42** specification standard for software development documentation. +## News + +Updates from key releases: + +- **v2.0.0:** [Spec-Driven Development plugin](https://cek.neolab.finance/plugins/sdd) was rewritten from scratch. It is now able to produce working code in 99% of cases on real-life production projects! +- **v2.1.0:** [Spec-Driven Development plugin](https://cek.neolab.finance/plugins/sdd) agents include high-level code quality guidelines from [DDD plugin](https://cek.neolab.finance/plugins/ddd). +- **v2.2.0:** [Subagent-Driven Development plugin](https://cek.neolab.finance/plugins/sadd) now works as a distilled version of [SDD plugin](https://cek.neolab.finance/plugins/sdd) using meta-judge and judge sub-agents for specification generation on the fly and in parallel to implementation. [DDD plugin](https://cek.neolab.finance/plugins/ddd) now includes Clean Architecture, DDD, SOLID, Functional Programming, and other pattern examples as rules that are automatically added to the context during code writing. + ## Quick Start ### Step 1: Install Marketplace and Plugins @@ -106,100 +111,104 @@ In order to use this hook, you need to have `bun` installed. However, it is not You can find the complete Context Engineering Kit documentation [here](https://cek.neolab.finance). -However, the main plugin we recommend starting with is [Spec-Driven Development](https://cek.neolab.finance/plugins/sdd). - -## [Spec-Driven Development](https://cek.neolab.finance/plugins/sdd) - -Comprehensive specification-driven development workflow plugin that transforms prompts into production-ready implementations through structured planning, architecture design, and quality-gated execution. - -This plugin is designed to consistently produce working code. It was tested on real-life production projects by our team, and in 100% of cases, it generated working code aligned with the initial prompt. If you find a use case it cannot handle, please report it as an issue. - -### Key Features - -- **Development as compilation** — The plugin works like a "compilation" or "nightly build" for your development process: `task specs → run /sdd:implement → working code`. After writing your prompt, you can launch the plugin and expect a working result when you come back. The time it takes depends on task complexity — simple tasks may finish in 30 minutes, while complex ones can take a few days. -- **Benchmark-level quality in real life** — Model benchmarks improve with each release, yet real-world results usually stay the same. That's because benchmarks reflect the best possible output a model can achieve, whereas in practice LLMs tend to drift toward sub-optimal solutions that can be wrong or non-functional. This plugin uses a variety of patterns to keep the model working at its peak performance. -- **Customizable** — Balance result quality and process speed by adjusting command parameters. Learn more in the [Customization](./customization.md) section. -- **Developer time-efficient** — The overall process is designed to minimize developer time and reduce the number of interactions, while still producing results better than what a model can generate from scratch. However, overall quality is highly proportional to the time you invest in iterating and refining the specification. -- **Industry-standard** — The plugin's specification template is based on the arc42 standard, adjusted for LLM capabilities. Arc42 is a widely adopted, high-quality standard for software development documentation used by many companies and organizations. -- **Works best in complex or large codebases** — While most other frameworks work best for new projects and greenfield development, this plugin is designed to perform better the more existing code and well-structured architecture you have. At each planning phase it includes a **codebase impact analysis** step that evaluates which files may be affected and which patterns to follow to achieve the desired result. -- **Simple** — This plugin avoids unnecessary complexity and mainly uses just 3 commands, offloading process complexity to the model via multi-agent orchestration. `/sdd:implement` is a single command that produces working code from a task specification. To create that specification, you run `/sdd:add-task` and `/sdd:plan`, which analyze your prompt and iteratively refine the specification until it meets the required quality. - -### Quick Start - -```bash -/plugin install sdd@NeoLabHQ/context-engineering-kit -``` - -Then run the following commands: - -```bash -# create .specs/tasks/draft/design-auth-middleware.feature.md file with initial prompt -/sdd:add-task "Design and implement authentication middleware with JWT support" - -# write detailed specification for the task -/sdd:plan -# will move task to .specs/tasks/todo/ folder -``` - -Restart the Claude Code session to clear context and start fresh. Then run the following command: - -```bash -# implement the task -/sdd:implement @.specs/tasks/todo/design-auth-middleware.feature.md -# produces working implementation and moves the task to .specs/tasks/done/ folder -``` - -- [Detailed guide](https://cek.neolab.finance/guides/spec-driven-development) -- [Usage Examples](https://cek.neolab.finance/plugins/sdd/usage-examples) - -**Commands** - -- [/sdd:add-task](https://cek.neolab.finance/plugins/sdd/add-task) - Create task template file with initial prompt -- [/sdd:plan](https://cek.neolab.finance/plugins/sdd/plan) - Analyze prompt, generate required skills and refine task specification -- [/sdd:implement](https://cek.neolab.finance/plugins/sdd/implement) - Produce a working implementation of the task and verify it - -Additional commands useful before creating a task: - -- [/sdd:create-ideas](https://cek.neolab.finance/plugins/sdd/create-ideas) - Generate diverse ideas on a given topic using creative sampling techniques -- [/sdd:brainstorm](https://cek.neolab.finance/plugins/sdd/brainstorm) - Refine vague ideas into fully-formed designs through collaborative dialogue - -**Agents** - -| Agent | Description | Used By | -|-------|-------------|---------| -| `researcher` | Technology research, dependency analysis, best practices | `/sdd:plan` (Phase 2a) | -| `code-explorer` | Codebase analysis, pattern identification, architecture mapping | `/sdd:plan` (Phase 2b) | -| `business-analyst` | Requirements discovery, stakeholder analysis, specification writing | `/sdd:plan` (Phase 2c) | -| `software-architect` | Architecture design, component design, implementation planning | `/sdd:plan` (Phase 3) | -| `tech-lead` | Task decomposition, dependency mapping, risk analysis | `/sdd:plan` (Phase 4) | -| `team-lead` | Step parallelization, agent assignment, execution planning | `/sdd:plan` (Phase 5) | -| `qa-engineer` | Verification rubrics, quality gates, LLM-as-Judge definitions | `/sdd:plan` (Phase 6) | -| `developer` | Code implementation, TDD execution, quality review, verification | `/sdd:implement` | -| `tech-writer` | Technical documentation writing, API guides, architecture updates, lessons learned | `/sdd:implement` | - - -### Patterns - -Key patterns implemented in this plugin: - -- **Structured reasoning templates** — includes Zero-shot and Few-shot Chain of Thought, Tree of Thoughts, Problem Decomposition, and Self-Critique. Each is tailored to a specific agent and task, enabling sufficiently detailed decomposition so that isolated sub-agents can implement each step independently. -- **Multi-agent orchestration for context management** — Context isolation of independent agents prevents the context rot problem, essentially keeping LLMs at optimal performance at each step of the process. The main agent acts as an orchestrator that launches sub-agents and controls their work. -- **Quality gates based on LLM-as-Judge** — Evaluate the quality of each planning and implementation step using evidence-based scoring and predefined verification rubrics. This fully eliminates cases where an agent produces non-working or incorrect solutions. -- **Continuous learning** — Builds skills that the agent needs to implement a specific task, which it would otherwise not be able to perform from scratch. -- **Spec-driven development pattern** — Based on the arc42 specification standard, adjusted for LLM capabilities, to eliminate parts of the specification that add no value to implementation quality or that could degrade it. -- **MAKER** — An agent reliability pattern introduced in [Solving a Million-Step LLM Task with Zero Errors](https://arxiv.org/abs/2511.09030). It removes agent mistakes caused by accumulated context and hallucinations by utilizing clean-state agent launches, filesystem-based memory storage, and multi-agent voting during critical decision-making. - -### Vibe Coding vs. Specification-Driven Development - -This plugin is not a "vibe coding" solution, but out of the box it works like one. By default it is designed to work from a single prompt through to the end of the task, making reasonable assumptions and evidence-based decisions instead of constantly asking for clarification. This is because developer time is more valuable than model time, allowing the developer to decide how much time the task is worth. The plugin will always produce working results, but quality will be sub-optimal if no human feedback is provided. - -To improve quality, after generating a specification you can correct it or leave comments using `//`, then run the `/plan` command again with the `--refine` flag. You can also verify each planning and implementation phase by adding the `--human-in-the-loop` flag. According to most known research, human feedback is the most effective way to improve results. - -Our tests showed that even when the initially generated specification was incorrect due to lack of information or task complexity, the agent was still able to self-correct until it reached a working solution. However, it usually took much longer, spending time on wrong paths and stopping more frequently. To avoid this, we strongly advise decomposing tasks into smaller separate tasks with dependencies and reviewing the specification for each one. You can add dependencies between tasks as arguments to the `/add-task` command, and the model will link them together by adding a `depends_on` section to the task file frontmatter. - -Even if you don't want to spend much time on this process, you can still use the plugin for complex tasks without decomposition or human verification — but you will likely need tools like ralph-loop to keep the agent running for longer. - -Learn more about available customization options in [Customization](https://cek.neolab.finance/plugins/sdd/customization). +However, the main plugins we recommend starting from are [Subagent-Driven Development](https://cek.neolab.finance/plugins/sadd) and [Spec-Driven Development](https://cek.neolab.finance/plugins/sdd). + +### Agent Reliability Engineering + +The three plugins in this marketplace are designed to improve how accurately and consistently the agent follows provided instructions and reduce the number of hallucinations and bias toward incorrect solutions. They are not competitors but rather complementary to each other, because they allow you to balance reliability vs token cost. Here is a high-level comparison of different agent usage approaches vs probability to receive results that are fully accurate and include zero hallucinations based on task complexity: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ApproachProbability to receive fully accurate results for the following number of changed files (p)Tokens OverheadWhat does this mean in practice
1-34-1010-2020+
One-shot prompt60%-80%30%-50%5%-30%1%-20%0Accuracy depends on model, but with context growth LLM quality degrades exponentially
/reflect68%-91%49%-71%13%-41%1%-30%1k-3kAgent finds and fixes missed requirements on its own
/reflect + /memorize79%-87%60%-79%34%-42%5%-30%2k-5kAgent extracts repeatable mistakes and avoids them during new tasks
/do-and-judge90%83%60%30%1.5x-3xMitigates context rot, bias, hallucinations and missed requirements using Judge sub-agent
/do-in-steps92%90%71%50%3x-5xResolves all issues similarly to /do-and-judge, but separately per file group
/plan + /implement94%93%85%70%5x-20xPerforms the /do-in-steps flow, but the specification mitigates issues caused by inconsistent architecture and codebase size
/brainstorm + /plan + /implement95%95%90%80%5x-20xBrainstorming decreases the number of incorrect decisions and missed requirements
/plan + human review + /implement99%99%99%95%5x-35xHuman review mitigates misunderstanding of requirements by LLM
+ +> Reliability metrics are based on real development usage on production projects for more than 6 months. ## Plugins List @@ -212,7 +221,7 @@ To view all available plugins: - [Reflexion](https://cek.neolab.finance/plugins/reflexion) - Introduces feedback and refinement loops to improve output quality. - [Spec-Driven Development](https://cek.neolab.finance/plugins/sdd) - Introduces commands for specification-driven development, based on Continuous Learning + LLM-as-Judge + Agent Swarm. Achieves **development as compilation** through reliable code generation. - [Code Review](https://cek.neolab.finance/plugins/code-review) - Introduces codebase and PR review commands and skills using multiple specialized agents. -- [Git](https://cek.neolab.finance/plugins/git) - Introduces commands for commit and PRs creation. +- [Git](https://cek.neolab.finance/plugins/git) - Introduces commands for commit and PR creation. - [Test-Driven Development](https://cek.neolab.finance/plugins/tdd) - Introduces commands for test-driven development, common anti-patterns and skills for testing using subagents. - [Subagent-Driven Development](https://cek.neolab.finance/plugins/sadd) - Introduces skills for subagent-driven development, which dispatches a fresh subagent for each task with code review between tasks, enabling fast iteration with quality gates. - [Domain-Driven Development](https://cek.neolab.finance/plugins/ddd) - Introduces commands to update CLAUDE.md with best practices for domain-driven development, focused on code quality, and includes Clean Architecture, SOLID principles, and other design patterns. @@ -225,7 +234,7 @@ To view all available plugins: ### [Reflexion](https://cek.neolab.finance/plugins/reflexion) -Collection of commands that force the LLM to reflect on previous response and output. Includes **automatic reflection hooks** that trigger when you include "reflect" in your prompt. +Collection of commands that force the LLM to reflect on the previous response and output. Includes **automatic reflection hooks** that trigger when you include "reflect" in your prompt. **How to install** @@ -243,6 +252,14 @@ Collection of commands that force the LLM to reflect on previous response and ou - **Automatic Reflection Hook** - Triggers `/reflexion:reflect` automatically when "reflect" appears in your prompt +**Theoretical Foundation** + +The plugin is based on papers like [Self-Refine](https://arxiv.org/abs/2303.17651) and [Reflexion](https://arxiv.org/abs/2303.11366). These techniques improve the output of large language models by introducing feedback and refinement loops. + +They are proven to **increase output quality by 8–21%** based on both automatic metrics and human preferences across seven diverse tasks, including dialogue generation, coding, and mathematical reasoning, when compared to standard one-step model outputs. + +On top of that, the plugin is based on the [Agentic Context Engineering](https://arxiv.org/abs/2510.04618) paper that uses memory updates after reflection, and **consistently outperforms strong baselines by 10.6%** on agents. + ### [Code Review](https://cek.neolab.finance/plugins/code-review) Comprehensive code review commands using multiple specialized agents for thorough code quality evaluation. @@ -338,9 +355,103 @@ Execution framework for competitive generation, multi-agent evaluation, and suba **Skills** -- [subagent-driven-development](https://cek.neolab.finance/plugins/sadd/subagent-driven-development) - Dispatches fresh subagent for each task with code review between tasks, enabling fast iteration with quality gates +- [subagent-driven-development](https://cek.neolab.finance/plugins/sadd/subagent-driven-development) - Dispatches a fresh subagent for each task with code review between tasks, enabling fast iteration with quality gates - [multi-agent-patterns](https://cek.neolab.finance/plugins/sadd/multi-agent-patterns) - Design multi-agent architectures (supervisor, peer-to-peer, hierarchical) for complex tasks exceeding single-agent context limits +### [Spec-Driven Development](https://cek.neolab.finance/plugins/sdd) + +Comprehensive specification-driven development workflow plugin that transforms prompts into production-ready implementations through structured planning, architecture design, and quality-gated execution. + +This plugin is designed to consistently produce working code. It was tested on real-life production projects by our team, and in 100% of cases, it generated working code aligned with the initial prompt. If you find a use case it cannot handle, please report it as an issue. + +#### Key Features + +- **Development as compilation** — The plugin works like a "compilation" or "nightly build" for your development process: `task specs → run /sdd:implement → working code`. After writing your prompt, you can launch the plugin and expect a working result when you come back. The time it takes depends on task complexity — simple tasks may finish in 30 minutes, while complex ones can take a few days. +- **Benchmark-level quality in real life** — Model benchmarks improve with each release, yet real-world results usually stay the same. That's because benchmarks reflect the best possible output a model can achieve, whereas in practice LLMs tend to drift toward sub-optimal solutions that can be wrong or non-functional. This plugin uses a variety of patterns to keep the model working at its peak performance. +- **Customizable** — Balance result quality and process speed by adjusting command parameters. Learn more in the [Customization](./customization.md) section. +- **Developer time-efficient** — The overall process is designed to minimize developer time and reduce the number of interactions, while still producing results better than what a model can generate from scratch. However, overall quality is highly proportional to the time you invest in iterating and refining the specification. +- **Industry-standard** — The plugin's specification template is based on the arc42 standard, adjusted for LLM capabilities. Arc42 is a widely adopted, high-quality standard for software development documentation used by many companies and organizations. +- **Works best in complex or large codebases** — While most other frameworks work best for new projects and greenfield development, this plugin is designed to perform better the more existing code and well-structured architecture you have. At each planning phase it includes a **codebase impact analysis** step that evaluates which files may be affected and which patterns to follow to achieve the desired result. +- **Simple** — This plugin avoids unnecessary complexity and mainly uses just 3 commands, offloading process complexity to the model via multi-agent orchestration. `/sdd:implement` is a single command that produces working code from a task specification. To create that specification, you run `/sdd:add-task` and `/sdd:plan`, which analyze your prompt and iteratively refine the specification until it meets the required quality. + +#### Quick Start + +```bash +/plugin install sdd@NeoLabHQ/context-engineering-kit +``` + +Then run the following commands: + +```bash +# create .specs/tasks/draft/design-auth-middleware.feature.md file with initial prompt +/sdd:add-task "Design and implement authentication middleware with JWT support" + +# write detailed specification for the task +/sdd:plan +# will move task to .specs/tasks/todo/ folder +``` + +Restart the Claude Code session to clear context and start fresh. Then run the following command: + +```bash +# implement the task +/sdd:implement @.specs/tasks/todo/design-auth-middleware.feature.md +# produces working implementation and moves the task to .specs/tasks/done/ folder +``` + +- [Detailed guide](https://cek.neolab.finance/guides/spec-driven-development) +- [Usage Examples](https://cek.neolab.finance/plugins/sdd/usage-examples) + +**Commands** + +- [/sdd:add-task](https://cek.neolab.finance/plugins/sdd/add-task) - Create task template file with initial prompt +- [/sdd:plan](https://cek.neolab.finance/plugins/sdd/plan) - Analyze prompt, generate required skills and refine task specification +- [/sdd:implement](https://cek.neolab.finance/plugins/sdd/implement) - Produce a working implementation of the task and verify it + +Additional commands useful before creating a task: + +- [/sdd:create-ideas](https://cek.neolab.finance/plugins/sdd/create-ideas) - Generate diverse ideas on a given topic using creative sampling techniques +- [/sdd:brainstorm](https://cek.neolab.finance/plugins/sdd/brainstorm) - Refine vague ideas into fully-formed designs through collaborative dialogue + +**Agents** + +| Agent | Description | Used By | +|-------|-------------|---------| +| `researcher` | Technology research, dependency analysis, best practices | `/sdd:plan` (Phase 2a) | +| `code-explorer` | Codebase analysis, pattern identification, architecture mapping | `/sdd:plan` (Phase 2b) | +| `business-analyst` | Requirements discovery, stakeholder analysis, specification writing | `/sdd:plan` (Phase 2c) | +| `software-architect` | Architecture design, component design, implementation planning | `/sdd:plan` (Phase 3) | +| `tech-lead` | Task decomposition, dependency mapping, risk analysis | `/sdd:plan` (Phase 4) | +| `team-lead` | Step parallelization, agent assignment, execution planning | `/sdd:plan` (Phase 5) | +| `qa-engineer` | Verification rubrics, quality gates, LLM-as-Judge definitions | `/sdd:plan` (Phase 6) | +| `developer` | Code implementation, TDD execution, quality review, verification | `/sdd:implement` | +| `tech-writer` | Technical documentation writing, API guides, architecture updates, lessons learned | `/sdd:implement` | + + +#### Patterns + +Key patterns implemented in this plugin: + +- **Structured reasoning templates** — includes Zero-shot and Few-shot Chain of Thought, Tree of Thoughts, Problem Decomposition, and Self-Critique. Each is tailored to a specific agent and task, enabling sufficiently detailed decomposition so that isolated sub-agents can implement each step independently. +- **Multi-agent orchestration for context management** — Context isolation of independent agents prevents the context rot problem, essentially keeping LLMs at optimal performance at each step of the process. The main agent acts as an orchestrator that launches sub-agents and controls their work. +- **Quality gates based on LLM-as-Judge** — Evaluate the quality of each planning and implementation step using evidence-based scoring and predefined verification rubrics. This fully eliminates cases where an agent produces non-working or incorrect solutions. +- **Continuous learning** — Builds skills that the agent needs to implement a specific task, which it would otherwise not be able to perform from scratch. +- **Spec-driven development pattern** — Based on the arc42 specification standard, adjusted for LLM capabilities, to eliminate parts of the specification that add no value to implementation quality or that could degrade it. +- **MAKER** — An agent reliability pattern introduced in [Solving a Million-Step LLM Task with Zero Errors](https://arxiv.org/abs/2511.09030). It removes agent mistakes caused by accumulated context and hallucinations by utilizing clean-state agent launches, filesystem-based memory storage, and multi-agent voting during critical decision-making. + +#### Vibe Coding vs. Specification-Driven Development + +This plugin is not a "vibe coding" solution, but out of the box it works like one. By default it is designed to work from a single prompt through to the end of the task, making reasonable assumptions and evidence-based decisions instead of constantly asking for clarification. This is because developer time is more valuable than model time. As a result, the plugin is designed to allow the developer to decide how much time the task is worth. The plugin will always produce working results, but quality will be sub-optimal if no human feedback is provided. + +To improve quality, after generating a specification you can correct it or leave comments using `//`, then run the `/plan` command again with the `--refine` flag. You can also verify each planning and implementation phase by adding the `--human-in-the-loop` flag. According to most known research, human feedback is the most effective way to improve results. + +Our tests showed that even when the initially generated specification was incorrect due to lack of information or task complexity, the agent was still able to self-correct until it reached a working solution. However, it usually takes much longer, and results in the agent spending time on wrong paths and stopping more frequently. To avoid this, we strongly advise decomposing tasks into smaller separate tasks with dependencies and reviewing the specification for each one independently. You can add dependencies between tasks as arguments to the `/add-task` command, and the agent will link them together by adding a `depends_on` section to the task file frontmatter. + +Even if you don't want to spend much time on this process, you can still use the plugin for complex tasks without decomposition or human verification — but you will likely need tools like ralph-loop to keep the agent running for longer. + +Learn more about available customization options in [Customization](https://cek.neolab.finance/plugins/sdd/customization). + + ### [Domain-Driven Development](https://cek.neolab.finance/plugins/ddd) Commands for setting up domain-driven development best practices focused on code quality. @@ -461,7 +572,7 @@ Commands and skills for creating and refining Claude Code extensions. **Skills** -- [prompt-engineering](https://cek.neolab.finance/plugins/customaize-agent/prompt-engineering) - Well known prompt engineering techniques and patterns, includes Anthropic Best Practices and Agent Persuasion Principles +- [prompt-engineering](https://cek.neolab.finance/plugins/customaize-agent/prompt-engineering) - Well-known prompt engineering techniques and patterns, includes Anthropic Best Practices and Agent Persuasion Principles - [context-engineering](https://cek.neolab.finance/plugins/customaize-agent/context-engineering) - Deep understanding of context mechanics: attention budget, progressive disclosure, lost-in-middle effect, and practical optimization patterns - [agent-evaluation](https://cek.neolab.finance/plugins/customaize-agent/agent-evaluation) - Evaluation frameworks for agent systems: LLM-as-Judge, multi-dimensional rubrics, bias mitigation, and the 95% performance finding diff --git a/docs/plugins/sadd/do-in-parallel.md b/docs/plugins/sadd/do-in-parallel.md index b1d20ec..3b924f1 100644 --- a/docs/plugins/sadd/do-in-parallel.md +++ b/docs/plugins/sadd/do-in-parallel.md @@ -1,15 +1,15 @@ # /do-in-parallel -Execute tasks in parallel across multiple targets with intelligent model selection, independence validation, meta-judge evaluation specification, LLM-as-a-judge verification, and quality-focused prompting. +Execute tasks in parallel across multiple targets with intelligent model selection, independence validation, requirement grouping analysis, meta-judge evaluation specification, LLM-as-a-judge verification, and quality-focused prompting. -- Purpose - Execute the same task across multiple independent targets in parallel -- Pattern - Supervisor/Orchestrator with parallel dispatch, context isolation, and meta-judge + judge verification +- Purpose - Execute tasks across multiple independent targets in parallel +- Pattern - Supervisor/Orchestrator with parallel dispatch, requirement grouping, context isolation, and meta-judge + judge verification - Output - Multiple solutions, one per target, with aggregated summary - Efficiency - Dramatic time savings through concurrent execution of independent work ## Quality Assurance -Enhanced verification with Zero-shot CoT, Constitutional AI self-critique, meta-judge evaluation specification, LLM-as-a-judge verification, and intelligent model selection +Enhanced verification with Zero-shot CoT, Constitutional AI self-critique, requirement grouping analysis, meta-judge evaluation specification, LLM-as-a-judge verification, and intelligent model selection ## Pattern: Parallel Orchestration with Judge Verification @@ -25,17 +25,21 @@ Phase 2: Task Analysis with Zero-shot CoT │ (high/medium/low) │ ├─ Independence Validation ──────────────────┤ │ CRITICAL: Must pass before proceeding │ + ├─ Requirement Grouping Analysis ────────────┤ + │ (repeatable / shared / independent) │ └────────────────────────────────────────────┘ │ Phase 3: Model and Agent Selection Is task COMPLEX? → Opus Is task SIMPLE/MECHANICAL? → Haiku + Is output LARGE but task not complex? → Sonnet Otherwise → Opus (default for balanced work) │ -Phase 3.5: Dispatch Meta-Judge (ONCE) - Single sadd:meta-judge agent (Opus) - → Evaluation Specification YAML - (Reused for ALL targets — not re-run per target) +Phase 3.5: Dispatch Meta-Judges (Grouped, All in Parallel) + One per repeatable group (reusable spec) + One per shared group (combined spec) + One per independent task (task-specific spec) + (Specs reused for ALL retries — never re-run) │ Phase 4: Construct Per-Target Prompts [CoT Prefix] + [Task Body] + [Self-Critique Suffix] @@ -47,6 +51,7 @@ Phase 5: Parallel Dispatch and Judge Verification └─ Agent 3 (target C) ─→ Judge 3 (+meta-spec) ─┘ │ Each target: Implement → Judge (with meta-spec) → Retry (max 3) + Shared groups: ONE judge reviews ALL related changes together │ Phase 6: Collect and Summarize Results Aggregate outcomes, report failures, suggest remediation @@ -54,38 +59,71 @@ Phase 6: Collect and Summarize Results ## Execution Flow +### Independent / Repeatable Flow (one judge per task) + ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ │ -│ Phase 3.5: Meta-Judge (ONCE) │ -│ ┌──────────────────────────────────────┐ │ -│ │ Meta-Judge (Opus, sadd:meta-judge) │ │ -│ │ → Evaluation Specification YAML │ │ -│ └──────────────────┬───────────────────┘ │ -│ │ (shared across all targets) │ -│ ▼ │ -│ Parallel Targets │ +│ Phase 3.5: Meta-Judge Dispatch (ALL in parallel) │ +│ │ +│ Independent: Repeatable Group: │ +│ ┌──────────────┐ ┌─────────────────────┐ │ +│ │ Meta-Judge A │ │ Meta-Judge (shared) │ │ +│ │ (Opus) │ │ (Opus) │ │ +│ │ → Spec YAML A │ │ → Reusable Spec YAML │ │ +│ └──────┬───────┘ └──────────┬──────────┘ │ +│ │ ┌─────┴─────┐ │ +│ ▼ ▼ ▼ │ +│ Phase 5: Implementation (ALL in parallel, one per task) │ +│ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ Implementer A │ │ Implementer B │ │ Implementer C │ │ +│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ +│ │ │ │ │ +│ ▼ ▼ ▼ │ +│ Phase 5.2: Judge per task (after ALL implementors complete) │ +│ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ Judge A │ │ Judge B │ │ Judge C │ │ +│ │ +Spec YAML A │ │ +Reusable Spec│ │ +Reusable Spec│ │ +│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ +│ ▼ ▼ ▼ │ +│ Parse Verdict (per target) │ +│ ├─ PASS (≥4)? → Complete │ +│ ├─ Soft PASS (≥3 + low priority issues)? → Done │ +│ └─ FAIL (<4)? → Retry (max 3 per target) │ +│ │ +└─────────────────────────────────────────────────────────────────────────┘ +``` + +### Shared Flow (one judge for the group) + +``` +┌─────────────────────────────────────────────────────────────────────────┐ │ │ -│ Target A Target B Target C │ -│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ -│ │Implementer│ │Implementer│ │Implementer│ │ -│ │(parallel) │ │(parallel) │ │(parallel) │ │ -│ └─────┬────┘ └─────┬────┘ └─────┬────┘ │ -│ │ │ │ │ -│ ▼ ▼ ▼ │ -│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ -│ │ Judge │ │ Judge │ │ Judge │ │ -│ │(sadd:judge)│ │(sadd:judge)│ │(sadd:judge)│ │ -│ │+meta-spec │ │+meta-spec │ │+meta-spec │ │ -│ └─────┬────┘ └─────┬────┘ └─────┬────┘ │ -│ │ │ │ │ -│ ▼ ▼ ▼ │ -│ ┌──────────────────────────────────────────────────┐ │ -│ │ Parse Verdict (per target) │ │ -│ │ ├─ PASS (≥4)? → Complete │ │ -│ │ ├─ Soft PASS (≥3 + low priority issues)? → Done │ │ -│ │ └─ FAIL (<4)? → Retry (max 3 per target) │ │ -│ └──────────────────────────────────────────────────┘ │ +│ Phase 3.5: Meta-Judge for Shared Group │ +│ ┌──────────────────────┐ │ +│ │ Meta-Judge (combined) │ │ +│ │ (Opus) │ │ +│ │ → Combined Spec YAML │ │ +│ └──────────┬───────────┘ │ +│ ┌────┴────┐ │ +│ ▼ ▼ │ +│ Phase 5: Implementation (one per task, in parallel) │ +│ ┌──────────────┐ ┌──────────────┐ │ +│ │ Implementer X │ │ Implementer Y │ │ +│ └──────┬───────┘ └──────┬───────┘ │ +│ │ │ │ +│ └────────┬─────────┘ │ +│ ▼ │ +│ Phase 5.2: ONE Judge for entire group │ +│ ┌────────────────────────────────┐ │ +│ │ Judge (shared) │ │ +│ │ +Combined Spec YAML │ │ +│ │ +ALL implementation outputs │ │ +│ └──────────────┬─────────────────┘ │ +│ ▼ │ +│ Parse per-task verdicts → Retry ONLY failing task(s) if needed │ │ │ └─────────────────────────────────────────────────────────────────────────┘ ``` @@ -134,18 +172,27 @@ Phase 6: Collect and Summarize Results ## Meta-Judge and Judge Verification -A single `sadd:meta-judge` agent generates a tailored evaluation specification (rubrics, checklists, scoring criteria) before any implementation begins. This specification is reused for ALL per-target judge verifications -- it is never re-run per target or on retries. +Meta-judges are dispatched based on a requirement grouping analysis performed before any implementation begins. The number and type of meta-judges depends on how tasks are grouped: + +| Grouping Type | When to Apply | Meta-Judges | Judges | +|---------------|---------------|-------------|--------| +| **Repeatable** | Same task applied across multiple targets (e.g., "add tests to all 3 modules") | ONE shared meta-judge producing a reusable spec | One per task, each receiving the SAME shared spec | +| **Shared** | Interdependent tasks reviewed together (e.g., "implement S3 adapter AND integrate it") | ONE combined meta-judge for the group | ONE judge for the entire group, reviewing all changes together | +| **Independent** | Fully independent tasks with no grouping benefit | One per task | One per task | -Each parallel agent is then verified by an independent `sadd:judge` agent that applies the meta-judge specification mechanically. +Each meta-judge generates a tailored evaluation specification (rubrics, checklists, scoring criteria). Specifications are reused for all retries of their associated tasks -- they are never re-run per target or on retries. All meta-judges are launched in parallel regardless of grouping type. + +Each implementation agent is then verified by an independent `sadd:judge` agent that applies the appropriate meta-judge specification mechanically. | Aspect | Details | |--------|---------| -| **Meta-Judge** | Single `sadd:meta-judge` (Opus) dispatched once before implementation | -| **Judge** | Per-target `sadd:judge` (Opus) applying the shared meta-judge spec | +| **Meta-Judge** | `sadd:meta-judge` (Opus) dispatched per group or independent task, all in parallel | +| **Judge** | `sadd:judge` (Opus) per target (independent/repeatable) or per group (shared) | | **Threshold** | Score >=4/5.0 for PASS; soft PASS at >=3 if all issues are low priority | | **Max Retries** | 3 retries per target (same meta-judge spec reused on retries) | | **Isolation** | Each target's failure doesn't affect others | | **Feedback Loop** | Judge ISSUES passed to retry implementation | +| **Shared Retries** | Only failing implementation agent(s) are retried, not the entire group | ### Scoring Scale @@ -163,8 +210,9 @@ Each parallel agent is then verified by an independent `sadd:judge` agent that a |-----------|-------|---------| | Zero-shot Chain-of-Thought | Phase 4 (prompt prefix) | Structured reasoning before implementation | | Constitutional AI Self-Critique | Phase 4 (prompt suffix) | Internal verification before submission | -| Meta-Judge Specification | Phase 3.5 (single dispatch) | Tailored rubrics and checklists generated once, reused for all targets | -| LLM-as-a-Judge | Phase 5 (per-target) | External verification applying meta-judge spec mechanically | +| Requirement Grouping | Phase 2 (analysis) | Reduces meta-judges and judges by identifying repeatable and shared task patterns | +| Meta-Judge Specification | Phase 3.5 (grouped dispatch) | Tailored rubrics and checklists generated per group/task, reused for all retries | +| LLM-as-a-Judge | Phase 5 (per-target or per-group) | External verification applying meta-judge spec mechanically | | Retry with Feedback | Phase 5 (on failure) | Iterative improvement using judge-identified issues | ## Context Isolation Best Practices @@ -173,6 +221,7 @@ Each parallel agent is then verified by an independent `sadd:judge` agent that a - **No cross-references**: Don't tell Agent A about Agent B's target - **Let them discover**: Sub-agents read files to understand local patterns - **File system as truth**: Changes are coordinated through the filesystem +- **Track pre-existing changes**: Pass context about prior modifications to each agent's judge to prevent attribution confusion between pre-existing and current changes ## Error Handling @@ -186,8 +235,47 @@ Each parallel agent is then verified by an independent `sadd:judge` agent that a **Critical Rules:** - Each target is isolated - failures don't affect other targets - NEVER continue past max retries without user input +- NEVER try to "fix forward" without addressing judge issues +- NEVER skip judge verification +- STOP and report if context is missing (don't guess) - Continue with successful targets even if some fail - Report all failures clearly in final summary +- For shared groups, only retry the specific failing implementation agent(s), not the entire group + +## Token Optimisation via Requirement Grouping + +Requirement grouping analysis reduces the total number of agents dispatched by sharing meta-judges and judges across related tasks. The key insight is that tasks sharing the same pattern (repeatable) or requiring joint review (shared) do not each need their own meta-judge. + +### How It Works + +1. A **single meta-judge** is dispatched per group (not per target) before implementation begins +2. Its evaluation specification YAML is **reused across ALL targets** in that group +3. The meta-judge spec is also **reused on retries** -- it is never re-generated + +### Agent Count Formula + +| Grouping | Meta-Judges | Implementers | Judges | Total | +|----------|-------------|--------------|--------|-------| +| **Without grouping** (all independent) | N | N | N | 3N | +| **With grouping** (repeatable/shared) | G (groups) | N | N (repeatable) or G (shared) | G + 2N or 2G + N | + +For repeatable groups (the most common case): G meta-judges + N implementers + N judges = **G + 2N** agents. + +### Concrete Savings Examples + +| Scenario | Targets | Without Grouping | With Grouping | Savings | +|----------|---------|-----------------|---------------|---------| +| Same refactoring across 5 files (1 repeatable group) | 5 | 15 agents (5+5+5) | 11 agents (1+5+5) | 4 agents (27%) | +| Same task across 3 files + 1 independent task | 4 | 12 agents (4+4+4) | 10 agents (2+4+4) | 2 agents (17%) | +| 2 shared tasks + 3 repeatable tasks | 5 | 15 agents (5+5+5) | 11 agents (2+5+4) | 4 agents (27%) | +| 3 fully independent tasks | 3 | 9 agents (3+3+3) | 9 agents (3+3+3) | 0 (no reduction possible) | + +### Key Principles + +- Implementation agents are **always isolated** -- one per task, never shared. Only meta-judges and judges can be grouped +- When in doubt, default to **independent** grouping. Over-grouping risks incorrect evaluation specs; independent tasks always receive correct, task-specific evaluation +- All meta-judges are launched **in parallel** regardless of grouping type +- Implementers launch immediately after their meta-judge completes, without waiting for all meta-judges ## Theoretical Foundation diff --git a/plugins/sadd/.claude-plugin/plugin.json b/plugins/sadd/.claude-plugin/plugin.json index 0e05a1b..7fab63b 100644 --- a/plugins/sadd/.claude-plugin/plugin.json +++ b/plugins/sadd/.claude-plugin/plugin.json @@ -1,6 +1,6 @@ { "name": "sadd", - "version": "1.3.2", + "version": "1.3.3", "description": "Introduces skills for subagent-driven development, dispatches fresh subagent for each task with code review between tasks, enabling fast iteration with quality gates.", "author": { "name": "Vlad Goncharov", diff --git a/plugins/sadd/skills/do-and-judge/SKILL.md b/plugins/sadd/skills/do-and-judge/SKILL.md index 4912e01..22915ec 100644 --- a/plugins/sadd/skills/do-and-judge/SKILL.md +++ b/plugins/sadd/skills/do-and-judge/SKILL.md @@ -253,6 +253,32 @@ CRITICAL: Provide to the judge EXACT meta-judge's evaluation specification YAML, - Summary section (files modified, key changes) - Paths to files modified +#### 3.1 Analyze the Pre-existing Changes Section + +Before dispatching the judge, assess whether there are pre-existing changes in the codebase that the judge needs to be aware of. The "Pre-existing Changes" section prevents the judge from confusing prior modifications with the current implementation agent's work. + +**When to include:** + +- Previous do-and-judge task runs completed earlier in the same session +- User's manual modifications made before invoking the skill (visible from conversation context or in git) +- Changes from other tools or agents that ran before this task + +**When to omit:** + +- This is the first task with no known prior changes — omit the section entirely +- On retries within the SAME task, do NOT include the implementation agent's own previous attempt as "pre-existing changes" — those are part of the current task's iteration cycle + +**Content guidelines:** + +- Use a high-level summary: task description, list of affected files/modules, general nature of changes (created, modified, deleted) +- Do NOT include code blocks, diffs, or line-level details — keep it concise +- Label the source clearly: "Previous Task: {description}", "User modifications (before current task)", etc. +- If multiple sources of pre-existing changes exist, use separate subsections for each + +CRITICAL: avoid reading full codebase or git history, just use high-level git diff/status to determine which files were changed, or use conversation context to determine if there are any pre-existing changes. + +### 3.2 Launch Judge with prompt and specification YAML + **Judge prompt template:** ```markdown @@ -263,6 +289,17 @@ CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}` ## User Prompt {Original task description from user} +{IF pre-existing changes are known, include the following section — otherwise omit entirely} + +## Pre-existing Changes (Context Only) + +The following changes were made BEFORE the current implementation agent started working. They are NOT part of the current task's output. Focus your evaluation on the current task's changes. Only verify pre-existing changed files/logic if they directly relate to the current task requirements. + +### {Source of changes: e.g., "Previous Task: {task description}" or "User modifications (before current task)"} +{High-level summary: what was done, which files/modules were created or modified} + +{END conditional section} + ## Evaluation Specification ```yaml @@ -289,7 +326,7 @@ CRITICAL: NEVER provide score threshold, in any format, including `threshold_pas ``` Use Task tool: - description: "Judge: {brief task summary}" - - prompt: {judge verification prompt with exact meta-judge specification YAML} + - prompt: {judge verification prompt with exact meta-judge specification YAML, and Pre-existing Changes section if applicable} - model: opus - subagent_type: "sadd:judge" ``` @@ -440,38 +477,143 @@ Awaiting your decision... ## Examples -### Example 1: Simple Refactoring (Pass on First Try) +### Example 1: Documentation Update (Pass on First Try) **Input:** ``` -/do-and-judge Extract the validation logic from UserController into a separate UserValidator class +/do-and-judge Rewrite the API authentication section in docs/api-reference.md to cover the new OAuth2 flow ``` **Execution:** ``` Phase 1: Task Analysis - → Model: Opus + - Complexity: Medium (rewriting existing documentation with new technical flow) + - Risk: Low (documentation only, no code changes) + - Scope: Small (single file, focused section) + → Model: opus + → Agent type: general-purpose + Reasoning: This is a documentation task — writing and restructuring + prose, not implementing code. The sdd:developer agent is optimized + for code implementation patterns, not technical writing. A + general-purpose agent handles documentation tasks more effectively + because it applies broader writing and reasoning skills without + code-centric constraints. Phase 2: Parallel Dispatch (single message, 2 tool calls) Tool call 1 — Meta-judge (Opus)... + Meta-judge prompt sent: + ┌───────────────────────────────────────────────────────── + │ ## Task + │ Generate an evaluation specification yaml for the + │ following task. You will produce rubrics, checklists, + │ and scoring criteria that a judge agent will use to + │ evaluate the implementation artifact. + │ + │ CLAUDE_PLUGIN_ROOT=... + │ + │ ## User Prompt + │ Rewrite the API authentication section in + │ docs/api-reference.md to cover the new OAuth2 flow + │ + │ ## Context + │ Existing docs/api-reference.md contains an outdated + │ "Authentication" section describing API key auth. + │ The codebase recently migrated to OAuth2 with PKCE. + │ Related source: src/auth/oauth2.ts, src/auth/config.ts. + │ + │ ## Artifact Type + │ documentation + │ + │ ## Instructions + │ Return only the final evaluation specification YAML + │ in your response. + └───────────────────────────────────────────────────────── → Generated evaluation specification YAML - → 3 rubric dimensions, 6 checklist items - Tool call 2 — Implementation (sadd:meta-judge + Opus)... - → Created UserValidator.ts - → Updated UserController to use validator - → Summary: 2 files modified, validation extracted + → 3 rubric dimensions (accuracy, completeness, clarity) + → 5 checklist items + + Tool call 2 — Implementation (general-purpose + Opus)... + Implementation prompt sent (abbreviated): + ┌───────────────────────────────────────────────────────── + │ ## Reasoning Approach + │ Before taking any action, think through this task + │ systematically. + │ [... step-by-step reasoning template ...] + │ + │ ## Task + │ Rewrite the API authentication section in + │ docs/api-reference.md to cover the new OAuth2 flow. + │ Replace the outdated API key auth documentation with + │ OAuth2 + PKCE flow documentation including token + │ endpoints, scopes, refresh token handling, and + │ example requests. + │ + │ ## Constraints + │ - Follow existing documentation patterns and conventions + │ - Make minimal changes to achieve the objective + │ - Do not introduce new dependencies without justification + │ - Ensure changes are testable + │ + │ ## Output + │ Provide your implementation along with a "Summary" + │ section containing: + │ - Files modified (full paths) + │ - Key changes (3-5 bullet points) + │ - Any decisions made and rationale + │ - Potential concerns or follow-up needed + │ + │ ## Self-Critique Verification (MANDATORY) + │ [... verification questions and revision process ...] + └───────────────────────────────────────────────────────── + → Rewrote Authentication section in docs/api-reference.md + → Added OAuth2 flow diagram, token endpoints, scopes table + → Added code examples for authorization and token refresh + → Summary: 1 file modified, authentication section rewritten Phase 3: Dispatch Judge (with meta-judge specification) - Judge (sadd:judge)... + NOTE: No pre-existing changes — first task on a clean codebase. + The "Pre-existing Changes" section is OMITTED from the judge prompt. + + Judge prompt sent: + ┌───────────────────────────────────────────────────────── + │ You are evaluating an implementation artifact against + │ an evaluation specification produced by the meta judge. + │ + │ CLAUDE_PLUGIN_ROOT=... + │ + │ ## User Prompt + │ Rewrite the API authentication section in + │ docs/api-reference.md to cover the new OAuth2 flow + │ + │ ## Evaluation Specification + │ ```yaml + │ {meta-judge's evaluation specification YAML} + │ ``` + │ + │ ## Implementation Output + │ Files: docs/api-reference.md (modified) + │ Key changes: Replaced API key auth section with OAuth2 + │ + PKCE flow, added token endpoints, scopes table, + │ and code examples for authorization and refresh... + │ + │ ## Instructions + │ Follow your full judge process... + └───────────────────────────────────────────────────────── + + Judge (sadd:judge + Opus)... → VERDICT: PASS, SCORE: 4.2/5.0 → ISSUES: None - → IMPROVEMENTS: Add input validation for edge cases + → IMPROVEMENTS: Add error response examples for expired tokens + +Phase 4: Parse Verdict + → Score 4.2 ≥ 4.0 threshold → PASS + → No retry needed (Phase 5 skipped) Phase 6: Final Report ✅ PASS on attempt 1 - Files: UserValidator.ts (new), UserController.ts (modified) + Files: docs/api-reference.md (modified) ``` ### Example 2: Complex Task (Pass After Retry) @@ -495,7 +637,7 @@ Phase 2: Parallel Dispatch (Attempt 1) Tool call 1 — Meta-judge (Opus)... → Generated evaluation specification YAML → 4 rubric dimensions, 8 checklist items - Tool call 2 — Implementation (sadd:meta-judge + Opus + sdd:developer)... + Tool call 2 — Implementation (sdd:developer + Opus)... → Created RateLimiter middleware → Added configuration schema @@ -560,6 +702,395 @@ User chose: Option 1 - "Delete orphaned records older than 1 year" Attempt 4 (with guidance): PASS (4.1/5.0) ``` +### Example 4: Sequential do-and-judge Runs (Pre-existing Changes from Previous Task) + +**Input (first run):** + +``` +/do-and-judge add basic authentication module +``` + +**Execution (first run):** + +``` +Phase 1: Task Analysis + - Complexity: High (new feature, security-sensitive) + - Risk: High (authentication is critical) + - Scope: Medium (new module) + → Model: opus + - Pre-existing Changes: None + +Phase 2: Parallel Dispatch (Attempt 1) + Tool call 1 — Meta-judge (Opus)... + Meta-judge prompt sent: + ┌───────────────────────────────────────────────────────── + │ ## Task + │ Generate an evaluation specification yaml for the + │ following task. You will produce rubrics, checklists, + │ and scoring criteria that a judge agent will use to + │ evaluate the implementation artifact. + │ + │ CLAUDE_PLUGIN_ROOT=... + │ + │ ## User Prompt + │ Add basic authentication module + │ + │ ## Context + │ Express.js backend, src/auth/ directory does not exist + │ yet. Existing middleware pattern in src/middleware/. + │ + │ ## Artifact Type + │ code + │ + │ ## Instructions + │ Return only the final evaluation specification YAML + │ in your response. + └───────────────────────────────────────────────────────── + → Generated evaluation specification YAML + → 4 rubric dimensions, 7 checklist items + + Tool call 2 — Implementation (sdd:developer + Opus)... + Implementation prompt sent (abbreviated): + ┌───────────────────────────────────────────────────────── + │ ## Reasoning Approach + │ Before taking any action, think through this task + │ systematically. + │ [... step-by-step reasoning template ...] + │ + │ ## Task + │ Add basic authentication module to the Express.js + │ backend. Create login, logout, and register endpoints + │ with proper middleware for route protection. + │ + │ ## Constraints + │ - Follow existing code patterns and conventions + │ - Make minimal changes to achieve the objective + │ - Do not introduce new dependencies without + │ justification + │ - Ensure changes are testable + │ + │ ## Output + │ Provide your implementation along with a "Summary" + │ section containing: + │ - Files modified (full paths) + │ - Key changes (3-5 bullet points) + │ - Any decisions made and rationale + │ - Potential concerns or follow-up needed + │ + │ ## Self-Critique Verification (MANDATORY) + │ [... verification questions and revision process ...] + └───────────────────────────────────────────────────────── + → Created src/auth/AuthService.ts + → Created src/auth/AuthMiddleware.ts + → Created src/auth/auth.routes.ts + → Modified src/app.ts + → Summary: 4 files changed, auth module added + +Phase 3: Dispatch Judge (with meta-judge specification) + NOTE: No pre-existing changes — this is the first task on a clean codebase. + The "Pre-existing Changes" section is OMITTED from the judge prompt. + + Judge prompt sent: + ┌───────────────────────────────────────────────────────── + │ You are evaluating an implementation artifact against + │ an evaluation specification produced by the meta judge. + │ + │ CLAUDE_PLUGIN_ROOT=... + │ + │ ## User Prompt + │ Add basic authentication module + │ + │ ## Evaluation Specification + │ ```yaml + │ {meta-judge's evaluation specification YAML} + │ ``` + │ + │ ## Implementation Output + │ Files: src/auth/AuthService.ts (new), ... + │ Key changes: Added login/logout/register endpoints... + │ + │ ## Instructions + │ Follow your full judge process... + └───────────────────────────────────────────────────────── + + Judge (sadd:judge + Opus)... + → VERDICT: FAIL, SCORE: 3.0/5.0 + → ISSUES: + - Missing password hashing (plain-text storage) + - No unit tests for AuthService + → IMPROVEMENTS: Add rate limiting on login endpoint + +Phase 5: Retry with Feedback (Attempt 2) + Implementation (sdd:developer + Opus)... + → Added bcrypt password hashing + → Created tests/auth/AuthService.test.ts + → Summary: 2 files modified, 1 file created + +Phase 3: Dispatch Judge (Attempt 2, same meta-judge specification) + NOTE: This is a retry within the SAME task — do NOT include the + implementation agent's previous attempt as "pre-existing changes". + The "Pre-existing Changes" section is still OMITTED. + + Judge (sadd:judge + Opus)... + → VERDICT: PASS, SCORE: 4.3/5.0 + → IMPROVEMENTS: Add integration tests + +Phase 6: Final Report + ✅ PASS on attempt 2 + Files: AuthService.ts, AuthMiddleware.ts, auth.routes.ts, + AuthService.test.ts, app.ts +``` + +**Input (second run, same session):** + +``` +/do-and-judge refactor auth module to use dependency injection +``` + +**Execution (second run):** + +``` +Phase 1: Task Analysis + - Complexity: Medium (refactoring existing code) + - Risk: Medium (modifying working auth module) + - Scope: Medium (single module refactor) + → Model: opus + - Pre-existing Changes: Auth module created in previous task + +Phase 2: Parallel Dispatch + Tool call 1 — Meta-judge (Opus)... + Meta-judge prompt sent: + ┌───────────────────────────────────────────────────────── + │ ## Task + │ Generate an evaluation specification yaml for the + │ following task. You will produce rubrics, checklists, + │ and scoring criteria that a judge agent will use to + │ evaluate the implementation artifact. + │ + │ CLAUDE_PLUGIN_ROOT=... + │ + │ ## User Prompt + │ Refactor auth module to use dependency injection + │ + │ ## Context + │ Existing auth module at src/auth/ with AuthService, + │ AuthMiddleware, auth.routes. Tests in tests/auth/. + │ + │ ## Artifact Type + │ code + │ + │ ## Instructions + │ Return only the final evaluation specification YAML + │ in your response. + └───────────────────────────────────────────────────────── + → Generated evaluation specification YAML + → 3 rubric dimensions, 5 checklist items + + Tool call 2 — Implementation (sdd:developer + Opus)... + Implementation prompt sent (abbreviated): + ┌───────────────────────────────────────────────────────── + │ ## Reasoning Approach + │ Before taking any action, think through this task + │ systematically. + │ [... step-by-step reasoning template ...] + │ + │ ## Task + │ Refactor the auth module to use dependency injection. + │ AuthService should accept its dependencies via + │ constructor instead of importing them directly. + │ + │ ## Constraints + │ - Follow existing code patterns and conventions + │ - Make minimal changes to achieve the objective + │ - Do not introduce new dependencies without + │ justification + │ - Ensure changes are testable + │ + │ ## Output + │ Provide your implementation along with a "Summary" + │ section containing: + │ - Files modified (full paths) + │ - Key changes (3-5 bullet points) + │ - Any decisions made and rationale + │ - Potential concerns or follow-up needed + │ + │ ## Self-Critique Verification (MANDATORY) + │ [... verification questions and revision process ...] + └───────────────────────────────────────────────────────── + → Refactored AuthService to accept dependencies via constructor + → Created src/auth/AuthServiceFactory.ts + → Updated tests to use mocked dependencies + → Summary: 4 files modified, 1 file created + +Phase 3: Dispatch Judge (with meta-judge specification) + NOTE: Pre-existing changes detected — the previous do-and-judge run + created the auth module. Include "Pre-existing Changes" section so the + judge does not confuse prior work with the current refactoring task. + + Judge prompt sent: + ┌───────────────────────────────────────────────────────── + │ You are evaluating an implementation artifact against + │ an evaluation specification produced by the meta judge. + │ + │ CLAUDE_PLUGIN_ROOT=... + │ + │ ## User Prompt + │ Refactor auth module to use dependency injection + │ + │ ## Pre-existing Changes (Context Only) + │ + │ The following changes were made BEFORE the current + │ implementation agent started working. They are NOT part + │ of the current task's output. Focus your evaluation on + │ the current task's changes. Only verify pre-existing + │ changed files/logic if they directly relate to the + │ current task requirements. + │ + │ ### Previous Task: "Add basic authentication module" + │ The following files were created/modified as part of a + │ previous task: + │ - src/auth/AuthService.ts (new) - Authentication service + │ with login/logout/register + │ - src/auth/AuthMiddleware.ts (new) - Express middleware + │ for route protection + │ - src/auth/auth.routes.ts (new) - Auth API routes + │ - tests/auth/AuthService.test.ts (new) - Unit tests for + │ auth service + │ - src/app.ts (modified) - Integrated auth routes and + │ middleware + │ + │ These files exist in the codebase and may be modified by + │ the current task, but you should evaluate only the + │ changes made by the current implementation agent for the + │ current task (refactoring to dependency injection). + │ + │ ## Evaluation Specification + │ ```yaml + │ {meta-judge's evaluation specification YAML} + │ ``` + │ + │ ## Implementation Output + │ Files: src/auth/AuthService.ts (modified), ... + │ Key changes: Refactored to constructor injection... + │ + │ ## Instructions + │ Follow your full judge process... + └───────────────────────────────────────────────────────── + + Judge (sadd:judge + Opus)... + → VERDICT: PASS, SCORE: 4.5/5.0 + → ISSUES: None + → IMPROVEMENTS: Add interface documentation + +Phase 6: Final Report + ✅ PASS on attempt 1 + Files: AuthService.ts (modified), AuthServiceFactory.ts (new), + AuthMiddleware.ts (modified), AuthService.test.ts (modified), + app.ts (modified) +``` + +### Example 5: User-Modified Codebase Before do-and-judge + +**Scenario:** + +The user has been working on an e-commerce codebase during the conversation. They modified the shopping cart, product catalog, and checkout flow before invoking do-and-judge. + +**Input:** + +``` +/do-and-judge fix shopping cart module bug when it adds duplicated items +``` + +**Execution:** + +``` +Phase 1: Task Analysis + - Complexity: Medium (bug fix in existing module) + - Risk: Medium (cart logic affects checkout) + - Scope: Small (focused bug fix) + → Model: opus + - Pre-existing Changes: User modified several files before this task + +Phase 2: Parallel Dispatch + Tool call 1 — Meta-judge (Opus)... + → Generated evaluation specification YAML + → 3 rubric dimensions, 5 checklist items + Tool call 2 — Implementation (sdd:developer + Opus)... + → Fixed duplicate detection in CartService.addItem() + → Added deduplication guard in cart.routes.ts + → Added regression test for duplicate item scenario + → Summary: 3 files modified + +Phase 3: Dispatch Judge (with meta-judge specification) + NOTE: The orchestrator is aware from git diff/status that the user + modified several files before this task. Include "Pre-existing Changes" + section so the judge focuses only on the bug fix. + + Judge prompt sent: + ┌───────────────────────────────────────────────────────── + │ You are evaluating an implementation artifact against + │ an evaluation specification produced by the meta judge. + │ + │ CLAUDE_PLUGIN_ROOT=... + │ + │ ## User Prompt + │ Fix shopping cart module bug when it adds duplicated items + │ + │ ## Pre-existing Changes (Context Only) + │ + │ The following changes were made BEFORE the current + │ implementation agent started working. They are NOT part + │ of the current task's output. Focus your evaluation on + │ the current task's changes. Only verify pre-existing + │ changed files/logic if they directly relate to the + │ current task requirements. + │ + │ ### User modifications (before current task) + │ The user made changes to the following files/modules + │ before this task was started: + │ - src/cart/CartService.ts (modified) - Shopping cart + │ business logic updates + │ - src/cart/cart.routes.ts (modified) - Updated cart API + │ endpoints + │ - src/products/ProductCatalog.ts (modified) - Product + │ listing changes + │ - src/checkout/CheckoutFlow.ts (modified) - Checkout + │ process updates + │ - tests/cart/CartService.test.ts (modified) - Updated + │ cart tests + │ + │ The current task focuses specifically on fixing the + │ duplicate items bug in the shopping cart module. + │ Pre-existing changes to cart files may overlap with the + │ current task scope — evaluate whether the implementation + │ agent's changes correctly address the bug without + │ breaking the pre-existing modifications. + │ + │ ## Evaluation Specification + │ ```yaml + │ {meta-judge's evaluation specification YAML} + │ ``` + │ + │ ## Implementation Output + │ Files: src/cart/CartService.ts (modified), ... + │ Key changes: Added duplicate item detection... + │ + │ ## Instructions + │ Follow your full judge process... + └───────────────────────────────────────────────────────── + + Judge (sadd:judge + Opus)... + → VERDICT: PASS, SCORE: 4.1/5.0 + → ISSUES: None + → IMPROVEMENTS: Consider extracting deduplication logic + into a shared utility + +Phase 6: Final Report + ✅ PASS on attempt 1 + Files: CartService.ts (modified), cart.routes.ts (modified), + CartService.test.ts (modified) +``` + ## Best Practices ### Model Selection @@ -589,3 +1120,4 @@ Attempt 4 (with guidance): PASS (4.1/5.0) - **Summarize, don't copy** - Pass summaries, not full file contents - **Trust sub-agents** - They can read files themselves - **Meta-judge YAML** - Pass only the meta-judge YAML to the judge, do not add any additional text or comments to it! +- **Track pre-existing changes** - Pass context about prior modifications to the judge to prevent attribution confusion between pre-existing and current changes diff --git a/plugins/sadd/skills/do-in-parallel/SKILL.md b/plugins/sadd/skills/do-in-parallel/SKILL.md index 5c6316a..70e0132 100644 --- a/plugins/sadd/skills/do-in-parallel/SKILL.md +++ b/plugins/sadd/skills/do-in-parallel/SKILL.md @@ -7,17 +7,18 @@ argument-hint: Task description [--files "file1.ts,file2.ts,..."] [--targets "ta # do-in-parallel -Launch multiple sub-agents in parallel to execute the same task across different files or targets. Analyze the task to intelligently select the optimal model, generate quality-focused prompts with Zero-shot Chain-of-Thought reasoning and mandatory self-critique, then dispatch one meta-judge per target (all in parallel), followed by implementors for each target in parallel, with LLM-as-a-judge verification using target-specific evaluation specs after each completes. +Launch multiple sub-agents in parallel to execute tasks across different files or targets. Analyze the task to intelligently select the optimal model, perform requirement grouping analysis (repeatable, shared, or independent), generate quality-focused prompts with Zero-shot Chain-of-Thought reasoning and mandatory self-critique, then dispatch meta-judges based on grouping (one per group or per independent task, all in parallel), followed by implementors for each task in parallel, with LLM-as-a-judge verification using grouping-appropriate evaluation specs after each completes. -This command implements the **Supervisor/Orchestrator pattern** with parallel dispatch and **meta-judge → LLM-as-a-judge verification**. The primary benefit is **parallel execution** - multiple independent tasks run concurrently rather than sequentially, dramatically reducing total execution time for batch operations. One meta-judge per task generates tailored evaluation criteria specific to that task, then each parallel implementor is verified by an independent judge using its task-specific specification, with automatic retry on failure. +This command implements the **Supervisor/Orchestrator pattern** with parallel dispatch, **requirement grouping**, and **meta-judge → LLM-as-a-judge verification**. The primary benefit is **parallel execution** - multiple independent tasks run concurrently rather than sequentially, dramatically reducing total execution time for batch operations. Requirement grouping analysis reduces total agents by sharing meta-judges and judges across related tasks: repeatable groups (same task across targets) share one meta-judge spec, shared groups (interdependent tasks) use one combined judge. Key benefits: - **Parallel execution** - Multiple tasks run simultaneously +- **Requirement grouping** - Reduces meta-judges and judges by identifying repeatable and shared task patterns - **Fresh context** - Each sub-agent works with clean context window -- **task-specific evaluation** - Each meta-judge produces tailored rubrics and checklists for its specific task +- **Task-specific evaluation** - Each meta-judge produces tailored rubrics and checklists for its specific task or group - **External verification** - Judge applies target-specific meta-judge specification mechanically — catches blind spots self-critique misses - **Feedback loop** - Retry with specific issues identified by judge - **Quality gate** - Work doesn't ship until it meets threshold @@ -31,11 +32,11 @@ Key benefits: **CRITICAL:** You are the orchestrator only - you MUST NOT perform the task yourself. IF you read, write or run bash tools you failed task imidiatly. It is single most critical criteria for you. If you used anyting except sub-agents you will be killed immediatly!!!! Your role is to: -1. Analyze the task and select optimal model -2. Dispatch ALL meta-judges in parallel (one per target) to generate target-specific evaluation specifications -3. After each meta-judge completes, dispatch the implementation sub-agent for that target with structured prompts -4. After each implementor completes, dispatch its independent judge with the target-specific meta-judge specification -5. Parse verdict and iterate if needed (max 3 retries per target) +1. Analyze the task, perform requirement grouping analysis, and select optimal model +2. Dispatch meta-judges in parallel based on grouping +3. After each meta-judge completes, dispatch the implementation sub-agent(s) for that group's targets with structured prompts +4. After implementors complete, dispatch judges based on grouping +5. Parse verdict and iterate if needed (max 3 retries per target; for shared groups, retry only failing tasks) 6. Collect results and report final summary ## RED FLAGS - Never Do These @@ -50,20 +51,25 @@ Key benefits: - Wait for one agent to complete before starting another - Re-run meta-judge on retries - Wait to launch implementors until ALL meta-judges have completed +- Launch separate meta-judges for tasks that belong to the same repeatable or shared group +- Re-launch ALL implementation agents in a shared group when only some failed **ALWAYS:** - Use Task tool to dispatch sub-agents for ALL implementation work -- Dispatch one meta-judge PER task, all in parallel in a SINGLE response +- Perform requirement grouping analysis BEFORE dispatching any meta-judges +- Dispatch meta-judges based on grouping -- all in parallel in a SINGLE response - Do not wait for ALL meta-judges to complete before dispatching implementors, launch them immediately after each meta-judge completes -- Launch each implementor for a task immediately after its meta-judge completes. If all meta-judges are completed, launch all implementaion agents in SINGLE response. -- Pass each target's specific meta-judge evaluation specification to its judge agent +- Launch each implementor for a task immediately after its meta-judge completes. If all meta-judges are completed, launch all implementation agents in SINGLE response +- Pass each target's specific meta-judge evaluation specification to its judge agent +- For shared groups, dispatch ONE judge that reviews ALL related changes together - Include `CLAUDE_PLUGIN_ROOT=${CLAUDE_PLUGIN_ROOT}` in prompts to meta-judge and judge agents - Use Task tool to dispatch independent judges for verification - Wait for each implementation to complete before dispatching its judge - Parse only VERDICT/SCORE/ISSUES from judge output - Iterate with feedback if verification fails (max 3 retries per target) -- Reuse same task-specific meta-judge specification for all retries of that task (never re-run meta-judge) +- For shared group retries, only re-launch the specific failing implementation agent(s), not the entire group +- Reuse same meta-judge specification for all retries (never re-run meta-judge) ## Process @@ -138,6 +144,52 @@ Verify tasks are truly independent before proceeding: If ANY check fails: STOP and inform user why parallelization is unsafe. Recommend `/launch-sub-agent` for sequential execution. +#### Requirement Grouping Analysis (REQUIRED before Meta-Judge dispatch) + +After identifying individual tasks and validating independence, analyze whether tasks can share meta-judges and/or judges. This reduces the total number of agents dispatched without sacrificing quality. + +**Three grouping types** (can be combined within a single user prompt): + +| Grouping Type | When to Apply | Meta-Judges | Implementation Agents | Judges | +|---------------|---------------|-------------|----------------------|--------| +| **Repeatable** | Same task pattern applied across multiple files/modules (e.g., "add tests to all 3 modules") | ONE shared meta-judge for the group | One per task (always isolated) | One per task, each receiving the SAME shared spec | +| **Shared** | Tasks that should be reviewed/verified together because they are interdependent (e.g., "implement S3 adapter AND integrate it into analytics") | ONE combined meta-judge for the group | One per task (always isolated) | ONE judge for the entire group, reviewing all changes together | +| **Independent** | Tasks that are fully independent with no grouping benefit | One per task | One per task (always isolated) | One per task | + +**Decision process:** + +``` +For each pair of tasks, ask: + +1. "Is this the SAME task applied to different targets?" + +-- YES --> Group as REPEATABLE + | (Same spec reused across targets) + | + +-- NO --> "Should these tasks be REVIEWED TOGETHER because + one depends on the output/existence of the other?" + | + +-- YES --> Group as SHARED + | (Combined spec, single judge reviews all) + | + +-- NO --> Mark as INDEPENDENT + (Separate meta-judge and judge per task) +``` + +CRITICAL: +- When in doubt, default to INDEPENDENT.** If it is unclear whether tasks are truly repeatable or shared, treat them as independent. Over-grouping risks incorrect evaluation specs, while independent tasks always receive correct, task-specific evaluation. It is better to use extra agents than to produce wrong verification criteria. +- Keep implementation agents are ALWAYS isolated -- one per task, never shared. Only meta-judges and judges can be shared/grouped. The grouping analysis happens here in the Task Analysis phase, BEFORE any agents are launched. + +**Meta-judge instructions:** +- Repeatable group: When dispatching a meta-judge for a repeatable group, include explicit instructions to produce a reusable verification spec. +- Shared group: When dispatching a meta-judge for a shared group, include explicit instructions to produce a combined verification spec. + + +**Shared group retry logic:** + +If the shared judge finds issues, analyze which specific implementation agent(s) produced the failing changes. Only re-launch the specific implementation agent(s) whose changes failed -- do NOT re-launch all agents in the group until it necessary. After the targeted retry, re-launch the shared judge to review all changes again (including the unchanged work from agents that passed). + + + ### Phase 3: Model and Agent Selection Select the optimal model and specialized agent based on task analysis. **Same configuration for all parallel agents** (ensures consistent quality): @@ -186,14 +238,15 @@ Skip specialized agent when: - No clear domain match exists - General-purpose execution is sufficient -### Phase 3.5: Dispatch Meta-Judges (One Per Target, All in Parallel) +### Phase 3.5: Dispatch Meta-Judges (Grouped by Requirement Type, All in Parallel) -Before dispatching implementation agents, dispatch one meta-judge per target in parallel. Each meta-judge receives task-specific context so it produces a tailored evaluation specification for that specific task. Each meta-judge produces rubrics, checklists, and scoring criteria tailored to the specific task. Each target's specification is reused for all retries of that target ONLY. +Before dispatching implementation agents, dispatch meta-judges based on the requirement grouping analysis from Phase 2. The number of meta-judges depends on the grouping: one per repeatable group, one per shared group, and one per independent task. All meta-judges are launched in parallel regardless of grouping type. Each meta-judge produces rubrics, checklists, and scoring criteria. Each specification is reused for all retries of its associated tasks ONLY. +Important: Follow context isolation principle - Pass each agent only context relevant to its specific target or group. -Important: Follow context isolation principle - Pass each agent only context relevant to its specific target. +#### 3.5.1 Meta-Judge Prompt Templates by Grouping Type -**Meta-judge prompt template (per target):** +**Independent meta-judge prompt:** ```markdown ## Task @@ -206,7 +259,7 @@ CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}` {Original user prompt} ## Target -{Specific target for this meta-judge: task description,file path, component name, etc. extracted from User Prompt} +{Specific target for this meta-judge: task description, file path, component name, etc. extracted from User Prompt} ## Context {Any relevant codebase context, file paths, constraints} @@ -219,33 +272,103 @@ User prompt is provided as context, you should use it only as reference of chang Return only the final evaluation specification YAML in your response. ``` -**Dispatch ALL meta-judges in a SINGLE response:** +**Repeatable group meta-judge prompt (ONE per group):** + +```markdown +## Task + +Generate a REUSABLE evaluation specification yaml that can be applied to ANY of the following targets performing the same task. You will produce rubrics, checklists, and scoring criteria that individual judge agents will each use independently to evaluate one target's implementation artifact. + +CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}` + +## User Prompt as Context +{Original user prompt} + +## Task Being Repeated +{The common task description shared by all targets in this group} + +## Targets in This Group +{List of all targets: file paths, component names, etc.} + +## Context +{Any relevant codebase context, file paths, constraints} + +## Artifact Type +{code | documentation | configuration | etc.} + +## Instructions +CRITICAL: You are generating a REUSABLE spec that will be applied to EACH target independently by separate judges. +- Use generic language: "target file should align with criteria" instead of "all files should align" +- Do NOT include file-specific requirements (e.g., NOT "file should have only authentication logic") if the same spec will be applied to another target which logically cannot fulfill this criteria (e.g. "cart.ts" or "payments.ts" cannot have authentication logic) +- The spec must be applicable to ANY target in this group without modification +- Each judge will receive this same spec and evaluate only its own target against it +User prompt is provided as context, you should use it only as reference of changes that can occur in the project by other agents. +Return only the final evaluation specification YAML in your response. +``` + +**Shared group meta-judge prompt (ONE per group):** + +```markdown +## Task + +Generate a COMBINED evaluation specification yaml that covers ALL of the following related tasks. These tasks are interdependent and will be reviewed TOGETHER by a single judge. You will produce rubrics, checklists, and scoring criteria that account for cross-task dependencies and integration points. + +CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}` + +## User Prompt as Context +{Original user prompt} + +## Tasks in This Shared Group +{List of all tasks with their targets: +- Task 1: {description} -> {target} +- Task 2: {description} -> {target} +} + +## Context +{Any relevant codebase context, file paths, constraints, integration points between tasks} + +## Artifact Type +{code | documentation | configuration | etc.} + +## Instructions +CRITICAL: You are generating a COMBINED spec for tasks that will be reviewed TOGETHER by ONE judge. +- Include evaluation criteria for EACH individual task +- Include cross-task verification criteria (e.g., "adapter implementation matches the interface consumed by the integration module") +- Organize the spec so the judge can identify which criteria apply to which task's changes +- The judge will review ALL changes from ALL tasks in this group in a single evaluation +User prompt is provided as context, you should use it only as reference of changes that can occur in the project by other agents. +Return only the final evaluation specification YAML in your response. +``` + +#### 3.5.2 Dispatch Pattern + +**Dispatch ALL meta-judges in a SINGLE response (regardless of grouping type):** ``` -Use Task tool (one per target, all in same message): +Use Task tool (one per group/independent task, all in same message): -[Meta-judge for Target A] - - description: "Meta-judge: {brief task summary} for {target_A}" - - prompt: {meta-judge prompt with target_A context} +[Meta-judge for Repeatable Group: "add tests"] + - description: "Meta-judge (repeatable): reusable spec for adding tests across 3 modules" + - prompt: {repeatable group meta-judge prompt} - model: opus - subagent_type: "sadd:meta-judge" -[Meta-judge for Target B] - - description: "Meta-judge: {brief task summary} for {target_B}" - - prompt: {meta-judge prompt with target_B context} +[Meta-judge for Shared Group: "S3 adapter + integration"] + - description: "Meta-judge (shared): combined spec for S3 adapter implementation and integration" + - prompt: {shared group meta-judge prompt} - model: opus - subagent_type: "sadd:meta-judge" -[Meta-judge for Target C] - - description: "Meta-judge: {brief task summary} for {target_C}" - - prompt: {meta-judge prompt with target_C context} +[Meta-judge for Independent Task: "update CI pipeline"] + - description: "Meta-judge: update CI pipeline" + - prompt: {independent meta-judge prompt} - model: opus - subagent_type: "sadd:meta-judge" -[All meta-judges launched simultaneously - wait for ALL to complete before Phase 4] +[All meta-judges launched simultaneously] ``` -**CRITICAL:** Do not wait for ALL meta-judges to complete before proceeding to Phase 4. Launch implementors immediately after each meta-judge completes. If all meta-judges are completed, launch all implementaion agents in SINGLE response. +**CRITICAL:** Do not wait for ALL meta-judges to complete before proceeding to Phase 4. Launch implementors immediately after each meta-judge completes. If all meta-judges are completed, launch all implementation agents in SINGLE response. ### Phase 4: Construct Per-Target Prompts @@ -366,47 +489,71 @@ CRITICAL: Do not submit until ALL verification questions have satisfactory answe ### Phase 5: Parallel Implementation Dispatch and Judge Verification -After ALL meta-judges complete, launch all implementation sub-agents simultaneously, then verify each with an independent judge using the target-specific meta-judge evaluation specification. +After meta-judges complete, launch all implementation sub-agents simultaneously, then verify with judges based on grouping type. #### 5.1 Execution Flow +**Independent / Repeatable flow** (one judge per task): + ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ │ -│ Phase 3.5: Meta-Judge Batch (ALL in parallel) │ +│ Phase 3.5: Meta-Judge Dispatch (ALL in parallel) │ │ │ -│ Target A Target B Target C │ -│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ -│ │ Meta-Judge A │ │ Meta-Judge B │ │ Meta-Judge C │ │ -│ │ (Opus) │ │ (Opus) │ │ (Opus) │ │ -│ │ → Spec YAML A │ │ → Spec YAML B │ │ → Spec YAML C │ │ -│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ -│ │ │ │ │ -│ ▼ ▼ ▼ │ -│ Phase 5: Implementation Batch (after each meta-judge completes) │ -│ │ │ -│ Target A Target B Target C │ -│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ -│ │ Implementer A │ │ Implementer B │ │ Implementer C │ │ -│ │ (parallel) │ │ (parallel) │ │ (parallel) │ │ -│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ -│ │ │ │ │ -│ ▼ ▼ ▼ │ -│ Phase 5.2: Judge per target (after each implementor completes) │ +│ Independent: Repeatable Group: │ +│ ┌──────────────┐ ┌─────────────────────┐ │ +│ │ Meta-Judge A │ │ Meta-Judge (shared) │ │ +│ │ (Opus) │ │ (Opus) │ │ +│ │ → Spec YAML A │ │ → Reusable Spec YAML │ │ +│ └──────┬───────┘ └──────────┬──────────┘ │ +│ │ ┌─────┴─────┐ │ +│ ▼ ▼ ▼ │ +│ Phase 5: Implementation (ALL in parallel, one per task) │ │ │ -│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ -│ │ Judge A │ │ Judge B │ │ Judge C │ │ -│ │ +Spec YAML A │ │ +Spec YAML B │ │ +Spec YAML C │ │ -│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ -│ │ │ │ │ -│ ▼ ▼ ▼ │ -│ ┌──────────────────────────────────────────────────────────┐ │ -│ │ Parse Verdict (per target) │ │ -│ │ ├─ PASS (≥4)? → Complete │ │ -│ │ ├─ Soft PASS (≥3 + low priority issues)? → Complete │ │ -│ │ └─ FAIL (<4)? → Retry (max 3 per target, reuse Spec YAML)│ │ -│ └──────────────────────────────────────────────────────────┘ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ Implementer A │ │ Implementer B │ │ Implementer C │ │ +│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ +│ │ │ │ │ +│ ▼ ▼ ▼ │ +│ Phase 5.2: Judge per task (after ALL implementors complete) │ │ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ Judge A │ │ Judge B │ │ Judge C │ │ +│ │ +Spec YAML A │ │ +Reusable Spec│ │ +Reusable Spec│ │ +│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ +│ ▼ ▼ ▼ │ +│ Parse Verdict (per target) → PASS/FAIL → Retry if needed │ +└─────────────────────────────────────────────────────────────────────────┘ +``` + +**Shared flow** (one judge for the group): + +``` +┌─────────────────────────────────────────────────────────────────────────┐ +│ │ +│ Phase 3.5: Meta-Judge for Shared Group │ +│ ┌──────────────────────┐ │ +│ │ Meta-Judge (combined) │ │ +│ │ (Opus) │ │ +│ │ → Combined Spec YAML │ │ +│ └──────────┬───────────┘ │ +│ ┌────┴────┐ │ +│ ▼ ▼ │ +│ Phase 5: Implementation (one per task, in parallel) │ +│ ┌──────────────┐ ┌──────────────┐ │ +│ │ Implementer X │ │ Implementer Y │ │ +│ └──────┬───────┘ └──────┬───────┘ │ +│ │ │ │ +│ └────────┬─────────┘ │ +│ ▼ │ +│ Phase 5.2: ONE Judge for entire group │ +│ ┌────────────────────────────────┐ │ +│ │ Judge (shared) │ │ +│ │ +Combined Spec YAML │ │ +│ │ +ALL implementation outputs │ │ +│ └──────────────┬─────────────────┘ │ +│ ▼ │ +│ Parse per-task verdicts → Retry ONLY failing task(s) if needed │ └─────────────────────────────────────────────────────────────────────────┘ ``` @@ -453,9 +600,42 @@ Use Task tool: #### 5.2 Judge Verification Protocol -After each implementation agent completes, dispatch an **independent judge** for that target using that target's specific meta-judge evaluation specification. +After ALL implementation agents complete, dispatch judges based on the requirement grouping determined in Phase 2. The dispatch pattern varies by grouping type: + +| Grouping Type | Judge Dispatch | Spec Used | +|---------------|---------------|-----------| +| **Independent** | One judge per task | Task-specific meta-judge spec | +| **Repeatable** | One judge per task | SAME shared reusable spec from the group's meta-judge | +| **Shared** | ONE judge for the entire group | Combined spec from the group's meta-judge | + +CRITICAL: Provide to the judge the EXACT meta-judge evaluation specification YAML, do not skip or add anything, do not modify it in any way, do not shorten or summarize any text in it! For repeatable groups, each target's judge receives the SAME reusable spec. For shared groups, the single judge receives the combined spec covering all tasks. -CRITICAL: Provide to the judge the EXACT meta-judge evaluation specification YAML for that specific target, do not skip or add anything, do not modify it in any way, do not shorten or summarize any text in it! Each target's judge receives only that target's meta-judge YAML, not another target's specification or some combination of them. +##### 5.2.1 Analyze the Pre-existing or expected parallel Changes Section + +Before dispatching each target's judge, assess whether there are pre-existing or expected parallel changes in the codebase that the judge needs to be aware of. The "Pre-existing or Expected Parallel Changes" section prevents the judge from confusing prior modifications with the current implementation agent's work. + +**When to include:** + +- Previous do-in-parallel runs completed earlier in the same session (all targets from a prior batch) +- User's manual modifications made before invoking the skill (visible from conversation context or in git) +- Changes from other tools or agents that ran before this parallel dispatch +- Expected changes from other parallel agents in the same batch (e.g. if other agents are expected to modify other files in repository during the parallel development) + +**When to omit:** + +- This is the first run with no known prior changes — omit the section entirely +- On retries within the SAME target, do NOT include the implementation agent's own previous attempt as "pre-existing changes" — those are part of the current target's iteration cycle + +**Content guidelines:** + +- Use a high-level summary: task description, list of affected files/modules, general nature of changes (created, modified, deleted) +- Do NOT include code blocks, diffs, or line-level details — keep it concise +- Label the source clearly: "Previous do-in-parallel: {description}", "User modifications (before current task)", etc. +- If multiple sources of changes exist, use separate subsections for each + +CRITICAL: avoid reading full codebase or git history, just use high-level git diff/status to determine which files were changed, or use conversation context to determine if there are any pre-existing changes. + +##### 5.2.2 Launch Judge with prompt and target-specific specification YAML **Judge prompt template:** @@ -470,6 +650,17 @@ CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}` ## Target {Specific target: file path or component name} +{IF pre-existing changes are known, include the following section — otherwise omit entirely} + +## Pre-existing or Expected Parallel Changes (Context Only) + +The following changes were made before or expected to be done by other parallel agents in the same batch now. They are NOT part of the current implementation agent's output. Focus your evaluation on the current agent's changes to its specific target. Only verify other changed files/logic if they directly relate to the current target's task requirements. + +### {Source of changes: e.g., "Previous do-in-parallel: {task description}" or "User modifications (before current task)"} +{High-level summary: what was done, which files/modules were created or modified} + +{END conditional section} + ## Evaluation Specification ```yaml @@ -491,16 +682,76 @@ CRITICAL: You must reply with this exact structured evaluation report format in CRITICAL: NEVER provide score threshold, in any format, including `threshold_pass` or anything different. Judge MUST not know what threshold for score is, in order to not be biased!!! -**Dispatch judge for each target:** +##### 5.2.3 Shared Group Judge Prompt Template + +For shared groups where ONE judge reviews ALL related changes together: + +```markdown +You are evaluating implementation artifacts for a group of related tasks against a combined evaluation specification produced by the meta judge. These tasks are interdependent and must be reviewed together. + +CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}` + +## User Prompt +{Original task description from user} + +## Tasks in This Shared Group +{List of all tasks with their targets: +- Task 1: {description} -> {target} +- Task 2: {description} -> {target} +} + +{IF pre-existing changes are known, include the "Pre-existing or Expected Parallel Changes (Context Only)" section — otherwise omit entirely} + +## Evaluation Specification + +```yaml +{meta-judge's COMBINED evaluation specification YAML} +``` + +## Implementation Outputs +{For each task in the group:} +### Task: {task description} -> {target} +{Summary section from that task's implementation agent} +{Paths to files modified} + +## Instructions +User prompt is provided as context, you should use it only as reference of changes that can occur in the project by other agents. Evaluate ALL tasks in this shared group together. Verify cross-task integration points (e.g., does the adapter match the interface the integration module consumes?). +CRITICAL: For each task, indicate separately whether it PASSED or FAILED so that only failing tasks can be retried. +Follow your full judge process as defined in your agent instructions! + +## Output + +CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response! Include per-task verdicts within the report. +``` + +##### 5.2.4 Dispatch Judges by Grouping Type + +**Independent and Repeatable targets -- one judge per task:** ``` Use Task tool: - description: "Judge: {target name}" - - prompt: {judge verification prompt with exact meta-judge specification YAML} + - prompt: {judge verification prompt with exact meta-judge specification YAML, and Pre-existing or Expected Parallel Changes section if applicable} - model: opus - subagent_type: "sadd:judge" ``` +For repeatable groups, each judge receives the SAME shared reusable spec from the group's single meta-judge. The judge prompt template from 5.2.2 is used as-is; only the target and implementation output differ between judges. + +**Shared group -- ONE judge for the entire group:** + +``` +Use Task tool: + - description: "Judge (shared): {group description}" + - prompt: {shared group judge prompt from 5.2.3 with combined meta-judge specification YAML and ALL implementation outputs} + - model: opus + - subagent_type: "sadd:judge" +``` + +**Launch ALL judges in parallel** (independent, repeatable, and shared judges all dispatched in same response). + +CRITICAL: NEVER provide score threshold, in any format, including `threshold_pass` or anything different. Judge MUST not know what threshold for score is, in order to not be biased!!! + #### 5.3 Parse Verdict and Iterate Parse judge output for each target (DO NOT read full report): @@ -544,6 +795,36 @@ If score < 4: - Other parallel tasks continue independently - Only the failed target is retried +**Shared group verdict parsing:** + +For shared groups, the judge produces per-task verdicts within a single report. Parse each task's verdict individually: + +``` +Extract from shared judge reply: +- Per-task verdicts: + - Task 1 ({target}): VERDICT: PASS/FAIL, SCORE: X.X/5.0, ISSUES: [...] + - Task 2 ({target}): VERDICT: PASS/FAIL, SCORE: X.X/5.0, ISSUES: [...] +- OVERALL SCORE: X.X/5.0 +- CROSS-TASK ISSUES: List of integration problems (if any) +``` + +**Shared group retry logic:** + +``` +If shared judge finds failures: + 1. Identify which specific task(s) failed from per-task verdicts + 2. Re-launch ONLY the implementation agent(s) for the failed task(s) + -- Do NOT re-launch agents whose tasks passed + 3. After retry implementation completes, re-launch the shared judge + to review ALL changes again (passed + retried) + -- The shared judge still uses the same combined meta-judge spec + 4. Repeat until all tasks pass or max retries reached for any task + +CRITICAL: Only the specific failing implementation agent(s) are retried. +Passing tasks are NOT re-implemented. The shared judge always reviews +the complete group together on each evaluation round. +``` + #### 5.4 Retry with Feedback (If Needed) **Retry prompt template:** @@ -595,12 +876,12 @@ After all agents complete (with retries as needed), aggregate results: ### Results -| Target | Model | Judge Score | Retries | Status | Summary | -|--------|-------|-------------|---------|--------|---------| -| {target_1} | {model} | {X.X}/5.0 | {0-3} | SUCCESS | {brief outcome} | -| {target_2} | {model} | {X.X}/5.0 | {0-3} | SUCCESS | {brief outcome} | -| {target_3} | {model} | {X.X}/5.0 | {3} | FAILED | {failure reason} | -| ... | ... | ... | ... | ... | ... | +| Target | Grouping | Model | Judge Score | Retries | Status | Summary | +|--------|----------|-------|-------------|---------|--------|---------| +| {target_1} | {Repeatable/Shared/Independent} | {model} | {X.X}/5.0 | {0-3} | SUCCESS | {brief outcome} | +| {target_2} | {Repeatable/Shared/Independent} | {model} | {X.X}/5.0 | {0-3} | SUCCESS | {brief outcome} | +| {target_3} | {Repeatable/Shared/Independent} | {model} | {X.X}/5.0 | {3} | FAILED | {failure reason} | +| ... | ... | ... | ... | ... | ... | ... | ### Overall Assessment - **Completed:** {X}/{total} @@ -633,254 +914,1221 @@ After all agents complete (with retries as needed), aggregate results: ## Examples -### Example 1: Code Simplification Across Modules +### Example 1: Requirement Grouping -- Mixed Repeatable + Independent (with Pre-existing Changes from Prior Batch) + +**Scenario:** + +A team runs two sequential do-in-parallel batches. The first batch updates API documentation across 3 endpoint files (`src/api/users.ts`, `src/api/orders.ts`, `src/api/products.ts`). The second batch adds tests to all 3 modules in src folder and adds a tests step to GitHub Actions. Each agent's judge in the second batch needs to know about the documentation changes from the first batch AND the expected changes from other parallel agents in the same second batch. + +**Input (second batch -- first batch already completed earlier in session):** -**Input:** ``` -/do-in-parallel "Simplify error handling to use early returns instead of nested if-else" \ - --files "src/services/user.ts,src/services/order.ts,src/services/payment.ts" -``` - -**Analysis:** -- Task type: Code transformation / refactoring -- Per-target complexity: Medium (pattern-based transformation) -- Output size: Medium (modified file) -- Independence: Yes (separate files, no cross-dependencies) - -**Model Selection:** Sonnet (pattern-based, medium complexity) - -**Execution:** - -``` -Phase 3.5: Dispatch Meta-Judges (3 in parallel, one per target) - [All 3 meta-judges launched simultaneously] - Meta-judge for user.ts (Opus)... - → Generated target-specific evaluation specification YAML - → 3 rubric dimensions, 5 checklist items tailored to user.ts - Meta-judge for order.ts (Opus)... - → Generated target-specific evaluation specification YAML - → 3 rubric dimensions, 6 checklist items tailored to order.ts - Meta-judge for payment.ts (Opus)... - → Generated target-specific evaluation specification YAML - → 3 rubric dimensions, 5 checklist items tailored to payment.ts - -Phase 5: Parallel Implementation Dispatch (after all meta-judges complete) - [All 3 implementation agents launched simultaneously] - - Target: user.ts - Implementation (Sonnet)... - -> Converted 4 nested if-else blocks to early returns - Judge Verification (Opus, with user.ts meta-judge spec)... - -> VERDICT: PASS, SCORE: 4.2/5.0 - -> IMPROVEMENTS: Consider extracting complex conditions - - Target: order.ts - Implementation (Sonnet)... - -> Converted 6 nested if-else blocks to early returns - Judge Verification (Opus, with order.ts meta-judge spec)... - -> VERDICT: PASS, SCORE: 4.0/5.0 - -> ISSUES: None - - Target: payment.ts - Implementation (Sonnet)... - -> Converted 3 nested if-else blocks - Judge Verification (Opus, with payment.ts meta-judge spec)... - -> VERDICT: FAIL, SCORE: 3.2/5.0 - -> ISSUES: Missing edge case for null amount - Retry Implementation (Sonnet)... - -> Added null check for payment amount - Judge Verification (Opus, with same payment.ts meta-judge spec)... - -> VERDICT: PASS, SCORE: 4.1/5.0 +/do-in-parallel add tests to all 3 modules in src folder and add tests step to github actions ``` -**Result:** -```markdown -## Parallel Execution Summary +**Orchestrator Analysis:** -### Configuration -- **Task:** Simplify error handling to use early returns -- **Model:** Sonnet -- **Targets:** 3 files +``` +Phase 2: Task Analysis + Requirement Grouping + +1. Task Identification: + - Task A: "Add tests to src/modules/auth.ts" + - Task B: "Add tests to src/modules/cart.ts" + - Task C: "Add tests to src/modules/payments.ts" + - Task D: "Add tests step to GitHub Actions CI pipeline" + +2. Requirement Grouping: + - Tasks A, B, C: REPEATABLE — same task ("add tests") applied to 3 different modules + → ONE shared meta-judge producing a reusable spec + - Task D: INDEPENDENT — different task type (CI configuration) + → Separate meta-judge + +3. Pre-existing and Expected Parallel Changes Assessment: + - Pre-existing (from prior batch): API documentation updated across + src/api/users.ts, src/api/orders.ts, src/api/products.ts + - Expected parallel: Each agent should be aware that other agents in this + batch are adding tests to other modules and updating GH Actions simultaneously + +4. Agent Count: + - Meta-judges: 2 (1 repeatable for tests + 1 independent for GH Actions) + - Implementation agents: 4 (one per task, always isolated) + - Judges: 4 (3 using shared test spec + 1 for GH Actions) + - Total: 10 agents (vs. 12 without grouping) +``` -### Results +**Phase 3.5: Meta-Judge Dispatch (2 meta-judges in parallel):** -| Target | Model | Judge Score | Retries | Status | Summary | -|--------|-------|-------------|---------|--------|---------| -| src/services/user.ts | sonnet | 4.2/5.0 | 0 | SUCCESS | Converted 4 nested if-else blocks | -| src/services/order.ts | sonnet | 4.0/5.0 | 0 | SUCCESS | Converted 6 nested if-else blocks | -| src/services/payment.ts | sonnet | 4.1/5.0 | 1 | SUCCESS | Converted 3 blocks, added null check | +``` +[Meta-judge 1: Repeatable group — test generation] +Use Task tool: + - description: "Meta-judge (repeatable): reusable spec for adding tests across 3 modules" + - prompt: + ## Task + + Generate a REUSABLE evaluation specification yaml that can be applied to + ANY of the following targets performing the same task. You will produce + rubrics, checklists, and scoring criteria that individual judge agents + will each use independently to evaluate one target's implementation artifact. + + CLAUDE_PLUGIN_ROOT={CLAUDE_PLUGIN_ROOT} + + ## User Prompt as Context + add tests to all 3 modules in src folder and add tests step to github actions + + ## Task Being Repeated + Add comprehensive unit tests to a source module + + ## Targets in This Group + - src/modules/auth.ts + - src/modules/cart.ts + - src/modules/payments.ts + + ## Context + Project uses Jest for testing. Test files should be co-located as + *.test.ts files. Existing test patterns available in src/modules/__tests__/. + + ## Artifact Type + code + + ## Instructions + CRITICAL: You are generating a REUSABLE spec that will be applied to + EACH target independently by separate judges. + - Use generic language: "target file should align with criteria" instead + of "all files should align" + - Do NOT include file-specific requirements (e.g., NOT "auth.ts should + test only authentication logic") since this same spec will be applied + to different files + - The spec must be applicable to ANY target in this group without modification + - Each judge will receive this same spec and evaluate only its own target + against it + User prompt is provided as context, you should use it only as reference + of changes that can occur in the project by other agents. + Return only the final evaluation specification YAML in your response. + - model: opus + - subagent_type: "sadd:meta-judge" -### Overall Assessment -- **Completed:** 3/3 -- **Total Retries:** 1 -- **Total Agents:** 11 (3 meta-judges + 3 implementations + 1 retry + 4 judges) -- **Common patterns:** All files followed consistent early return pattern +[Meta-judge 2: Independent — GitHub Actions] +Use Task tool: + - description: "Meta-judge: add tests step to GitHub Actions" + - prompt: + ## Task + + Generate an evaluation specification yaml for the following task applied + to a specific target. You will produce rubrics, checklists, and scoring + criteria that a judge agent will use to evaluate the implementation + artifact for this specific target. + + CLAUDE_PLUGIN_ROOT={CLAUDE_PLUGIN_ROOT} + + ## User Prompt as Context + add tests to all 3 modules in src folder and add tests step to github actions + + ## Target + Add a test execution step to the GitHub Actions CI pipeline + (.github/workflows/ci.yml or similar) + + ## Context + Project uses Jest for testing. The CI pipeline should run tests after + build step. Existing workflow file may need a new job or step. + + ## Artifact Type + configuration + + ## Instructions + User prompt is provided as context, you should use it only as reference + of changes that can occur in the project by other agents. Generate + evaluation specification ONLY for adding the tests step to GitHub Actions. + Your report will be used to verify only this particular task, not the + all tasks in the user prompt. + Return only the final evaluation specification YAML in your response. + - model: opus + - subagent_type: "sadd:meta-judge" + +[Both meta-judges launched simultaneously] ``` ---- +**Phase 5: Implementation Dispatch (4 agents in parallel, after meta-judges complete):** -### Example 2: Documentation Generation +``` +[Implementation 1: auth module tests] +Use Task tool: + - description: "Parallel: add tests to src/modules/auth.ts" + - prompt: + ## Reasoning Approach + Let's think step by step. + Before taking any action, think through the problem systematically: + 1. "Let me first understand what is being asked for this specific target..." + 2. "Let me analyze this specific target..." + 3. "Let me plan my approach..." + Work through each step explicitly before implementing. + + Add comprehensive unit tests + src/modules/auth.ts + + - Work ONLY on the specified target + - Do NOT modify other files unless explicitly required + - Follow existing test patterns in the project + + + Create test file for the auth module. + CRITICAL: At the end of your work, provide a "Summary" section containing: + - Files modified (full paths) + - Key changes (3-5 bullet points) + - Any decisions made and rationale + + + ## Self-Critique Verification (MANDATORY) + [standard self-critique suffix] + - model: sonnet + +[Implementation 2: cart module tests] +Use Task tool: + - description: "Parallel: add tests to src/modules/cart.ts" + - prompt: + ## Reasoning Approach + Let's think step by step. + Before taking any action, think through the problem systematically: + 1. "Let me first understand what is being asked for this specific target..." + 2. "Let me analyze this specific target..." + 3. "Let me plan my approach..." + Work through each step explicitly before implementing. + + Add comprehensive unit tests + src/modules/cart.ts + + - Work ONLY on the specified target + - Do NOT modify other files unless explicitly required + - Follow existing test patterns in the project + + + Create test file for the cart module. + CRITICAL: At the end of your work, provide a "Summary" section containing: + - Files modified (full paths) + - Key changes (3-5 bullet points) + - Any decisions made and rationale + + + ## Self-Critique Verification (MANDATORY) + Before submitting, verify your work: + 1. Re-read the original task and confirm every requirement is addressed + 2. Check that all tests follow existing patterns in the project + 3. Verify no unrelated files were modified + 4. Confirm the Summary section is complete and accurate + - model: sonnet + +[Implementation 3: payments module tests] +Use Task tool: + - description: "Parallel: add tests to src/modules/payments.ts" + - prompt: [Same CoT prefix + task body for payments.ts + critique suffix] + - model: sonnet -**Input:** +[Implementation 4: GitHub Actions test step] +Use Task tool: + - description: "Parallel: add tests step to GitHub Actions CI" + - prompt: + ## Reasoning Approach + Let's think step by step. + Before taking any action, think through the problem systematically: + 1. "Let me first understand what is being asked for this specific target..." + 2. "Let me analyze this specific target..." + 3. "Let me plan my approach..." + Work through each step explicitly before implementing. + + Add a test execution step to the GitHub Actions CI pipeline + .github/workflows/ci.yml + + - Work ONLY on the CI workflow file + - Add a step that runs the test suite after the build step + - Do NOT modify other workflow files or steps beyond what is necessary + - Follow existing workflow patterns and conventions + + + Update the CI workflow with a test execution step. + CRITICAL: At the end of your work, provide a "Summary" section containing: + - Files modified (full paths) + - Key changes (3-5 bullet points) + - Any decisions made and rationale + + + ## Self-Critique Verification (MANDATORY) + Before submitting, verify your work: + 1. Re-read the original task and confirm every requirement is addressed + 2. Check that the workflow YAML is valid and well-structured + 3. Verify no unrelated workflow steps were modified + 4. Confirm the Summary section is complete and accurate + - model: sonnet + +[All 4 launched simultaneously] ``` -/do-in-parallel "Generate JSDoc documentation for all public methods" \ - --files "src/api/users.ts,src/api/products.ts,src/api/orders.ts,src/api/auth.ts" + +**Phase 5.2: Judge Dispatch (4 judges in parallel, after ALL implementors complete):** + ``` +[Judge 1: auth module — uses SHARED reusable spec from repeatable meta-judge] +Use Task tool: + - description: "Judge: src/modules/auth.ts" + - prompt: + You are evaluating an implementation artifact for target + src/modules/auth.ts against an evaluation specification produced + by the meta judge. + + CLAUDE_PLUGIN_ROOT={CLAUDE_PLUGIN_ROOT} + + ## User Prompt + add tests to all 3 modules in src folder and add tests step to github actions + + ## Target + src/modules/auth.ts + + ## Pre-existing and expected parallel changes (Context Only) + + The following changes were made before or expected to be done by + other parallel agents in the same batch now. They are NOT part of + the current implementation agent's output. Focus your evaluation + on the current agent's changes to its specific target. Only verify + other changed files/logic if they directly relate to the current + target's task requirements. + + ### Previous do-in-parallel: "Update API documentation for all endpoints" + The following files were modified as part of a previous parallel batch: + - src/api/users.ts (modified) - Added JSDoc to public methods, + updated module header + - src/api/orders.ts (modified) - Added JSDoc to public methods, + added @example tags + - src/api/products.ts (modified) - Added JSDoc to public methods, + updated type annotations + + ### Expected parallel changes (current batch) + Other agents in this batch are simultaneously: + - Adding tests to src/modules/cart.ts and src/modules/payments.ts + (repeatable group — same task on other modules) + - Adding a tests step to .github/workflows/ci.yml (independent task) + + ## Evaluation Specification + ```yaml + {EXACT reusable spec YAML from repeatable meta-judge — same for all 3 module judges} + ``` + + ## Implementation Output + {Summary from auth implementation agent} + + ## Instructions + User prompt is provided as context, you should use it only as reference + of changes that can occur in the project by other agents. Evaluate ONLY + the test generation for auth.ts. + Follow your full judge process as defined in your agent instructions! + + ## Output + CRITICAL: You must reply with this exact structured evaluation report + format in YAML at the START of your response! + - model: opus + - subagent_type: "sadd:judge" -**Analysis:** -- Task type: Documentation generation -- Per-target complexity: Low (mechanical documentation) -- Output size: Medium (inline comments) -- Independence: Yes +[Judge 2: cart module — uses SAME shared reusable spec] +Use Task tool: + - description: "Judge: src/modules/cart.ts" + - prompt: [Same judge template, same reusable spec YAML, cart implementation output. + Pre-existing and expected parallel changes section: same prior batch info, + expected parallel changes list auth.ts, payments.ts, and GH Actions instead] + - model: opus + - subagent_type: "sadd:judge" + +[Judge 3: payments module — uses SAME shared reusable spec] +Use Task tool: + - description: "Judge: src/modules/payments.ts" + - prompt: [Same judge template, same reusable spec YAML, payments implementation output. + Pre-existing and expected parallel changes section: same prior batch info, + expected parallel changes list auth.ts, cart.ts, and GH Actions instead] + - model: opus + - subagent_type: "sadd:judge" -**Model Selection:** Haiku (mechanical, well-defined rules) +[Judge 4: GitHub Actions — uses INDEPENDENT spec from GH Actions meta-judge] +Use Task tool: + - description: "Judge: GitHub Actions CI" + - prompt: + You are evaluating an implementation artifact for target + .github/workflows/ci.yml against an evaluation specification produced + by the meta judge. + + CLAUDE_PLUGIN_ROOT={CLAUDE_PLUGIN_ROOT} + + ## User Prompt + add tests to all 3 modules in src folder and add tests step to github actions + + ## Target + .github/workflows/ci.yml + + ## Pre-existing and expected parallel changes (Context Only) + + The following changes were made before or expected to be done by + other parallel agents in the same batch now. They are NOT part of + the current implementation agent's output. Focus your evaluation + on the current agent's changes to its specific target. Only verify + other changed files/logic if they directly relate to the current + target's task requirements. + + ### Previous do-in-parallel: "Update API documentation for all endpoints" + The following files were modified as part of a previous parallel batch: + - src/api/users.ts (modified) - Added JSDoc to public methods, + updated module header + - src/api/orders.ts (modified) - Added JSDoc to public methods, + added @example tags + - src/api/products.ts (modified) - Added JSDoc to public methods, + updated type annotations + + ### Expected parallel changes (current batch) + Other agents in this batch are simultaneously: + - Adding tests to src/modules/auth.ts, src/modules/cart.ts, + and src/modules/payments.ts (repeatable group — test generation) + + ## Evaluation Specification + ```yaml + {EXACT spec YAML from independent GH Actions meta-judge} + ``` + + ## Implementation Output + {Summary from GH Actions implementation agent} + + ## Instructions + User prompt is provided as context, you should use it only as reference + of changes that can occur in the project by other agents. Evaluate ONLY + the GitHub Actions test step. + Follow your full judge process as defined in your agent instructions! + + ## Output + CRITICAL: You must reply with this exact structured evaluation report + format in YAML at the START of your response! + - model: opus + - subagent_type: "sadd:judge" -**Dispatch:** 4 meta-judges (parallel) → 4 implementors (parallel) → 4 judges +[All 4 judges launched simultaneously] +``` -**Execution Summary:** +**Result:** -| Target | Model | Judge Score | Retries | Status | -|--------|-------|-------------|---------|--------| -| src/api/users.ts | haiku | 4.0/5.0 | 0 | SUCCESS | -| src/api/products.ts | haiku | 3.8/5.0 | 0 | SUCCESS | -| src/api/orders.ts | haiku | 4.2/5.0 | 0 | SUCCESS | -| src/api/auth.ts | haiku | 4.1/5.0 | 0 | SUCCESS | +| Target | Grouping | Model | Judge Score | Retries | Status | +|--------|----------|-------|-------------|---------|--------| +| src/modules/auth.ts | Repeatable | sonnet | 4.2/5.0 | 0 | SUCCESS | +| src/modules/cart.ts | Repeatable | sonnet | 4.0/5.0 | 0 | SUCCESS | +| src/modules/payments.ts | Repeatable | sonnet | 4.1/5.0 | 0 | SUCCESS | +| .github/workflows/ci.yml | Independent | sonnet | 4.3/5.0 | 0 | SUCCESS | -Total Agents: 12 (4 meta-judges + 4 implementations + 4 judges) +**Overall:** 4/4 completed. Total Agents: 10 (2 meta-judges + 4 implementations + 4 judges) --- -### Example 3: Security Analysis +### Example 2: Requirement Grouping -- Shared + Repeatable Combined (with Pre-existing User Changes) + +**Scenario:** + +A developer has been working on a Node.js backend during the conversation. They refactored the database connection layer and updated several service modules manually, including adding S3 class interface. Then they invoked do-in-parallel to implement and integrate the S3 interface, and also refactor the cart module. Each agent's judge needs to know about the user's prior modifications AND the expected changes from other parallel agents in the same batch. **Input:** + ``` -/do-in-parallel "Analyze for potential SQL injection vulnerabilities and suggest fixes" \ - --files "src/db/queries.ts,src/db/migrations.ts,src/api/search.ts" +/do-in-parallel I wrote class interface for S3 service in s3.adapter.ts, please do 2 tasks: implement s3 adapter with tests and integrate s3 adapter to analytics module. Also refactor and simplify all files in cart module ``` -**Analysis:** -- Task type: Security analysis -- Per-target complexity: High (security requires careful analysis) -- Output size: Medium (analysis report + suggestions) -- Independence: Yes +**Orchestrator Analysis:** -**Model Selection:** Opus (security-critical, requires deep analysis) +``` +Phase 2: Task Analysis + Requirement Grouping + +1. Task Identification: + - Task A: "Implement S3 adapter with tests in src/adapters/s3.adapter.ts" + - Task B: "Integrate S3 adapter into src/modules/analytics.module.ts" + - Task C: "Refactor and simplify src/modules/cart/cart.service.ts" + - Task D: "Refactor and simplify src/modules/cart/cart.repository.ts" + - Task E: "Refactor and simplify src/modules/cart/cart.controller.ts" + +2. Requirement Grouping: + - Tasks A, B: SHARED — interdependent (adapter must match interface consumed + by analytics integration; should be reviewed together) + → ONE combined meta-judge, ONE shared judge + - Tasks C, D, E: REPEATABLE — same task ("refactor and simplify") applied + to 3 different files in cart module + → ONE reusable meta-judge + +3. Pre-existing and Expected Parallel Changes Assessment: + - Pre-existing (user modifications): Refactored database connection layer + (src/db/connection.ts, src/db/queries.ts), updated service modules, + and added S3 class interface in src/adapters/s3.adapter.ts + - Expected parallel: S3 adapter implementation and analytics integration + run in parallel (shared group); cart refactoring agents run in parallel + (repeatable group); both groups run simultaneously + +4. Agent Count: + - Meta-judges: 2 (1 shared for S3 work + 1 repeatable for cart refactoring) + - Implementation agents: 5 (one per task, always isolated) + - Judges: 4 (1 shared for S3 group + 3 individual for cart) + - Total: 11 agents (vs. 15 without grouping) +``` -**Dispatch:** 3 meta-judges (parallel) → 3 implementors (parallel) → 3 judges + retries +**Phase 3.5: Meta-Judge Dispatch (2 meta-judges in parallel):** -**Execution Summary:** +``` +[Meta-judge 1: Shared group — S3 adapter + integration] +Use Task tool: + - description: "Meta-judge (shared): combined spec for S3 adapter and analytics integration" + - prompt: + ## Task + + Generate a COMBINED evaluation specification yaml that covers ALL of the + following related tasks. These tasks are interdependent and will be + reviewed TOGETHER by a single judge. You will produce rubrics, checklists, + and scoring criteria that account for cross-task dependencies and + integration points. + + CLAUDE_PLUGIN_ROOT={CLAUDE_PLUGIN_ROOT} + + ## User Prompt as Context + I wrote class interface for S3 service in s3.adapter.ts, please do 2 tasks: + implement s3 adapter with tests and integrate s3 adapter to analytics module. + Also refactor and simplify all files in cart module + + ## Tasks in This Shared Group + - Task A: Implement S3 adapter with tests -> src/adapters/s3.adapter.ts + - Task B: Integrate S3 adapter into analytics module -> src/modules/analytics.module.ts + + ## Context + The user has already written the class interface in s3.adapter.ts. Task A + implements the interface methods and adds unit tests. Task B integrates the + adapter into the analytics module. The adapter's public API from Task A must + match what Task B consumes. + + ## Artifact Type + code + + ## Instructions + CRITICAL: You are generating a COMBINED spec for tasks that will be + reviewed TOGETHER by ONE judge. + - Include evaluation criteria for EACH individual task + - Include cross-task verification criteria (e.g., "S3 adapter's public + methods match the calls made by the analytics integration") + - Organize the spec so the judge can identify which criteria apply to + which task's changes + - The judge will review ALL changes from ALL tasks in this group in a + single evaluation + User prompt is provided as context, you should use it only as reference + of changes that can occur in the project by other agents. + Return only the final evaluation specification YAML in your response. + - model: opus + - subagent_type: "sadd:meta-judge" + +[Meta-judge 2: Repeatable group — cart refactoring] +Use Task tool: + - description: "Meta-judge (repeatable): reusable spec for refactoring cart module files" + - prompt: + ## Task + + Generate a REUSABLE evaluation specification yaml that can be applied to + ANY of the following targets performing the same task. You will produce + rubrics, checklists, and scoring criteria that individual judge agents + will each use independently to evaluate one target's implementation artifact. + + CLAUDE_PLUGIN_ROOT={CLAUDE_PLUGIN_ROOT} + + ## User Prompt as Context + I wrote class interface for S3 service in s3.adapter.ts, please do 2 tasks: + implement s3 adapter with tests and integrate s3 adapter to analytics module. + Also refactor and simplify all files in cart module + + ## Task Being Repeated + Refactor and simplify a source file in the cart module + + ## Targets in This Group + - src/modules/cart/cart.service.ts + - src/modules/cart/cart.repository.ts + - src/modules/cart/cart.controller.ts + + ## Context + All three files are in the cart module. Refactoring should simplify logic, + reduce complexity, improve readability while preserving existing behavior. + + ## Artifact Type + code + + ## Instructions + CRITICAL: You are generating a REUSABLE spec that will be applied to + EACH target independently by separate judges. + - Use generic language: "target file should align with criteria" instead + of "all files should align" + - Do NOT include file-specific requirements since this same spec will be + applied to different files + - The spec must be applicable to ANY target in this group without modification + User prompt is provided as context, you should use it only as reference + of changes that can occur in the project by other agents. + Return only the final evaluation specification YAML in your response. + - model: opus + - subagent_type: "sadd:meta-judge" -| Target | Model | Judge Score | Retries | Status | -|--------|-------|-------------|---------|--------| -| src/db/queries.ts | opus | 4.5/5.0 | 0 | SUCCESS | -| src/db/migrations.ts | opus | 4.3/5.0 | 0 | SUCCESS | -| src/api/search.ts | opus | 4.0/5.0 | 1 | SUCCESS | +[Both meta-judges launched simultaneously] +``` -Total Agents: 11 (3 meta-judges + 3 implementations + 1 retry + 4 judges) +**Phase 5: Implementation Dispatch (5 agents in parallel, after meta-judges complete):** ---- +``` +[Implementation 1: S3 adapter] +Use Task tool: + - description: "Parallel: implement S3 adapter with tests" + - prompt: + ## Reasoning Approach + Let's think step by step. + Before taking any action, think through the problem systematically: + 1. "Let me first understand what is being asked for this specific target..." + 2. "Let me analyze this specific target..." + 3. "Let me plan my approach..." + Work through each step explicitly before implementing. + + Implement S3 adapter with tests based on the existing class interface + src/adapters/s3.adapter.ts + + - Work ONLY on the specified target + - Implement all methods defined in the existing class interface + - Add comprehensive unit tests + - Do NOT modify the analytics module + + + Implement the S3 adapter and create tests. + CRITICAL: At the end of your work, provide a "Summary" section containing: + - Files modified (full paths) + - Key changes (3-5 bullet points) + - Any decisions made and rationale + + + ## Self-Critique Verification (MANDATORY) + Before submitting, verify your work: + 1. Re-read the original task and confirm every requirement is addressed + 2. Check that the adapter implements all interface methods correctly + 3. Verify no unrelated files were modified + 4. Confirm the Summary section is complete and accurate + - model: opus -### Example 4: Test Generation with Partial Failure +[Implementation 2: Analytics integration] +Use Task tool: + - description: "Parallel: integrate S3 adapter into analytics module" + - prompt: + ## Reasoning Approach + [standard CoT prefix] + + Integrate S3 adapter into the analytics module + src/modules/analytics.module.ts + + - Work ONLY on the specified target + - Import and use the S3 adapter from src/adapters/s3.adapter.ts + - Follow existing dependency injection patterns + - Do NOT modify the S3 adapter itself + + + Integrate S3 adapter into analytics module. + CRITICAL: At the end of your work, provide a "Summary" section. + + + ## Self-Critique Verification (MANDATORY) + [standard self-critique suffix] + - model: opus -**Input:** +[Implementation 3: cart.service.ts refactoring] +Use Task tool: + - description: "Parallel: refactor src/modules/cart/cart.service.ts" + - prompt: + ## Reasoning Approach + Let's think step by step. + Before taking any action, think through the problem systematically: + 1. "Let me first understand what is being asked for this specific target..." + 2. "Let me analyze this specific target..." + 3. "Let me plan my approach..." + Work through each step explicitly before implementing. + + Refactor and simplify the cart service + src/modules/cart/cart.service.ts + + - Work ONLY on the specified target + - Simplify logic, reduce complexity, improve readability + - Preserve existing behavior — no functional changes + - Do NOT modify other cart module files + + + Refactor the cart service file. + CRITICAL: At the end of your work, provide a "Summary" section containing: + - Files modified (full paths) + - Key changes (3-5 bullet points) + - Any decisions made and rationale + + + ## Self-Critique Verification (MANDATORY) + Before submitting, verify your work: + 1. Re-read the original task and confirm every requirement is addressed + 2. Check that existing behavior is preserved after refactoring + 3. Verify no unrelated files were modified + 4. Confirm the Summary section is complete and accurate + - model: sonnet + +[Implementation 4: cart.repository.ts refactoring] +Use Task tool: + - description: "Parallel: refactor src/modules/cart/cart.repository.ts" + - prompt: [Same CoT prefix + refactoring task body for cart.repository.ts + critique suffix] + - model: sonnet + +[Implementation 5: cart.controller.ts refactoring] +Use Task tool: + - description: "Parallel: refactor src/modules/cart/cart.controller.ts" + - prompt: [Same CoT prefix + refactoring task body for cart.controller.ts + critique suffix] + - model: sonnet + +[All 5 launched simultaneously] ``` -/do-in-parallel "Generate unit tests achieving 80% coverage" \ - --targets "UserService,OrderService,PaymentService,NotificationService" + +**Phase 5.2: Judge Dispatch (4 judges in parallel, after ALL implementors complete):** + ``` +[Judge 1: SHARED judge for S3 group — reviews both S3 adapter + analytics integration] +Use Task tool: + - description: "Judge (shared): S3 adapter implementation and analytics integration" + - prompt: + You are evaluating implementation artifacts for a group of related tasks + against a combined evaluation specification produced by the meta judge. + These tasks are interdependent and must be reviewed together. + + CLAUDE_PLUGIN_ROOT={CLAUDE_PLUGIN_ROOT} + + ## User Prompt + I wrote class interface for S3 service in s3.adapter.ts, please do 2 tasks: + implement s3 adapter with tests and integrate s3 adapter to analytics module. + Also refactor and simplify all files in cart module + + ## Tasks in This Shared Group + - Task A: Implement S3 adapter with tests -> src/adapters/s3.adapter.ts + - Task B: Integrate S3 adapter into analytics module -> src/modules/analytics.module.ts + + ## Pre-existing and expected parallel changes (Context Only) + + The following changes were made before or expected to be done by + other parallel agents in the same batch now. They are NOT part of + the current implementation agents' output for this shared group. + Focus your evaluation on the S3 group's changes. Only verify other + changed files/logic if they directly relate to these tasks. + + ### User modifications (before current task) + The user made changes to the following files/modules before this + task was started: + - src/db/connection.ts (modified) - Refactored database connection + pooling + - src/db/queries.ts (modified) - Updated query builder patterns + - src/adapters/s3.adapter.ts (created) - Added S3 class interface + (the interface that Task A implements) + - Several service modules updated to use new DB connection API + + ### Expected parallel changes (current batch) + Other agents in this batch are simultaneously: + - Refactoring src/modules/cart/cart.service.ts (repeatable group) + - Refactoring src/modules/cart/cart.repository.ts (repeatable group) + - Refactoring src/modules/cart/cart.controller.ts (repeatable group) + + ## Evaluation Specification + ```yaml + {EXACT combined spec YAML from shared S3 meta-judge} + ``` + + ## Implementation Outputs + ### Task: Implement S3 adapter with tests -> src/adapters/s3.adapter.ts + {Summary from S3 adapter implementation agent} + Files: src/adapters/s3.adapter.ts (modified), src/adapters/s3.adapter.test.ts (created) + + ### Task: Integrate S3 adapter into analytics -> src/modules/analytics.module.ts + {Summary from analytics integration agent} + Files: src/modules/analytics.module.ts (modified) + + ## Instructions + User prompt is provided as context, you should use it only as reference + of changes that can occur in the project by other agents. Evaluate ALL + tasks in this shared group together. Verify cross-task integration points + (e.g., does the adapter's public API match what the analytics module consumes?). + CRITICAL: For each task, indicate separately whether it PASSED or FAILED + so that only failing tasks can be retried. + Follow your full judge process as defined in your agent instructions! + + ## Output + CRITICAL: You must reply with this exact structured evaluation report + format in YAML at the START of your response! Include per-task verdicts. + - model: opus + - subagent_type: "sadd:judge" + +[Judge 2: cart.service.ts — uses SHARED reusable spec from repeatable meta-judge] +Use Task tool: + - description: "Judge: src/modules/cart/cart.service.ts" + - prompt: + You are evaluating an implementation artifact for target + src/modules/cart/cart.service.ts against an evaluation specification + produced by the meta judge. + + CLAUDE_PLUGIN_ROOT={CLAUDE_PLUGIN_ROOT} + + ## User Prompt + [original user prompt] + + ## Target + src/modules/cart/cart.service.ts + + ## Pre-existing and expected parallel changes (Context Only) + + The following changes were made before or expected to be done by + other parallel agents in the same batch now. They are NOT part of + the current implementation agent's output. Focus your evaluation + on the current agent's changes to its specific target. Only verify + other changed files/logic if they directly relate to the current + target's task requirements. + + ### User modifications (before current task) + The user made changes to the following files/modules before this + task was started: + - src/db/connection.ts (modified) - Refactored database connection + pooling + - src/db/queries.ts (modified) - Updated query builder patterns + - src/adapters/s3.adapter.ts (created) - Added S3 class interface + - Several service modules updated to use new DB connection API + + ### Expected parallel changes (current batch) + Other agents in this batch are simultaneously: + - Implementing S3 adapter in src/adapters/s3.adapter.ts (shared group) + - Integrating S3 adapter into src/modules/analytics.module.ts (shared group) + - Refactoring src/modules/cart/cart.repository.ts (repeatable group) + - Refactoring src/modules/cart/cart.controller.ts (repeatable group) + + ## Evaluation Specification + ```yaml + {EXACT reusable spec YAML from repeatable cart meta-judge — same for all 3 cart judges} + ``` + + ## Implementation Output + {Summary from cart.service.ts implementation agent} + + ## Instructions + User prompt is provided as context, you should use it only as reference + of changes that can occur in the project by other agents. Evaluate ONLY + the refactoring of cart.service.ts. + Follow your full judge process as defined in your agent instructions! + + ## Output + CRITICAL: You must reply with this exact structured evaluation report + format in YAML at the START of your response! + - model: opus + - subagent_type: "sadd:judge" -**Analysis:** -- Task type: Test generation -- Per-target complexity: Medium (follow testing patterns) -- Output size: Large (multiple test files) -- Independence: Yes (separate services) +[Judge 3: cart.repository.ts — uses SAME shared reusable spec] +Use Task tool: + - description: "Judge: src/modules/cart/cart.repository.ts" + - prompt: [Same judge template, same reusable spec YAML, cart.repository implementation output. + Pre-existing and expected parallel changes section: same user modifications, + expected parallel changes list S3 group, cart.service.ts, and cart.controller.ts instead] + - model: opus + - subagent_type: "sadd:judge" -**Model Selection:** Sonnet (pattern-based, extensive output) +[Judge 4: cart.controller.ts — uses SAME shared reusable spec] +Use Task tool: + - description: "Judge: src/modules/cart/cart.controller.ts" + - prompt: [Same judge template, same reusable spec YAML, cart.controller implementation output. + Pre-existing and expected parallel changes section: same user modifications, + expected parallel changes list S3 group, cart.service.ts, and cart.repository.ts instead] + - model: opus + - subagent_type: "sadd:judge" -**Dispatch:** 4 meta-judges (parallel) → 4 implementors (parallel) → judges + retries +[All 4 judges launched simultaneously] +``` -**Execution:** +**Shared judge retry scenario** (if S3 shared judge finds issues): ``` -Phase 3.5: Meta-judge Batch (4 in parallel, one per target) - [All 4 meta-judges launched simultaneously] - Meta-judge for UserService (Opus) → target-specific evaluation spec YAML - Meta-judge for OrderService (Opus) → target-specific evaluation spec YAML - Meta-judge for PaymentService (Opus) → target-specific evaluation spec YAML - Meta-judge for NotificationService (Opus) → target-specific evaluation spec YAML +Shared Judge Verdict: + - Task A (S3 adapter): PASS, SCORE: 4.2/5.0 + - Task B (analytics integration): FAIL, SCORE: 3.0/5.0 + ISSUES: Analytics module imports wrong method name from S3 adapter + - CROSS-TASK ISSUES: Method signature mismatch between adapter and consumer + +Retry Decision: + → Task A PASSED — do NOT re-launch S3 adapter implementation agent + → Task B FAILED — re-launch ONLY the analytics integration agent with feedback + → After retry, re-launch shared judge to review ALL changes again +``` + +**Result:** -Phase 5: Implementation Batch (4 in parallel, after all meta-judges complete) - [All 4 implementors launched simultaneously] +| Target | Grouping | Model | Judge Score | Retries | Status | +|--------|----------|-------|-------------|---------|--------| +| src/adapters/s3.adapter.ts | Shared | opus | 4.2/5.0 | 0 | SUCCESS | +| src/modules/analytics.module.ts | Shared | opus | 4.1/5.0 | 1 | SUCCESS | +| src/modules/cart/cart.service.ts | Repeatable | sonnet | 4.0/5.0 | 0 | SUCCESS | +| src/modules/cart/cart.repository.ts | Repeatable | sonnet | 4.3/5.0 | 0 | SUCCESS | +| src/modules/cart/cart.controller.ts | Repeatable | sonnet | 4.1/5.0 | 0 | SUCCESS | -Target: UserService - -> Judge (Opus, with UserService meta-judge spec): PASS, 4.3/5.0 +**Overall:** 5/5 completed. Total Agents: 12 (2 meta-judges + 5 implementations + 1 retry + 4 judges [1 shared re-run + 3 cart]) -Target: OrderService - -> Judge (Opus, with OrderService meta-judge spec): FAIL, 3.2/5.0 (missing edge cases) - -> Retry: Judge (Opus, same OrderService spec): PASS, 4.0/5.0 +--- + +### Example 3: Requirement Grouping -- All Independent -Target: PaymentService - -> Judge (Opus, with PaymentService meta-judge spec): FAIL, 2.8/5.0 (wrong mock patterns) - -> Retry 1: Judge (Opus, same PaymentService spec): FAIL, 3.0/5.0 (still missing scenarios) - -> Retry 2: Judge (Opus, same PaymentService spec): FAIL, 3.1/5.0 (coverage only 65%) - -> Retry 3: Judge (Opus, same PaymentService spec): FAIL, 3.2/5.0 (coverage at 72%) - -> MARKED FAILED after max retries +**Input:** -Target: NotificationService - -> Judge (Opus, with NotificationService meta-judge spec): PASS, 4.1/5.0 +``` +/do-in-parallel write tests for loan.service.ts, add password recovery feature to auth module and enable caching during dependency loading in github actions. ``` -**Result:** +**Orchestrator Analysis:** -| Target | Model | Judge Score | Retries | Status | -|--------|-------|-------------|---------|--------| -| UserService | sonnet | 4.3/5.0 | 0 | SUCCESS | -| OrderService | sonnet | 4.0/5.0 | 1 | SUCCESS | -| PaymentService | sonnet | 3.2/5.0 | 3 | FAILED | -| NotificationService | sonnet | 4.1/5.0 | 0 | SUCCESS | +``` +Phase 2: Task Analysis + Requirement Grouping + +1. Task Identification: + - Task A: "Write tests for src/services/loan.service.ts" + - Task B: "Add password recovery feature to src/modules/auth/" + - Task C: "Enable caching during dependency loading in .github/workflows/ci.yml" + +2. Requirement Grouping: + - Task A: INDEPENDENT — test generation for a specific service + - Task B: INDEPENDENT — new feature in auth module (unrelated to tasks A and C) + - Task C: INDEPENDENT — CI configuration change (unrelated to tasks A and B) + - No grouping possible: all 3 tasks are different task types on different targets + +3. Agent Count: + - Meta-judges: 3 (one per task — standard flow) + - Implementation agents: 3 (one per task) + - Judges: 3 (one per task) + - Total: 9 agents (no reduction possible) +``` -**Overall:** 3/4 completed, 1 failed +**Phase 3.5: Meta-Judge Dispatch (3 meta-judges in parallel):** -**Escalation for PaymentService:** -```markdown -### Failed Target: PaymentService -- **Final Score:** 3.2/5.0 -- **Persistent Issues:** - - Test coverage at 72%, target is 80% - - Complex async scenarios not fully covered -- **Options:** - 1. Provide guidance on specific async patterns to test - 2. Accept 72% coverage as sufficient - 3. Manual test writing for complex scenarios ``` +[Meta-judge 1: Independent — loan service tests] +Use Task tool: + - description: "Meta-judge: write tests for loan.service.ts" + - prompt: + ## Task + + Generate an evaluation specification yaml for the following task applied + to a specific target. You will produce rubrics, checklists, and scoring + criteria that a judge agent will use to evaluate the implementation + artifact for this specific target. + + CLAUDE_PLUGIN_ROOT={CLAUDE_PLUGIN_ROOT} + + ## User Prompt as Context + write tests for loan.service.ts, add password recovery feature to auth + module and enable caching during dependency loading in github actions. + + ## Target + Write comprehensive unit tests for src/services/loan.service.ts + + ## Context + Project uses Jest. Tests should cover all public methods, edge cases, + and error scenarios for the loan service. + + ## Artifact Type + code + + ## Instructions + User prompt is provided as context, you should use it only as reference + of changes that can occur in the project by other agents. Generate + evaluation specification ONLY for the loan service test generation. + Your report will be used to verify only this particular task, not the + all tasks in the user prompt. + Return only the final evaluation specification YAML in your response. + - model: opus + - subagent_type: "sadd:meta-judge" ---- +[Meta-judge 2: Independent — password recovery feature] +Use Task tool: + - description: "Meta-judge: add password recovery to auth module" + - prompt: + ## Task + + Generate an evaluation specification yaml for the following task applied + to a specific target. You will produce rubrics, checklists, and scoring + criteria that a judge agent will use to evaluate the implementation + artifact for this specific target. + + CLAUDE_PLUGIN_ROOT={CLAUDE_PLUGIN_ROOT} + + ## User Prompt as Context + write tests for loan.service.ts, add password recovery feature to auth + module and enable caching during dependency loading in github actions. + + ## Target + Add password recovery feature to src/modules/auth/ (password reset flow: + request, token generation, validation, password update) + + ## Context + Auth module handles authentication. Password recovery requires new + endpoints, email integration, token management. + + ## Artifact Type + code + + ## Instructions + User prompt is provided as context, you should use it only as reference + of changes that can occur in the project by other agents. Generate + evaluation specification ONLY for the password recovery feature. + Your report will be used to verify only this particular task, not the + all tasks in the user prompt. + Return only the final evaluation specification YAML in your response. + - model: opus + - subagent_type: "sadd:meta-judge" + +[Meta-judge 3: Independent — GH Actions caching] +Use Task tool: + - description: "Meta-judge: enable dependency caching in GitHub Actions" + - prompt: + ## Task + + Generate an evaluation specification yaml for the following task applied + to a specific target. You will produce rubrics, checklists, and scoring + criteria that a judge agent will use to evaluate the implementation + artifact for this specific target. + + CLAUDE_PLUGIN_ROOT={CLAUDE_PLUGIN_ROOT} + + ## User Prompt as Context + write tests for loan.service.ts, add password recovery feature to auth + module and enable caching during dependency loading in github actions. + + ## Target + Enable caching during dependency loading in .github/workflows/ci.yml + (e.g., npm/yarn cache, actions/cache) + + ## Context + GitHub Actions CI pipeline. Dependency installation step should use + caching to speed up builds. + + ## Artifact Type + configuration + + ## Instructions + User prompt is provided as context, you should use it only as reference + of changes that can occur in the project by other agents. Generate + evaluation specification ONLY for enabling dependency caching in GH Actions. + Your report will be used to verify only this particular task, not the + all tasks in the user prompt. + Return only the final evaluation specification YAML in your response. + - model: opus + - subagent_type: "sadd:meta-judge" -### Example 5: Inferred Targets from Task +[All 3 meta-judges launched simultaneously] +``` + +**Phase 5: Implementation Dispatch (3 agents in parallel, after meta-judges complete):** -**Input:** ``` -/do-in-parallel "Apply consistent logging format to src/handlers/user.ts, src/handlers/order.ts, and src/handlers/product.ts" +[Implementation 1: loan service tests] +Use Task tool: + - description: "Parallel: write tests for loan.service.ts" + - prompt: + ## Reasoning Approach + Let's think step by step. + Before taking any action, think through the problem systematically: + 1. "Let me first understand what is being asked for this specific target..." + 2. "Let me analyze this specific target..." + 3. "Let me plan my approach..." + Work through each step explicitly before implementing. + + Write comprehensive unit tests for the loan service + src/services/loan.service.ts + + - Work ONLY on the specified target + - Create test file co-located with the service + - Cover all public methods, edge cases, and error scenarios + - Follow existing test patterns in the project + + + Create test file for the loan service. + CRITICAL: At the end of your work, provide a "Summary" section containing: + - Files modified (full paths) + - Key changes (3-5 bullet points) + - Any decisions made and rationale + + + ## Self-Critique Verification (MANDATORY) + Before submitting, verify your work: + 1. Re-read the original task and confirm every requirement is addressed + 2. Check that all tests follow existing patterns in the project + 3. Verify no unrelated files were modified + 4. Confirm the Summary section is complete and accurate + - model: sonnet + +[Implementation 2: password recovery] +Use Task tool: + - description: "Parallel: add password recovery feature to auth module" + - prompt: + ## Reasoning Approach + [standard CoT prefix] + + Add password recovery feature to the auth module + src/modules/auth/ + + - Work ONLY on the auth module + - Implement password reset request, token generation, validation, + and password update + - Follow existing auth module patterns + - Do NOT modify unrelated modules + + + Implement password recovery feature. + CRITICAL: At the end of your work, provide a "Summary" section. + + + ## Self-Critique Verification (MANDATORY) + [standard self-critique suffix] + - model: opus + +[Implementation 3: GH Actions caching] +Use Task tool: + - description: "Parallel: enable dependency caching in GitHub Actions" + - prompt: + ## Reasoning Approach + [standard CoT prefix] + + Enable caching during dependency loading in CI pipeline + .github/workflows/ci.yml + + - Work ONLY on the CI workflow file + - Add dependency caching (npm/yarn cache or actions/cache) + - Do NOT modify other workflow steps beyond what is necessary + + + Update CI workflow with dependency caching. + CRITICAL: At the end of your work, provide a "Summary" section. + + + ## Self-Critique Verification (MANDATORY) + [standard self-critique suffix] + - model: sonnet + +[All 3 launched simultaneously] +``` + +**Phase 5.2: Judge Dispatch (3 judges in parallel, after ALL implementors complete):** + ``` +[Judge 1: loan service tests — independent spec] +Use Task tool: + - description: "Judge: loan.service.ts tests" + - prompt: + You are evaluating an implementation artifact for target + src/services/loan.service.ts against an evaluation specification + produced by the meta judge. + + CLAUDE_PLUGIN_ROOT={CLAUDE_PLUGIN_ROOT} + + ## User Prompt + write tests for loan.service.ts, add password recovery feature to auth + module and enable caching during dependency loading in github actions. + + ## Target + src/services/loan.service.ts + + ## Evaluation Specification + ```yaml + {EXACT spec YAML from loan service meta-judge} + ``` + + ## Implementation Output + {Summary from loan service test implementation agent} + + ## Instructions + User prompt is provided as context, you should use it only as reference + of changes that can occur in the project by other agents. Evaluate ONLY + the test generation for loan.service.ts. + Follow your full judge process as defined in your agent instructions! + + ## Output + CRITICAL: You must reply with this exact structured evaluation report + format in YAML at the START of your response! + - model: opus + - subagent_type: "sadd:judge" + +[Judge 2: password recovery — independent spec] +Use Task tool: + - description: "Judge: auth password recovery" + - prompt: + You are evaluating an implementation artifact for target + src/modules/auth/ against an evaluation specification produced + by the meta judge. + + CLAUDE_PLUGIN_ROOT={CLAUDE_PLUGIN_ROOT} + + ## User Prompt + [original user prompt] + + ## Target + src/modules/auth/ (password recovery feature) -**Analysis:** -- Targets inferred: 3 files extracted from task description -- Task type: Code transformation -- Complexity: Low -- Independence: Yes + ## Evaluation Specification + ```yaml + {EXACT spec YAML from password recovery meta-judge} + ``` -**Model Selection:** Haiku (simple, mechanical) + ## Implementation Output + {Summary from password recovery implementation agent} -**Dispatch:** 3 meta-judges (parallel) → 3 implementors (parallel) → 3 judges + ## Instructions + User prompt is provided as context, you should use it only as reference + of changes that can occur in the project by other agents. Evaluate ONLY + the password recovery feature. + Follow your full judge process as defined in your agent instructions! -**Execution Summary:** + ## Output + CRITICAL: You must reply with this exact structured evaluation report + format in YAML at the START of your response! + - model: opus + - subagent_type: "sadd:judge" + +[Judge 3: GH Actions caching — independent spec] +Use Task tool: + - description: "Judge: GitHub Actions dependency caching" + - prompt: + You are evaluating an implementation artifact for target + .github/workflows/ci.yml against an evaluation specification produced + by the meta judge. + + CLAUDE_PLUGIN_ROOT={CLAUDE_PLUGIN_ROOT} + + ## User Prompt + [original user prompt] + + ## Target + .github/workflows/ci.yml (dependency caching) + + ## Evaluation Specification + ```yaml + {EXACT spec YAML from GH Actions caching meta-judge} + ``` + + ## Implementation Output + {Summary from GH Actions caching implementation agent} + + ## Instructions + User prompt is provided as context, you should use it only as reference + of changes that can occur in the project by other agents. Evaluate ONLY + the dependency caching in GitHub Actions. + Follow your full judge process as defined in your agent instructions! + + ## Output + CRITICAL: You must reply with this exact structured evaluation report + format in YAML at the START of your response! + - model: opus + - subagent_type: "sadd:judge" + +[All 3 judges launched simultaneously] +``` + +**Result:** -| Target | Model | Judge Score | Retries | Status | -|--------|-------|-------------|---------|--------| -| src/handlers/user.ts | haiku | 4.2/5.0 | 0 | SUCCESS | -| src/handlers/order.ts | haiku | 4.0/5.0 | 0 | SUCCESS | -| src/handlers/product.ts | haiku | 4.1/5.0 | 0 | SUCCESS | +| Target | Grouping | Model | Judge Score | Retries | Status | +|--------|----------|-------|-------------|---------|--------| +| src/services/loan.service.ts | Independent | sonnet | 4.1/5.0 | 0 | SUCCESS | +| src/modules/auth/ | Independent | opus | 4.3/5.0 | 0 | SUCCESS | +| .github/workflows/ci.yml | Independent | sonnet | 4.0/5.0 | 0 | SUCCESS | -Total Agents: 9 (3 meta-judges + 3 implementations + 3 judges) +**Overall:** 3/3 completed. Total Agents: 9 (3 meta-judges + 3 implementations + 3 judges). No grouping reduction possible for fully independent tasks. ## Best Practices @@ -904,12 +2152,14 @@ Total Agents: 9 (3 meta-judges + 3 implementations + 3 judges) ### Meta-Judge + Judge Verification -- **One meta-judge per target** - Each target gets its own tailored evaluation specification, producing more relevant and precise judgments than a shared generic one -- **Batch meta-judges first** - Launch all meta-judges in parallel, then launch implementors -- **Reuse target-specific spec on retries** - Each target's evaluation specification stays constant across retries; only the implementation changes +- **Requirement grouping first** - Before dispatching any meta-judges, analyze tasks for repeatable, shared, or independent grouping to minimize total agents +- **One meta-judge per group or independent task** - Repeatable groups share one reusable spec, shared groups share one combined spec, independent tasks get their own spec +- **Batch meta-judges first** - Launch all meta-judges in parallel (regardless of grouping type), then launch implementors +- **Reuse spec on retries** - Each group/target's evaluation specification stays constant across retries; only the implementation changes - **Parse only headers from judge** - Don't read full reports to avoid context pollution - **Include CLAUDE_PLUGIN_ROOT** - Both meta-judge and judge need the resolved plugin root path -- **Target-specific YAML** - Pass only the target's own meta-judge YAML to its judge, do not add any additional text or comments to it! +- **Target-specific YAML** - Pass only the relevant meta-judge YAML to its judge, do not add any additional text or comments to it! +- **Shared group retries** - Only re-launch the specific failing implementation agent(s), not the entire group ### Judge Selection @@ -927,6 +2177,7 @@ Total Agents: 9 (3 meta-judges + 3 implementations + 3 judges) - **No cross-references:** Don't tell Agent A about Agent B's target - **Let them discover:** Sub-agents read files to understand patterns - **File system as truth:** Changes are coordinated through the filesystem +- **Track pre-existing changes** - Pass context about prior modifications to each agent's judge to prevent attribution confusion between pre-existing and current changes ### Quality Assurance diff --git a/plugins/sadd/skills/do-in-steps/SKILL.md b/plugins/sadd/skills/do-in-steps/SKILL.md index a20ce9c..1ea213d 100644 --- a/plugins/sadd/skills/do-in-steps/SKILL.md +++ b/plugins/sadd/skills/do-in-steps/SKILL.md @@ -507,6 +507,32 @@ After BOTH meta-judge and implementation agent complete, dispatch an **independe CRITICAL: Provide to the judge EXACT meta-judge's evaluation specification YAML, do not skip or add anything, do not modify it in any way, do not shorten or summarize any text in it! +##### 3.4.1 Analyze the Pre-existing Changes Section + +Before dispatching the judge for each step, assess whether there are pre-existing changes in the codebase that the judge needs to be aware of. The "Pre-existing Changes" section prevents the judge from confusing prior modifications with the current step's implementation agent's work. + +**When to include:** + +- Previous steps' changes from the SAME do-in-steps run (steps 1..N-1 when judging step N) — this is the most common case in sequential execution. When running step N, the judge MUST know about changes from steps 1..N-1 as pre-existing. Each completed step's output (files created/modified, key changes) becomes pre-existing context for subsequent step judges. +- Previous do-in-steps or do-and-judge task runs completed earlier in the same session +- User's manual modifications made before invoking the skill (visible from conversation context or in git) +- Changes from other tools or agents that ran before this task + +**When to omit:** + +- This is step 1 with no known prior changes (no earlier session tasks, no user modifications) — omit the section entirely +- On retries within the SAME step, do NOT include the implementation agent's own previous attempt as "pre-existing changes" — those are part of the current step's iteration cycle + +**Content guidelines:** + +- Use a high-level summary: task description, list of affected files/modules, general nature of changes (created, modified, deleted) +- Do NOT include code blocks, diffs, or line-level details — keep it concise +- Label each source clearly: "Step 1: {description}", "Step 2: {description}", "User modifications (before current task)", etc. +- If multiple sources of pre-existing changes exist, use separate subsections for each (one per completed step, plus any external sources) +- Leverage the Context Passing Protocol output (section 3.1) — the "Completed Steps Summary" already tracks what each step produced + +CRITICAL: avoid reading full codebase or git history, just use high-level git diff/status to determine which files were changed, or use conversation context and completed step summaries to determine pre-existing changes. + **Prompt template for step judge:** ```markdown @@ -525,6 +551,20 @@ CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}` ## Previous Steps Context {Summary of what previous steps accomplished} +{IF pre-existing changes are known (previous steps, prior tasks, or user modifications), include the following section — otherwise omit entirely} + +## Pre-existing Changes (Context Only) + +The following changes were made BEFORE the current step's implementation agent started working. They are NOT part of the current step's output. Focus your evaluation on the current step's changes. Only verify pre-existing changed files/logic if they directly relate to the current step's requirements. + +### {Source of changes: e.g., "Step 1: {step description}" or "Previous Task: {task description}" or "User modifications (before current task)"} +{High-level summary: what was done, which files/modules were created or modified} + +### {Additional source if applicable} +{High-level summary} + +{END conditional section} + ## Evaluation Specification ```yaml @@ -551,7 +591,7 @@ CRITICAL: NEVER provide score threshold, in any format, including `threshold_pas ``` Use Task tool: - description: "Judge Step {N}/{total}: {subtask_name}" - - prompt: {judge verification prompt with exact meta-judge specification YAML} + - prompt: {judge verification prompt with exact meta-judge specification YAML, and Pre-existing Changes section if applicable} - model: opus - subagent_type: "sadd:judge" ``` @@ -801,130 +841,378 @@ Awaiting your decision... ## Examples -### Example 1: Interface Change with Consumer Updates +### Example 1: Sequential Steps Building on Each Other (Pre-existing Changes from Previous Steps) **Input:** ``` -/do-in-steps Change the return type of UserService.getUser() from User to UserDTO and update all consumers +/do-in-steps implement user management feature ``` **Phase 1 - Decomposition:** | Step | Subtask | Depends On | Complexity | Type | Output | |------|---------|------------|------------|------|--------| -| 1 | Create UserDTO class with proper structure | - | Medium | Implementation | New UserDTO.ts file | -| 2 | Update UserService.getUser() to return UserDTO | Step 1 | High | Implementation | Modified UserService | -| 3 | Update UserController to handle UserDTO | Step 2 | Medium | Refactoring | Modified UserController | -| 4 | Update tests for UserService and UserController | Steps 2,3 | Medium | Testing | Updated test files | - -**Phase 2 - Model Selection:** - -| Step | Subtask | Model | Agent | Rationale | -|------|---------|-------|-------|-----------| -| 1 | Create DTO | sonnet | sdd:developer | Medium complexity, standard pattern | -| 2 | Update Service | opus | sdd:developer | High risk, core service change | -| 3 | Update Controller | sonnet | sdd:developer | Medium complexity, follows patterns | -| 4 | Update Tests | sonnet | sdd:tdd-developer | Test expertise | +| 1 | Create User model and database schema | - | Medium | Implementation | User model, migration files | +| 2 | Add CRUD endpoints for users | Step 1 | Medium | Implementation | REST API routes, controller | +| 3 | Add authentication integration | Steps 1,2 | High | Implementation | Auth middleware, JWT handling | -**Phase 3 - Execution with Parallel Meta-Judge and Judge Verification:** +**Phase 3 - Execution with Pre-existing Changes Accumulation:** ``` -Step 1: Create UserDTO - Parallel dispatch (single message, 2 tool calls): - Tool call 1 — Meta-judge (Opus, sadd:meta-judge)... - → Generated step-specific evaluation specification YAML - Tool call 2 — Implementation (Sonnet, sdd:developer)... - → Created UserDTO.ts with id, name, email, createdAt fields - Judge Verification (Opus, sadd:judge, with step 1 meta-judge spec)... - → VERDICT: PASS, SCORE: 4.2/5.0 - → IMPROVEMENTS: Consider adding validation methods - → Context passed: UserDTO interface, file path - -Step 2: Update UserService (First Attempt Failed) - Parallel dispatch (single message, 2 tool calls): - Tool call 1 — Meta-judge (Opus, sadd:meta-judge)... - → Generated step-specific evaluation specification YAML - Tool call 2 — Implementation (Opus, sdd:developer)... - → Updated return type but missed mapping logic - Judge Verification (Opus, sadd:judge, with step 2 meta-judge spec)... - → VERDICT: FAIL, SCORE: 2.8/5.0 - → ISSUES: Missing User->UserDTO mapping, return type changed but still returns User - Retry Implementation (Opus) with judge feedback... - → Added static fromUser() factory method - → Updated getUser() to use mapping - Judge Verification (Opus, sadd:judge, same step 2 meta-judge spec)... - → VERDICT: PASS, SCORE: 4.5/5.0 - → Context passed: Method signature changed, mapping pattern used - -Step 3: Update UserController - Parallel dispatch (single message, 2 tool calls): - Tool call 1 — Meta-judge (Opus, sadd:meta-judge)... - → Generated step-specific evaluation specification YAML - Tool call 2 — Implementation (Sonnet, sdd:developer)... - → Updated controller to expect UserDTO - Judge Verification (Opus, sadd:judge, with step 3 meta-judge spec)... - → VERDICT: PASS, SCORE: 4.0/5.0 - → Context passed: Endpoint contracts updated - -Step 4: Update Tests - Parallel dispatch (single message, 2 tool calls): - Tool call 1 — Meta-judge (Opus, sadd:meta-judge)... - → Generated step-specific evaluation specification YAML - Tool call 2 — Implementation (Sonnet, sdd:tdd-developer)... - → Updated service and controller tests - Judge Verification (Opus, sadd:judge, with step 4 meta-judge spec)... - → VERDICT: PASS, SCORE: 4.3/5.0 - → All steps complete +Step 1: Create User model and database schema + Parallel dispatch: Meta-judge + Implementation + Judge Verification (with step 1 meta-judge spec): + NOTE: No pre-existing changes — this is step 1 with no prior session tasks. + The "Pre-existing Changes" section is OMITTED from the judge prompt. + + Judge prompt sent: + ┌───────────────────────────────────────────────────────── + │ You are evaluating Step 1/3: Create User model and + │ database schema against an evaluation specification + │ produced by the meta judge. + │ + │ CLAUDE_PLUGIN_ROOT=... + │ + │ ## Original Task + │ Implement user management feature + │ + │ ## Step Requirements + │ Create User model and database schema with proper + │ fields and relationships. + │ + │ ## Previous Steps Context + │ None (first step) + │ + │ ## Evaluation Specification + │ ```yaml + │ {meta-judge's evaluation specification YAML} + │ ``` + │ + │ ## Implementation Output + │ Files: src/models/User.ts (new), migrations/001_create_users.ts (new) + │ Key changes: Created User model with id, email, name, passwordHash... + │ + │ ## Instructions + │ Follow your full judge process... + └───────────────────────────────────────────────────────── + + → VERDICT: PASS, SCORE: 4.2/5.0 + → Context passed forward: User model fields, migration file paths + +Step 2: Add CRUD endpoints for users + Parallel dispatch: Meta-judge + Implementation + Judge Verification (with step 2 meta-judge spec): + NOTE: Pre-existing changes detected — Step 1 created the User model. + Include "Pre-existing Changes" section so the judge does not confuse + Step 1's files with Step 2's implementation work. + + Judge prompt sent: + ┌───────────────────────────────────────────────────────── + │ You are evaluating Step 2/3: Add CRUD endpoints for + │ users against an evaluation specification produced by + │ the meta judge. + │ + │ CLAUDE_PLUGIN_ROOT=... + │ + │ ## Original Task + │ Implement user management feature + │ + │ ## Step Requirements + │ Add CRUD endpoints (create, read, update, delete) for + │ user management with proper validation and error handling. + │ + │ ## Previous Steps Context + │ Step 1 created User model with fields: id, email, name, + │ passwordHash, createdAt, updatedAt. + │ + │ ## Pre-existing Changes (Context Only) + │ + │ The following changes were made BEFORE the current + │ step's implementation agent started working. They are + │ NOT part of the current step's output. Focus your + │ evaluation on the current step's changes. Only verify + │ pre-existing changed files/logic if they directly + │ relate to the current step's requirements. + │ + │ ### Step 1: "Create User model and database schema" + │ The following files were created as part of Step 1: + │ - src/models/User.ts (new) - User model with fields: + │ id, email, name, passwordHash, createdAt, updatedAt + │ - migrations/001_create_users.ts (new) - Database + │ migration for users table + │ + │ These files exist in the codebase and may be referenced + │ by the current step, but evaluate only the changes made + │ by Step 2's implementation agent. + │ + │ ## Evaluation Specification + │ ```yaml + │ {meta-judge's evaluation specification YAML} + │ ``` + │ + │ ## Implementation Output + │ Files: src/controllers/UserController.ts (new), + │ src/routes/users.ts (new), src/app.ts (modified) + │ Key changes: Added REST endpoints for user CRUD... + │ + │ ## Instructions + │ Follow your full judge process... + └───────────────────────────────────────────────────────── + + → VERDICT: PASS, SCORE: 4.4/5.0 + → Context passed forward: API routes, controller patterns + +Step 3: Add authentication integration + Parallel dispatch: Meta-judge + Implementation + Judge Verification (with step 3 meta-judge spec): + NOTE: Pre-existing changes include BOTH Step 1 AND Step 2. + The judge needs to know about all prior steps' output. + + Judge prompt sent: + ┌───────────────────────────────────────────────────────── + │ You are evaluating Step 3/3: Add authentication + │ integration against an evaluation specification + │ produced by the meta judge. + │ + │ CLAUDE_PLUGIN_ROOT=... + │ + │ ## Original Task + │ Implement user management feature + │ + │ ## Step Requirements + │ Add JWT-based authentication with login/register + │ endpoints and middleware for protecting user routes. + │ + │ ## Previous Steps Context + │ Step 1 created User model. Step 2 added CRUD endpoints + │ at /api/users with UserController. + │ + │ ## Pre-existing Changes (Context Only) + │ + │ The following changes were made BEFORE the current + │ step's implementation agent started working. They are + │ NOT part of the current step's output. Focus your + │ evaluation on the current step's changes. Only verify + │ pre-existing changed files/logic if they directly + │ relate to the current step's requirements. + │ + │ ### Step 1: "Create User model and database schema" + │ - src/models/User.ts (new) - User model with fields: + │ id, email, name, passwordHash, createdAt, updatedAt + │ - migrations/001_create_users.ts (new) - Database + │ migration for users table + │ + │ ### Step 2: "Add CRUD endpoints for users" + │ - src/controllers/UserController.ts (new) - REST + │ controller with create, read, update, delete handlers + │ - src/routes/users.ts (new) - Express router for + │ /api/users endpoints + │ - src/app.ts (modified) - Registered user routes + │ + │ These files exist in the codebase and may be modified + │ by the current step, but evaluate only the changes made + │ by Step 3's implementation agent. + │ + │ ## Evaluation Specification + │ ```yaml + │ {meta-judge's evaluation specification YAML} + │ ``` + │ + │ ## Implementation Output + │ Files: src/auth/AuthMiddleware.ts (new), + │ src/routes/auth.ts (new), src/app.ts (modified), + │ src/routes/users.ts (modified) + │ Key changes: Added JWT auth with login/register... + │ + │ ## Instructions + │ Follow your full judge process... + └───────────────────────────────────────────────────────── + + → VERDICT: PASS, SCORE: 4.1/5.0 ``` **Final Summary:** -- Total Agents: 13 (4 meta-judges + 4 implementations + 1 retry + 4 judges) -- Steps with Retries: Step 2 (1 retry, reused step 2 meta-judge spec) -- All Judge Scores: 4.2, 4.5, 4.0, 4.3 +- Total Agents: 10 (3 meta-judges + 3 implementations + 0 retries + 3 judges) +- Pre-existing Changes Progression: + - Step 1 judge: None + - Step 2 judge: Step 1 output (2 files) + - Step 3 judge: Steps 1+2 output (5 files) +- All Judge Scores: 4.2, 4.4, 4.1 --- -### Example 2: Feature Addition Across Layers +### Example 2: User-Modified Codebase + Sequential Steps (Mixed Pre-existing Changes Sources) + +**Scenario:** + +The user has been working on a payment processing module during the conversation. They modified several files (added a new PaymentGateway interface, updated configuration) before invoking do-in-steps. **Input:** ``` -/do-in-steps Add email notification capability to the order processing system +/do-in-steps fix and improve payment processing ``` **Phase 1 - Decomposition:** | Step | Subtask | Depends On | Complexity | Type | Output | |------|---------|------------|------------|------|--------| -| 1 | Create EmailService with send capability | - | Medium | Implementation | New EmailService class | -| 2 | Add notification triggers to OrderService | Step 1 | Medium | Implementation | Modified OrderService | -| 3 | Create email templates for order events | Step 2 | Low | Documentation | Template files | -| 4 | Add configuration and environment variables | Step 1 | Low | Configuration | Updated config files | -| 5 | Add integration tests for email flow | Steps 1-4 | Medium | Testing | Test files | - -**Phase 2 - Model Selection:** +| 1 | Fix payment validation bugs | - | Medium | Bug fix | Corrected validation logic | +| 2 | Add retry logic for failed payments | Step 1 | High | Implementation | Retry mechanism with backoff | -| Step | Subtask | Model | Rationale | -|------|---------|-------|-----------| -| 1 | EmailService | sonnet | Standard implementation | -| 2 | Notification triggers | sonnet | Business logic | -| 3 | Email templates | haiku | Simple content | -| 4 | Configuration | haiku | Mechanical updates | -| 5 | Integration tests | sonnet | Test expertise | +**Phase 3 - Execution with Mixed Pre-existing Changes:** -**Phase 3 - Execution Summary (each step has parallel meta-judge + implementation):** +``` +Step 1: Fix payment validation bugs + Parallel dispatch: Meta-judge + Implementation + Judge Verification (with step 1 meta-judge spec): + NOTE: Pre-existing changes detected from USER modifications. + The user modified payment files before this task — include those + so the judge focuses only on the bug fix, not the user's prior work. + + Judge prompt sent: + ┌───────────────────────────────────────────────────────── + │ You are evaluating Step 1/2: Fix payment validation + │ bugs against an evaluation specification produced by + │ the meta judge. + │ + │ CLAUDE_PLUGIN_ROOT=... + │ + │ ## Original Task + │ Fix and improve payment processing + │ + │ ## Step Requirements + │ Fix validation bugs in payment amount and currency + │ checks that allow invalid transactions to proceed. + │ + │ ## Previous Steps Context + │ None (first step) + │ + │ ## Pre-existing Changes (Context Only) + │ + │ The following changes were made BEFORE the current + │ step's implementation agent started working. They are + │ NOT part of the current step's output. Focus your + │ evaluation on the current step's changes. Only verify + │ pre-existing changed files/logic if they directly + │ relate to the current step's requirements. + │ + │ ### User modifications (before current task) + │ The user made changes to the following files/modules + │ before this task was started: + │ - src/payments/PaymentGateway.ts (new) - Payment + │ gateway interface definition + │ - src/payments/StripeAdapter.ts (modified) - Updated + │ to implement new PaymentGateway interface + │ - src/config/payment.config.ts (modified) - Added + │ gateway configuration settings + │ + │ The current task focuses on fixing validation bugs. + │ Pre-existing changes to payment files may overlap with + │ the current step's scope — evaluate whether the + │ implementation agent's changes correctly fix the bugs + │ without breaking the pre-existing modifications. + │ + │ ## Evaluation Specification + │ ```yaml + │ {meta-judge's evaluation specification YAML} + │ ``` + │ + │ ## Implementation Output + │ Files: src/payments/PaymentValidator.ts (modified), + │ tests/payments/PaymentValidator.test.ts (modified) + │ Key changes: Fixed amount validation to reject negative + │ values, added currency code format check... + │ + │ ## Instructions + │ Follow your full judge process... + └───────────────────────────────────────────────────────── + + → VERDICT: PASS, SCORE: 4.3/5.0 + → Context passed forward: Validation fixes, affected files + +Step 2: Add retry logic for failed payments + Parallel dispatch: Meta-judge + Implementation + Judge Verification (with step 2 meta-judge spec): + NOTE: Pre-existing changes now include BOTH the user's modifications + AND Step 1's output. The judge needs both sources to correctly + attribute changes. + + Judge prompt sent: + ┌───────────────────────────────────────────────────────── + │ You are evaluating Step 2/2: Add retry logic for failed + │ payments against an evaluation specification produced by + │ the meta judge. + │ + │ CLAUDE_PLUGIN_ROOT=... + │ + │ ## Original Task + │ Fix and improve payment processing + │ + │ ## Step Requirements + │ Add retry mechanism with exponential backoff for failed + │ payment transactions, with configurable max retries. + │ + │ ## Previous Steps Context + │ Step 1 fixed payment validation bugs in + │ PaymentValidator.ts (amount and currency checks). + │ + │ ## Pre-existing Changes (Context Only) + │ + │ The following changes were made BEFORE the current + │ step's implementation agent started working. They are + │ NOT part of the current step's output. Focus your + │ evaluation on the current step's changes. Only verify + │ pre-existing changed files/logic if they directly + │ relate to the current step's requirements. + │ + │ ### User modifications (before current task) + │ - src/payments/PaymentGateway.ts (new) - Payment + │ gateway interface definition + │ - src/payments/StripeAdapter.ts (modified) - Updated + │ to implement new PaymentGateway interface + │ - src/config/payment.config.ts (modified) - Added + │ gateway configuration settings + │ + │ ### Step 1: "Fix payment validation bugs" + │ - src/payments/PaymentValidator.ts (modified) - Fixed + │ amount validation and currency code format checks + │ - tests/payments/PaymentValidator.test.ts (modified) - + │ Added regression tests for validation fixes + │ + │ These files exist in the codebase and may be modified + │ by the current step, but evaluate only the changes made + │ by Step 2's implementation agent. + │ + │ ## Evaluation Specification + │ ```yaml + │ {meta-judge's evaluation specification YAML} + │ ``` + │ + │ ## Implementation Output + │ Files: src/payments/PaymentRetryService.ts (new), + │ src/payments/StripeAdapter.ts (modified), + │ src/config/payment.config.ts (modified), + │ tests/payments/PaymentRetryService.test.ts (new) + │ Key changes: Added PaymentRetryService with exponential + │ backoff, integrated into StripeAdapter... + │ + │ ## Instructions + │ Follow your full judge process... + └───────────────────────────────────────────────────────── + + → VERDICT: PASS, SCORE: 4.5/5.0 +``` -| Step | Subtask | Meta-Judge | Judge Score | Retries | Status | -|------|---------|------------|-------------|---------|--------| -| 1 | EmailService | Step-specific spec | 4.1/5.0 | 0 | PASS | -| 2 | Notification triggers | Step-specific spec | 4.2/5.0 | 1 | PASS | -| 3 | Email templates | Step-specific spec | 4.5/5.0 | 0 | PASS | -| 4 | Configuration | Step-specific spec | 4.2/5.0 | 0 | PASS | -| 5 | Integration tests | Step-specific spec | 4.0/5.0 | 0 | PASS | +**Final Summary:** -Total Agents: 16 (5 meta-judges + 5 implementations + 1 retry + 5 judges) +- Total Agents: 7 (2 meta-judges + 2 implementations + 0 retries + 2 judges) +- Pre-existing Changes Progression: + - Step 1 judge: User modifications (3 files) + - Step 2 judge: User modifications (3 files) + Step 1 output (2 files) +- All Judge Scores: 4.3, 4.5 --- @@ -992,6 +1280,8 @@ Step 4-5: Each with parallel meta-judge + implementation, complete without issue Total Agents: 20 (5 meta-judges + 5 implementations + 5 retries + 5 judges) +--- + ## Best Practices ### Task Decomposition @@ -1031,6 +1321,7 @@ Total Agents: 20 (5 meta-judges + 5 implementations + 5 retries + 5 judges) - Omit internal details that don't affect subsequent steps - Highlight patterns/conventions to maintain consistency - Include judge IMPROVEMENTS as optional enhancements +- **Track pre-existing changes** - Pass context about prior modifications (including previous steps) to the judge to prevent attribution confusion ### Meta-Judge + Judge Verification