Add built-in "Skill Validator" multi-agent preset#302
Conversation
Adds a new built-in GroupPreset called "Skill Validator" (⚖️) that pits two evaluators against each other to assess skills: - Worker 1 (Dotnet Skill Validator): empirical, outcome-focused evaluation using methodology inspired by dotnet/skills skill-validator — measures baseline vs. skill-augmented performance, pairwise comparative judging, statistical confidence assessment. - Worker 2 (Anthropic Skill Evaluator): prompt-design-focused evaluation assessing description quality/trigger accuracy, instruction clarity, scope appropriateness, and test coverage. - Orchestrator: routes work to both evaluators, highlights agreements and disagreements, explains which suggestions are adopted and why, produces a consensus KEEP/IMPROVE/REMOVE verdict. Configuration: - Mode: OrchestratorReflect - OrchestratorModel: claude-opus-4.6 - WorkerModels: 2x claude-sonnet-4.6 - MaxReflectIterations: 6 Also updates Scenario_CreateGroupFromPreset test to expect 3 built-in presets and adds BuiltInPresets_IncludeSkillValidator test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-dispatch clearing Three fixes to dramatically reduce recovery time for stuck sessions: 1. Escalation timeout: After the first 600s timeout with no events (but server alive), switch to 60s timeout for subsequent checks. Reduces max stuck time from ~40 minutes to ~12 minutes. 2. Tool health check: Start a 30s timer when a tool begins executing. If no events arrive within 30s, check if the connection is still alive. After 3 consecutive stale checks (90s total), trigger recovery. This detects dead connections much faster than waiting for the 10-minute watchdog. 3. Re-dispatch clearing: Before re-dispatching workers after app restart, clear any stuck IsProcessing/ActiveToolCallCount/SendingFlag state from the previous incomplete turn. This allows SendPromptAsync to accept the new prompt instead of rejecting with 'already processing'. Also resets HasUsedToolsThisTurn on reconnect so the watchdog uses the appropriate timeout (120s instead of 600s) for the fresh connection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
eeea8ee to
0359d3b
Compare
1. Stagger worker dispatch: Add 1s delay between worker launches to prevent burst connection saturation (previously 3/5 workers crashed with IOException) 2. IOException retry: Wrap SendPromptAndWaitAsync in retry loop (max 2 attempts, 2s delay) using existing IsConnectionError() helper for resilience 3. Smart Case A watchdog: Replace fixed timeout and reset cap with events.jsonl mtime freshness check. Fresh file (<60s) = wait indefinitely (tool actively running), stale = 1 confirmation cycle then terminate. Protects active tools. 4. TurnEnd fallback re-arm: Add FallbackCanceledByTurnStart diagnostic flag so TurnEnd can log when re-arming the fallback timer after TurnStart canceled it Live test results: 5/5 workers connected (was 2/5), 4/5 returned real responses, full orchestration synthesis completed. Remaining empty response from worker-2 is caused by upstream SDK bug 299 (missing SessionIdleEvent). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The smart Case A watchdog (events.jsonl freshness) now handles dead session detection in ~90 seconds. The WorkerExecutionTimeout only needs to be an absolute backstop. At 10 minutes, it was prematurely terminating workers with 200+ tool calls that were actively processing (e.g. long PR reviews). This caused cascading failures: the session stayed in IsProcessing=true from the SDK side, so re-dispatch attempts got 'Session is already processing a request' errors. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #302 Review — Multi-Model Consensus (5 models, 2+ agreement filter)PR: Add built-in "Skill Validator" multi-agent preset (DRAFT) 🔴 CRITICAL — MonitorAndSynthesizeAsync hangs orchestrator + off-UI-thread mutation (5/5 models)
The new block sets 🔴 CRITICAL — TriggerToolHealthRecovery INV-1 violation (5/5 models)
This 9th path that clears Also: Missing: 🟡 MODERATE — WatchdogCaseAResets double-increment race (4/5 models)
Both the 30s tool health timer AND the main watchdog Case A path call 🟡 MODERATE — CancelToolHealthCheck missing from CompleteResponse / AbortSessionAsync (4/5 models)
Test Coverage
|
|
Round 1 review complete — see findings below |
PR #302 — 5-Model Consensus ReviewTitle: Add built-in "Skill Validator" multi-agent preset + watchdog/tool-health improvements 🔴 CRITICAL —
|
| Issue | Severity | Ship-blocker? |
|---|---|---|
TriggerToolHealthRecovery incomplete cleanup (missing 7+ fields, watchdog, OnSessionComplete) |
🔴 CRITICAL | Yes — stale state corrupts next turn; orchestrator loops may hang 60 min |
MonitorAndSynthesizeAsync IsProcessing mutation off UI thread + missing TrySetResult |
🔴 CRITICAL | Yes — race condition + synthesis deadlock |
ToolHealthCheckTimer not cancelled in abort/dispose/reconnect |
🔴 CRITICAL | Yes — timer fires on orphaned state after reconnect |
WatchdogCaseAResets shared between health check and watchdog |
🟡 MODERATE | Yes — can prematurely kill legitimately running tools |
WorkerExecutionTimeout 60 min for remote/demo |
🟡 MODERATE | Yes — hour-long hang regression |
Timer/cancel race in ScheduleToolHealthCheck |
🟡 MODERATE | No — mitigated by activeTools guard |
| Dead code after retry loop | 🟢 MINOR | No |
TriggerToolHealthRecovery cleanup gap is particularly dangerous as it violates the well-documented IsProcessing invariant. The Skill Validator preset itself is clean and uncontroversial.
5-model review: claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex. Consensus filter: issues flagged by 2+ models only.
Round 1 Review |
CRITICAL fixes: - TriggerToolHealthRecovery: complete INV-1 cleanup (was missing 8+ fields, OnSessionComplete, CancelProcessingWatchdog/TurnEndFallback, proper FlushedResponse+CurrentResponse for TCS result) - MonitorAndSynthesizeAsync: wrap IsProcessing mutation in InvokeOnUI with TCS synchronization, add ResponseCompletion.TrySetResult, complete all companion field cleanup - CancelToolHealthCheck added to all 14 cleanup paths (AbortSessionAsync, ReconnectAsync, DisposeAsync, CloseSessionCoreAsync, CompleteResponse, watchdog timeouts) MODERATE fixes: - Separate ToolHealthStaleChecks counter from WatchdogCaseAResets to prevent cross-system interference - WorkerExecutionTimeoutRemote (10 min) for remote/demo mode where smart watchdog is unavailable; keep 60 min for normal mode - Fix ScheduleToolHealthCheck timer/cancel race: create dormant, store via Interlocked.Exchange, then start with timer.Change() MINOR: Replace unreachable code after retry loop with UnreachableException Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #302 — Round 2 Re-Review (post fix commit
|
| # | Round 1 Finding | Status | Evidence |
|---|---|---|---|
| 🔴 C1 | TriggerToolHealthRecovery INV-1 violation (4 of 12+ fields) |
✅ FIXED | Full companion field cleanup added: all 9 fields + CancelProcessingWatchdog + CancelTurnEndFallback + OnSessionComplete + diagnostic log tag |
| 🔴 C2 | MonitorAndSynthesizeAsync background-thread IsProcessing=false |
✅ FIXED | Wrapped in InvokeOnUI() with companion field cleanup |
| 🟡 M1 | WatchdogCaseAResets shared counter double-increment |
✅ FIXED | New separate counter ToolHealthStaleChecks — no longer shared with WatchdogCaseAResets |
| 🟡 M2 | ToolHealthCheckTimer not cancelled on teardown |
✅ FIXED | CancelToolHealthCheck added to all 14 cleanup paths (CompleteResponse, AbortSessionAsync, CloseSessionCoreAsync, DisposeAsync, ReconnectAsync, watchdog timeouts) |
| 🟡 M3 | TCS result drops FlushedResponse content |
✅ FIXED | Now uses FlushedResponse + CurrentResponse like CompleteResponse |
New Issues (Round 2)
None found. All 5 Round 1 findings are resolved.
Test Coverage Gap (non-blocking)
Zero unit tests for the new ToolHealthCheckTimer code paths. Suggested additions:
ToolHealthCheck_CompletedTool_DoesNotTriggerRecoveryToolHealthCheck_StaleEvents_TriggersRecoveryWatchdogMaxToolAliveResets_BoundsMaxStuckTime_WithEscalationTriggerToolHealthRecovery_NotifiesOnSessionComplete
Verdict: ✅ Approve
All critical and moderate findings from Round 1 have been addressed. Tests pass (2422/2422). The missing test coverage for ToolHealthCheckTimer paths is noted but non-blocking — the existing watchdog tests cover the broader processing-state invariants.
…352) ## Problem The **Evaluate-pr-tests-orchestrator** fails to complete with two symptoms: 1. Orchestrator loops indefinitely with "0 raw assignments" 2. Workers return empty responses despite having processed content ### Root Cause 1: Completion Override Infinite Loop When workers fail/return empty (SDK bug #299), `dispatchedWorkers` stays empty → `allWorkersDispatched = false` → `[[GROUP_REFLECT_COMPLETE]]` is overridden → orchestrator re-prompted but says "nothing to delegate" → repeat until MaxIterations. ### Root Cause 2: Empty Worker Responses Workers complete but SDK never sends SessionIdleEvent. Watchdog fires and completes the session, but `FlushedResponse`/`CurrentResponse` are empty because content was in delta events that ended up in chat history but not the response buffers. ## Fixes ### 1. Accept completion when all workers were attempted Changed both evaluator and self-eval paths: if all workers were **attempted** (even if some failed/returned empty), accept `[[GROUP_REFLECT_COMPLETE]]` instead of overriding it. Uses `allWorkersAttempted` (`attemptedWorkers` set) alongside `allWorkersDispatched` (`dispatchedWorkers` set). ### 2. Chat history fallback for empty responses When a worker returns empty after completion + revival, extract the last assistant message from chat history as a fallback. This recovers content that was streamed via delta events but lost when the watchdog completed the session with empty response buffers. ## Testing - All 2422 tests pass - Built and relaunched successfully ## Related - SDK bug #299 (missing SessionIdleEvent) — upstream issue causing workers to appear stuck - PR #302 — smart watchdog, stagger dispatch, IOException retry (merged) --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## What Adds a new built-in `"Skill Validator"` (`⚖️`) preset to `GroupPreset.BuiltIn` — a multi-agent group that evaluates skills from two complementary angles and builds a consensus verdict. ## How it works Uses `OrchestratorReflect` mode with two specialized workers and an orchestrator that synthesizes their findings: - **Worker 1 — Dotnet Skill Validator**: Empirical, outcome-focused evaluation using a 5-step methodology: 1. Inspect the skill definition (SKILL.md + eval.yaml) 2. Baseline comparison — describe agent behavior without vs. with the skill 3. Pairwise comparative judgment with position-swap bias mitigation 4. Statistical confidence assessment across the scenario set 5. Produce a scored verdict (KEEP / IMPROVE / REMOVE) - **Worker 2 — Anthropic Skill Evaluator**: Prompt-design-focused evaluation across 4 dimensions: 1. Description quality & trigger accuracy (precision/recall rating) 2. Instruction clarity — actionable, unambiguous, no over-constraint 3. Scope appropriateness — focused, no overlap with other skills 4. Test coverage — happy path, edge cases, negative cases - **Orchestrator**: Dispatches to both workers, collects independent verdicts, identifies agreements (high confidence) vs. disagreements (requires judgment), and produces a structured consensus report: ``` ## Skill Validator Consensus Report: [Skill Name] Dotnet Verdict / Anthropic Verdict / Consensus Points of Agreement · Points of Disagreement Adopted Suggestions · Declined Suggestions (with rationale) Final Recommendation ``` **Consensus thresholds** (defined in SharedContext): - KEEP = both say KEEP, or one KEEP + one IMPROVE - IMPROVE = evaluators disagree, or both say IMPROVE - REMOVE = either says REMOVE with strong evidence **Config**: `OrchestratorModel: claude-opus-4.6`, `WorkerModels: 2x claude-sonnet-4.6`, `MaxReflectIterations: 6` ## Current State - ✅ Built-in preset added to `PolyPilot/Models/ModelCapabilities.cs` - ✅ New test: `BuiltInPresets_IncludeSkillValidator` in `SessionOrganizationTests.cs` - ✅ Updated `Scenario_CreateGroupFromPreset` preset count expectation (2 → 3) - ✅ Build passes — 0 errors, 7 pre-existing warnings - ✅ 1661 tests pass (1 pre-existing failure: `PlatformHelperTests.AvailableModes_OnNonDesktop_IsRemoteOnly`, unrelated to this change) - ✅ Challenger review passed ## Files Changed | File | Change | |------|--------| | `PolyPilot/Models/ModelCapabilities.cs` | New `"Skill Validator"` entry in `BuiltIn` array (+191 lines) | | `PolyPilot.Tests/SessionOrganizationTests.cs` | New test + updated preset count expectation (+20, -2 lines) | ## Future Work / Potential Improvements - Add more granular evaluation rubrics based on real-world skill validation experience - Integrate with the `skill-creator` skill to provide an end-to-end skill development workflow (create → validate → iterate) - Consider a "Skill Validator Lite" broadcast-mode variant for quick single-pass evaluation without a consensus loop - Could expose consensus thresholds (KEEP/IMPROVE/REMOVE) as configurable preset parameters - Add eval.yaml scenario examples to the SharedContext to guide workers on what good test cases look like --- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ureWeen#352) ## Problem The **Evaluate-pr-tests-orchestrator** fails to complete with two symptoms: 1. Orchestrator loops indefinitely with "0 raw assignments" 2. Workers return empty responses despite having processed content ### Root Cause 1: Completion Override Infinite Loop When workers fail/return empty (SDK bug PureWeen#299), `dispatchedWorkers` stays empty → `allWorkersDispatched = false` → `[[GROUP_REFLECT_COMPLETE]]` is overridden → orchestrator re-prompted but says "nothing to delegate" → repeat until MaxIterations. ### Root Cause 2: Empty Worker Responses Workers complete but SDK never sends SessionIdleEvent. Watchdog fires and completes the session, but `FlushedResponse`/`CurrentResponse` are empty because content was in delta events that ended up in chat history but not the response buffers. ## Fixes ### 1. Accept completion when all workers were attempted Changed both evaluator and self-eval paths: if all workers were **attempted** (even if some failed/returned empty), accept `[[GROUP_REFLECT_COMPLETE]]` instead of overriding it. Uses `allWorkersAttempted` (`attemptedWorkers` set) alongside `allWorkersDispatched` (`dispatchedWorkers` set). ### 2. Chat history fallback for empty responses When a worker returns empty after completion + revival, extract the last assistant message from chat history as a fallback. This recovers content that was streamed via delta events but lost when the watchdog completed the session with empty response buffers. ## Testing - All 2422 tests pass - Built and relaunched successfully ## Related - SDK bug PureWeen#299 (missing SessionIdleEvent) — upstream issue causing workers to appear stuck - PR PureWeen#302 — smart watchdog, stagger dispatch, IOException retry (merged) --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
What
Adds a new built-in
"Skill Validator"(⚖️) preset toGroupPreset.BuiltIn— a multi-agent group that evaluates skills from two complementary angles and builds a consensus verdict.How it works
Uses
OrchestratorReflectmode with two specialized workers and an orchestrator that synthesizes their findings:Worker 1 — Dotnet Skill Validator: Empirical, outcome-focused evaluation using a 5-step methodology:
Worker 2 — Anthropic Skill Evaluator: Prompt-design-focused evaluation across 4 dimensions:
Orchestrator: Dispatches to both workers, collects independent verdicts, identifies agreements (high confidence) vs. disagreements (requires judgment), and produces a structured consensus report:
Consensus thresholds (defined in SharedContext):
Config:
OrchestratorModel: claude-opus-4.6,WorkerModels: 2x claude-sonnet-4.6,MaxReflectIterations: 6Current State
PolyPilot/Models/ModelCapabilities.csBuiltInPresets_IncludeSkillValidatorinSessionOrganizationTests.csScenario_CreateGroupFromPresetpreset count expectation (2 → 3)PlatformHelperTests.AvailableModes_OnNonDesktop_IsRemoteOnly, unrelated to this change)Files Changed
PolyPilot/Models/ModelCapabilities.cs"Skill Validator"entry inBuiltInarray (+191 lines)PolyPilot.Tests/SessionOrganizationTests.csFuture Work / Potential Improvements
skill-creatorskill to provide an end-to-end skill development workflow (create → validate → iterate)Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com