Add built-in "Skill Validator" multi-agent preset by PureWeen · Pull Request #302 · PureWeen/PolyPilot

PureWeen · 2026-03-07T16:38:34Z

What

Adds a new built-in "Skill Validator" (⚖️) preset to GroupPreset.BuiltIn — a multi-agent group that evaluates skills from two complementary angles and builds a consensus verdict.

How it works

Uses OrchestratorReflect mode with two specialized workers and an orchestrator that synthesizes their findings:

Worker 1 — Dotnet Skill Validator: Empirical, outcome-focused evaluation using a 5-step methodology:
1. Inspect the skill definition (SKILL.md + eval.yaml)
2. Baseline comparison — describe agent behavior without vs. with the skill
3. Pairwise comparative judgment with position-swap bias mitigation
4. Statistical confidence assessment across the scenario set
5. Produce a scored verdict (KEEP / IMPROVE / REMOVE)
Worker 2 — Anthropic Skill Evaluator: Prompt-design-focused evaluation across 4 dimensions:
1. Description quality & trigger accuracy (precision/recall rating)
2. Instruction clarity — actionable, unambiguous, no over-constraint
3. Scope appropriateness — focused, no overlap with other skills
4. Test coverage — happy path, edge cases, negative cases

Orchestrator: Dispatches to both workers, collects independent verdicts, identifies agreements (high confidence) vs. disagreements (requires judgment), and produces a structured consensus report:

## Skill Validator Consensus Report: [Skill Name]
Dotnet Verdict / Anthropic Verdict / Consensus
Points of Agreement · Points of Disagreement
Adopted Suggestions · Declined Suggestions (with rationale)
Final Recommendation

Consensus thresholds (defined in SharedContext):

KEEP = both say KEEP, or one KEEP + one IMPROVE
IMPROVE = evaluators disagree, or both say IMPROVE
REMOVE = either says REMOVE with strong evidence

Config: OrchestratorModel: claude-opus-4.6, WorkerModels: 2x claude-sonnet-4.6, MaxReflectIterations: 6

Current State

✅ Built-in preset added to PolyPilot/Models/ModelCapabilities.cs
✅ New test: BuiltInPresets_IncludeSkillValidator in SessionOrganizationTests.cs
✅ Updated Scenario_CreateGroupFromPreset preset count expectation (2 → 3)
✅ Build passes — 0 errors, 7 pre-existing warnings
✅ 1661 tests pass (1 pre-existing failure: PlatformHelperTests.AvailableModes_OnNonDesktop_IsRemoteOnly, unrelated to this change)
✅ Challenger review passed

Files Changed

File	Change
`PolyPilot/Models/ModelCapabilities.cs`	New `"Skill Validator"` entry in `BuiltIn` array (+191 lines)
`PolyPilot.Tests/SessionOrganizationTests.cs`	New test + updated preset count expectation (+20, -2 lines)

Future Work / Potential Improvements

Add more granular evaluation rubrics based on real-world skill validation experience
Integrate with the skill-creator skill to provide an end-to-end skill development workflow (create → validate → iterate)
Consider a "Skill Validator Lite" broadcast-mode variant for quick single-pass evaluation without a consensus loop
Could expose consensus thresholds (KEEP/IMPROVE/REMOVE) as configurable preset parameters
Add eval.yaml scenario examples to the SharedContext to guide workers on what good test cases look like

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Adds a new built-in GroupPreset called "Skill Validator" (⚖️) that pits two evaluators against each other to assess skills: - Worker 1 (Dotnet Skill Validator): empirical, outcome-focused evaluation using methodology inspired by dotnet/skills skill-validator — measures baseline vs. skill-augmented performance, pairwise comparative judging, statistical confidence assessment. - Worker 2 (Anthropic Skill Evaluator): prompt-design-focused evaluation assessing description quality/trigger accuracy, instruction clarity, scope appropriateness, and test coverage. - Orchestrator: routes work to both evaluators, highlights agreements and disagreements, explains which suggestions are adopted and why, produces a consensus KEEP/IMPROVE/REMOVE verdict. Configuration: - Mode: OrchestratorReflect - OrchestratorModel: claude-opus-4.6 - WorkerModels: 2x claude-sonnet-4.6 - MaxReflectIterations: 6 Also updates Scenario_CreateGroupFromPreset test to expect 3 built-in presets and adds BuiltInPresets_IncludeSkillValidator test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…-dispatch clearing Three fixes to dramatically reduce recovery time for stuck sessions: 1. Escalation timeout: After the first 600s timeout with no events (but server alive), switch to 60s timeout for subsequent checks. Reduces max stuck time from ~40 minutes to ~12 minutes. 2. Tool health check: Start a 30s timer when a tool begins executing. If no events arrive within 30s, check if the connection is still alive. After 3 consecutive stale checks (90s total), trigger recovery. This detects dead connections much faster than waiting for the 10-minute watchdog. 3. Re-dispatch clearing: Before re-dispatching workers after app restart, clear any stuck IsProcessing/ActiveToolCallCount/SendingFlag state from the previous incomplete turn. This allows SendPromptAsync to accept the new prompt instead of rejecting with 'already processing'. Also resets HasUsedToolsThisTurn on reconnect so the watchdog uses the appropriate timeout (120s instead of 600s) for the fresh connection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

1. Stagger worker dispatch: Add 1s delay between worker launches to prevent burst connection saturation (previously 3/5 workers crashed with IOException) 2. IOException retry: Wrap SendPromptAndWaitAsync in retry loop (max 2 attempts, 2s delay) using existing IsConnectionError() helper for resilience 3. Smart Case A watchdog: Replace fixed timeout and reset cap with events.jsonl mtime freshness check. Fresh file (<60s) = wait indefinitely (tool actively running), stale = 1 confirmation cycle then terminate. Protects active tools. 4. TurnEnd fallback re-arm: Add FallbackCanceledByTurnStart diagnostic flag so TurnEnd can log when re-arming the fallback timer after TurnStart canceled it Live test results: 5/5 workers connected (was 2/5), 4/5 returned real responses, full orchestration synthesis completed. Remaining empty response from worker-2 is caused by upstream SDK bug 299 (missing SessionIdleEvent). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The smart Case A watchdog (events.jsonl freshness) now handles dead session detection in ~90 seconds. The WorkerExecutionTimeout only needs to be an absolute backstop. At 10 minutes, it was prematurely terminating workers with 200+ tool calls that were actively processing (e.g. long PR reviews). This caused cascading failures: the session stayed in IsProcessing=true from the SDK side, so re-dispatch attempts got 'Session is already processing a request' errors. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-11T02:48:29Z

PR #302 Review — Multi-Model Consensus (5 models, 2+ agreement filter)

PR: Add built-in "Skill Validator" multi-agent preset (DRAFT)
CI: No checks configured
Prior reviews: None
Note: DRAFT PR. The Skill Validator preset (ModelCapabilities.cs) is clean data — no bugs. The runtime watchdog/tool-health changes have significant issues.

🔴 CRITICAL — MonitorAndSynthesizeAsync hangs orchestrator + off-UI-thread mutation (5/5 models)

CopilotService.Organization.cs — worker re-dispatch loop

The new block sets workerState.Info.IsProcessing = false directly on a background thread (INV-2 violation) and does NOT call state.ResponseCompletion?.TrySetResult(...). Since SendPromptAndWaitAsync blocks on ResponseCompletion.Task, the orchestrator loop awaiting the re-dispatched worker hangs indefinitely. Also missing from INV-1: IsResumed, ProcessingStartedAt, ToolCallCount, ProcessingPhase, ClearPermissionDenials(), FlushCurrentResponse(), OnSessionComplete, diagnostic log tag.

🔴 CRITICAL — TriggerToolHealthRecovery INV-1 violation (5/5 models)

CopilotService.Events.cs — new method

This 9th path that clears IsProcessing skips 6 required cleanup fields. The missing OnSessionComplete means multi-agent orchestrations listening on this signal are never notified, leaving sessions showing "Thinking..." on mobile clients forever.

Also: TrySetResult uses only CurrentResponse.ToString(), but CompleteResponse uses FlushedResponse + CurrentResponse. If a TurnEnd flush already moved content to FlushedResponse (clearing CurrentResponse), the orchestrator gets an empty string as the worker result.

Missing: IsResumed = false, ProcessingStartedAt = null, ToolCallCount = 0, ProcessingPhase = 0, ClearPermissionDenials(), OnSessionComplete?.Invoke().

🟡 MODERATE — WatchdogCaseAResets double-increment race (4/5 models)

CopilotService.Events.cs

Both the 30s tool health timer AND the main watchdog Case A path call Interlocked.Increment(ref state.WatchdogCaseAResets) on the same counter. With WatchdogMaxToolAliveResets = 2 and threshold resets > 1: health check fires at T+30s (→1), watchdog fires at T+60s (→2, triggers termination). One stale observation is enough to terminate rather than the intended 2-cycle confirmation — false positives for legitimate long-running tools.

🟡 MODERATE — CancelToolHealthCheck missing from CompleteResponse / AbortSessionAsync (4/5 models)

CopilotService.Events.cs — CompleteResponse

CompleteResponse calls CancelProcessingWatchdog and CancelTurnEndFallback but not CancelToolHealthCheck. The orphaned timer fires 30s after session completes, finds IsProcessing=false, returns early — no correctness bug, but Timer is never disposed, accumulating leaks at high throughput.

Test Coverage

BuiltInPresets_IncludeSkillValidator is solid. Missing: (1) TriggerToolHealthRecovery verifies all INV-1 fields cleared; (2) re-dispatch path verifies ResponseCompletion is unblocked; (3) counter test verifying no double-increment.

⚠️ Request changes (draft — not merge-ready)

The Skill Validator preset config is clean. Three runtime bugs to fix:

Add ResponseCompletion?.TrySetResult(...) and wrap in InvokeOnUI in the MonitorAndSynthesizeAsync re-dispatch path
Complete INV-1 cleanup in TriggerToolHealthRecovery (6 missing fields + fix empty-content TrySetResult)
Separate the WatchdogCaseAResets counter or coordinate so only one path increments per cycle

PureWeen · 2026-03-11T02:50:10Z

Round 1 review complete — see findings below

PureWeen · 2026-03-11T02:51:00Z

PR #302 — 5-Model Consensus Review

Title: Add built-in "Skill Validator" multi-agent preset + watchdog/tool-health improvements
CI: ⚠️ No checks configured on feat/skill-validator-preset
Prior reviews: None
Models: claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex

🔴 CRITICAL — `TriggerToolHealthRecovery` violates IsProcessing Cleanup Invariant (5/5 models)

PolyPilot/Services/CopilotService.Events.cs — TriggerToolHealthRecovery

The new recovery method sets IsProcessing = false but skips the cleanup required by every other path that clears processing state. Missing:

CancelProcessingWatchdog(state) — watchdog keeps running, fires a second timeout on the already-recovered session (3/5 models)
CancelTurnEndFallback(state) — TurnEnd fallback fires ~4s later, calling CompleteResponse on a recovered session; guards prevent double-complete but leave stale companion fields (2/5 models)
state.Info.IsResumed = false — next turn watchdog uses 30s resume-quiescence instead of 600s tool timeout → premature kill (4/5 models)
state.Info.ProcessingPhase = 0 — UI stuck showing "Working…" forever (4/5 models)
state.Info.ProcessingStartedAt = null — elapsed time counter never stops (3/5 models)
state.SuccessfulToolCountThisTurn = 0 (3/5 models)
state.WatchdogCaseAResets = 0 — stale count affects next turn's watchdog timeout selection (3/5 models)
OnSessionComplete?.Invoke(sessionName, ...) — orchestrator worker loops: ResponseCompletion TCS unblocks via TrySetResult but MonitorAndSynthesizeAsync never gets its completion notification, blocking synthesis for up to 60 minutes (4/5 models)

Fix: Replace the manual clear with a call to CompleteResponse(state) (or extract the shared cleanup into a helper), and add CancelProcessingWatchdog(state) + CancelTurnEndFallback(state) before the InvokeOnUI.

🔴 CRITICAL — `MonitorAndSynthesizeAsync` clears `IsProcessing` off the UI thread (3/5 models)

PolyPilot/Services/CopilotService.Organization.cs — MonitorAndSynthesizeAsync

workerState.Info.IsProcessing = false;  // runs on orchestration task thread
Interlocked.Exchange(ref workerState.ActiveToolCallCount, 0);

Per the codebase invariant: "All mutations to state.Info.IsProcessing must be marshaled to the UI thread." This runs on the orchestration background task, racing with watchdog/event handlers. Additionally, ResponseCompletion?.TrySetResult(...) is never called — workers blocked in SendPromptAndWaitAsync awaiting the TCS will never complete, so Task.WhenAll(workerTasks) hangs indefinitely with IsProcessing = false in the UI (2/5 models).

Fix: Wrap in InvokeOnUI(() => { ... }) and add workerState.ResponseCompletion?.TrySetResult(...).

🔴 CRITICAL — `ToolHealthCheckTimer` not cancelled in session cleanup paths (2/5 models)

PolyPilot/Services/CopilotService.Events.cs / CopilotService.cs

CancelToolHealthCheck is never called in:

AbortSessionAsync (cancels watchdog + fallback, but not health check) — stale 30s timer fires on an aborted session; if a new prompt arrives before it fires, the generation guard may not protect it if ProcessingGeneration didn't change
ReconnectAsync / DisposeAsync session cleanup loops (cancel watchdog + fallback, not health check)
CloseSessionCoreAsync

After a reconnect, old health-check timers hold closures over old SessionState objects and can fire on detached state.

Fix: Add CancelToolHealthCheck(state) to AbortSessionAsync alongside the existing CancelProcessingWatchdog call, and to the cleanup loops in ReconnectAsync/DisposeAsync.

🟡 MODERATE — `WatchdogCaseAResets` shared between two independent systems (4/5 models)

PolyPilot/Services/CopilotService.Events.cs

Both ScheduleToolHealthCheck's timer callback and RunProcessingWatchdogAsync Case A increment WatchdogCaseAResets independently. The watchdog also reads the counter to select between 600s and 60s escalation timeouts (caseAResets > 0). A single stale health check at t=30s sets caseAResets = 1, permanently switching the watchdog to the 10× shorter timeout for the rest of that turn — even if the connection recovers and events resume. With WatchdogMaxToolAliveResets reduced from 3 to 2, a health check fire + watchdog fire in the same cycle could kill a legitimately running tool prematurely.

Fix: Use a separate field for the health check's stale counter (e.g., ToolHealthStaleChecks) instead of sharing WatchdogCaseAResets.

🟡 MODERATE — `WorkerExecutionTimeout` 10 min → 60 min regresses remote/demo mode (3/5 models)

PolyPilot/Services/CopilotService.Organization.cs line ~33

ScheduleToolHealthCheck returns early for demo and remote mode. The watchdog's Case A also skips for demo/remote. The 60-minute timeout is the only backstop for workers in these modes. A stuck remote worker now blocks orchestration for a full hour vs. the previous 10 minutes.

Fix: Keep WorkerExecutionTimeout = 60 minutes for non-remote/demo but restore a shorter backstop (e.g., 10 min) for remote/demo mode, or add a separate remote-mode health check mechanism.

🟡 MODERATE — `ScheduleToolHealthCheck` timer/cancel race window (3/5 models)

PolyPilot/Services/CopilotService.Events.cs — ScheduleToolHealthCheck

The timer is constructed (and starts counting) before Interlocked.Exchange(ref state.ToolHealthCheckTimer, timer) stores it. A concurrent CancelToolHealthCheck (e.g., from ToolExecutionCompleteEvent) runs between construction and storage, finds the field null (just cleared), and becomes a no-op. The new timer is then stored and runs untracked. The activeTools <= 0 guard in the callback prevents false recovery, but the callback still runs and may reschedule recursively.

Fix: Store the timer before the delay starts, or use a CancellationTokenSource pattern instead of Interlocked on Timer?.

🟢 MINOR — Dead code after retry loop in `ExecuteWorkerAsync` (3/5 models)

PolyPilot/Services/CopilotService.Organization.cs — ExecuteWorkerAsync

With maxRetries = 2, attempt = 2 always falls into the unconditional catch (Exception ex) block and returns. The return new WorkerResult(..., "Max retries exceeded", ...) after the loop is unreachable.

Test Coverage Assessment

The PR adds BuiltInPresets_IncludeSkillValidator (validates preset structure) and updates the preset count test. Missing tests for the new runtime paths:

TriggerToolHealthRecovery clearing all IsProcessing companion fields (especially IsResumed, ProcessingPhase, ProcessingStartedAt)
CancelToolHealthCheck being called in AbortSessionAsync / DisposeAsync / ReconnectAsync
ScheduleToolHealthCheck not firing in demo/remote mode
WatchdogCaseAResets counter interaction between health check and watchdog
MonitorAndSynthesizeAsync re-dispatch unblocking ResponseCompletion TCS

Summary

Issue	Severity	Ship-blocker?
`TriggerToolHealthRecovery` incomplete cleanup (missing 7+ fields, watchdog, `OnSessionComplete`)	🔴 CRITICAL	Yes — stale state corrupts next turn; orchestrator loops may hang 60 min
`MonitorAndSynthesizeAsync` IsProcessing mutation off UI thread + missing `TrySetResult`	🔴 CRITICAL	Yes — race condition + synthesis deadlock
`ToolHealthCheckTimer` not cancelled in abort/dispose/reconnect	🔴 CRITICAL	Yes — timer fires on orphaned state after reconnect
`WatchdogCaseAResets` shared between health check and watchdog	🟡 MODERATE	Yes — can prematurely kill legitimately running tools
`WorkerExecutionTimeout` 60 min for remote/demo	🟡 MODERATE	Yes — hour-long hang regression
Timer/cancel race in `ScheduleToolHealthCheck`	🟡 MODERATE	No — mitigated by activeTools guard
Dead code after retry loop	🟢 MINOR	No

⚠️ Request changes — three CRITICAL issues must be addressed before merge. The TriggerToolHealthRecovery cleanup gap is particularly dangerous as it violates the well-documented IsProcessing invariant. The Skill Validator preset itself is clean and uncontroversial.

5-model review: claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex. Consensus filter: issues flagged by 2+ models only.

PureWeen · 2026-03-11T02:51:25Z

Round 1 Review

CRITICAL fixes: - TriggerToolHealthRecovery: complete INV-1 cleanup (was missing 8+ fields, OnSessionComplete, CancelProcessingWatchdog/TurnEndFallback, proper FlushedResponse+CurrentResponse for TCS result) - MonitorAndSynthesizeAsync: wrap IsProcessing mutation in InvokeOnUI with TCS synchronization, add ResponseCompletion.TrySetResult, complete all companion field cleanup - CancelToolHealthCheck added to all 14 cleanup paths (AbortSessionAsync, ReconnectAsync, DisposeAsync, CloseSessionCoreAsync, CompleteResponse, watchdog timeouts) MODERATE fixes: - Separate ToolHealthStaleChecks counter from WatchdogCaseAResets to prevent cross-system interference - WorkerExecutionTimeoutRemote (10 min) for remote/demo mode where smart watchdog is unavailable; keep 60 min for normal mode - Fix ScheduleToolHealthCheck timer/cancel race: create dormant, store via Interlocked.Exchange, then start with timer.Change() MINOR: Replace unreachable code after retry loop with UnreachableException Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-11T04:46:15Z

PR #302 — Round 2 Re-Review (post fix commit `eedf8e59`)

CI: ⚠️ No checks configured | Tests: ✅ 2422 passed, 0 failed

Previous Findings Status

#	Round 1 Finding	Status	Evidence
🔴 C1	`TriggerToolHealthRecovery` INV-1 violation (4 of 12+ fields)	✅ FIXED	Full companion field cleanup added: all 9 fields + `CancelProcessingWatchdog` + `CancelTurnEndFallback` + `OnSessionComplete` + diagnostic log tag
🔴 C2	`MonitorAndSynthesizeAsync` background-thread `IsProcessing=false`	✅ FIXED	Wrapped in `InvokeOnUI()` with companion field cleanup
🟡 M1	`WatchdogCaseAResets` shared counter double-increment	✅ FIXED	New separate counter `ToolHealthStaleChecks` — no longer shared with `WatchdogCaseAResets`
🟡 M2	`ToolHealthCheckTimer` not cancelled on teardown	✅ FIXED	`CancelToolHealthCheck` added to all 14 cleanup paths (`CompleteResponse`, `AbortSessionAsync`, `CloseSessionCoreAsync`, `DisposeAsync`, `ReconnectAsync`, watchdog timeouts)
🟡 M3	TCS result drops `FlushedResponse` content	✅ FIXED	Now uses `FlushedResponse + CurrentResponse` like `CompleteResponse`

New Issues (Round 2)

None found. All 5 Round 1 findings are resolved.

Test Coverage Gap (non-blocking)

Zero unit tests for the new ToolHealthCheckTimer code paths. Suggested additions:

ToolHealthCheck_CompletedTool_DoesNotTriggerRecovery
ToolHealthCheck_StaleEvents_TriggersRecovery
WatchdogMaxToolAliveResets_BoundsMaxStuckTime_WithEscalation
TriggerToolHealthRecovery_NotifiesOnSessionComplete

Verdict: ✅ Approve

All critical and moderate findings from Round 1 have been addressed. Tests pass (2422/2422). The missing test coverage for ToolHealthCheckTimer paths is noted but non-blocking — the existing watchdog tests cover the broader processing-state invariants.

…352) ## Problem The **Evaluate-pr-tests-orchestrator** fails to complete with two symptoms: 1. Orchestrator loops indefinitely with "0 raw assignments" 2. Workers return empty responses despite having processed content ### Root Cause 1: Completion Override Infinite Loop When workers fail/return empty (SDK bug #299), `dispatchedWorkers` stays empty → `allWorkersDispatched = false` → `[[GROUP_REFLECT_COMPLETE]]` is overridden → orchestrator re-prompted but says "nothing to delegate" → repeat until MaxIterations. ### Root Cause 2: Empty Worker Responses Workers complete but SDK never sends SessionIdleEvent. Watchdog fires and completes the session, but `FlushedResponse`/`CurrentResponse` are empty because content was in delta events that ended up in chat history but not the response buffers. ## Fixes ### 1. Accept completion when all workers were attempted Changed both evaluator and self-eval paths: if all workers were **attempted** (even if some failed/returned empty), accept `[[GROUP_REFLECT_COMPLETE]]` instead of overriding it. Uses `allWorkersAttempted` (`attemptedWorkers` set) alongside `allWorkersDispatched` (`dispatchedWorkers` set). ### 2. Chat history fallback for empty responses When a worker returns empty after completion + revival, extract the last assistant message from chat history as a fallback. This recovers content that was streamed via delta events but lost when the watchdog completed the session with empty response buffers. ## Testing - All 2422 tests pass - Built and relaunched successfully ## Related - SDK bug #299 (missing SessionIdleEvent) — upstream issue causing workers to appear stuck - PR #302 — smart watchdog, stagger dispatch, IOException retry (merged) --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

## What Adds a new built-in `"Skill Validator"` (`⚖️`) preset to `GroupPreset.BuiltIn` — a multi-agent group that evaluates skills from two complementary angles and builds a consensus verdict. ## How it works Uses `OrchestratorReflect` mode with two specialized workers and an orchestrator that synthesizes their findings: - **Worker 1 — Dotnet Skill Validator**: Empirical, outcome-focused evaluation using a 5-step methodology: 1. Inspect the skill definition (SKILL.md + eval.yaml) 2. Baseline comparison — describe agent behavior without vs. with the skill 3. Pairwise comparative judgment with position-swap bias mitigation 4. Statistical confidence assessment across the scenario set 5. Produce a scored verdict (KEEP / IMPROVE / REMOVE) - **Worker 2 — Anthropic Skill Evaluator**: Prompt-design-focused evaluation across 4 dimensions: 1. Description quality & trigger accuracy (precision/recall rating) 2. Instruction clarity — actionable, unambiguous, no over-constraint 3. Scope appropriateness — focused, no overlap with other skills 4. Test coverage — happy path, edge cases, negative cases - **Orchestrator**: Dispatches to both workers, collects independent verdicts, identifies agreements (high confidence) vs. disagreements (requires judgment), and produces a structured consensus report: ``` ## Skill Validator Consensus Report: [Skill Name] Dotnet Verdict / Anthropic Verdict / Consensus Points of Agreement · Points of Disagreement Adopted Suggestions · Declined Suggestions (with rationale) Final Recommendation ``` **Consensus thresholds** (defined in SharedContext): - KEEP = both say KEEP, or one KEEP + one IMPROVE - IMPROVE = evaluators disagree, or both say IMPROVE - REMOVE = either says REMOVE with strong evidence **Config**: `OrchestratorModel: claude-opus-4.6`, `WorkerModels: 2x claude-sonnet-4.6`, `MaxReflectIterations: 6` ## Current State - ✅ Built-in preset added to `PolyPilot/Models/ModelCapabilities.cs` - ✅ New test: `BuiltInPresets_IncludeSkillValidator` in `SessionOrganizationTests.cs` - ✅ Updated `Scenario_CreateGroupFromPreset` preset count expectation (2 → 3) - ✅ Build passes — 0 errors, 7 pre-existing warnings - ✅ 1661 tests pass (1 pre-existing failure: `PlatformHelperTests.AvailableModes_OnNonDesktop_IsRemoteOnly`, unrelated to this change) - ✅ Challenger review passed ## Files Changed | File | Change | |------|--------| | `PolyPilot/Models/ModelCapabilities.cs` | New `"Skill Validator"` entry in `BuiltIn` array (+191 lines) | | `PolyPilot.Tests/SessionOrganizationTests.cs` | New test + updated preset count expectation (+20, -2 lines) | ## Future Work / Potential Improvements - Add more granular evaluation rubrics based on real-world skill validation experience - Integrate with the `skill-creator` skill to provide an end-to-end skill development workflow (create → validate → iterate) - Consider a "Skill Validator Lite" broadcast-mode variant for quick single-pass evaluation without a consensus loop - Could expose consensus thresholds (KEEP/IMPROVE/REMOVE) as configurable preset parameters - Add eval.yaml scenario examples to the SharedContext to guide workers on what good test cases look like --- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ureWeen#352) ## Problem The **Evaluate-pr-tests-orchestrator** fails to complete with two symptoms: 1. Orchestrator loops indefinitely with "0 raw assignments" 2. Workers return empty responses despite having processed content ### Root Cause 1: Completion Override Infinite Loop When workers fail/return empty (SDK bug PureWeen#299), `dispatchedWorkers` stays empty → `allWorkersDispatched = false` → `[[GROUP_REFLECT_COMPLETE]]` is overridden → orchestrator re-prompted but says "nothing to delegate" → repeat until MaxIterations. ### Root Cause 2: Empty Worker Responses Workers complete but SDK never sends SessionIdleEvent. Watchdog fires and completes the session, but `FlushedResponse`/`CurrentResponse` are empty because content was in delta events that ended up in chat history but not the response buffers. ## Fixes ### 1. Accept completion when all workers were attempted Changed both evaluator and self-eval paths: if all workers were **attempted** (even if some failed/returned empty), accept `[[GROUP_REFLECT_COMPLETE]]` instead of overriding it. Uses `allWorkersAttempted` (`attemptedWorkers` set) alongside `allWorkersDispatched` (`dispatchedWorkers` set). ### 2. Chat history fallback for empty responses When a worker returns empty after completion + revival, extract the last assistant message from chat history as a fallback. This recovers content that was streamed via delta events but lost when the watchdog completed the session with empty response buffers. ## Testing - All 2422 tests pass - Built and relaunched successfully ## Related - SDK bug PureWeen#299 (missing SessionIdleEvent) — upstream issue causing workers to appear stuck - PR PureWeen#302 — smart watchdog, stagger dispatch, IOException retry (merged) --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen marked this pull request as draft March 8, 2026 00:43

PureWeen and others added 2 commits March 10, 2026 17:53

PureWeen force-pushed the feat/skill-validator-preset branch from eeea8ee to 0359d3b Compare March 10, 2026 22:57

PureWeen and others added 2 commits March 10, 2026 21:17

PureWeen marked this pull request as ready for review March 11, 2026 04:24

PureWeen merged commit eaba0b1 into main Mar 11, 2026

PureWeen deleted the feat/skill-validator-preset branch March 11, 2026 04:54

This was referenced Mar 11, 2026

fix: reduce watchdog timeout for multi-agent sessions without tools #316

Closed

fix: prevent reflect loop semaphore from getting stuck on cleanup hang #323

Closed

Fix orchestrator delegation infinite loop and empty worker responses #352

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add built-in "Skill Validator" multi-agent preset#302

Add built-in "Skill Validator" multi-agent preset#302
PureWeen merged 5 commits intomainfrom
feat/skill-validator-preset

PureWeen commented Mar 7, 2026

Uh oh!

PureWeen commented Mar 11, 2026

Uh oh!

PureWeen commented Mar 11, 2026

Uh oh!

PureWeen commented Mar 11, 2026

Uh oh!

PureWeen commented Mar 11, 2026

Uh oh!

PureWeen commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PureWeen commented Mar 7, 2026

What

How it works

Current State

Files Changed

Future Work / Potential Improvements

Uh oh!

PureWeen commented Mar 11, 2026

PR #302 Review — Multi-Model Consensus (5 models, 2+ agreement filter)

🔴 CRITICAL — MonitorAndSynthesizeAsync hangs orchestrator + off-UI-thread mutation (5/5 models)

🔴 CRITICAL — TriggerToolHealthRecovery INV-1 violation (5/5 models)

🟡 MODERATE — WatchdogCaseAResets double-increment race (4/5 models)

🟡 MODERATE — CancelToolHealthCheck missing from CompleteResponse / AbortSessionAsync (4/5 models)

Test Coverage

⚠️ Request changes (draft — not merge-ready)

Uh oh!

PureWeen commented Mar 11, 2026

Uh oh!

PureWeen commented Mar 11, 2026

PR #302 — 5-Model Consensus Review

🔴 CRITICAL — TriggerToolHealthRecovery violates IsProcessing Cleanup Invariant (5/5 models)

🔴 CRITICAL — MonitorAndSynthesizeAsync clears IsProcessing off the UI thread (3/5 models)

🔴 CRITICAL — ToolHealthCheckTimer not cancelled in session cleanup paths (2/5 models)

🟡 MODERATE — WatchdogCaseAResets shared between two independent systems (4/5 models)

🟡 MODERATE — WorkerExecutionTimeout 10 min → 60 min regresses remote/demo mode (3/5 models)

🟡 MODERATE — ScheduleToolHealthCheck timer/cancel race window (3/5 models)

🟢 MINOR — Dead code after retry loop in ExecuteWorkerAsync (3/5 models)

Test Coverage Assessment

Summary

Uh oh!

PureWeen commented Mar 11, 2026

Round 1 Review

Uh oh!

PureWeen commented Mar 11, 2026

PR #302 — Round 2 Re-Review (post fix commit eedf8e59)

Previous Findings Status

New Issues (Round 2)

Test Coverage Gap (non-blocking)

Verdict: ✅ Approve

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🔴 CRITICAL — `TriggerToolHealthRecovery` violates IsProcessing Cleanup Invariant (5/5 models)

🔴 CRITICAL — `MonitorAndSynthesizeAsync` clears `IsProcessing` off the UI thread (3/5 models)

🔴 CRITICAL — `ToolHealthCheckTimer` not cancelled in session cleanup paths (2/5 models)

🟡 MODERATE — `WatchdogCaseAResets` shared between two independent systems (4/5 models)

🟡 MODERATE — `WorkerExecutionTimeout` 10 min → 60 min regresses remote/demo mode (3/5 models)

🟡 MODERATE — `ScheduleToolHealthCheck` timer/cancel race window (3/5 models)

🟢 MINOR — Dead code after retry loop in `ExecuteWorkerAsync` (3/5 models)

PR #302 — Round 2 Re-Review (post fix commit `eedf8e59`)