Skip to content

Add built-in "Skill Validator" multi-agent preset#302

Merged
PureWeen merged 5 commits intomainfrom
feat/skill-validator-preset
Mar 11, 2026
Merged

Add built-in "Skill Validator" multi-agent preset#302
PureWeen merged 5 commits intomainfrom
feat/skill-validator-preset

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

@PureWeen PureWeen commented Mar 7, 2026

What

Adds a new built-in "Skill Validator" (⚖️) preset to GroupPreset.BuiltIn — a multi-agent group that evaluates skills from two complementary angles and builds a consensus verdict.

How it works

Uses OrchestratorReflect mode with two specialized workers and an orchestrator that synthesizes their findings:

  • Worker 1 — Dotnet Skill Validator: Empirical, outcome-focused evaluation using a 5-step methodology:

    1. Inspect the skill definition (SKILL.md + eval.yaml)
    2. Baseline comparison — describe agent behavior without vs. with the skill
    3. Pairwise comparative judgment with position-swap bias mitigation
    4. Statistical confidence assessment across the scenario set
    5. Produce a scored verdict (KEEP / IMPROVE / REMOVE)
  • Worker 2 — Anthropic Skill Evaluator: Prompt-design-focused evaluation across 4 dimensions:

    1. Description quality & trigger accuracy (precision/recall rating)
    2. Instruction clarity — actionable, unambiguous, no over-constraint
    3. Scope appropriateness — focused, no overlap with other skills
    4. Test coverage — happy path, edge cases, negative cases
  • Orchestrator: Dispatches to both workers, collects independent verdicts, identifies agreements (high confidence) vs. disagreements (requires judgment), and produces a structured consensus report:

    ## Skill Validator Consensus Report: [Skill Name]
    Dotnet Verdict / Anthropic Verdict / Consensus
    Points of Agreement · Points of Disagreement
    Adopted Suggestions · Declined Suggestions (with rationale)
    Final Recommendation
    

Consensus thresholds (defined in SharedContext):

  • KEEP = both say KEEP, or one KEEP + one IMPROVE
  • IMPROVE = evaluators disagree, or both say IMPROVE
  • REMOVE = either says REMOVE with strong evidence

Config: OrchestratorModel: claude-opus-4.6, WorkerModels: 2x claude-sonnet-4.6, MaxReflectIterations: 6

Current State

  • ✅ Built-in preset added to PolyPilot/Models/ModelCapabilities.cs
  • ✅ New test: BuiltInPresets_IncludeSkillValidator in SessionOrganizationTests.cs
  • ✅ Updated Scenario_CreateGroupFromPreset preset count expectation (2 → 3)
  • ✅ Build passes — 0 errors, 7 pre-existing warnings
  • ✅ 1661 tests pass (1 pre-existing failure: PlatformHelperTests.AvailableModes_OnNonDesktop_IsRemoteOnly, unrelated to this change)
  • ✅ Challenger review passed

Files Changed

File Change
PolyPilot/Models/ModelCapabilities.cs New "Skill Validator" entry in BuiltIn array (+191 lines)
PolyPilot.Tests/SessionOrganizationTests.cs New test + updated preset count expectation (+20, -2 lines)

Future Work / Potential Improvements

  • Add more granular evaluation rubrics based on real-world skill validation experience
  • Integrate with the skill-creator skill to provide an end-to-end skill development workflow (create → validate → iterate)
  • Consider a "Skill Validator Lite" broadcast-mode variant for quick single-pass evaluation without a consensus loop
  • Could expose consensus thresholds (KEEP/IMPROVE/REMOVE) as configurable preset parameters
  • Add eval.yaml scenario examples to the SharedContext to guide workers on what good test cases look like

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

@PureWeen PureWeen marked this pull request as draft March 8, 2026 00:43
PureWeen and others added 2 commits March 10, 2026 17:53
Adds a new built-in GroupPreset called "Skill Validator" (⚖️) that
pits two evaluators against each other to assess skills:

- Worker 1 (Dotnet Skill Validator): empirical, outcome-focused evaluation
  using methodology inspired by dotnet/skills skill-validator — measures
  baseline vs. skill-augmented performance, pairwise comparative judging,
  statistical confidence assessment.

- Worker 2 (Anthropic Skill Evaluator): prompt-design-focused evaluation
  assessing description quality/trigger accuracy, instruction clarity,
  scope appropriateness, and test coverage.

- Orchestrator: routes work to both evaluators, highlights agreements and
  disagreements, explains which suggestions are adopted and why, produces
  a consensus KEEP/IMPROVE/REMOVE verdict.

Configuration:
- Mode: OrchestratorReflect
- OrchestratorModel: claude-opus-4.6
- WorkerModels: 2x claude-sonnet-4.6
- MaxReflectIterations: 6

Also updates Scenario_CreateGroupFromPreset test to expect 3 built-in
presets and adds BuiltInPresets_IncludeSkillValidator test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-dispatch clearing

Three fixes to dramatically reduce recovery time for stuck sessions:

1. Escalation timeout: After the first 600s timeout with no events (but server
   alive), switch to 60s timeout for subsequent checks. Reduces max stuck time
   from ~40 minutes to ~12 minutes.

2. Tool health check: Start a 30s timer when a tool begins executing. If no
   events arrive within 30s, check if the connection is still alive. After 3
   consecutive stale checks (90s total), trigger recovery. This detects dead
   connections much faster than waiting for the 10-minute watchdog.

3. Re-dispatch clearing: Before re-dispatching workers after app restart,
   clear any stuck IsProcessing/ActiveToolCallCount/SendingFlag state from
   the previous incomplete turn. This allows SendPromptAsync to accept the
   new prompt instead of rejecting with 'already processing'.

Also resets HasUsedToolsThisTurn on reconnect so the watchdog uses the
appropriate timeout (120s instead of 600s) for the fresh connection.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the feat/skill-validator-preset branch from eeea8ee to 0359d3b Compare March 10, 2026 22:57
PureWeen and others added 2 commits March 10, 2026 21:17
1. Stagger worker dispatch: Add 1s delay between worker launches to prevent
   burst connection saturation (previously 3/5 workers crashed with IOException)

2. IOException retry: Wrap SendPromptAndWaitAsync in retry loop (max 2 attempts,
   2s delay) using existing IsConnectionError() helper for resilience

3. Smart Case A watchdog: Replace fixed timeout and reset cap with events.jsonl
   mtime freshness check. Fresh file (<60s) = wait indefinitely (tool actively
   running), stale = 1 confirmation cycle then terminate. Protects active tools.

4. TurnEnd fallback re-arm: Add FallbackCanceledByTurnStart diagnostic flag
   so TurnEnd can log when re-arming the fallback timer after TurnStart canceled it

Live test results: 5/5 workers connected (was 2/5), 4/5 returned real responses,
full orchestration synthesis completed. Remaining empty response from worker-2
is caused by upstream SDK bug 299 (missing SessionIdleEvent).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The smart Case A watchdog (events.jsonl freshness) now handles dead session
detection in ~90 seconds. The WorkerExecutionTimeout only needs to be an
absolute backstop. At 10 minutes, it was prematurely terminating workers
with 200+ tool calls that were actively processing (e.g. long PR reviews).

This caused cascading failures: the session stayed in IsProcessing=true
from the SDK side, so re-dispatch attempts got 'Session is already
processing a request' errors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PR #302 Review — Multi-Model Consensus (5 models, 2+ agreement filter)

PR: Add built-in "Skill Validator" multi-agent preset (DRAFT)
CI: No checks configured
Prior reviews: None
Note: DRAFT PR. The Skill Validator preset (ModelCapabilities.cs) is clean data — no bugs. The runtime watchdog/tool-health changes have significant issues.


🔴 CRITICAL — MonitorAndSynthesizeAsync hangs orchestrator + off-UI-thread mutation (5/5 models)

CopilotService.Organization.cs — worker re-dispatch loop

The new block sets workerState.Info.IsProcessing = false directly on a background thread (INV-2 violation) and does NOT call state.ResponseCompletion?.TrySetResult(...). Since SendPromptAndWaitAsync blocks on ResponseCompletion.Task, the orchestrator loop awaiting the re-dispatched worker hangs indefinitely. Also missing from INV-1: IsResumed, ProcessingStartedAt, ToolCallCount, ProcessingPhase, ClearPermissionDenials(), FlushCurrentResponse(), OnSessionComplete, diagnostic log tag.


🔴 CRITICAL — TriggerToolHealthRecovery INV-1 violation (5/5 models)

CopilotService.Events.cs — new method

This 9th path that clears IsProcessing skips 6 required cleanup fields. The missing OnSessionComplete means multi-agent orchestrations listening on this signal are never notified, leaving sessions showing "Thinking..." on mobile clients forever.

Also: TrySetResult uses only CurrentResponse.ToString(), but CompleteResponse uses FlushedResponse + CurrentResponse. If a TurnEnd flush already moved content to FlushedResponse (clearing CurrentResponse), the orchestrator gets an empty string as the worker result.

Missing: IsResumed = false, ProcessingStartedAt = null, ToolCallCount = 0, ProcessingPhase = 0, ClearPermissionDenials(), OnSessionComplete?.Invoke().


🟡 MODERATE — WatchdogCaseAResets double-increment race (4/5 models)

CopilotService.Events.cs

Both the 30s tool health timer AND the main watchdog Case A path call Interlocked.Increment(ref state.WatchdogCaseAResets) on the same counter. With WatchdogMaxToolAliveResets = 2 and threshold resets > 1: health check fires at T+30s (→1), watchdog fires at T+60s (→2, triggers termination). One stale observation is enough to terminate rather than the intended 2-cycle confirmation — false positives for legitimate long-running tools.


🟡 MODERATE — CancelToolHealthCheck missing from CompleteResponse / AbortSessionAsync (4/5 models)

CopilotService.Events.cs — CompleteResponse

CompleteResponse calls CancelProcessingWatchdog and CancelTurnEndFallback but not CancelToolHealthCheck. The orphaned timer fires 30s after session completes, finds IsProcessing=false, returns early — no correctness bug, but Timer is never disposed, accumulating leaks at high throughput.


Test Coverage

BuiltInPresets_IncludeSkillValidator is solid. Missing: (1) TriggerToolHealthRecovery verifies all INV-1 fields cleared; (2) re-dispatch path verifies ResponseCompletion is unblocked; (3) counter test verifying no double-increment.


⚠️ Request changes (draft — not merge-ready)

The Skill Validator preset config is clean. Three runtime bugs to fix:

  1. Add ResponseCompletion?.TrySetResult(...) and wrap in InvokeOnUI in the MonitorAndSynthesizeAsync re-dispatch path
  2. Complete INV-1 cleanup in TriggerToolHealthRecovery (6 missing fields + fix empty-content TrySetResult)
  3. Separate the WatchdogCaseAResets counter or coordinate so only one path increments per cycle

@PureWeen
Copy link
Copy Markdown
Owner Author

Round 1 review complete — see findings below

@PureWeen
Copy link
Copy Markdown
Owner Author

PR #302 — 5-Model Consensus Review

Title: Add built-in "Skill Validator" multi-agent preset + watchdog/tool-health improvements
CI: ⚠️ No checks configured on feat/skill-validator-preset
Prior reviews: None
Models: claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex


🔴 CRITICAL — TriggerToolHealthRecovery violates IsProcessing Cleanup Invariant (5/5 models)

PolyPilot/Services/CopilotService.Events.csTriggerToolHealthRecovery

The new recovery method sets IsProcessing = false but skips the cleanup required by every other path that clears processing state. Missing:

  • CancelProcessingWatchdog(state) — watchdog keeps running, fires a second timeout on the already-recovered session (3/5 models)
  • CancelTurnEndFallback(state) — TurnEnd fallback fires ~4s later, calling CompleteResponse on a recovered session; guards prevent double-complete but leave stale companion fields (2/5 models)
  • state.Info.IsResumed = false — next turn watchdog uses 30s resume-quiescence instead of 600s tool timeout → premature kill (4/5 models)
  • state.Info.ProcessingPhase = 0 — UI stuck showing "Working…" forever (4/5 models)
  • state.Info.ProcessingStartedAt = null — elapsed time counter never stops (3/5 models)
  • state.SuccessfulToolCountThisTurn = 0 (3/5 models)
  • state.WatchdogCaseAResets = 0 — stale count affects next turn's watchdog timeout selection (3/5 models)
  • OnSessionComplete?.Invoke(sessionName, ...) — orchestrator worker loops: ResponseCompletion TCS unblocks via TrySetResult but MonitorAndSynthesizeAsync never gets its completion notification, blocking synthesis for up to 60 minutes (4/5 models)

Fix: Replace the manual clear with a call to CompleteResponse(state) (or extract the shared cleanup into a helper), and add CancelProcessingWatchdog(state) + CancelTurnEndFallback(state) before the InvokeOnUI.


🔴 CRITICAL — MonitorAndSynthesizeAsync clears IsProcessing off the UI thread (3/5 models)

PolyPilot/Services/CopilotService.Organization.csMonitorAndSynthesizeAsync

workerState.Info.IsProcessing = false;  // runs on orchestration task thread
Interlocked.Exchange(ref workerState.ActiveToolCallCount, 0);

Per the codebase invariant: "All mutations to state.Info.IsProcessing must be marshaled to the UI thread." This runs on the orchestration background task, racing with watchdog/event handlers. Additionally, ResponseCompletion?.TrySetResult(...) is never called — workers blocked in SendPromptAndWaitAsync awaiting the TCS will never complete, so Task.WhenAll(workerTasks) hangs indefinitely with IsProcessing = false in the UI (2/5 models).

Fix: Wrap in InvokeOnUI(() => { ... }) and add workerState.ResponseCompletion?.TrySetResult(...).


🔴 CRITICAL — ToolHealthCheckTimer not cancelled in session cleanup paths (2/5 models)

PolyPilot/Services/CopilotService.Events.cs / CopilotService.cs

CancelToolHealthCheck is never called in:

  • AbortSessionAsync (cancels watchdog + fallback, but not health check) — stale 30s timer fires on an aborted session; if a new prompt arrives before it fires, the generation guard may not protect it if ProcessingGeneration didn't change
  • ReconnectAsync / DisposeAsync session cleanup loops (cancel watchdog + fallback, not health check)
  • CloseSessionCoreAsync

After a reconnect, old health-check timers hold closures over old SessionState objects and can fire on detached state.

Fix: Add CancelToolHealthCheck(state) to AbortSessionAsync alongside the existing CancelProcessingWatchdog call, and to the cleanup loops in ReconnectAsync/DisposeAsync.


🟡 MODERATE — WatchdogCaseAResets shared between two independent systems (4/5 models)

PolyPilot/Services/CopilotService.Events.cs

Both ScheduleToolHealthCheck's timer callback and RunProcessingWatchdogAsync Case A increment WatchdogCaseAResets independently. The watchdog also reads the counter to select between 600s and 60s escalation timeouts (caseAResets > 0). A single stale health check at t=30s sets caseAResets = 1, permanently switching the watchdog to the 10× shorter timeout for the rest of that turn — even if the connection recovers and events resume. With WatchdogMaxToolAliveResets reduced from 3 to 2, a health check fire + watchdog fire in the same cycle could kill a legitimately running tool prematurely.

Fix: Use a separate field for the health check's stale counter (e.g., ToolHealthStaleChecks) instead of sharing WatchdogCaseAResets.


🟡 MODERATE — WorkerExecutionTimeout 10 min → 60 min regresses remote/demo mode (3/5 models)

PolyPilot/Services/CopilotService.Organization.cs line ~33

ScheduleToolHealthCheck returns early for demo and remote mode. The watchdog's Case A also skips for demo/remote. The 60-minute timeout is the only backstop for workers in these modes. A stuck remote worker now blocks orchestration for a full hour vs. the previous 10 minutes.

Fix: Keep WorkerExecutionTimeout = 60 minutes for non-remote/demo but restore a shorter backstop (e.g., 10 min) for remote/demo mode, or add a separate remote-mode health check mechanism.


🟡 MODERATE — ScheduleToolHealthCheck timer/cancel race window (3/5 models)

PolyPilot/Services/CopilotService.Events.csScheduleToolHealthCheck

The timer is constructed (and starts counting) before Interlocked.Exchange(ref state.ToolHealthCheckTimer, timer) stores it. A concurrent CancelToolHealthCheck (e.g., from ToolExecutionCompleteEvent) runs between construction and storage, finds the field null (just cleared), and becomes a no-op. The new timer is then stored and runs untracked. The activeTools <= 0 guard in the callback prevents false recovery, but the callback still runs and may reschedule recursively.

Fix: Store the timer before the delay starts, or use a CancellationTokenSource pattern instead of Interlocked on Timer?.


🟢 MINOR — Dead code after retry loop in ExecuteWorkerAsync (3/5 models)

PolyPilot/Services/CopilotService.Organization.csExecuteWorkerAsync

With maxRetries = 2, attempt = 2 always falls into the unconditional catch (Exception ex) block and returns. The return new WorkerResult(..., "Max retries exceeded", ...) after the loop is unreachable.


Test Coverage Assessment

The PR adds BuiltInPresets_IncludeSkillValidator (validates preset structure) and updates the preset count test. Missing tests for the new runtime paths:

  • TriggerToolHealthRecovery clearing all IsProcessing companion fields (especially IsResumed, ProcessingPhase, ProcessingStartedAt)
  • CancelToolHealthCheck being called in AbortSessionAsync / DisposeAsync / ReconnectAsync
  • ScheduleToolHealthCheck not firing in demo/remote mode
  • WatchdogCaseAResets counter interaction between health check and watchdog
  • MonitorAndSynthesizeAsync re-dispatch unblocking ResponseCompletion TCS

Summary

Issue Severity Ship-blocker?
TriggerToolHealthRecovery incomplete cleanup (missing 7+ fields, watchdog, OnSessionComplete) 🔴 CRITICAL Yes — stale state corrupts next turn; orchestrator loops may hang 60 min
MonitorAndSynthesizeAsync IsProcessing mutation off UI thread + missing TrySetResult 🔴 CRITICAL Yes — race condition + synthesis deadlock
ToolHealthCheckTimer not cancelled in abort/dispose/reconnect 🔴 CRITICAL Yes — timer fires on orphaned state after reconnect
WatchdogCaseAResets shared between health check and watchdog 🟡 MODERATE Yes — can prematurely kill legitimately running tools
WorkerExecutionTimeout 60 min for remote/demo 🟡 MODERATE Yes — hour-long hang regression
Timer/cancel race in ScheduleToolHealthCheck 🟡 MODERATE No — mitigated by activeTools guard
Dead code after retry loop 🟢 MINOR No

⚠️ Request changes — three CRITICAL issues must be addressed before merge. The TriggerToolHealthRecovery cleanup gap is particularly dangerous as it violates the well-documented IsProcessing invariant. The Skill Validator preset itself is clean and uncontroversial.


5-model review: claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex. Consensus filter: issues flagged by 2+ models only.

@PureWeen
Copy link
Copy Markdown
Owner Author

Round 1 Review

@PureWeen PureWeen marked this pull request as ready for review March 11, 2026 04:24
CRITICAL fixes:
- TriggerToolHealthRecovery: complete INV-1 cleanup (was missing 8+ fields,
  OnSessionComplete, CancelProcessingWatchdog/TurnEndFallback, proper
  FlushedResponse+CurrentResponse for TCS result)
- MonitorAndSynthesizeAsync: wrap IsProcessing mutation in InvokeOnUI with
  TCS synchronization, add ResponseCompletion.TrySetResult, complete all
  companion field cleanup
- CancelToolHealthCheck added to all 14 cleanup paths (AbortSessionAsync,
  ReconnectAsync, DisposeAsync, CloseSessionCoreAsync, CompleteResponse,
  watchdog timeouts)

MODERATE fixes:
- Separate ToolHealthStaleChecks counter from WatchdogCaseAResets to prevent
  cross-system interference
- WorkerExecutionTimeoutRemote (10 min) for remote/demo mode where smart
  watchdog is unavailable; keep 60 min for normal mode
- Fix ScheduleToolHealthCheck timer/cancel race: create dormant, store via
  Interlocked.Exchange, then start with timer.Change()

MINOR: Replace unreachable code after retry loop with UnreachableException

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PR #302 — Round 2 Re-Review (post fix commit eedf8e59)

CI: ⚠️ No checks configured | Tests: ✅ 2422 passed, 0 failed

Previous Findings Status

# Round 1 Finding Status Evidence
🔴 C1 TriggerToolHealthRecovery INV-1 violation (4 of 12+ fields) FIXED Full companion field cleanup added: all 9 fields + CancelProcessingWatchdog + CancelTurnEndFallback + OnSessionComplete + diagnostic log tag
🔴 C2 MonitorAndSynthesizeAsync background-thread IsProcessing=false FIXED Wrapped in InvokeOnUI() with companion field cleanup
🟡 M1 WatchdogCaseAResets shared counter double-increment FIXED New separate counter ToolHealthStaleChecks — no longer shared with WatchdogCaseAResets
🟡 M2 ToolHealthCheckTimer not cancelled on teardown FIXED CancelToolHealthCheck added to all 14 cleanup paths (CompleteResponse, AbortSessionAsync, CloseSessionCoreAsync, DisposeAsync, ReconnectAsync, watchdog timeouts)
🟡 M3 TCS result drops FlushedResponse content FIXED Now uses FlushedResponse + CurrentResponse like CompleteResponse

New Issues (Round 2)

None found. All 5 Round 1 findings are resolved.

Test Coverage Gap (non-blocking)

Zero unit tests for the new ToolHealthCheckTimer code paths. Suggested additions:

  1. ToolHealthCheck_CompletedTool_DoesNotTriggerRecovery
  2. ToolHealthCheck_StaleEvents_TriggersRecovery
  3. WatchdogMaxToolAliveResets_BoundsMaxStuckTime_WithEscalation
  4. TriggerToolHealthRecovery_NotifiesOnSessionComplete

Verdict: ✅ Approve

All critical and moderate findings from Round 1 have been addressed. Tests pass (2422/2422). The missing test coverage for ToolHealthCheckTimer paths is noted but non-blocking — the existing watchdog tests cover the broader processing-state invariants.

@PureWeen PureWeen merged commit eaba0b1 into main Mar 11, 2026
@PureWeen PureWeen deleted the feat/skill-validator-preset branch March 11, 2026 04:54
PureWeen added a commit that referenced this pull request Mar 12, 2026
…352)

## Problem

The **Evaluate-pr-tests-orchestrator** fails to complete with two
symptoms:
1. Orchestrator loops indefinitely with "0 raw assignments" 
2. Workers return empty responses despite having processed content

### Root Cause 1: Completion Override Infinite Loop
When workers fail/return empty (SDK bug #299), `dispatchedWorkers` stays
empty → `allWorkersDispatched = false` → `[[GROUP_REFLECT_COMPLETE]]` is
overridden → orchestrator re-prompted but says "nothing to delegate" →
repeat until MaxIterations.

### Root Cause 2: Empty Worker Responses  
Workers complete but SDK never sends SessionIdleEvent. Watchdog fires
and completes the session, but `FlushedResponse`/`CurrentResponse` are
empty because content was in delta events that ended up in chat history
but not the response buffers.

## Fixes

### 1. Accept completion when all workers were attempted
Changed both evaluator and self-eval paths: if all workers were
**attempted** (even if some failed/returned empty), accept
`[[GROUP_REFLECT_COMPLETE]]` instead of overriding it. Uses
`allWorkersAttempted` (`attemptedWorkers` set) alongside
`allWorkersDispatched` (`dispatchedWorkers` set).

### 2. Chat history fallback for empty responses
When a worker returns empty after completion + revival, extract the last
assistant message from chat history as a fallback. This recovers content
that was streamed via delta events but lost when the watchdog completed
the session with empty response buffers.

## Testing
- All 2422 tests pass
- Built and relaunched successfully

## Related
- SDK bug #299 (missing SessionIdleEvent) — upstream issue causing
workers to appear stuck
- PR #302 — smart watchdog, stagger dispatch, IOException retry (merged)

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
## What

Adds a new built-in `"Skill Validator"` (`⚖️`) preset to
`GroupPreset.BuiltIn` — a multi-agent group that evaluates skills from
two complementary angles and builds a consensus verdict.

## How it works

Uses `OrchestratorReflect` mode with two specialized workers and an
orchestrator that synthesizes their findings:

- **Worker 1 — Dotnet Skill Validator**: Empirical, outcome-focused
evaluation using a 5-step methodology:
  1. Inspect the skill definition (SKILL.md + eval.yaml)
2. Baseline comparison — describe agent behavior without vs. with the
skill
  3. Pairwise comparative judgment with position-swap bias mitigation
  4. Statistical confidence assessment across the scenario set
  5. Produce a scored verdict (KEEP / IMPROVE / REMOVE)

- **Worker 2 — Anthropic Skill Evaluator**: Prompt-design-focused
evaluation across 4 dimensions:
  1. Description quality & trigger accuracy (precision/recall rating)
  2. Instruction clarity — actionable, unambiguous, no over-constraint
  3. Scope appropriateness — focused, no overlap with other skills
  4. Test coverage — happy path, edge cases, negative cases

- **Orchestrator**: Dispatches to both workers, collects independent
verdicts, identifies agreements (high confidence) vs. disagreements
(requires judgment), and produces a structured consensus report:
  ```
  ## Skill Validator Consensus Report: [Skill Name]
  Dotnet Verdict / Anthropic Verdict / Consensus
  Points of Agreement · Points of Disagreement
  Adopted Suggestions · Declined Suggestions (with rationale)
  Final Recommendation
  ```

**Consensus thresholds** (defined in SharedContext):
- KEEP = both say KEEP, or one KEEP + one IMPROVE  
- IMPROVE = evaluators disagree, or both say IMPROVE  
- REMOVE = either says REMOVE with strong evidence

**Config**: `OrchestratorModel: claude-opus-4.6`, `WorkerModels: 2x
claude-sonnet-4.6`, `MaxReflectIterations: 6`

## Current State

- ✅ Built-in preset added to `PolyPilot/Models/ModelCapabilities.cs`
- ✅ New test: `BuiltInPresets_IncludeSkillValidator` in
`SessionOrganizationTests.cs`
- ✅ Updated `Scenario_CreateGroupFromPreset` preset count expectation (2
→ 3)
- ✅ Build passes — 0 errors, 7 pre-existing warnings
- ✅ 1661 tests pass (1 pre-existing failure:
`PlatformHelperTests.AvailableModes_OnNonDesktop_IsRemoteOnly`,
unrelated to this change)
- ✅ Challenger review passed

## Files Changed

| File | Change |
|------|--------|
| `PolyPilot/Models/ModelCapabilities.cs` | New `"Skill Validator"`
entry in `BuiltIn` array (+191 lines) |
| `PolyPilot.Tests/SessionOrganizationTests.cs` | New test + updated
preset count expectation (+20, -2 lines) |

## Future Work / Potential Improvements

- Add more granular evaluation rubrics based on real-world skill
validation experience
- Integrate with the `skill-creator` skill to provide an end-to-end
skill development workflow (create → validate → iterate)
- Consider a "Skill Validator Lite" broadcast-mode variant for quick
single-pass evaluation without a consensus loop
- Could expose consensus thresholds (KEEP/IMPROVE/REMOVE) as
configurable preset parameters
- Add eval.yaml scenario examples to the SharedContext to guide workers
on what good test cases look like

---

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
…ureWeen#352)

## Problem

The **Evaluate-pr-tests-orchestrator** fails to complete with two
symptoms:
1. Orchestrator loops indefinitely with "0 raw assignments" 
2. Workers return empty responses despite having processed content

### Root Cause 1: Completion Override Infinite Loop
When workers fail/return empty (SDK bug PureWeen#299), `dispatchedWorkers` stays
empty → `allWorkersDispatched = false` → `[[GROUP_REFLECT_COMPLETE]]` is
overridden → orchestrator re-prompted but says "nothing to delegate" →
repeat until MaxIterations.

### Root Cause 2: Empty Worker Responses  
Workers complete but SDK never sends SessionIdleEvent. Watchdog fires
and completes the session, but `FlushedResponse`/`CurrentResponse` are
empty because content was in delta events that ended up in chat history
but not the response buffers.

## Fixes

### 1. Accept completion when all workers were attempted
Changed both evaluator and self-eval paths: if all workers were
**attempted** (even if some failed/returned empty), accept
`[[GROUP_REFLECT_COMPLETE]]` instead of overriding it. Uses
`allWorkersAttempted` (`attemptedWorkers` set) alongside
`allWorkersDispatched` (`dispatchedWorkers` set).

### 2. Chat history fallback for empty responses
When a worker returns empty after completion + revival, extract the last
assistant message from chat history as a fallback. This recovers content
that was streamed via delta events but lost when the watchdog completed
the session with empty response buffers.

## Testing
- All 2422 tests pass
- Built and relaunched successfully

## Related
- SDK bug PureWeen#299 (missing SessionIdleEvent) — upstream issue causing
workers to appear stuck
- PR PureWeen#302 — smart watchdog, stagger dispatch, IOException retry (merged)

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant