feat(eval) Foundry EvaluationClient harness + custom citation-accuracy evaluators (Wave 3 PR 8)#92
Merged
Merged
Conversation
…y evaluators (Wave 3 PR 8) Phase 3 evaluation harness per ADR-0016 + build-spec § Phase 3 scope item 12. Ships the regression-detection floor for the Wizard answer flow before Phase 4 RAG lands — citation-accuracy is the load-bearing showcase metric, and without a baseline we can't gate a deploy. The harness drives every question in `data/eval/wizard.v1.jsonl` through `IAiRouter` (production code path against the deployed Foundry agents — DL-0002 / DL-0003 lessons honored), scores the response against four custom code-based evaluators (citation precision, citation recall, subagent accuracy, refusal correctness), aggregates, and writes a timestamped JSON file under `data/eval/results/`. The `--eval` CLI flag invokes it; results JSON is committed so the metric trajectory is `git diff`-able. Why these four metrics: citation precision penalizes hallucinated provenance (the failure mode `guardrails.md` goal #5 exists to prevent); citation recall penalizes silent under-citation; subagent accuracy guards routing regressions when the Wizard prompt is edited; refusal correctness has signal in both directions (over-eager refusal is also a regression per ADR-0017's "refusal is a feature, not a failure" framing). Phase 3 deviation from ADR-0016's pseudo-code: `Azure.AI.Projects` 2.0.1 GA does not yet expose `ProjectEvaluators.CreateVersionAsync` publicly (gated behind `AAIP001` experimental + non-public accessor on `AIProjectClient`). The four evaluator definitions live as .NET classes (canonical Phase 3 runtime) plus equivalent Python snippets in `EvaluatorPythonSpecs.cs` (spec for the future Foundry-side registration). When the SDK exposes the round-trip, the harness's planned-registration noop swaps to a real call without changing `IEvaluationHarness` or the results JSON shape. Documented in the class comment + `data/eval/README.md`. Eval set: 30 hand-curated questions (10 Rules + 10 Valuation + 10 Repair, with three out-of-scope rows for refusal-correctness symmetry) biased toward simple OPDB lookups. The `EvalGroundTruthFileTests` suite pins the file's structural integrity at build time (every row has a valid `expected_sub_agent`, ids are unique, refusal rows have empty citation sets, set contains both refusal and non-refusal rows). Telemetry: appended `pinwiz.eval.runs`, `pinwiz.eval.runs.failed`, `pinwiz.eval.questions.scored`, `pinwiz.eval.evaluator.registrations`, and `pinwiz.eval.question.duration_ms` to PinballWizardTelemetry. Tests: 619 → 677 (+58 across the four evaluators, the JSONL parser, the harness fixture + ground-truth file integrity). `/local-review` (mental pass): 0 🔴, 1⚠️ (the SDK-deviation noop — addressed by EvaluatorPythonSpecs preserving the spec + the class-comment paragraph documenting the swap path). 7-item self-audit: 1. Every option field read: 6/6 EvalHarnessOptions properties have real getter calls in EvaluationHarness. 2. Sibling-diff vs AzureFoundrySmokeProbe + OpdbSyncService: consistent ctor / ArgumentNullException pattern, Activity start, TimeProvider injection, structured logging. 3. No bare catch{} in new code (only pre-existing one in OpdbClient.cs). 4. CLI/orchestrator wiring end-to-end: `--eval` flag resolves IEvaluationHarness from DI; missing service exits 2 with remediation. 5. Tests assert behavior: partial-overlap fixtures in precision + recall tests, hallucinated-citation fixture in harness tests, over-eager-answer-on-refusable-question fixture, etc. 6. Build is zero-warning. 7. Identity: personal noreply confirmed.
Resolves the lone CONFLICTING file ServiceCollectionExtensions.cs in Application/Ai/ — PR #91 added IAiCostCalculator + ITokenUsageReader singletons; this PR (#92) added the four custom evaluators. Both sets of singletons need to register; the resolved file keeps both blocks side-by-side in the import order Cost-then-Evaluators (alphabetical). Build green (0 warnings under TreatWarningsAsErrors); 687/687 tests passing — that's 629 (PR #91 baseline after merge to main) + 58 new from this PR's evaluator + harness + ground-truth-file tests. Identity verified.
Comment on lines
+47
to
+53
| foreach (var predictedId in predictedSet) | ||
| { | ||
| if (expectedSet.Contains(predictedId)) | ||
| { | ||
| hits++; | ||
| } | ||
| } |
Comment on lines
+42
to
+48
| foreach (var expectedId in expectedSet) | ||
| { | ||
| if (predictedSet.Contains(expectedId)) | ||
| { | ||
| hits++; | ||
| } | ||
| } |
Comment on lines
+274
to
+291
| foreach (var citation in citations) | ||
| { | ||
| // Phase 3 ground-truth ids are OPDB MachineId values (e.g. | ||
| // GRBN-MQR4P) — sometimes wrapped with the "mch_" prefix in | ||
| // the seed file for symmetry with Phase 4 doc_ ids. Accept | ||
| // either form by storing the raw MachineId; the eval-set | ||
| // curator is responsible for matching the expected form. | ||
| // Phase 4 RAG fills in DocumentChunkId; both flow through. | ||
| var id = citation.MachineId ?? citation.DocumentChunkId; | ||
| if (string.IsNullOrWhiteSpace(id)) | ||
| { | ||
| continue; | ||
| } | ||
| if (seen.Add(id)) | ||
| { | ||
| ids.Add(id); | ||
| } | ||
| } |
| { | ||
| var stamp = startedAt.UtcDateTime.ToString("yyyyMMddTHHmmss", CultureInfo.InvariantCulture) + "Z"; | ||
| var fileName = $"wizard.{stamp}.json"; | ||
| return Path.Combine(_evalOptions.ResultsDirectory, fileName); |
| [Fact] | ||
| public void ParseFile_NonExistent_Throws() | ||
| { | ||
| var bogusPath = Path.Combine(Path.GetTempPath(), $"nonexistent-{Guid.NewGuid():N}.jsonl"); |
| [Fact] | ||
| public void ParseFile_ValidFile_RoundTrip() | ||
| { | ||
| var path = Path.Combine(Path.GetTempPath(), $"eval-test-{Guid.NewGuid():N}.jsonl"); |
|
|
||
| public HarnessFixture() | ||
| { | ||
| Root = Path.Combine(Path.GetTempPath(), $"eval-harness-{Guid.NewGuid():N}"); |
| { | ||
| Root = Path.Combine(Path.GetTempPath(), $"eval-harness-{Guid.NewGuid():N}"); | ||
| Directory.CreateDirectory(Root); | ||
| GroundTruthPath = Path.Combine(Root, "wizard.test.jsonl"); |
| Root = Path.Combine(Path.GetTempPath(), $"eval-harness-{Guid.NewGuid():N}"); | ||
| Directory.CreateDirectory(Root); | ||
| GroundTruthPath = Path.Combine(Root, "wizard.test.jsonl"); | ||
| ResultsDirectory = Path.Combine(Root, "results"); |
Comment on lines
+228
to
+232
| catch (Exception ex) | ||
| { | ||
| error = $"{ex.GetType().Name}: {ex.Message}"; | ||
| _logger.LogWarning(ex, "EvaluationHarness: question {Id} threw.", question.Id); | ||
| } |
| { | ||
| var stamp = startedAt.UtcDateTime.ToString("yyyyMMddTHHmmss", CultureInfo.InvariantCulture) + "Z"; | ||
| var fileName = $"wizard.{stamp}.json"; | ||
| return Path.Combine(_evalOptions.ResultsDirectory, fileName); |
| var dir = new DirectoryInfo(AppContext.BaseDirectory); | ||
| while (dir is not null) | ||
| { | ||
| var candidate = Path.Combine(dir.FullName, "data", "eval", "wizard.v1.jsonl"); |
| [Fact] | ||
| public void ParseFile_NonExistent_Throws() | ||
| { | ||
| var bogusPath = Path.Combine(Path.GetTempPath(), $"nonexistent-{Guid.NewGuid():N}.jsonl"); |
| [Fact] | ||
| public void ParseFile_ValidFile_RoundTrip() | ||
| { | ||
| var path = Path.Combine(Path.GetTempPath(), $"eval-test-{Guid.NewGuid():N}.jsonl"); |
|
|
||
| public HarnessFixture() | ||
| { | ||
| Root = Path.Combine(Path.GetTempPath(), $"eval-harness-{Guid.NewGuid():N}"); |
| { | ||
| Root = Path.Combine(Path.GetTempPath(), $"eval-harness-{Guid.NewGuid():N}"); | ||
| Directory.CreateDirectory(Root); | ||
| GroundTruthPath = Path.Combine(Root, "wizard.test.jsonl"); |
| Root = Path.Combine(Path.GetTempPath(), $"eval-harness-{Guid.NewGuid():N}"); | ||
| Directory.CreateDirectory(Root); | ||
| GroundTruthPath = Path.Combine(Root, "wizard.test.jsonl"); | ||
| ResultsDirectory = Path.Combine(Root, "results"); |
Comment on lines
+274
to
+291
| foreach (var citation in citations) | ||
| { | ||
| // Phase 3 ground-truth ids are OPDB MachineId values (e.g. | ||
| // GRBN-MQR4P) — sometimes wrapped with the "mch_" prefix in | ||
| // the seed file for symmetry with Phase 4 doc_ ids. Accept | ||
| // either form by storing the raw MachineId; the eval-set | ||
| // curator is responsible for matching the expected form. | ||
| // Phase 4 RAG fills in DocumentChunkId; both flow through. | ||
| var id = citation.MachineId ?? citation.DocumentChunkId; | ||
| if (string.IsNullOrWhiteSpace(id)) | ||
| { | ||
| continue; | ||
| } | ||
| if (seen.Add(id)) | ||
| { | ||
| ids.Add(id); | ||
| } | ||
| } |
Comment on lines
+47
to
+53
| foreach (var predictedId in predictedSet) | ||
| { | ||
| if (expectedSet.Contains(predictedId)) | ||
| { | ||
| hits++; | ||
| } | ||
| } |
Merged
10 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 3 Wave 3 PR 8 — the evaluation harness per ADR-0016 + build-spec § Phase 3 scope item 12. Ships:
IEvaluationHarness(Application contract) +EvaluationHarness(Infrastructure implementation) — drives every question indata/eval/wizard.v1.jsonlthroughIAiRouter(production code path against deployed Foundry agents per DL-0002 / DL-0003), scores with the four custom code-based evaluators, aggregates, and writes a timestamped JSON file underdata/eval/results/.CitationPrecisionEvaluator(load-bearing showcase metric perguardrails.mdgoal chore(deps)(deps): bump actions/cache from 4 to 5 #5),CitationRecallEvaluator,SubagentAccuracyEvaluator,RefusalCorrectnessEvaluator. Pure deterministic logic — singletons.EvaluatorPythonSpecs.csas the future Foundry-side registration spec.--evalCLI flag + DI gating + exit-code-2 remediation pattern (mirrors--ensure-azure-foundry).data/eval/wizard.v1.jsonlground-truth (10 Rules, 10 Valuation, 10 Repair; three out-of-scope rows for refusal-correctness symmetry).data/eval/README.mddocuments the OPDB-citable bias per P3-R8.pinwiz.eval.*instruments appended toPinballWizardTelemetry.Phase 3 deviation from ADR-0016 pseudo-code
Azure.AI.Projects2.0.1 GA does not yet exposeProjectEvaluators.CreateVersionAsyncpublicly — it's gated behind theAAIP001experimental diagnostic and the operations-client accessor (GetProjectEvaluatorsClient) is non-public. The harness adapts: the four evaluators are implemented in .NET (Phase 3 runtime), Python equivalents are committed as the spec, and the planned-registration step is a counter-incrementing noop with a debug log. When a future SDK version flips the API to public, the harness swaps to a real round-trip without changingIEvaluationHarnessor the results JSON shape. Documented in the class header +data/eval/README.md§ Phase 3 implementation note.Why the four metrics
guardrails.mdgoal chore(deps)(deps): bump actions/cache from 4 to 5 #5 ("provenance is sacred") exists to prevent.WizardAnswer.SubAgentUsedis currently alwaysWizarduntil the connected-agent trace correlation is wired; the evaluator is correct in isolation).Test Plan
dotnet build PinballWizard.slnx— 0 warnings, 0 errors.dotnet test PinballWizard.slnx— 677 tests pass (619 original + 58 new).EvalGroundTruthFileTestspins the structural integrity ofwizard.v1.jsonlat build time (every row has a validexpected_sub_agent, ids are unique, refusal rows have empty citation sets, set contains both refusal and non-refusal rows).dotnet run --project src/PinballWizard.Cli -- --evalagainst the deployed Foundry project.Out of Scope
EvaluationRule) and scheduled evaluation (ProjectsSchedule) — Phase 6 turns these on per ADR-0016.IAiRouter/AiRouter/FoundryAgentFactory/ agent prompts — locked deferral.Checklist
docs/adr/— covered by ADR-0016 alreadyREADME.mdand/ordocs/are updated in the same PR —data/eval/README.mddocuments the new surface~/.claude/projects/c--projects-PinballWizard/memory/is now stale, it has been updated or removed in the same PR — no stale memoryTODO/FIXME/ commented-out code committed<NoWarn>without a comment explaining why and the removal criterionPre-push self-audit (additive PRs)
Step 0 —
/local-review(qualitative)/local-reviewand addressed every 🔴 finding before pushStep 1 — Mechanical checklist
*Optionsproperty has at least one real getter call insrc/(6/6 EvalHarnessOptions properties verified)AzureFoundrySmokeProbe+OpdbSyncService; consistent ctor / ArgumentNullException / Activity / TimeProvider patternscatch { }— onlycatch (Exception)minimumISourceScraper; not applicablegit log -1 --format='%an <%ae>'showsJim Keeley <94459922+jkeeley2073@users.noreply.github.com>