docs(spec) Phase 3 closeout — retrospective + Phase 4 inherited follow-ups + locked decisions ✅#93
Merged
Merged
Conversation
…w-ups + locked decisions ✅
Wave 4 PR 9 of Phase 3 — closes the phase. Per build-spec scope item 14
+ guardrails.md § Per-phase gate.
H2 baseline run completed 2026-05-07T16:25Z; results JSON committed at
data/eval/results/wizard.20260507T162529Z.json. Aggregate metrics:
citation_precision = 0.133
citation_recall = 0.133
subagent_accuracy = 0.033
refusal_correctness = 0.300
ConfidenceThreshold stays at ADR-0017's draft 0.65 (NOT moved).
Calibration deferred to Phase 4 once the upstream gaps that floored
the H2 baseline are fixed. DL 2026-05-07 records the rationale.
Five Phase 3 lessons documented in build-spec § Phase 3 § Retrospective
+ rolled forward as Phase 4 § Inherited Phase 3 follow-ups (priority
order):
1. Connected-agents dispatch is non-functional — Wizard refuses
in-scope questions because FoundryAgentFactory wires only
getMachineByTitle, not the sub-agents. Phase 4 first scope item.
2. Eval ground-truth OPDB IDs need verification against deployed
Cosmos catalog — predicted ≠ expected on most successful lookups.
Phase 4 second scope item.
3. Replace OPDB-URL regex citation extraction with proper tool-call
trace inspection (Phase 4).
4. Replace NullTokenUsageReader with a real impl when Microsoft.Agents.AI
exposes Usage on AgentResponse (microsoft/agent-framework#2688).
5. Read WizardAnswer.SubAgentUsed from connected-agents trace
correlation (lands alongside item 1).
Locked decisions added to guardrails.md § Locked decisions:
- Microsoft Foundry orchestration (ADR-0014)
- Per-AIAgent model selection + LRU cache + cost ceiling (ADR-0015)
- Confidence-threshold refusal mandatory (ADR-0017)
- Code-resource agent definitions (ADR-0018)
CLAUDE.md updates:
- Test count 566 → 687
- 13 ADRs → 18 ADRs in documentation map
- Locked invariants 1-8 → 1-12 (added 4 AI-architecture invariants
pointing at ADRs 0014/0015/0017/0018)
- Freshest handoff pointer → session_handoff_2026_05_07_phase3_close.md
Memory updates:
- New: session_handoff_2026_05_07_phase3_close.md
- MEMORY.md index points at it
Phase 3 status table flipped Phase 3 to ✅ Complete.
H1 (deploy + smoke probe) ✅ done 2026-05-07.
H2 (eval baseline) ✅ done 2026-05-07.
H3 (Pinball Map live-API probe) ⏳ deferred to operator availability;
not on the critical path for Phase 3 close.
Build green, 687 tests passing, identity verified, no code change in
this PR — pure spec/docs/memory closeout.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wave 4 PR 9 — closes Phase 3. Per build-spec scope item 14 + guardrails.md § Per-phase gate. Pure spec/docs/memory closeout — no code changes.
H2 outcome
Baseline run completed 2026-05-07T16:25Z; results JSON committed at
data/eval/results/wizard.20260507T162529Z.json:citation_precision_meancitation_recall_meansubagent_accuracy_meanrefusal_correctness_meanAiFoundryOptions.ConfidenceThresholdstays at ADR-0017's draft 0.65 — NOT moved. Calibration deferred to Phase 4 once the upstream gaps are fixed. Decision-log entry DL 2026-05-07 — Phase 3 H2 baseline + threshold-not-moved records the rationale.Why the metrics are floored — and why that's fine for closing Phase 3
H2 surfaced two upstream gaps that floor every metric:
Wizard.mdinstructs the Wizard to dispatch to Valuation/Rules/Repair, butFoundryAgentFactoryconstructs all four agents as standaloneAIAgentinstances with onlygetMachineByTitleattached — no actual sub-agent dispatch wiring exists. The Wizard either calls the function tool directly (and answers as itself) OR refuses with the agent's own OutOfScope text. Calibrating against this floor would tune for the gap, not the steady-state.getMachineByTitle("Godzilla")it gets the catalog's record (e.g., a Sega 1998 entry instead of the Stern 2021 entry the ground-truth expected); citation_precision / citation_recall score 0 on a structurally-correct lookup.Both are Phase 4 first-scope items — fix them, re-run
--eval, then calibrate. The H2 baseline IS the v1 regression-detection floor as ADR-0016 specified — any Phase 4 number above 0.133/0.133/0.033/0.300 is real improvement.What ships in this PR
docs/build-spec.mddocs/guardrails.md§ Locked decisionsFour new locked decisions:
AIAgentmodel selection + LRU cache + cost ceiling (ADR-0015)docs/decision-log.mdNew entry DL 2026-05-07 — Phase 3 H2 eval baseline + ConfidenceThreshold stays at 0.65.
CLAUDE.mdsession_handoff_2026_05_07_phase3_close.mdMemory
session_handoff_2026_05_07_phase3_close.md(10 PRs summary, hand-off outcomes, 5 lessons, repo state)MEMORY.mdindex points at itEval baseline
data/eval/results/wizard.20260507T162529Z.jsoncommitted as the v1 referencePhase 3 by the numbers
Audit posture
TreatWarningsAsErrors94459922+jkeeley2073@users.noreply.github.com)/local-reviewcarve-out per guardrails.md § Per-PR gateWhat this PR does NOT do (intentional)
Microsoft.Agents.AIversions — 1.4.0 GA is the current pin; future bumps land when the SDK exposes Usage / connected-agents primitives the inherited follow-ups needAfter merge
Phase 3 is done. The next session should start with:
PINBALL_WIZARD_LIVE_CONTRACT_TESTS=1Test plan