docs(spec) Phase 3 closeout — retrospective + Phase 4 inherited follow-ups + locked decisions ✅ by jkeeley2073 · Pull Request #93 · Early-Bird-Solutions-LLC/PinballWizard

jkeeley2073 · 2026-05-07T16:37:51Z

Summary

Wave 4 PR 9 — closes Phase 3. Per build-spec scope item 14 + guardrails.md § Per-phase gate. Pure spec/docs/memory closeout — no code changes.

H2 outcome

Baseline run completed 2026-05-07T16:25Z; results JSON committed at data/eval/results/wizard.20260507T162529Z.json:

Metric	Value
`citation_precision_mean`	0.133
`citation_recall_mean`	0.133
`subagent_accuracy_mean`	0.033
`refusal_correctness_mean`	0.300

AiFoundryOptions.ConfidenceThreshold stays at ADR-0017's draft 0.65 — NOT moved. Calibration deferred to Phase 4 once the upstream gaps are fixed. Decision-log entry DL 2026-05-07 — Phase 3 H2 baseline + threshold-not-moved records the rationale.

Why the metrics are floored — and why that's fine for closing Phase 3

H2 surfaced two upstream gaps that floor every metric:

Connected-agents dispatch is non-functional. Wizard.md instructs the Wizard to dispatch to Valuation/Rules/Repair, but FoundryAgentFactory constructs all four agents as standalone AIAgent instances with only getMachineByTitle attached — no actual sub-agent dispatch wiring exists. The Wizard either calls the function tool directly (and answers as itself) OR refuses with the agent's own OutOfScope text. Calibrating against this floor would tune for the gap, not the steady-state.
Eval ground-truth OPDB IDs aren't verified against deployed Cosmos. PR 8's subagent curated plausible OPDB-format IDs from machine titles, but the deployed catalog has different actual IDs. When the agent successfully calls getMachineByTitle("Godzilla") it gets the catalog's record (e.g., a Sega 1998 entry instead of the Stern 2021 entry the ground-truth expected); citation_precision / citation_recall score 0 on a structurally-correct lookup.

Both are Phase 4 first-scope items — fix them, re-run --eval, then calibrate. The H2 baseline IS the v1 regression-detection floor as ADR-0016 specified — any Phase 4 number above 0.133/0.133/0.033/0.300 is real improvement.

What ships in this PR

`docs/build-spec.md`

Phase 3 status flipped to ✅ Complete (master timeline + Phase 3 header)
Phase 3 § Retrospective populated — 10 PRs across 4 waves, test count 566 → 687, five lessons documented, H1 / H2 / H3 outcomes
Phase 4 § Inherited Phase 3 follow-ups added — five items rolled forward as Phase 4 first scope items

`docs/guardrails.md` § Locked decisions

Four new locked decisions:

Microsoft Foundry orchestration (ADR-0014)
Per-AIAgent model selection + LRU cache + cost ceiling (ADR-0015)
Confidence-threshold refusal mandatory (ADR-0017)
Code-resource agent definitions (ADR-0018)

`docs/decision-log.md`

New entry DL 2026-05-07 — Phase 3 H2 eval baseline + ConfidenceThreshold stays at 0.65.

`CLAUDE.md`

Test count 566 → 687
13 ADRs → 18 ADRs in documentation map
Locked invariants 1-8 → 1-12 (4 new AI-architecture invariants → ADR pointers)
Freshest handoff pointer → session_handoff_2026_05_07_phase3_close.md

Memory

New: session_handoff_2026_05_07_phase3_close.md (10 PRs summary, hand-off outcomes, 5 lessons, repo state)
MEMORY.md index points at it

Eval baseline

data/eval/results/wizard.20260507T162529Z.json committed as the v1 reference

Phase 3 by the numbers

Metric	Phase 2 exit	Phase 3 exit	Δ
Tests	566	687	+121
ADRs	13 (0001–0013)	18 (0001–0018)	+5
PRs in phase	n/a	10	—
Idle Azure cost	~$30/mo	~$120–150/mo	+$90–120/mo (Foundry + ACR + KV + Storage + App Insights; AI Search deferred)

Audit posture

Build green: 0 warnings, 0 errors under TreatWarningsAsErrors
Tests green: 687/687 (no code changes; locked from PR feat(eval) Foundry EvaluationClient harness + custom citation-accuracy evaluators (Wave 3 PR 8) #92 merge)
Identity check ✅ (94459922+jkeeley2073@users.noreply.github.com)
Doc-only PR — eligible for the /local-review carve-out per guardrails.md § Per-PR gate
Phase 3 § Retrospective covers all five lessons + all three hand-off outcomes per the per-phase-gate template
Locked decisions list updated in same PR per guardrails.md § Spec maintenance
Memory handoff written so the next session can resume cleanly

What this PR does NOT do (intentional)

Fix any of the five Phase 3 lessons — those are Phase 4 work, scoped + prioritized in the build-spec
Modify ADR-0017 — threshold stays at 0.65; ADR-0017 is unchanged. The decision-log entry captures the calibration-deferred rationale separately
Bump Microsoft.Agents.AI versions — 1.4.0 GA is the current pin; future bumps land when the SDK exposes Usage / connected-agents primitives the inherited follow-ups need

After merge

Phase 3 is done. The next session should start with:

Phase 4 design conversation (fresh context window) — sequence the five inherited follow-ups, then add Phase 4 RAG-specific items (Cosmos Change Feed Function, PdfPig text extraction, page-aware chunking strategy, AI Search index population, semantic-ranker config)
H3 (Pinball Map live-API probe) at operator convenience — not a Phase 3 blocker; runnable any time via PINBALL_WIZARD_LIVE_CONTRACT_TESTS=1
Cost burn-rate review at end of June 2026 — first full month with Foundry idle cost; calibration moment for the $300/mo anomaly alarm

Test plan

…w-ups + locked decisions ✅ Wave 4 PR 9 of Phase 3 — closes the phase. Per build-spec scope item 14 + guardrails.md § Per-phase gate. H2 baseline run completed 2026-05-07T16:25Z; results JSON committed at data/eval/results/wizard.20260507T162529Z.json. Aggregate metrics: citation_precision = 0.133 citation_recall = 0.133 subagent_accuracy = 0.033 refusal_correctness = 0.300 ConfidenceThreshold stays at ADR-0017's draft 0.65 (NOT moved). Calibration deferred to Phase 4 once the upstream gaps that floored the H2 baseline are fixed. DL 2026-05-07 records the rationale. Five Phase 3 lessons documented in build-spec § Phase 3 § Retrospective + rolled forward as Phase 4 § Inherited Phase 3 follow-ups (priority order): 1. Connected-agents dispatch is non-functional — Wizard refuses in-scope questions because FoundryAgentFactory wires only getMachineByTitle, not the sub-agents. Phase 4 first scope item. 2. Eval ground-truth OPDB IDs need verification against deployed Cosmos catalog — predicted ≠ expected on most successful lookups. Phase 4 second scope item. 3. Replace OPDB-URL regex citation extraction with proper tool-call trace inspection (Phase 4). 4. Replace NullTokenUsageReader with a real impl when Microsoft.Agents.AI exposes Usage on AgentResponse (microsoft/agent-framework#2688). 5. Read WizardAnswer.SubAgentUsed from connected-agents trace correlation (lands alongside item 1). Locked decisions added to guardrails.md § Locked decisions: - Microsoft Foundry orchestration (ADR-0014) - Per-AIAgent model selection + LRU cache + cost ceiling (ADR-0015) - Confidence-threshold refusal mandatory (ADR-0017) - Code-resource agent definitions (ADR-0018) CLAUDE.md updates: - Test count 566 → 687 - 13 ADRs → 18 ADRs in documentation map - Locked invariants 1-8 → 1-12 (added 4 AI-architecture invariants pointing at ADRs 0014/0015/0017/0018) - Freshest handoff pointer → session_handoff_2026_05_07_phase3_close.md Memory updates: - New: session_handoff_2026_05_07_phase3_close.md - MEMORY.md index points at it Phase 3 status table flipped Phase 3 to ✅ Complete. H1 (deploy + smoke probe) ✅ done 2026-05-07. H2 (eval baseline) ✅ done 2026-05-07. H3 (Pinball Map live-API probe) ⏳ deferred to operator availability; not on the critical path for Phase 3 close. Build green, 687 tests passing, identity verified, no code change in this PR — pure spec/docs/memory closeout.

jkeeley2073 added the claude-code Generated with Claude Code label May 7, 2026

jkeeley2073 merged commit 6572bdb into main May 7, 2026
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(spec) Phase 3 closeout — retrospective + Phase 4 inherited follow-ups + locked decisions ✅#93

docs(spec) Phase 3 closeout — retrospective + Phase 4 inherited follow-ups + locked decisions ✅#93
jkeeley2073 merged 1 commit into
mainfrom
Dev-Phase3Closeout

jkeeley2073 commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jkeeley2073 commented May 7, 2026

Summary

H2 outcome

Why the metrics are floored — and why that's fine for closing Phase 3

What ships in this PR

docs/build-spec.md

docs/guardrails.md § Locked decisions

docs/decision-log.md

CLAUDE.md

Memory

Eval baseline

Phase 3 by the numbers

Audit posture

What this PR does NOT do (intentional)

After merge

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`docs/build-spec.md`

`docs/guardrails.md` § Locked decisions

`docs/decision-log.md`

`CLAUDE.md`