Skip to content

docs(spec) Phase 3 closeout — retrospective + Phase 4 inherited follow-ups + locked decisions ✅#93

Merged
jkeeley2073 merged 1 commit into
mainfrom
Dev-Phase3Closeout
May 7, 2026
Merged

docs(spec) Phase 3 closeout — retrospective + Phase 4 inherited follow-ups + locked decisions ✅#93
jkeeley2073 merged 1 commit into
mainfrom
Dev-Phase3Closeout

Conversation

@jkeeley2073
Copy link
Copy Markdown
Contributor

Summary

Wave 4 PR 9 — closes Phase 3. Per build-spec scope item 14 + guardrails.md § Per-phase gate. Pure spec/docs/memory closeout — no code changes.

H2 outcome

Baseline run completed 2026-05-07T16:25Z; results JSON committed at data/eval/results/wizard.20260507T162529Z.json:

Metric Value
citation_precision_mean 0.133
citation_recall_mean 0.133
subagent_accuracy_mean 0.033
refusal_correctness_mean 0.300

AiFoundryOptions.ConfidenceThreshold stays at ADR-0017's draft 0.65 — NOT moved. Calibration deferred to Phase 4 once the upstream gaps are fixed. Decision-log entry DL 2026-05-07 — Phase 3 H2 baseline + threshold-not-moved records the rationale.

Why the metrics are floored — and why that's fine for closing Phase 3

H2 surfaced two upstream gaps that floor every metric:

  1. Connected-agents dispatch is non-functional. Wizard.md instructs the Wizard to dispatch to Valuation/Rules/Repair, but FoundryAgentFactory constructs all four agents as standalone AIAgent instances with only getMachineByTitle attached — no actual sub-agent dispatch wiring exists. The Wizard either calls the function tool directly (and answers as itself) OR refuses with the agent's own OutOfScope text. Calibrating against this floor would tune for the gap, not the steady-state.
  2. Eval ground-truth OPDB IDs aren't verified against deployed Cosmos. PR 8's subagent curated plausible OPDB-format IDs from machine titles, but the deployed catalog has different actual IDs. When the agent successfully calls getMachineByTitle("Godzilla") it gets the catalog's record (e.g., a Sega 1998 entry instead of the Stern 2021 entry the ground-truth expected); citation_precision / citation_recall score 0 on a structurally-correct lookup.

Both are Phase 4 first-scope items — fix them, re-run --eval, then calibrate. The H2 baseline IS the v1 regression-detection floor as ADR-0016 specified — any Phase 4 number above 0.133/0.133/0.033/0.300 is real improvement.

What ships in this PR

docs/build-spec.md

  • Phase 3 status flipped to ✅ Complete (master timeline + Phase 3 header)
  • Phase 3 § Retrospective populated — 10 PRs across 4 waves, test count 566 → 687, five lessons documented, H1 / H2 / H3 outcomes
  • Phase 4 § Inherited Phase 3 follow-ups added — five items rolled forward as Phase 4 first scope items

docs/guardrails.md § Locked decisions

Four new locked decisions:

  • Microsoft Foundry orchestration (ADR-0014)
  • Per-AIAgent model selection + LRU cache + cost ceiling (ADR-0015)
  • Confidence-threshold refusal mandatory (ADR-0017)
  • Code-resource agent definitions (ADR-0018)

docs/decision-log.md

New entry DL 2026-05-07 — Phase 3 H2 eval baseline + ConfidenceThreshold stays at 0.65.

CLAUDE.md

  • Test count 566 → 687
  • 13 ADRs → 18 ADRs in documentation map
  • Locked invariants 1-8 → 1-12 (4 new AI-architecture invariants → ADR pointers)
  • Freshest handoff pointer → session_handoff_2026_05_07_phase3_close.md

Memory

Eval baseline

Phase 3 by the numbers

Metric Phase 2 exit Phase 3 exit Δ
Tests 566 687 +121
ADRs 13 (0001–0013) 18 (0001–0018) +5
PRs in phase n/a 10
Idle Azure cost ~$30/mo ~$120–150/mo +$90–120/mo (Foundry + ACR + KV + Storage + App Insights; AI Search deferred)

Audit posture

What this PR does NOT do (intentional)

  • Fix any of the five Phase 3 lessons — those are Phase 4 work, scoped + prioritized in the build-spec
  • Modify ADR-0017 — threshold stays at 0.65; ADR-0017 is unchanged. The decision-log entry captures the calibration-deferred rationale separately
  • Bump Microsoft.Agents.AI versions — 1.4.0 GA is the current pin; future bumps land when the SDK exposes Usage / connected-agents primitives the inherited follow-ups need

After merge

Phase 3 is done. The next session should start with:

  1. Phase 4 design conversation (fresh context window) — sequence the five inherited follow-ups, then add Phase 4 RAG-specific items (Cosmos Change Feed Function, PdfPig text extraction, page-aware chunking strategy, AI Search index population, semantic-ranker config)
  2. H3 (Pinball Map live-API probe) at operator convenience — not a Phase 3 blocker; runnable any time via PINBALL_WIZARD_LIVE_CONTRACT_TESTS=1
  3. Cost burn-rate review at end of June 2026 — first full month with Foundry idle cost; calibration moment for the $300/mo anomaly alarm

Test plan

  • Build green (no code changes)
  • Tests green (687/687, unchanged from PR feat(eval) Foundry EvaluationClient harness + custom citation-accuracy evaluators (Wave 3 PR 8) #92)
  • Identity check (personal noreply on commit)
  • All five Phase 3 lessons documented in retrospective
  • Phase 4 inherited-follow-ups documented in build-spec § Phase 4
  • Decision-log entry for the H2 calibration result
  • Memory handoff written
  • Locked decisions updated in guardrails.md
  • CLAUDE.md updated with current state
  • Reviewer confirms Phase 3 ✅ Complete is the right call given the H2 baseline + Phase 4 inherited follow-ups

…w-ups + locked decisions ✅

Wave 4 PR 9 of Phase 3 — closes the phase. Per build-spec scope item 14
+ guardrails.md § Per-phase gate.

H2 baseline run completed 2026-05-07T16:25Z; results JSON committed at
data/eval/results/wizard.20260507T162529Z.json. Aggregate metrics:
  citation_precision  = 0.133
  citation_recall     = 0.133
  subagent_accuracy   = 0.033
  refusal_correctness = 0.300

ConfidenceThreshold stays at ADR-0017's draft 0.65 (NOT moved).
Calibration deferred to Phase 4 once the upstream gaps that floored
the H2 baseline are fixed. DL 2026-05-07 records the rationale.

Five Phase 3 lessons documented in build-spec § Phase 3 § Retrospective
+ rolled forward as Phase 4 § Inherited Phase 3 follow-ups (priority
order):
  1. Connected-agents dispatch is non-functional — Wizard refuses
     in-scope questions because FoundryAgentFactory wires only
     getMachineByTitle, not the sub-agents. Phase 4 first scope item.
  2. Eval ground-truth OPDB IDs need verification against deployed
     Cosmos catalog — predicted ≠ expected on most successful lookups.
     Phase 4 second scope item.
  3. Replace OPDB-URL regex citation extraction with proper tool-call
     trace inspection (Phase 4).
  4. Replace NullTokenUsageReader with a real impl when Microsoft.Agents.AI
     exposes Usage on AgentResponse (microsoft/agent-framework#2688).
  5. Read WizardAnswer.SubAgentUsed from connected-agents trace
     correlation (lands alongside item 1).

Locked decisions added to guardrails.md § Locked decisions:
  - Microsoft Foundry orchestration (ADR-0014)
  - Per-AIAgent model selection + LRU cache + cost ceiling (ADR-0015)
  - Confidence-threshold refusal mandatory (ADR-0017)
  - Code-resource agent definitions (ADR-0018)

CLAUDE.md updates:
  - Test count 566 → 687
  - 13 ADRs → 18 ADRs in documentation map
  - Locked invariants 1-8 → 1-12 (added 4 AI-architecture invariants
    pointing at ADRs 0014/0015/0017/0018)
  - Freshest handoff pointer → session_handoff_2026_05_07_phase3_close.md

Memory updates:
  - New: session_handoff_2026_05_07_phase3_close.md
  - MEMORY.md index points at it

Phase 3 status table flipped Phase 3 to ✅ Complete.

H1 (deploy + smoke probe) ✅ done 2026-05-07.
H2 (eval baseline) ✅ done 2026-05-07.
H3 (Pinball Map live-API probe) ⏳ deferred to operator availability;
  not on the critical path for Phase 3 close.

Build green, 687 tests passing, identity verified, no code change in
this PR — pure spec/docs/memory closeout.
@jkeeley2073 jkeeley2073 added the claude-code Generated with Claude Code label May 7, 2026
@jkeeley2073 jkeeley2073 merged commit 6572bdb into main May 7, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code Generated with Claude Code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant