docs(spec) post-Phase-3-close audit fix-ups — README honesty pass + observability PinballMap inventory#94
Merged
Conversation
…bservability.md PinballMap inventory Closes the three findings from the post-Phase-3-close due-diligence audit (memory: session_handoff_2026_05_07_post_close_audit.md): 1. README stale — Phase 3 status row now ✅ Complete with the actual Phase 3 surface (Foundry orchestration, four-agent surface, confidence-threshold refusal, cost routing + LRU cache, evaluation harness, H2 baseline). Test count 566 → 687. ADR count 0001-0013 → 0001-0018. Project-structure tree updated to match. 2. README + vision.md "always cite" overclaim softened. The H2 baseline surfaced two structural gaps that floor citation_precision at 0.133: (a) citations are extracted by regexing OPDB URLs out of agent prose (a Phase 3 placeholder — tool-call-trace inspection replaces it in Phase 4), and (b) eval ground-truth OPDB IDs were curated from titles, not verified against the deployed catalog. New "Phase 3 — current capability honest read" section in README names these gaps plus the connected-agents scaffolding gap and the RAG-corpus-deferred gap. vision.md line 18 likewise softens "sub-agent routing (Valuation / Rules / Repair)" to acknowledge structural wiring lands in Phase 4. Threshold-driven refusal (ADR-0017's draft 0.65) and the evaluation harness itself ARE shipped and unchanged. 3. observability.md gains the pinwiz.pinballmap.* instrument table that existed in PinballWizardTelemetry.cs since PR #83 but was never tabulated. Activity inventory also gains pinwiz.pinballmap.fetch (parity with the metric inventory; emitted by PinballMapClient). Why the soften, not just the mechanical fix: per guardrails.md goal #1 (10-minute prospect landing), customer-facing surfaces must distinguish shipped vs scaffolded vs deferred. "Always cite" set an expectation the H2 baseline cannot honor today; refuse-rather-than-fabricate is the actual safety invariant per ADR-0017. The Phase 4 trajectory closes the gap between architectural invariant and empirical baseline; this PR makes the docs honest about where we are now. Local review: 0 🔴 / 1⚠️ (Activity inventory parity — fixed in same PR before push). Doc-only — build green, 687/687 tests pass. Per-phase-close lesson recorded for guardrails.md: README + vision.md updates should be a per-phase-close checklist item. The pattern (PR #81 after Phase 2, this PR after Phase 3) repeated itself; the post-close audit caught both. Worth adding to guardrails.md § Per-phase gate as a Phase 4 enhancement.
This was referenced May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the three findings from the post-Phase-3-close due-diligence audit. Doc-only, ~30 lines net change across three files.
README.mdstale — Phase 3 row now ✅ Complete with the actual Phase 3 surface (Foundry orchestration, four-agent surface, confidence-threshold refusal, cost routing + LRU cache, evaluation harness, H2 baseline). Test count566 → 687. ADR count0001–0013 → 0001–0018. Project-structure tree updated to match.README +
docs/vision.md"always cite" overclaim softened. The H2 baseline (data/eval/results/wizard.20260507T162529Z.json) scoredcitation_precision = 0.133— the metric is floored by two structural gaps: (a) citations are extracted by regexing OPDB URLs out of agent prose (a Phase 3 placeholder — tool-call-trace inspection replaces it in Phase 4), and (b) eval ground-truth OPDB IDs were curated from titles, not verified against the deployed catalog. A new "Phase 3 — current capability honest read" section in README names these gaps plus connected-agents-scaffolding and RAG-corpus-deferred.vision.mdline 18 likewise softens "sub-agent routing (Valuation / Rules / Repair)" to acknowledge structural wiring lands in Phase 4. Threshold-driven refusal (ADR-0017's draft 0.65) and the evaluation harness itself ARE shipped and unchanged.docs/observability.mdgains thepinwiz.pinballmap.*instrument table that's existed inPinballWizardTelemetry.cs:62–80since PR feat(pinballmap) Pinball Map client with on-disk cache + per-source politeness #83 but was never tabulated. Activity inventory also gainspinwiz.pinballmap.fetchfor parity with the metric inventory (emitted byPinballMapClient.cs:74).Why the soften, not just the mechanical fix
Per
guardrails.mdgoal #1 (10-minute prospect landing), customer-facing surfaces must distinguish shipped vs scaffolded vs deferred. "Always cite" set an expectation the H2 baseline cannot honor today; refuse-rather-than-fabricate is the actual safety invariant per ADR-0017. The Phase 4 trajectory closes the gap between architectural invariant and empirical baseline; this PR makes the docs honest about where we are now.Local review
0 🔴 / 1 ⚠️ / 4 doc-applicable categories ✅(6 code categories N/A on a doc PR). Thedocs/observability.md— was fixed in the same PR before push.Test plan
guardrails.md§ Per-PR gate;/local-reviewran anyway because Finding 2 is a posture-correction.dotnet build PinballWizard.slnx -c Release— clean, 0 warnings.dotnet test PinballWizard.slnx -c Release— 687/687 passing.git log -1 --format='%an <%ae>'shows personal noreply.[Phase 3 known limitations](#phase-3--current-capability-honest-read)resolves to GitHub-slugified### Phase 3 — current capability honest read(em-dash drops, double-hyphen preserved).pinwiz.pinballmap.*rows inobservability.mdcross-checked againstPinballWizardTelemetry.cs— names, types, units, descriptions all match.Per-phase-close lesson
This is the second time a post-close audit has surfaced a README + vision.md update that the closeout PR didn't include (Phase 2 close → PR #81; Phase 3 close → this). Worth adding to
guardrails.md§ Per-phase gate as a Phase 4 enhancement: "README marketing claims reviewed against the phase's actual demonstrable artifact." Surfaced as a Phase 4 follow-up, not bundled here (one PR, one purpose).Memory references
session_handoff_2026_05_07_post_close_audit.md— the audit findings + the fix-it scope being closed by this PRsession_handoff_2026_05_07_phase3_close.md— Phase 3 closeout contextfeedback_pre_pr_self_audit.md— two-step audit gate that produced this PR🤖 Generated with Claude Code