docs: revise ADR-005 — staged overlay refactoring with prototype findings#439
Conversation
52ad12b to
6cb02dc
Compare
6cb02dc to
0f27c00
Compare
0f27c00 to
e2ef22f
Compare
e2ef22f to
a19016d
Compare
There was a problem hiding this comment.
Good ADR. 3 areas that need clarification before this is ready to merge:
-
P0: Constraint evaluator change: Moving from per-overlay to post-merge evaluation is a significant behavioral change that affects
ExcludedOverlays/ConstraintWarningssemantics. This ADR should address what happens when mixin-contributed constraints fail against a snapshot — does the whole recipe fail, or is there a graceful degradation path? -
Mixin-vs-inherited conflict policy: CI lint covers mixin-vs-mixin conflicts but not mixin-vs-inherited conflicts. A mixin could silently weaken or override a constraint from the inheritance chain. We need explicit policy for that.
-
Decision section: The analysis clearly supports Reorder + Mixins. Either commit to it as "Proposed" or explicitly frame the TBD. ADR is meant to force a team vote between specific options.
mchmarny
left a comment
There was a problem hiding this comment.
Only the constraint evaluator change is a blocker
|
Re: P0 — Constraint evaluator change Good catch. The intent is not to introduce partial mixin degradation. In Phase 2, constraint evaluation would run against the fully composed leaf candidate: inheritance chain + mixins + leaf-local content. If any constraint contributed by that composed candidate fails, that candidate is excluded as a unit. Recipe generation still succeeds by falling back to the remaining matching overlays, consistent with today's To keep this debuggable, I'll add a "Constraint Failure Semantics" section to the ADR documenting this explicitly. |
|
Thanks for the thorough review, Mark. Addressing each point: 1. Constraint evaluator change (P0) Good catch. The intent is not to introduce partial mixin degradation. In Phase 2, constraint evaluation would run against the fully composed leaf candidate: inheritance chain + mixins + leaf-local content. If any constraint contributed by that composed candidate fails, that candidate is excluded as a unit. Recipe generation still succeeds by falling back to the remaining matching overlays, consistent with today's To keep this debuggable, I'll add a "Constraint Failure Semantics" section to the ADR documenting this explicitly. 2. Mixin-vs-inherited conflict policy Agreed — the current "mixin wins" wording is too implicit. I'll revise the ADR to make the policy explicit: mixin-vs-inherited duplicate constraint names are forbidden, and CI lint rejects them. This keeps mixins truly orthogonal and prevents accidental weakening or silent override of inherited constraints. The 3. Decision section Agreed. The ADR currently says I'll also make Phase 2's boundary explicit: validation is not allowed in mixins in this phase because current validation merge semantics are replacement-based, not deep-merge. |
Consider New Phase 1.5: Push validation up + delete redundant constraintsNo code changes. Only overlay YAML restructuring. The problem in one picture Here's the inheritance chain for h100-eks-ubuntu-training today: The leaf file (h100-eks-ubuntu-training.yaml) is 73 lines. Here's what's in it: Two problems:
This same pattern repeats across ~16 leaf overlays. The fix (two moves, zero code changes) Move 1: Push validation up to the intent overlay. The validation block describes what to check for a given {accelerator, service, intent} combination. Move it from the -ubuntu- leaf to its parent. This works because Merge() already handles validation inheritance — if the parent has a validation block, children inherit it. Children only need to re-declare validation if they want to override a phase. Move 2: Delete redundant constraint re-declarations. If h100-eks-training already says K8s >= 1.32.4, every child inherits it automatically. Delete the duplicate line from every leaf that just re-states what the parent already says. After both moves: leaf becomes trivial h100-eks-ubuntu-training.yaml — AFTER (was 73 lines, now ~25 with license header)That's it. Only the things that are genuinely unique to "this is Ubuntu" remain. Why this is safe
What's left after Phase 1 + 1.5Leaf overlays are now ~10-15 lines of actual content. The remaining duplication:
The Ubuntu constraints are only 3 lines per file — arguably not worth any machinery to deduplicate. The kubeflow block could be moved to {accel}-{service}-training-kubeflow parent overlays. The dynamo block is the only one where a sharing Summary
Phase 1.5 captures most of the mixin benefit with zero code changes and zero new concepts. |
|
@lockwobr Good point. I agree this identifies a meaningful no-code cleanup step: push validation up to the I don't think it fully removes the case for mixins, because the platform overlays still duplicate substantial component blocks (dynamo in particular at ~25 lines × 4 files). But it does strengthen the case that the Ubuntu layer itself should become much thinner before introducing any new mechanism. I'll fold this into the ADR framing, either as part of Phase 1 or as an explicit refinement step between Phase 1 and Phase 2. |
PR #439 Review SummaryMark (mchmarny) — 8 inline commentsP0: Constraint evaluator change — Moving to post-merge evaluation changes Mixin-vs-inherited conflict policy — Mixins could silently weaken inherited constraints. Decision section TBD — Analysis clearly supports Reorder + Mixins, commit to it. Mixin field stripping — Use Nits — ADR-004→005 in PR description; rename file to drop Jason (xdu31) — 1 inline commentDoes the OS layer need to exist? — Ubuntu overlays add no components, no validation. Kernel >= 6.8 is a GPU driver requirement, not OS. Removing the OS layer eliminates 12 files and the main mixin justification. Brian (lockwobr) — 1 conversation commentPhase 1.5: Push validation up + delete redundant constraints — No code changes, captures ~70% dedup by moving validation to Alex (ayuskauskas) — 2 commentsAuto-mixin — If mixins only add constraints, the engine could auto-apply them by criteria match. Flat leaf-only structure — Remove inheritance entirely, one file per combo. |
Recent UpdatesReplies posted:
ADR content updates pushed: |
Summary of Review Feedback and ADR Updates
|
|
Here's the plan for Phase 1. I hope it addresses the comments. If we agree, I'll start an implementation. Phase 1 (no code changes):
|
ADR-005 Revision SummaryBased on a full prototype implementation of both the intermediate/reparenting approach and the mixin approach, the ADR needs material revision. Here are the findings and the proposed new phasing. What Changed and WhyPrototype implementation of the original Phase 1 (intermediates + reparenting) revealed three issues:
A subsequent mixin prototype validated that Revised PhasesPhase 1 — Correctness fix + structure cleanup
Phase 2 — Candidate selection refinement
Phase 3 — Mixins (OS + platform)
Deferred A — Intermediates + reparenting
Deferred B — Deep-merge validation mixins
Key Reframing
The full revised ADR text is ready — I'll push it as a commit once we align on the direction. The prototype code (Specificity fix, maximal leaf selection, mixin loader/merger) is available as a working reference. @mchmarny @xdu31 @ayuskauskas — would appreciate another look given the material changes to the phasing and justification. |
…ings Revise ADR-005 based on prototype implementation findings: - Original Phase 1 (intermediates + reparenting) is not viable with the current all-match resolver. Reparented overlays still compete with their former parents at the same specificity level, producing non-deterministic merge results. - Specificity() has a pre-existing bug: YAML-parsed empty strings are not treated as "any", making all overlays report the same specificity. - Intermediates increase net line count (+134 lines) for the current accelerator/service set due to content absorption from former parents. - Mixin prototype validated: spec.mixins composition produces identical hydrated output for all current leaf overlays. Revised sequencing: Phase 1: Specificity fix + validation lift-up Phase 2: Maximal leaf candidate selection (shared helper) Phase 3: Mixins for OS/platform composition Deferred A: Intermediates (after Phase 2, if accelerator growth justifies) Deferred B: Deep-merge validation mixins (after Phase 3 proven stable) Justification reframed from line-count reduction to maintenance cost, drift resistance, correctness, and review burden. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
6d1faf8 to
8f0cf7b
Compare
|
Seems reasonable |
Summary
Revise ADR-005 based on prototype implementation findings. The original proposal (reorder inheritance tree as a no-code Phase 1, then add mixins) is not viable as written — reparenting conflicts with the current all-match resolver semantics. The revised ADR resequences the work: fix resolver correctness first, then add mixins for maintenance-critical shared fragments.
Motivation / Context
A full prototype of the original Phase 1 (4 intermediate overlays, 8 reparented intent overlays) revealed:
eks-training) still match queries independently and compete with intermediates at the same specificity level, producing non-deterministic merge resultsSpecificity()bug — all overlays report the same specificity because YAML-parsed empty strings aren't treated as "any", making the sort meaninglessA subsequent mixin prototype validated that
spec.mixinscomposition works correctly, producing identical hydrated output for all converted overlays.Fixes: N/A
Related: #305
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
Revised phasing (was: Reorder → Mixins → Deep Mixins):
BuildRecipeResultandBuildRecipeResultWithEvaluator(candidate selection refinement)Key changes from original ADR:
Testing
N/A — documentation only.
Risk Assessment
Rollout notes: N/A — ADR document only, no code changes.
Checklist
make testwith-race)make lint)git commit -S) — GPG signing info