Skip to content

fix(operator/lws): preserve legacy worker pod labels#9738

Open
sttts wants to merge 8 commits into
ai-dynamo:mainfrom
sttts:sttts-issue-9700-no-rollout-labels
Open

fix(operator/lws): preserve legacy worker pod labels#9738
sttts wants to merge 8 commits into
ai-dynamo:mainfrom
sttts:sttts-issue-9700-no-rollout-labels

Conversation

@sttts
Copy link
Copy Markdown
Contributor

@sttts sttts commented May 19, 2026

Summary

  • fix LWS pod-template label rendering so leader and worker pods use the same legacy worker identity as Deployment and Service rendering
  • lock the non-EPP compatibility contract: existing decode/prefill workers keep component-type=worker plus sub-component-type=decode|prefill, and component-class=worker is not added
  • add table-driven regression coverage for Deployment, LWS, and Grove no-rollout behavior; EPP rollout policy stays out of this PR while the product decision is pending

Missing logic / bug

The compatibility decision already exists in getDCDWorkloadComponentType, but LWS pod templates were not using that decision. generateLeaderWorkerSet rendered LWS pod labels from GetDCDKubeLabels directly, so an upgraded non-EPP decode/prefill worker could render LWS pods as decode/prefill while the compatibility path still rendered Services for the legacy worker pod identity.

Example of the mismatch this PR prevents:

# Existing non-EPP worker pod labels from the legacy generation.
apiVersion: v1
kind: Pod
metadata:
  labels:
    nvidia.com/dynamo-component-type: worker
    nvidia.com/dynamo-sub-component-type: decode
# Buggy desired LWS pod template before this PR.
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
spec:
  leaderWorkerTemplate:
    workerTemplate:
      metadata:
        labels:
          nvidia.com/dynamo-component-type: decode
# Compatibility selector still expects the legacy pod identity.
apiVersion: v1
kind: Service
spec:
  selector:
    nvidia.com/dynamo-component-type: worker
    nvidia.com/dynamo-sub-component-type: decode

That is both unnecessary pod-template churn and a selector-safety risk. The fix keeps the rendered non-EPP LWS pod identity stable before spec hashes are compared, without adding hash-specific special cases:

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
spec:
  leaderWorkerTemplate:
    workerTemplate:
      metadata:
        labels:
          nvidia.com/dynamo-component-type: worker
          nvidia.com/dynamo-sub-component-type: decode
          # nvidia.com/dynamo-component-class is intentionally absent in non-EPP

Deployment already used the workload-compatible component type path; Grove has its own legacy selector preservation. This PR adds one table-driven test so Deployment, LWS, and Grove all keep those system labels stable for non-EPP legacy workers.

Tests

  • go test ./internal/controller -run "TestLegacyWorkerIdentityUpgradeDoesNotTriggerRollout|TestGroveNativeWorkerIdentityLabelsStayNative"
  • KUBEBUILDER_ASSETS=/private/tmp/no-rollout-labels-envtest/k8s/1.29.0-darwin-arm64 go test ./internal/controller

Relates to #9700

@sttts sttts requested a review from a team as a code owner May 19, 2026 10:02
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi sttts! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions Bot added external-contribution Pull request is from an external contributor deployment::k8s Relates to dynamo deployment in kubernetes labels May 19, 2026
if current.empty() {
return true
}
if requiresV2WorkerGeneration(dgd, current, desired) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding this here seems strange. workerHashForDCDGeneration should rather change in the right moment and stay equal when on rollout is desired.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Walkthrough

This PR refactors v1/v2 worker-hash compatibility logic in the rolling-update controller by introducing a requiresV2WorkerGeneration predicate that detects when EPP components require v2-rendered worker labels. The DynamoGraphDeployment is threaded through hash-matching and selection functions to force v2 generation when needed, even if v1 hashes align. Rolling-update triggering, annotation migration, and active-worker-hash selection are updated accordingly, with comprehensive test coverage.

Changes

EPP v2 Worker Generation Requirements

Layer / File(s) Summary
v2-generation predicate and hash-matching logic
deploy/operator/internal/controller/dynamographdeployment_rollingupdate.go
Introduces requiresV2WorkerGeneration(dgd, current, desired) to detect when EPP components require v2-rendered worker labels. Updates currentWorkerHashesMatchDesired to thread dgd and force a mismatch when v2 generation is required, and workerHashForDCDGeneration to select desired.v2 when the predicate applies.
Propagating dgd through rolling-update flow
deploy/operator/internal/controller/dynamographdeployment_rollingupdate.go
Updates shouldTriggerRollingUpdate, migrateCurrentWorkerHashIfNeeded, and activeWorkerHashCandidates to call the new dgd-aware hash-matching and selection functions. Adds a guard in migration to skip v2 migration when requiresV2WorkerGeneration applies. Rewraps comment to reflect updated active-worker-hash generation behavior.
Test coverage for EPP v2 generation
deploy/operator/internal/controller/dynamographdeployment_rollingupdate_test.go
Adds test case to TestShouldTriggerRollingUpdate covering legacy-hash EPP rollout triggering. Adds TestInitializeWorkerHashIfNeeded_EPPRequiresWorkerClassRollout to validate v1 annotation preservation, v2 hash generation, rolling-update triggering for EPP, and correct prefill component deployment with expected worker-class label.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ⚠️ Warning The PR title states 'preserve legacy worker pod labels' but the actual changes focus on requiring EPP-backed DGDs to trigger rolling updates for v2 worker generation with worker-class labels, not just preserving legacy labels. Update the title to better reflect the main change: consider 'fix(operator/epp): require worker rollout for v2 label generation' or similar to accurately convey that EPP components require rolling updates to render v2 worker-class labels.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description includes all required sections: Overview/Summary, Details explaining the changes and bug fix, specific files for review focus, and related issue references.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

current, desired workerGenerationHashes,
) string {
if requiresV2WorkerGeneration(dgd, current, desired) {
return desired.v2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning desired.v2 here also affects unsupported Grove/multinode paths, so their handler can create a new v2-suffixed worker DCD without managed rollout cleanup and leave the old v1 DCD running. Fix: gate this branch to supported managed rollouts or make workerHashesForUnsupportedPathway keep using the compatible v1 hash.

@sttts sttts force-pushed the sttts-issue-9700-no-rollout-labels branch from 2c23c19 to e64b8b1 Compare May 19, 2026 13:51
@sttts sttts changed the title operator/epp: roll workers for label migration operator/lws: preserve legacy worker pod labels May 19, 2026
Signed-off-by: Dr. Stefan Schimanski <sschimanski@nvidia.com>
@sttts sttts force-pushed the sttts-issue-9700-no-rollout-labels branch from e64b8b1 to e1264e5 Compare May 19, 2026 14:17
Signed-off-by: Dr. Stefan Schimanski <sschimanski@nvidia.com>
sttts added 2 commits May 19, 2026 16:38
Signed-off-by: Dr. Stefan Schimanski <sschimanski@nvidia.com>
Signed-off-by: Dr. Stefan Schimanski <sschimanski@nvidia.com>
Signed-off-by: Dr. Stefan Schimanski <sschimanski@nvidia.com>
Comment thread deploy/operator/internal/controller/upgrade_test.go Outdated
Signed-off-by: Dr. Stefan Schimanski <sschimanski@nvidia.com>
@sttts sttts changed the title operator/lws: preserve legacy worker pod labels fix(operator/lws): preserve legacy worker pod labels May 19, 2026
@github-actions github-actions Bot added the fix label May 19, 2026
sttts added 2 commits May 19, 2026 17:56
Signed-off-by: Dr. Stefan Schimanski <sschimanski@nvidia.com>
Signed-off-by: Dr. Stefan Schimanski <sschimanski@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment::k8s Relates to dynamo deployment in kubernetes external-contribution Pull request is from an external contributor fix size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants