Skip to content

fix: refactor LWS to use native scaling (#2214)#5468

Merged
tmonty12 merged 8 commits into
ai-dynamo:mainfrom
kruthiwusirika5:main
Apr 30, 2026
Merged

fix: refactor LWS to use native scaling (#2214)#5468
tmonty12 merged 8 commits into
ai-dynamo:mainfrom
kruthiwusirika5:main

Conversation

@kruthiwusirika5
Copy link
Copy Markdown
Contributor

@kruthiwusirika5 kruthiwusirika5 commented Jan 15, 2026

Overview:
Refactors the DynamoComponentDeployment controller to utilize the native scaling capabilities of the LeaderWorkerSet (LWS) API, addressing inefficient resource creation.

Details:
Previously, the controller created one LeaderWorkerSet object per desired replica. This change modifies the logic to:

Create a single LeaderWorkerSet object with Spec.Replicas set to the desired count
Remove generateVolcanoPodGroup function, SchedulerNameVolcano constant, and combineLWSReplicaStatuses helper (no longer needed)
Remove instanceID field from generateResourceOption and all related logic (instance-id labels, scheduling.k8s.io/group-name annotations)
Add cleanup logic for legacy indexed LWS and PodGroup resources (e.g. name-0, name-1)
Clean up verbose comments in reconcileLeaderWorkerSetResources
Update installation guide with LWS >= v0.7.0 version requirement and --set gangScheduling=volcano Helm value for Volcano integration
Update unit tests to verify single LWS creation with Replicas=N
Where should the reviewer start?

deploy/operator/internal/controller/dynamocomponentdeployment_controller.go: Review reconcileLeaderWorkerSetResources for the main logic change
deploy/operator/internal/controller/dynamocomponentdeployment_controller_test.go: Review updated test cases verifying single LWS creation
docs/kubernetes/installation-guide.md: Review updated LWS + Volcano installation instructions
Related Issues: Fixes #2214

@kruthiwusirika5 kruthiwusirika5 requested a review from a team as a code owner January 15, 2026 23:52
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jan 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jan 15, 2026

@github-actions github-actions Bot added fix external-contribution Pull request is from an external contributor labels Jan 15, 2026
@github-actions github-actions Bot added the deployment::k8s Relates to dynamo deployment in kubernetes label Jan 31, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 31, 2026

Walkthrough

The reconciliation logic for DynamoComponentDeployment is refactored to create a single LeaderWorkerSet resource with natively scaled replicas, replacing the previous pattern of instantiating multiple per-replica LeaderWorkerSets. Legacy per-replica resource cleanup is introduced, and readiness determination is simplified to assess a single consolidated LeaderWorkerSet.

Changes

Cohort / File(s) Summary
Controller Logic Refactoring
deploy/operator/internal/controller/dynamocomponentdeployment_controller.go
Reconciler now creates a single LeaderWorkerSet with native Replicas instead of multiple per-replica instances. Functions updated to support nil instanceID defaulting to 0. Legacy resource cleanup added to remove indexed LeaderWorkerSets (baseName-0, baseName-1, etc.) and corresponding PodGroups. Readiness checks adjusted to evaluate single LWS status. Per-index resource generation logic simplified.
Test Updates
deploy/operator/internal/controller/dynamocomponentdeployment_controller_test.go
Test cases renamed to reflect native scaling approach ("nil instanceID - success (native scaling)"). Expected outputs for nil instanceID expanded from nil to fully populated PodGroup/LeaderWorkerSet with instance-id label "0". Test scaffolding enhanced with mock Docker secret retriever and service account setup. Reconciliation tests updated with new readiness semantics and component naming conventions. Deployment strategy tests extended to cover annotation handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 One leader, many workers in harmony now,
No more per-replica chaos, just native scaling's vow,
Legacy ghosts swept clean from the kingdom's dust,
A consolidated dream where all pods trust,
The scheduler's burden lifted light.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Changes successfully address issue #2214 by refactoring to use single LWS with native scaling, fixing the worker blocking issue caused by per-replica resource creation.
Out of Scope Changes check ✅ Passed All changes are scoped to LWS integration refactoring: controller logic, helper functions, cleanup logic, and corresponding test updates.
Title check ✅ Passed The title 'fix: refactor LWS to use native scaling (#2214)' clearly summarizes the main change: migrating from per-replica LWS instances to native scaling with a single LWS resource.
Description check ✅ Passed The pull request description is well-structured and covers all required sections with comprehensive details about the changes, objectives, and files for review.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kruthiwusirika5 kruthiwusirika5 changed the title fix: refactor LWS integration to use native scaling (Fixes #2214) Fix/lws native scaling 2214 v2 Feb 19, 2026
@rmccorm4
Copy link
Copy Markdown
Contributor

Hi @kruthiwusirika5, thanks for the contribution!

Can you help fix merge conflicts and update the PR to be based off latest main branch?

After rebase, @tmonty12 @julienmancuso can you help review?

Copy link
Copy Markdown
Contributor

@tmonty12 tmonty12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! Please address comments

Comment thread deploy/operator/internal/controller/dynamocomponentdeployment_controller.go Outdated
Comment thread deploy/operator/internal/controller/dynamocomponentdeployment_controller.go Outdated
@kruthiwusirika5
Copy link
Copy Markdown
Contributor Author

Hi @tmonty12, I've addressed all your review feedback. Could you please take another look when you get a chance? Here's a summary of the changes:

Removed generateVolcanoPodGroup, SchedulerNameVolcano constant, and combineLWSReplicaStatuses helper
Updated the installation guide with --set gangScheduling=volcano and a link to LWS gang scheduling docs
Added LWS >= v0.7.0 version requirement note
Cleaned up verbose comments in reconcileLeaderWorkerSetResources — now uses a single LWS with native Spec.Replicas scaling
Removed instanceID from generateResourceOption and all related labels/annotations
Ready for re-review. Thanks!

@kruthiwusirika5 kruthiwusirika5 changed the title Fix/lws native scaling 2214 v2 fix: refactor LWS to use native scaling (#2214) Mar 20, 2026
@kruthiwusirika5
Copy link
Copy Markdown
Contributor Author

Hi @tmonty12, I've addressed all your review feedback. Kindly take another look when you get a chance?

…i-dynamo#2214)

- Replace per-replica LWS loop with a single LeaderWorkerSet using
  native Spec.Replicas for scaling
- Remove generateVolcanoPodGroup function and its tests (no longer needed)
- Remove instanceID field from generateResourceOption and all related logic
- Remove instance-id labels, scheduling.k8s.io/group-name annotations,
  and volcano scheduler from LWS pod templates
- Remove unused combineLWSReplicaStatuses, lwsInstanceName functions
- Add legacy resource cleanup for indexed LWS/PodGroup resources
- Update installation guide with LWS >= v0.7.0 requirement and
  gangScheduling=volcano helm value for Volcano integration

Signed-off-by: kruthiwusirika5 <kruthiwusirika@gmail.com>
@kruthiwusirika5
Copy link
Copy Markdown
Contributor Author

@tmonty12 / @julienmancuso / @rmccorm4 friendly bump — this has been ready for re-review for ~5 weeks

@rmccorm4
Copy link
Copy Markdown
Contributor

/ok to test 6d65a6d

Comment thread deploy/operator/internal/controller/dynamocomponentdeployment_controller.go Outdated
Comment thread deploy/operator/internal/controller/dynamocomponentdeployment_controller.go Outdated
Comment thread deploy/operator/internal/controller/dynamocomponentdeployment_controller.go Outdated
Comment thread docs/kubernetes/installation-guide.md
- Drop the now-unused kubeName parameter from generateLeaderPodTemplateSpec
  and generateWorkerPodTemplateSpec. The argument became obsolete after the
  move to native LWS scaling; the only caller was generateLeaderWorkerSet.
- gofmt -s: collapse a stray blank line before FinalizeResource.

Both surfaced once the typecheck error in the previous CI run was fixed.

Signed-off-by: kruthiwusirika5 <kruthiwusirika@gmail.com>
@kruthiwusirika5
Copy link
Copy Markdown
Contributor Author

Pushed 6f5db88 — three follow-up lint fixes that surfaced once the previous typecheck error cleared:

gofmt -s on controller.go:620 (stray blank line)
removed dead kubeName parameter from generateLeaderPodTemplateSpec and generateWorkerPodTemplateSpec (only one caller each, both updated)
@tmonty12 / @rmccorm4 — could one of you run /ok to test 6f5db88 when convenient? Local build + vet + gofmt are clean.

@tmonty12
Copy link
Copy Markdown
Contributor

/ok to test 6f5db88

…aling

The existing fixtures pre-populated multiple per-replica LWS objects
(test-component-0/-1/-2) and asserted ComponentName="test-component-0"
with a multi-element ComponentNames list. That matched the pre-refactor
per-replica architecture; the current code creates a single LWS named
after the DCD with Spec.Replicas=N.

Replace each fixture with a single LWS named "test-component" carrying
the desired Status, and update wantComponentReconcileResult to use the
single-LWS reason/message strings ("LeaderWorkerSetReady" /
"LeaderWorkerSet is not ready") and ComponentNames=["test-component"].

Signed-off-by: kruthiwusirika5 <kruthiwusirika@gmail.com>
@kruthiwusirika5
Copy link
Copy Markdown
Contributor Author

Pushed 5baae1c — Test_reconcileLeaderWorkerSetResources was still asserting the pre-refactor per-replica naming (test-component-0/-1/-2). Collapsed each case's fixture to a single LWS reflecting the native-scaling architecture and updated the expected reason/message strings.

Local: full unit-test sweep green, gofmt -s and go vet clean.

@tmonty12 / @rmccorm4 — could one of you run /ok to test 5baae1c when convenient?

@tmonty12
Copy link
Copy Markdown
Contributor

/ok to test 5baae1c

@kruthiwusirika5
Copy link
Copy Markdown
Contributor Author

@tmonty12 — CI is fully green on 5baae1c. When you have a moment, would appreciate a final pass.

Copy link
Copy Markdown
Contributor

@tmonty12 tmonty12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, can you address the error swallowing comment and then the PR will be approved/merged.

@tmonty12
Copy link
Copy Markdown
Contributor

/ok to test d01d193

…nciler requeues

Address @tmonty12's review feedback on PR ai-dynamo#5468: the legacy resource
cleanup in `reconcileLeaderWorkerSetResources` previously logged and
swallowed errors from `r.List` and `r.Delete`, which meant a transient
API server hiccup during cleanup would silently leave legacy per-replica
LWS / PodGroup resources running while the reconcile completed
"successfully" and was not requeued.

Return wrapped errors instead so the controller-runtime reconcile loop
re-enqueues the request and we get another shot at cleanup. The
ownership-and-not-found guards remain in place, so we still skip
resources we don't own and treat already-deleted objects as a no-op.

Signed-off-by: kruthiwusirika5 <kruthiwusirika@gmail.com>
@kruthiwusirika5
Copy link
Copy Markdown
Contributor Author

Thanks @tmonty12! Just pushed 29194c8 which:

Returns wrapped errors from the legacy LWS list and the legacy PodGroup list/delete instead of logging and swallowing — the reconciler will now requeue on cleanup failures.
The ownership and IsNotFound guards are preserved, so we still skip resources we don't own and treat already-deleted objects as a no-op.
Ready for another look when you have a moment.

@tmonty12
Copy link
Copy Markdown
Contributor

/ok to test 29194c8

Copy link
Copy Markdown
Contributor

@tmonty12 tmonty12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

@tmonty12 tmonty12 enabled auto-merge (squash) April 30, 2026 03:48
@tmonty12 tmonty12 merged commit 28a856c into ai-dynamo:main Apr 30, 2026
58 checks passed
furionw pushed a commit that referenced this pull request May 2, 2026
Signed-off-by: kruthiwusirika5 <kruthiwusirika@gmail.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment::k8s Relates to dynamo deployment in kubernetes documentation Improvements or additions to documentation external-contribution Pull request is from an external contributor fix size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants