fix: refactor LWS to use native scaling (#2214) by kruthiwusirika5 · Pull Request #5468 · ai-dynamo/dynamo

kruthiwusirika5 · 2026-01-15T23:52:57Z

Overview:
Refactors the DynamoComponentDeployment controller to utilize the native scaling capabilities of the LeaderWorkerSet (LWS) API, addressing inefficient resource creation.

Details:
Previously, the controller created one LeaderWorkerSet object per desired replica. This change modifies the logic to:

Create a single LeaderWorkerSet object with Spec.Replicas set to the desired count
Remove generateVolcanoPodGroup function, SchedulerNameVolcano constant, and combineLWSReplicaStatuses helper (no longer needed)
Remove instanceID field from generateResourceOption and all related logic (instance-id labels, scheduling.k8s.io/group-name annotations)
Add cleanup logic for legacy indexed LWS and PodGroup resources (e.g. name-0, name-1)
Clean up verbose comments in reconcileLeaderWorkerSetResources
Update installation guide with LWS >= v0.7.0 version requirement and --set gangScheduling=volcano Helm value for Volcano integration
Update unit tests to verify single LWS creation with Replicas=N
Where should the reviewer start?

deploy/operator/internal/controller/dynamocomponentdeployment_controller.go: Review reconcileLeaderWorkerSetResources for the main logic change
deploy/operator/internal/controller/dynamocomponentdeployment_controller_test.go: Review updated test cases verifying single LWS creation
docs/kubernetes/installation-guide.md: Review updated LWS + Volcano installation instructions
Related Issues: Fixes #2214

copy-pr-bot · 2026-01-15T23:53:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-01-15T23:53:04Z

🌿 Fern Docs Preview: https://nvidia-preview-d07e4a67-4ef7-43f7-b8a0-816a97bc2c86.docs.buildwithfern.com/dynamo/dev

coderabbitai · 2026-01-31T00:14:02Z

Walkthrough

The reconciliation logic for DynamoComponentDeployment is refactored to create a single LeaderWorkerSet resource with natively scaled replicas, replacing the previous pattern of instantiating multiple per-replica LeaderWorkerSets. Legacy per-replica resource cleanup is introduced, and readiness determination is simplified to assess a single consolidated LeaderWorkerSet.

Changes

Cohort / File(s)	Summary
Controller Logic Refactoring `deploy/operator/internal/controller/dynamocomponentdeployment_controller.go`	Reconciler now creates a single LeaderWorkerSet with native Replicas instead of multiple per-replica instances. Functions updated to support nil instanceID defaulting to 0. Legacy resource cleanup added to remove indexed LeaderWorkerSets (baseName-0, baseName-1, etc.) and corresponding PodGroups. Readiness checks adjusted to evaluate single LWS status. Per-index resource generation logic simplified.
Test Updates `deploy/operator/internal/controller/dynamocomponentdeployment_controller_test.go`	Test cases renamed to reflect native scaling approach ("nil instanceID - success (native scaling)"). Expected outputs for nil instanceID expanded from nil to fully populated PodGroup/LeaderWorkerSet with instance-id label "0". Test scaffolding enhanced with mock Docker secret retriever and service account setup. Reconciliation tests updated with new readiness semantics and component naming conventions. Deployment strategy tests extended to cover annotation handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 One leader, many workers in harmony now,
No more per-replica chaos, just native scaling's vow,
Legacy ghosts swept clean from the kingdom's dust,
A consolidated dream where all pods trust,
The scheduler's burden lifted light.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Changes successfully address issue `#2214` by refactoring to use single LWS with native scaling, fixing the worker blocking issue caused by per-replica resource creation.
Out of Scope Changes check	✅ Passed	All changes are scoped to LWS integration refactoring: controller logic, helper functions, cleanup logic, and corresponding test updates.
Title check	✅ Passed	The title 'fix: refactor LWS to use native scaling (`#2214`)' clearly summarizes the main change: migrating from per-replica LWS instances to native scaling with a single LWS resource.
Description check	✅ Passed	The pull request description is well-structured and covers all required sections with comprehensive details about the changes, objectives, and files for review.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

rmccorm4 · 2026-03-18T02:10:57Z

Hi @kruthiwusirika5, thanks for the contribution!

Can you help fix merge conflicts and update the PR to be based off latest main branch?

After rebase, @tmonty12 @julienmancuso can you help review?

tmonty12

Thank you for the contribution! Please address comments

kruthiwusirika5 · 2026-03-20T02:59:07Z

Hi @tmonty12, I've addressed all your review feedback. Could you please take another look when you get a chance? Here's a summary of the changes:

Removed generateVolcanoPodGroup, SchedulerNameVolcano constant, and combineLWSReplicaStatuses helper
Updated the installation guide with --set gangScheduling=volcano and a link to LWS gang scheduling docs
Added LWS >= v0.7.0 version requirement note
Cleaned up verbose comments in reconcileLeaderWorkerSetResources — now uses a single LWS with native Spec.Replicas scaling
Removed instanceID from generateResourceOption and all related labels/annotations
Ready for re-review. Thanks!

kruthiwusirika5 · 2026-03-24T18:43:42Z

Hi @tmonty12, I've addressed all your review feedback. Kindly take another look when you get a chance?

…i-dynamo#2214) - Replace per-replica LWS loop with a single LeaderWorkerSet using native Spec.Replicas for scaling - Remove generateVolcanoPodGroup function and its tests (no longer needed) - Remove instanceID field from generateResourceOption and all related logic - Remove instance-id labels, scheduling.k8s.io/group-name annotations, and volcano scheduler from LWS pod templates - Remove unused combineLWSReplicaStatuses, lwsInstanceName functions - Add legacy resource cleanup for indexed LWS/PodGroup resources - Update installation guide with LWS >= v0.7.0 requirement and gangScheduling=volcano helm value for Volcano integration Signed-off-by: kruthiwusirika5 <kruthiwusirika@gmail.com>

kruthiwusirika5 · 2026-04-27T20:03:05Z

@tmonty12 / @julienmancuso / @rmccorm4 friendly bump — this has been ready for re-review for ~5 weeks

rmccorm4 · 2026-04-27T20:25:38Z

/ok to test 6d65a6d

- Drop the now-unused kubeName parameter from generateLeaderPodTemplateSpec and generateWorkerPodTemplateSpec. The argument became obsolete after the move to native LWS scaling; the only caller was generateLeaderWorkerSet. - gofmt -s: collapse a stray blank line before FinalizeResource. Both surfaced once the typecheck error in the previous CI run was fixed. Signed-off-by: kruthiwusirika5 <kruthiwusirika@gmail.com>

kruthiwusirika5 · 2026-04-28T22:44:50Z

Pushed 6f5db88 — three follow-up lint fixes that surfaced once the previous typecheck error cleared:

gofmt -s on controller.go:620 (stray blank line)
removed dead kubeName parameter from generateLeaderPodTemplateSpec and generateWorkerPodTemplateSpec (only one caller each, both updated)
@tmonty12 / @rmccorm4 — could one of you run /ok to test 6f5db88 when convenient? Local build + vet + gofmt are clean.

tmonty12 · 2026-04-28T23:54:31Z

/ok to test 6f5db88

…aling The existing fixtures pre-populated multiple per-replica LWS objects (test-component-0/-1/-2) and asserted ComponentName="test-component-0" with a multi-element ComponentNames list. That matched the pre-refactor per-replica architecture; the current code creates a single LWS named after the DCD with Spec.Replicas=N. Replace each fixture with a single LWS named "test-component" carrying the desired Status, and update wantComponentReconcileResult to use the single-LWS reason/message strings ("LeaderWorkerSetReady" / "LeaderWorkerSet is not ready") and ComponentNames=["test-component"]. Signed-off-by: kruthiwusirika5 <kruthiwusirika@gmail.com>

kruthiwusirika5 · 2026-04-29T02:50:37Z

Pushed 5baae1c — Test_reconcileLeaderWorkerSetResources was still asserting the pre-refactor per-replica naming (test-component-0/-1/-2). Collapsed each case's fixture to a single LWS reflecting the native-scaling architecture and updated the expected reason/message strings.

Local: full unit-test sweep green, gofmt -s and go vet clean.

@tmonty12 / @rmccorm4 — could one of you run /ok to test 5baae1c when convenient?

tmonty12 · 2026-04-29T16:51:27Z

/ok to test 5baae1c

kruthiwusirika5 · 2026-04-29T17:44:57Z

@tmonty12 — CI is fully green on 5baae1c. When you have a moment, would appreciate a final pass.

tmonty12

Overall looks good, can you address the error swallowing comment and then the PR will be approved/merged.

tmonty12 · 2026-04-30T01:24:19Z

/ok to test d01d193

@tmonty12

…nciler requeues Address @tmonty12's review feedback on PR ai-dynamo#5468: the legacy resource cleanup in `reconcileLeaderWorkerSetResources` previously logged and swallowed errors from `r.List` and `r.Delete`, which meant a transient API server hiccup during cleanup would silently leave legacy per-replica LWS / PodGroup resources running while the reconcile completed "successfully" and was not requeued. Return wrapped errors instead so the controller-runtime reconcile loop re-enqueues the request and we get another shot at cleanup. The ownership-and-not-found guards remain in place, so we still skip resources we don't own and treat already-deleted objects as a no-op. Signed-off-by: kruthiwusirika5 <kruthiwusirika@gmail.com>

kruthiwusirika5 · 2026-04-30T03:21:34Z

Thanks @tmonty12! Just pushed 29194c8 which:

Returns wrapped errors from the legacy LWS list and the legacy PodGroup list/delete instead of logging and swallowing — the reconciler will now requeue on cleanup failures.
The ownership and IsNotFound guards are preserved, so we still skip resources we don't own and treat already-deleted objects as a no-op.
Ready for another look when you have a moment.

tmonty12 · 2026-04-30T03:28:12Z

/ok to test 29194c8

tmonty12

Thanks for the contribution!

Signed-off-by: kruthiwusirika5 <kruthiwusirika@gmail.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

kruthiwusirika5 requested a review from a team as a code owner January 15, 2026 23:52

pull-request-size Bot added the size/XL label Jan 15, 2026

github-actions Bot added fix external-contribution Pull request is from an external contributor labels Jan 15, 2026

github-actions Bot added the deployment::k8s Relates to dynamo deployment in kubernetes label Jan 31, 2026

kruthiwusirika5 changed the title ~~fix: refactor LWS integration to use native scaling (Fixes #2214)~~ Fix/lws native scaling 2214 v2 Feb 19, 2026

tmonty12 reviewed Mar 18, 2026

View reviewed changes

tmonty12 mentioned this pull request Mar 18, 2026

fix: refactor LeaderWorkerSet to native scaling (fixes #2214) #6417

Closed

github-actions Bot added the documentation Improvements or additions to documentation label Mar 20, 2026

kruthiwusirika5 force-pushed the main branch from 5d4cb4b to 18e1416 Compare March 20, 2026 02:49

pull-request-size Bot added size/L and removed size/XL labels Mar 20, 2026

kruthiwusirika5 force-pushed the main branch from 18e1416 to d15cc74 Compare March 20, 2026 03:02

kruthiwusirika5 changed the title ~~Fix/lws native scaling 2214 v2~~ fix: refactor LWS to use native scaling (#2214) Mar 20, 2026

kruthiwusirika5 requested a review from tmonty12 March 24, 2026 18:43

kruthiwusirika5 force-pushed the main branch from a1d1923 to ce9721e Compare April 22, 2026 17:40

Merge branch 'main' into main

6d65a6d

copy-pr-bot Bot temporarily deployed to GITLAB April 27, 2026 20:25 Inactive

rmccorm4 enabled auto-merge (squash) April 27, 2026 20:27

copy-pr-bot Bot temporarily deployed to GITLAB April 27, 2026 23:19 Inactive

tmonty12 reviewed Apr 28, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to GITLAB April 28, 2026 21:59 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 28, 2026 23:54 Inactive

pull-request-size Bot added size/XL and removed size/L labels Apr 29, 2026

copy-pr-bot Bot temporarily deployed to GITLAB April 29, 2026 16:51 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 29, 2026 16:52 Inactive

Merge branch 'ai-dynamo:main' into main

d01d193

tmonty12 reviewed Apr 30, 2026

View reviewed changes

Comment thread deploy/operator/internal/controller/dynamocomponentdeployment_controller.go

Comment thread deploy/operator/internal/controller/dynamocomponentdeployment_controller.go

copy-pr-bot Bot temporarily deployed to GITLAB April 30, 2026 01:24 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 30, 2026 02:31 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 30, 2026 03:28 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 30, 2026 03:33 Inactive

tmonty12 approved these changes Apr 30, 2026

View reviewed changes

tmonty12 enabled auto-merge (squash) April 30, 2026 03:48

tmonty12 merged commit 28a856c into ai-dynamo:main Apr 30, 2026
58 checks passed

keivenchang mentioned this pull request Apr 30, 2026

test(revalidate): fix: refactor LWS to use native scaling (#2214) #5468 #8891

Closed

furionw pushed a commit that referenced this pull request May 2, 2026

fix: refactor LWS to use native scaling (#2214) (#5468)

9b52b3b

Signed-off-by: kruthiwusirika5 <kruthiwusirika@gmail.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

This was referenced May 13, 2026

[BUG]: dynamo operator lws worker not serving while occupy GPUs #2214

Closed

fix(operator): avoid LWS service name collision #9612

Merged

Conversation

kruthiwusirika5 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jan 15, 2026

Uh oh!

github-actions Bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

rmccorm4 commented Mar 18, 2026

Uh oh!

tmonty12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kruthiwusirika5 commented Mar 20, 2026

Uh oh!

kruthiwusirika5 commented Mar 24, 2026

Uh oh!

kruthiwusirika5 commented Apr 27, 2026

Uh oh!

rmccorm4 commented Apr 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kruthiwusirika5 commented Apr 28, 2026

Uh oh!

tmonty12 commented Apr 28, 2026

Uh oh!

kruthiwusirika5 commented Apr 29, 2026

Uh oh!

tmonty12 commented Apr 29, 2026

Uh oh!

kruthiwusirika5 commented Apr 29, 2026

Uh oh!

tmonty12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tmonty12 commented Apr 30, 2026

Uh oh!

kruthiwusirika5 commented Apr 30, 2026

Uh oh!

tmonty12 commented Apr 30, 2026

Uh oh!

tmonty12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kruthiwusirika5 commented Jan 15, 2026 •

edited

Loading

github-actions Bot commented Jan 15, 2026 •

edited

Loading

coderabbitai Bot commented Jan 31, 2026 •

edited

Loading