fix: refactor LWS to use native scaling (#2214)#5468
Conversation
WalkthroughThe reconciliation logic for DynamoComponentDeployment is refactored to create a single LeaderWorkerSet resource with natively scaled replicas, replacing the previous pattern of instantiating multiple per-replica LeaderWorkerSets. Legacy per-replica resource cleanup is introduced, and readiness determination is simplified to assess a single consolidated LeaderWorkerSet. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi @kruthiwusirika5, thanks for the contribution! Can you help fix merge conflicts and update the PR to be based off latest After rebase, @tmonty12 @julienmancuso can you help review? |
tmonty12
left a comment
There was a problem hiding this comment.
Thank you for the contribution! Please address comments
|
Hi @tmonty12, I've addressed all your review feedback. Could you please take another look when you get a chance? Here's a summary of the changes: Removed generateVolcanoPodGroup, SchedulerNameVolcano constant, and combineLWSReplicaStatuses helper |
|
Hi @tmonty12, I've addressed all your review feedback. Kindly take another look when you get a chance? |
…i-dynamo#2214) - Replace per-replica LWS loop with a single LeaderWorkerSet using native Spec.Replicas for scaling - Remove generateVolcanoPodGroup function and its tests (no longer needed) - Remove instanceID field from generateResourceOption and all related logic - Remove instance-id labels, scheduling.k8s.io/group-name annotations, and volcano scheduler from LWS pod templates - Remove unused combineLWSReplicaStatuses, lwsInstanceName functions - Add legacy resource cleanup for indexed LWS/PodGroup resources - Update installation guide with LWS >= v0.7.0 requirement and gangScheduling=volcano helm value for Volcano integration Signed-off-by: kruthiwusirika5 <kruthiwusirika@gmail.com>
|
@tmonty12 / @julienmancuso / @rmccorm4 friendly bump — this has been ready for re-review for ~5 weeks |
|
/ok to test 6d65a6d |
- Drop the now-unused kubeName parameter from generateLeaderPodTemplateSpec and generateWorkerPodTemplateSpec. The argument became obsolete after the move to native LWS scaling; the only caller was generateLeaderWorkerSet. - gofmt -s: collapse a stray blank line before FinalizeResource. Both surfaced once the typecheck error in the previous CI run was fixed. Signed-off-by: kruthiwusirika5 <kruthiwusirika@gmail.com>
|
Pushed 6f5db88 — three follow-up lint fixes that surfaced once the previous typecheck error cleared: gofmt -s on controller.go:620 (stray blank line) |
|
/ok to test 6f5db88 |
…aling
The existing fixtures pre-populated multiple per-replica LWS objects
(test-component-0/-1/-2) and asserted ComponentName="test-component-0"
with a multi-element ComponentNames list. That matched the pre-refactor
per-replica architecture; the current code creates a single LWS named
after the DCD with Spec.Replicas=N.
Replace each fixture with a single LWS named "test-component" carrying
the desired Status, and update wantComponentReconcileResult to use the
single-LWS reason/message strings ("LeaderWorkerSetReady" /
"LeaderWorkerSet is not ready") and ComponentNames=["test-component"].
Signed-off-by: kruthiwusirika5 <kruthiwusirika@gmail.com>
|
Pushed 5baae1c — Test_reconcileLeaderWorkerSetResources was still asserting the pre-refactor per-replica naming (test-component-0/-1/-2). Collapsed each case's fixture to a single LWS reflecting the native-scaling architecture and updated the expected reason/message strings. Local: full unit-test sweep green, gofmt -s and go vet clean. @tmonty12 / @rmccorm4 — could one of you run /ok to test 5baae1c when convenient? |
|
/ok to test 5baae1c |
tmonty12
left a comment
There was a problem hiding this comment.
Overall looks good, can you address the error swallowing comment and then the PR will be approved/merged.
|
/ok to test d01d193 |
…nciler requeues Address @tmonty12's review feedback on PR ai-dynamo#5468: the legacy resource cleanup in `reconcileLeaderWorkerSetResources` previously logged and swallowed errors from `r.List` and `r.Delete`, which meant a transient API server hiccup during cleanup would silently leave legacy per-replica LWS / PodGroup resources running while the reconcile completed "successfully" and was not requeued. Return wrapped errors instead so the controller-runtime reconcile loop re-enqueues the request and we get another shot at cleanup. The ownership-and-not-found guards remain in place, so we still skip resources we don't own and treat already-deleted objects as a no-op. Signed-off-by: kruthiwusirika5 <kruthiwusirika@gmail.com>
|
Thanks @tmonty12! Just pushed 29194c8 which: Returns wrapped errors from the legacy LWS list and the legacy PodGroup list/delete instead of logging and swallowing — the reconciler will now requeue on cleanup failures. |
|
/ok to test 29194c8 |
tmonty12
left a comment
There was a problem hiding this comment.
Thanks for the contribution!
Overview:
Refactors the DynamoComponentDeployment controller to utilize the native scaling capabilities of the LeaderWorkerSet (LWS) API, addressing inefficient resource creation.
Details:
Previously, the controller created one LeaderWorkerSet object per desired replica. This change modifies the logic to:
Create a single LeaderWorkerSet object with Spec.Replicas set to the desired count
Remove generateVolcanoPodGroup function, SchedulerNameVolcano constant, and combineLWSReplicaStatuses helper (no longer needed)
Remove instanceID field from generateResourceOption and all related logic (instance-id labels, scheduling.k8s.io/group-name annotations)
Add cleanup logic for legacy indexed LWS and PodGroup resources (e.g. name-0, name-1)
Clean up verbose comments in reconcileLeaderWorkerSetResources
Update installation guide with LWS >= v0.7.0 version requirement and --set gangScheduling=volcano Helm value for Volcano integration
Update unit tests to verify single LWS creation with Replicas=N
Where should the reviewer start?
deploy/operator/internal/controller/dynamocomponentdeployment_controller.go: Review reconcileLeaderWorkerSetResources for the main logic change
deploy/operator/internal/controller/dynamocomponentdeployment_controller_test.go: Review updated test cases verifying single LWS creation
docs/kubernetes/installation-guide.md: Review updated LWS + Volcano installation instructions
Related Issues: Fixes #2214