fix: Make the optimizer offloading optional#1404
Conversation
ℹ️ File Consistency CheckCheck based on commit: 5416b9a (PR #1404 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
📝 WalkthroughWalkthroughThree policy worker implementations now gate optimizer state movement operations (both device transfers and CPU offloading) to only occur when generation is colocated, by adding Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Homogeneous guard condition additions repeated consistently across three closely related files with minimal logic density and straightforward boolean gating. Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
I feel it's unrelated to if it's colocated, b/c colocated will free vLLM's memory before I think why we offload / on-load is for increasing logprob batch size as you said. I feel it's better to apply "not offload optimizer" as an option. wdyt? @parthchadha @terrykong |
I see, if that is the case, I think logprob might already be able to use enough mbs to saturate the GPU MFU, because of memory saving in not-doing-bwd. Of course, I can make this tunable by |
ℹ️ File Consistency CheckCheck based on commit: ffd544d (PR #1404 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
ffd544d to
6c80e8a
Compare
ℹ️ File Consistency CheckCheck based on commit: 6c80e8a (PR #1404 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
ℹ️ File Consistency CheckCheck based on commit: ee5e3a6 (PR #1404 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
ℹ️ File Consistency CheckCheck based on commit: 525252d (PR #1404 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
ℹ️ File Consistency CheckCheck based on commit: 5fcd404 (PR #1404 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
ℹ️ File Consistency CheckCheck based on commit: 09a44c3 (PR #1404 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
ℹ️ File Consistency CheckCheck based on commit: 1e5fa20 (PR #1404 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
ℹ️ File Consistency CheckCheck based on commit: 8660c14 (PR #1404 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
|
@youngeunkwon0405 please fix CI tests |
|
Hi @terrykong, this PR fails in |
|
@youngeunkwon0405 Could you try updating with |
I already did this once. It was happening repeatedly. |
ℹ️ File Consistency CheckCheck based on commit: ab089fa (PR #1404 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
ℹ️ File Consistency CheckCheck based on commit: d80a914 (PR #1404 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
|
Hi @terrykong, now it passes the CI. Could I ask for your final review and possibly a merge as well? |
What does this PR do ?
This PR removes optimizer state offloading in
policy.prepare_for_lp_inference()and corresponding on-loading inpolicy.prepare_for_training().You can turn on optimizer state offloading by settingNRL_OFFLOAD_OPTIMIZER_STATE_FOR_LOGPROB=True(default value isFalse).Changed this to yaml config following a comment from @guyueh1. Now it is
policy.offload_optimizer_states_for_logprob=trueto turn it on.This will save 3 ~ 15s per step, depending on the model size.
I am not sure why it is required in both colocated and non-colocated cases. Maybe it could help increase logprob batch size to help increase MFU, but I'm not sure if it's worth it.
@yuki-97, do you think this change will cause additional memory overhead?
@parthchadha, this is the topic I mentioned today.
@guyueh1, do you think it should be removed in the colocated path as well?
Performance (logprob_inference_prep + training_prep)
Nsys rep (QWEN3 30B GBS64)
Current ToT
This PR
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit