Skip to content

feat: Fix CPU offloading + add options for FSDP offload and expandable segments#122

Closed
yfw wants to merge 27 commits intomainfrom
yifu/cpu_offload
Closed

feat: Fix CPU offloading + add options for FSDP offload and expandable segments#122
yfw wants to merge 27 commits intomainfrom
yifu/cpu_offload

Conversation

@yfw
Copy link
Copy Markdown
Contributor

@yfw yfw commented Apr 2, 2025

What does this PR do ?

Addresses #33 (FSDP1) and #67 by:

  1. Fixing the cpu offloading implementation so that HFPolicyWorker's memory is closer to 0 during VLLM generation.
  2. Adding an option for using FSDP's built-in cpu offloading. With FSDP cpu offloading enabled, only the forward and backward passes are done on the GPU. Everything else, including the optimizer step update, is done on the CPU.
  3. Adding the ability to set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True for the HFPolicy.

To test the impact of these settings, I did a sweep across different models, context lengths, and offload types. The results are here: https://docs.google.com/spreadsheets/d/1lWtw6-jbq4TAlM5Bu5Z3jNQjwAa_CEUX3bsKuMXHEkM/edit?usp=sharing

Offload Types:

  • Main: Previous implementation on the main branch
  • Manual: "Fixed" cpu offloading with the changes in this PR. This will be the default setting after this PR.
  • FSDP: FSDP's built-in CPU offloading

Key Takeways:

  • "Manual" offloading generally maintains or improves step time at lower allocated and reserved memory when compared with "Main"
  • FSDP cpu offloading generally incurs some overhead in terms of Step 1 and Step 3 times when compared to "Manual" offloading.
    • An exception is for 8B, 7500 sequence length (row 23) which is the longest sequence length I tested without OOMing for "Manual" offloading
  • Using expandable_segments keeps the reserved memory closer to allocated memory, but this is most evident with the "Main" offload type. Reserved memory is generally low for "Manual" and FSDP offload types.
    • One case where this does make a difference for the "Manual" offload type is the Llama3.1-8B model on 7500 context length, which OOMs without expandable_segments, but can run (slowly) without OOMing with expandable_segments (rows 22 and 23)
  • For 8k context length and 8B model (FSDP1 memory usage still too high (8b on 8k seqlen not fitting) #67), FSDP cpu offloading is the only case that works without OOMing (Rows 30 and 31).
  • Enabling expandable_segments also incurs some overhead in terms of step time, but this is most evident in the SFT case for 8k context, 8B model (rows 50-55), where expandable_segments runs ~3x slower than non-expandable_segments.

Issues

#33 (FSDP1 part)
#67

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@yfw yfw requested a review from terrykong April 2, 2025 21:11
@yfw yfw changed the title feat: Fix CPU offloading and add options for FSDP cpu offload and expandable segments feat: Fix CPU offloading + add options for FSDP offload and expandable segments Apr 2, 2025
yfw and others added 27 commits April 2, 2025 14:25
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
…e merge (#91)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
…up (#94)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
#95)

Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Co-authored-by: Sahil Jain <48468750+SahilJain314@users.noreply.github.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: KiddoZhu <zhaochengz@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Co-authored-by: Parth Chadha <pchadha@nvidia.com>
Co-authored-by: Sahil Jain <48468750+SahilJain314@users.noreply.github.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
…readme (#104)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
#105)

Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
#111)

Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@yfw yfw force-pushed the yifu/cpu_offload branch from e7c6d7f to 97c5e1b Compare April 2, 2025 21:29
@github-actions github-actions Bot added Documentation Improvements or additions to documentation CI Relating to CI labels Apr 2, 2025
@yfw
Copy link
Copy Markdown
Contributor Author

yfw commented Apr 2, 2025

Closing for #123

@yfw yfw closed this Apr 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Relating to CI Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants