Avoid registering pytree when using FSDP by kaixuanliu · Pull Request #39325 · huggingface/transformers

kaixuanliu · 2025-07-10T03:24:11Z

When using FSDP, this register_pytree_node operation will cost lots of extra memory. We found after this PR: #35873, we cannot finetune 70b model using FSDP due to OOM issue.

kaixuanliu · 2025-07-10T03:32:50Z

@SunMarc @ArthurZucker @IlyasMoutawwakil pls help review, thx!

IlyasMoutawwakil · 2025-07-10T07:57:11Z

@kaixuanliu do you mean this PR #36311 where it was added ?
Do you have exact measures of how much it costs when FSDP is not used and when it's used, because the operation itself has nothing to do with FSDP.

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

kaixuanliu · 2025-07-11T02:52:46Z

@IlyasMoutawwakil ，Oh yes, it's 36311, not 35873. And for the extra memory, I made an experiment: I use 4 processes to do FSDP finetune with llama2-7b model, and compare the maximum memory consumption for two configurations. Result shows w/ register_pytree_node operation, it will cost extra ~12GB memory for each card.

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

SunMarc

Thanks for discovering that ! Left a comment

SunMarc · 2025-07-11T15:19:27Z


-if is_torch_greater_or_equal("2.3"):
+# Register pytree node for DynamicCache if torch version is >= 2.3 and FSDP is not imported, FSDP will need more extra memory when using pytree node
+if is_torch_greater_or_equal("2.3") and "torch.distributed.fsdp" not in sys.modules:


can we register the pytree node for DynamicCache somewhere else so that we can perform a better check compared to just checking if "torch.distributed.fsdp" not in sys.modules ? cc @IlyasMoutawwakil @gante
We also have the is_fsdp_enabled function in modeling utils that could be used to perform the check.

I made more experiments both on A100 and XPU. And found that this is an XPU-specific issue. It steadily occurs on XPU but not on A100(I will try to figure out what happened on XPU later). Based on this finding, it seems not a good choice to just check if torch.distributed.fsdp in sys.modules. Can we add an ENV param here to let people can close the register_pytree_node action for DynamicCache? WDYT? @SunMarc @gante @IlyasMoutawwakil ?

why not disable it for xpu only ? until xpu fixes it ?

I suppose this piece of code is targeted to support model export/trace. XPU also need this.

can we move it to the dynamic cache's init ? with some registration check ; because a user will only export/compile it if it was already instantiated ?
and there we skip for xpu + fsdp

kaixuanliu · 2025-07-15T07:15:18Z

After consideration, I think actually we do not need KVCache at all during finetune/training stage. So will close this PR, and set use_cache=False explicitly in application code.

kaixuanliu added 2 commits July 10, 2025 08:28

Avoid registering pytree when using FSDP

83c8d71

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

add comment

3d0aad4

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

kaixuanliu added 2 commits July 11, 2025 10:52

Merge branch 'main' into traceable-cache-fsdp

e8d3234

fix format issue

6bb8ced

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

SunMarc reviewed Jul 11, 2025

View reviewed changes

kaixuanliu closed this Jul 15, 2025

SunMarc mentioned this pull request Aug 4, 2025

Regression - High memory usage when using transformers model with FSDP + LoRA #39795

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid registering pytree when using FSDP#39325

Avoid registering pytree when using FSDP#39325
kaixuanliu wants to merge 4 commits intohuggingface:mainfrom
kaixuanliu:traceable-cache-fsdp

kaixuanliu commented Jul 10, 2025

Uh oh!

kaixuanliu commented Jul 10, 2025

Uh oh!

IlyasMoutawwakil commented Jul 10, 2025 •

edited

Loading

Uh oh!

kaixuanliu commented Jul 11, 2025

Uh oh!

SunMarc left a comment

Uh oh!

SunMarc Jul 11, 2025

Uh oh!

kaixuanliu Jul 13, 2025

Uh oh!

IlyasMoutawwakil Jul 14, 2025

Uh oh!

kaixuanliu Jul 14, 2025

Uh oh!

IlyasMoutawwakil Jul 14, 2025 •

edited

Loading

Uh oh!

kaixuanliu commented Jul 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kaixuanliu commented Jul 10, 2025

Uh oh!

kaixuanliu commented Jul 10, 2025

Uh oh!

IlyasMoutawwakil commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaixuanliu commented Jul 11, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

kaixuanliu Jul 13, 2025

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

kaixuanliu Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaixuanliu commented Jul 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

IlyasMoutawwakil commented Jul 10, 2025 •

edited

Loading

IlyasMoutawwakil Jul 14, 2025 •

edited

Loading