Default synced_gpus to True when using FullyShardedDataParallel#33483
Conversation
Fixes huggingface#30228 Related: * pytorch/pytorch#100069 * pytorch/pytorch#123962 Similar to DeepSpeed ZeRO Stage 3, when using FSDP with multiple GPUs and differently sized data per rank, the ranks reach different synchronization points at the same time, leading to deadlock To avoid this, we can automatically set synced_gpus to True if we detect that a PreTrainedModel is being managed by FSDP using _is_fsdp_managed_module, which was added in 2.0.0 for torch.compile: https://github.com/pytorch/pytorch/blob/v2.0.0/torch/distributed/fsdp/_dynamo_utils.py
|
I should also mention that my script fails when using |
synced_gpus to True when using FullyShardedDataParallel
|
The CI failures seem to be from flaky tests. This should be good for review! |
|
Hey @gante and @ArthurZucker, just following up. Do you think you will be able to review this? |
|
@SunMarc and @muellerzr could you take a look here please? |
ArthurZucker
left a comment
There was a problem hiding this comment.
Looks great to me! I am just wondering if we can add a small test, fine to merge first as this looks like it would affect a lot of users!
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Before huggingface#33483, these tests would have hung for 10 minutes before crashing due to a timeout error
|
Okay, I gave it a stab based on these files + the example in the description. I added a test for FSDP and FSDP2. https://github.com/huggingface/transformers/blob/main/tests/trainer/test_trainer_distributed.py |
I think this might cause more problems if one of the workers was killed
There was a problem hiding this comment.
Thanks for adding this! This is very helpful! I left a comment. Can you have a second look @muellerzr ?
Also, I think that we can create a doc to explain how to perform generate when the model is initialized under fsdp or deepspeed ! I saw a lot of users bieng confused about this.
Also, if you want to dig deeper, that would be nice to have a comparison with ddp/deepspeed that was done here: huggingface/trl#1483 (comment)
| def is_fsdp_managed_module(module: nn.Module) -> bool: | ||
| return isinstance(module, torch.distributed.fsdp.FullyShardedDataParallel) or getattr( | ||
| module, "_is_fsdp_managed_module", False | ||
| ) |
There was a problem hiding this comment.
Can't we use is_fsdp_enabled just like how we do it for deepspeed ? I guess users will use fsdp in trainer + accelerate, so that should work. But I understand that it won't work with your general script. cc @muellerzr
There was a problem hiding this comment.
But I understand that it won't work with your general script
Right, since is_fsdp_enabled is a Trainer attribute, it would only work if you are using a Trainer, whereas this function works in all scenarios, including when using a Trainer.
muellerzr
left a comment
There was a problem hiding this comment.
Thanks! These tests look great!
|
Hi, I found that maybe have a bug of trainer fsdp init, when I use pytest to test the test code my packages version are |
|
@YeLuoSuiYou Seems like it was due to the logic added in this PR #34032 |
thx a lot, maybe I should wait the normal version to fix it? btw, thx much for the function provided by this PR! |
I commet this function, the test work well done, thx! |
…huggingface#33483) * Default synced_gpus to True when using FullyShardedDataParallel Fixes huggingface#30228 Related: * pytorch/pytorch#100069 * pytorch/pytorch#123962 Similar to DeepSpeed ZeRO Stage 3, when using FSDP with multiple GPUs and differently sized data per rank, the ranks reach different synchronization points at the same time, leading to deadlock To avoid this, we can automatically set synced_gpus to True if we detect that a PreTrainedModel is being managed by FSDP using _is_fsdp_managed_module, which was added in 2.0.0 for torch.compile: https://github.com/pytorch/pytorch/blob/v2.0.0/torch/distributed/fsdp/_dynamo_utils.py * Remove test file * ruff formatting * ruff format * Update copyright year Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Add test for FSDP-wrapped model generation Before huggingface#33483, these tests would have hung for 10 minutes before crashing due to a timeout error * Ruff format * Move argparse import * Remove barrier I think this might cause more problems if one of the workers was killed * Move import into function to decrease load time huggingface#33483 (comment) * Add test for accelerate and Trainer huggingface#33483 (comment) * Refactor imports * Ruff format * Use nullcontext --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

What does this PR do?
Fixes #30228
Related:
Similar to DeepSpeed ZeRO Stage 3, when using FSDP with multiple GPUs and differently sized data per rank, the ranks reach different synchronization points at the same time, leading to deadlock.
To avoid this, we can automatically set
synced_gpustoTrueif we detect that aPreTrainedModelis being managed by FSDP using_is_fsdp_managed_module, which was added in 2.0.0 fortorch.compile: https://github.com/pytorch/pytorch/blob/v2.0.0/torch/distributed/fsdp/_dynamo_utils.pyTo facilitate this, I created a module called
transformers.integrations.fsdpcontaining the functionis_fsdp_managed_modulewhich returnsTrueif aModulehas_is_fsdp_managed_moduleset toTrueon it or if theModuleitself is aFullyShardedDataParallelinstance.Here is the script I used to test my fix:
OMP_NUM_THREADS=2 \ TOKENIZERS_PARALLELISM=false \ CUDA_VISIBLE_DEVICES=6,7 \ torchrun \ --rdzv-backend=c10d \ --rdzv-endpoint=localhost:0 \ --nnodes=1 \ --nproc-per-node=2 \ fsdp_generate.pyfsdp_generate.py
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@gante
@ArthurZucker