allow TP to work in ND-parallel with fsdp cpu ram efficient loading by winglian · Pull Request #39999 · huggingface/transformers

winglian · 2025-08-07T14:14:48Z

What does this PR do?

For N-D parallelism, When using FSDP2+TP with cpu_ram_efficient_loading, we have to specify the device_map as "meta" for non-rank0 processes. Additionally, even though we already know what device it will ultimately end up on through the device_mesh, we don't want to change the device_map for it since we've already defined it as meta device.

@SunMarc @S1ro1 @ArthurZucker

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

SunMarc

Thanks, left a comment

SunMarc · 2025-08-07T14:46:29Z

            # TODO: we can relax this check when we support taking tp_plan from a json file, for example.
            raise ValueError(f"tp_plan supports 'auto' only for now but got {tp_plan}.")
-        if tp_plan is not None and device_map is not None:
+        if tp_plan is not None and device_map is not None and device_mesh is not None:


Suggested change

if tp_plan is not None and device_map is not None and device_mesh is not None:

if tp_plan is not None and device_map is not None and device_mesh is None:

we should check for device_mesh is None instead no ?
Also maybe we can add is_fsdp_enabled() somewhere here to make it easier to understand and add some comments

@salmanmohammadi ?

The workflow here is

TP plan is provided (we are applying tensor parallel)

device_map is set - it's meta device

We don't want to error out here, and we also don't want to infer the device map from the device mesh. I think a clearer check would be something like

Suggested change

if tp_plan is not None and device_map is not None and device_mesh is not None:

# device_map should be permitted if the user wishes to instantiate the model on meta device

if tp_plan is not None and device_map is not None and device_map != "meta":

What do you think? Is there a better check for meta device instantiation? @SunMarc @winglian

We still need a device_mesh check in that. maybe

if tp_plan is not None and device_map is not None and device_map != "meta" and device_mesh is None:

ArthurZucker

Can we add what this enables somewhere?! 🤗 like a small snippet would be very nice, thanks for the PR!

ArthurZucker

LGTM otherwise but would be nice to have a snippet of how to run!

ArthurZucker · 2025-08-20T14:03:50Z


        # Post-processing for tensor parallelism
-        if device_mesh is not None:
+        if device_mesh is not None and "tp" in device_mesh.mesh_dim_names:


not 100% sure we have to prevent cases where there is no tp in mesh dimnames, it happens a lot in inference

Agreed, we explicitly require "tp" in mesh dim names only in case of n-d parallelism, this would skip the postprocessing for every 1d parallelism case

ArthurZucker · 2025-08-25T08:56:54Z

cc @winglian if this is still breaking!

SunMarc reviewed Aug 7, 2025

View reviewed changes

ArthurZucker reviewed Aug 7, 2025

View reviewed changes

winglian marked this pull request as draft August 8, 2025 02:49

winglian added 2 commits August 9, 2025 08:18

allow TP to work in ND-parallel with fsdp cpu ram efficient loading

6c154f7

fix the check around meta+device_mesh

8af2853

winglian force-pushed the tp-with-device-mesh branch from 0896ad6 to 8af2853 Compare August 9, 2025 12:22

additional meta handling

9e7295e

ArthurZucker reviewed Aug 20, 2025

View reviewed changes

evalstate added a commit to evalstate/transformers that referenced this pull request Apr 29, 2026

Apply PR huggingface#39999: preserve meta device maps for TP

bcc0a6b

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow TP to work in ND-parallel with fsdp cpu ram efficient loading#39999

allow TP to work in ND-parallel with fsdp cpu ram efficient loading#39999
winglian wants to merge 3 commits intohuggingface:mainfrom
winglian:tp-with-device-mesh

winglian commented Aug 7, 2025

Uh oh!

SunMarc left a comment

Uh oh!

SunMarc Aug 7, 2025

Uh oh!

winglian Aug 7, 2025

Uh oh!

salmanmohammadi Aug 7, 2025 •

edited

Loading

Uh oh!

winglian Aug 8, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Aug 20, 2025

Uh oh!

S1ro1 Aug 21, 2025

Uh oh!

ArthurZucker commented Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	if tp_plan is not None and device_map is not None and device_mesh is not None:
	if tp_plan is not None and device_map is not None and device_mesh is None:

	if tp_plan is not None and device_map is not None and device_mesh is not None:
	# device_map should be permitted if the user wishes to instantiate the model on meta device
	if tp_plan is not None and device_map is not None and device_map != "meta":

Conversation

winglian commented Aug 7, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

winglian Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

salmanmohammadi Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

winglian Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

S1ro1 Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

salmanmohammadi Aug 7, 2025 •

edited

Loading