PATCH: add back n-dim device-mesh + fix tp trainer saving by S1ro1 · Pull Request #39693 · huggingface/transformers

S1ro1 · 2025-07-26T14:03:25Z

Fixes ndim check on device_mesh - This was merged in previously with Allow device_mesh have multiple dim #38949 but was by mistake reverted by Add ep #39501, we need this for upcoming accelerate/axolotl release.
Makes sure we save properly on distributed rank if tp allowed in trainer

HuggingFaceDocBuilderDev · 2025-07-26T14:19:08Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2025-07-27T15:01:09Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: llama

ArthurZucker · 2025-07-28T11:38:41Z

+                if "tp" not in device_mesh.mesh_dim_names:
+                    raise ValueError(
+                        "When using `tp_plan`, the `device_mesh` must contain a 'tp' dimension. "
+                        "Please provide a valid `device_mesh`."
+                    )


I don't think we should enforce 'tp' in the device mesh! for inference we never use that!~

ArthurZucker · 2025-07-28T11:38:52Z

+                device_mesh = device_mesh["tp"]
+                tp_size = device_mesh["tp"].size()
+                device_map = torch.device(f"{device_mesh.device_type}:{int(os.environ['LOCAL_RANK'])}")


only do this if tp exists in it!

ArthurZucker · 2025-07-28T11:39:08Z

                # 'user_content.pt' indicates model state_dict saved with smp >= 1.10
                Path(os.path.join(output_dir, "user_content.pt")).touch()
+        # We are in N-D parallelism if we have parallelism_config set, so we check accelerate if we're on a to_save rank
+        elif (getattr(self.accelerator, "parallelism_config")) is not None:


ArthurZucker

Thanks!

ArthurZucker · 2025-07-28T11:49:29Z

+                    if "tp" not in device_mesh.mesh_dim_names:
+                        raise ValueError(
+                            "When using `tp_plan`, the `device_mesh` must contain a 'tp' dimension. "
+                            "Please provide a valid `device_mesh`."
+                        )
+                    device_mesh = device_mesh["tp"]


can it be ndim > 1 but not mesh dim names?

Nope it can't IMO, how do you think we should take the correct submesh then?

No no I just want to be sure as the basic initialization we do is without providing a mesh name!

Oh, that works, we check for "tp" only if ndim > 1, therefore the basic initialization still works. We check for "tp" only if mesh.ndim > 1 AND user-provided mesh. If this passes we select the correct submesh ("tp") and use that as 1D mesh afterwards, as if it was created by us in initialize_tensor_parallelism

S1ro1 · 2025-07-28T12:17:32Z

Fails unrelated, merging

…e#39693) * Feat: something * Feat: initial changes * tmp changes to unblock * Refactor * remove todo * Feat: docstring * Fix: saving of distributed model in trainer * Fix: distributed saving with trainer * Feat: add pure tp saving * Only require tp dim if ndim > 1 * Fix: default to None * Fix: better comments/errors * Fix: properly check tp_size attribute * Fix: properly check for None in tp_size --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* Feat: something * Feat: initial changes * tmp changes to unblock * Refactor * remove todo * Feat: docstring * Fix: saving of distributed model in trainer * Fix: distributed saving with trainer * Feat: add pure tp saving * Only require tp dim if ndim > 1 * Fix: default to None * Fix: better comments/errors * Fix: properly check tp_size attribute * Fix: properly check for None in tp_size --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

…e#39693) * Feat: something * Feat: initial changes * tmp changes to unblock * Refactor * remove todo * Feat: docstring * Fix: saving of distributed model in trainer * Fix: distributed saving with trainer * Feat: add pure tp saving * Only require tp dim if ndim > 1 * Fix: default to None * Fix: better comments/errors * Fix: properly check tp_size attribute * Fix: properly check for None in tp_size --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

S1ro1 and others added 9 commits June 20, 2025 19:57

Feat: something

4dd497f

Feat: initial changes

08f54bb

tmp changes to unblock

f84ecc4

Refactor

17d2d69

remove todo

56d2c9e

Merge branch 'main' into fsdp2-tp

622e9b9

Feat: docstring

b35ac20

Merge branch 'main' into fsdp2-tp

33e2819

Merge branch 'main' into fsdp2-tp

83dedd8

S1ro1 added the for patch Tag issues / labels that should be included in the next patch label Jul 26, 2025

S1ro1 changed the title ~~PATCH: add back n-dim device-mesh~~ PATCH: add back n-dim device-mesh + fix tp hook registration Jul 26, 2025

S1ro1 force-pushed the fsdp2-tp branch from bf21f0a to 40fabad Compare July 27, 2025 15:00

S1ro1 force-pushed the fsdp2-tp branch from 40fabad to 83dedd8 Compare July 28, 2025 09:30

S1ro1 added 3 commits July 28, 2025 10:23

Fix: saving of distributed model in trainer

2423039

Fix: distributed saving with trainer

4ed1639

Feat: add pure tp saving

b5708c8

S1ro1 mentioned this pull request Jul 28, 2025

properly save model across tensor parallel processes #39700

Closed

5 tasks

S1ro1 changed the title ~~PATCH: add back n-dim device-mesh + fix tp hook registration~~ PATCH: add back n-dim device-mesh + fix tp trainer saving Jul 28, 2025

ArthurZucker reviewed Jul 28, 2025

View reviewed changes

S1ro1 added 2 commits July 28, 2025 11:45

Only require tp dim if ndim > 1

edd7684

Fix: default to None

d6581d8

ArthurZucker approved these changes Jul 28, 2025

View reviewed changes

S1ro1 and others added 4 commits July 28, 2025 11:49

Fix: better comments/errors

bba981c

Fix: properly check tp_size attribute

60a9687

Fix: properly check for None in tp_size

354e68f

Merge branch 'main' into fsdp2-tp

cacd06b

S1ro1 enabled auto-merge (squash) July 28, 2025 12:17

S1ro1 merged commit 4c7da9f into main Jul 28, 2025
26 checks passed

S1ro1 deleted the fsdp2-tp branch July 28, 2025 12:29

winglian mentioned this pull request Aug 2, 2025

make sure model.save_pretrained has the correct is_main_process #39866

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PATCH: add back n-dim device-mesh + fix tp trainer saving#39693

PATCH: add back n-dim device-mesh + fix tp trainer saving#39693
S1ro1 merged 18 commits intomainfrom
fsdp2-tp

S1ro1 commented Jul 26, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jul 26, 2025

Uh oh!

github-actions Bot commented Jul 27, 2025

Uh oh!

ArthurZucker Jul 28, 2025

Uh oh!

ArthurZucker Jul 28, 2025

Uh oh!

ArthurZucker Jul 28, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Jul 28, 2025

Uh oh!

S1ro1 Jul 28, 2025

Uh oh!

ArthurZucker Jul 28, 2025

Uh oh!

S1ro1 Jul 28, 2025 •

edited

Loading

Uh oh!

S1ro1 commented Jul 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

S1ro1 commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jul 26, 2025

Uh oh!

github-actions Bot commented Jul 27, 2025

Uh oh!

ArthurZucker Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

S1ro1 Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

S1ro1 Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

S1ro1 commented Jul 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

S1ro1 commented Jul 26, 2025 •

edited

Loading

S1ro1 Jul 28, 2025 •

edited

Loading