Fix loss scaling and token aggregation to use only data parallel group by Krish0909 · Pull Request #39674 · huggingface/transformers

Krish0909 · 2025-07-25T15:15:44Z

What does this PR do?

This PR fixes a bug in the Trainer where loss and token counts were previously being scaled across all Accelerate processes—including tensor parallel (TP) and context parallel (CP) meshes—leading to inflated training losses when using composable parallelism. After this change, loss scaling and token aggregation will only consider the data parallel group, aligning TP/CP runs with pure DDP behavior.

Fixes: Fixes #39648

Changes

Loss scaling: Replaced self.accelerator.num_processes with self.accelerator.state.num_data_parallel_processes when applying average_tokens_across_devices.

Token aggregation: Updated batching logic to use accelerator.reduce(..., group_type="data") for summing tokens only across the data parallel group.

Motivation and Context

When using Accelerate's composable parallelism (TP/CP), the original implementation erroneously multiplied the loss by the total number of processes (DP × TP × CP). This resulted in losses that were N× larger (where N = TP × CP), making training logs and LR schedulers behave incorrectly. By restricting scaling to the data parallel group, we restore consistency with pure DDP runs.

Testing

All existing Trainer integration tests pass (no regressions).

Manual verification:

Ran run_glue.py on MRPC with --tensor_parallel_size 2 --context_parallel_size 2. Logged losses every 10 steps.

Compared against a pure DDP run (no TP/CP flags). Loss trajectories matched within floating-point tolerance.

Before submitting

Who can review?

Trainer: @zach-huggingface, @SunMarc

Accelerate integration: @SunMarc, @zach-huggingface

S1ro1 · 2025-07-26T14:20:29Z

We aren't 100% sure of the API we'll go with in the PR you mentioned, so it's subject to change. Thank you for the contribution though! Also mind me asking, afaik we don't have num_data_parallel_processes in accelerate (yet), how did you come up with that?

Krish0909 · 2025-07-27T11:14:24Z

We aren't 100% sure of the API we'll go with in the PR you mentioned, so it's subject to change. Thank you for the contribution though! Also mind me asking, afaik we don't have num_data_parallel_processes in accelerate (yet), how did you come up with that?

You're absolutely right—num_data_parallel_processes isn't currently in accelerate. I added it as part of a forward-looking design to align with how AcceleratorState handles other parallelism dimensions like num_processes and num_mixed_precision_processes. I thought having explicit separation could be useful in scenarios with hybrid parallelism setups.

That said, I completely understand that the API is still evolving. I'm happy to adapt this PR once there's a clearer direction from the core team or if you'd prefer me to refactor to avoid the placeholder for now.

Let me know how you'd like me to proceed!

srrk-GreenMan · 2025-07-27T14:45:45Z

Sorry for interupt. If you are looking forward to add a new variable of Accelerator, why don't we use new attributes of the model? (ex. model.dp_size, model.tp_size etc). I think when the model parallel is applied by the function "fully_shard", we can use the device mesh names and its shape.

S1ro1 · 2025-07-27T15:03:12Z

@Krish0909 It's totally fine, we aim to add properties as such in the PR you mentioned anyway, so this is probably gonna be very close to final.

@srrk-GreenMan we'd like to avoid adding this to the model itself, as it becomes transformers specific. We'll probably allow users to take properties as such from the ParallelismConfig

Fix loss scaling and token aggregation to use only data parallel group

83895a3

Krish0909 changed the title ~~Fix loss scaling and token aggregation to use only data parallel group #39648~~ Fix loss scaling and token aggregation to use only data parallel group Jul 25, 2025

Krish0909 added 2 commits July 25, 2025 20:49

Merge branch 'main' into fix/loss-scaling-tp-cp

ffd1cee

Merge branch 'main' into fix/loss-scaling-tp-cp

c84a789

evalstate added a commit to evalstate/transformers that referenced this pull request Apr 29, 2026

Apply PR huggingface#39674: scale loss by data parallel size

4d1d1fc

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix loss scaling and token aggregation to use only data parallel group#39674

Fix loss scaling and token aggregation to use only data parallel group#39674
Krish0909 wants to merge 3 commits intohuggingface:mainfrom
Krish0909:fix/loss-scaling-tp-cp

Krish0909 commented Jul 25, 2025

Uh oh!

S1ro1 commented Jul 26, 2025

Uh oh!

Krish0909 commented Jul 27, 2025

Uh oh!

srrk-GreenMan commented Jul 27, 2025 •

edited

Loading

Uh oh!

S1ro1 commented Jul 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Krish0909 commented Jul 25, 2025

Uh oh!

S1ro1 commented Jul 26, 2025

Uh oh!

Krish0909 commented Jul 27, 2025

Uh oh!

srrk-GreenMan commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

S1ro1 commented Jul 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

srrk-GreenMan commented Jul 27, 2025 •

edited

Loading