fix bug when using DP in trl, the batch size of input and output dism… by kaixuanliu · Pull Request #38938 · huggingface/transformers

kaixuanliu · 2025-06-20T09:58:55Z

No description provided.

kaixuanliu · 2025-06-20T10:18:06Z

Steps to reproduce the bug:

git clone https://github.com/huggingface/trl.git
cd trl
git checkout 3ef9faf257
pip install .
export CUDA_VISIBLE_DEVICES=0,1,2
pytest -sv -rA tests/slow/test_sft_slow.py::SFTTrainerSlowTester::test_train_offloading_0_trl_internal_testing_tiny_LlamaForCausalLM_3_2

it will fail and return error:

def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        """
        Compute training loss and additionally compute token accuracies
        """
        mode = "train" if self.model.training else "eval"
        (loss, outputs) = super().compute_loss(
            model, inputs, return_outputs=True, num_items_in_batch=num_items_in_batch
        )
        if mode == "train":
            # When using padding-free, the attention_mask is not present in the inputs, instead we have cu_seq_lens_q,
            # cu_seq_lens_k, and max_length_k, max_length_q and position_ids.
            if "attention_mask" in inputs:
                num_tokens_in_batch = self.accelerator.gather_for_metrics(inputs["attention_mask"].sum()).sum().item()
            elif "position_ids" in inputs:
                local_num_tokens = torch.tensor(inputs["position_ids"].size(1), device=inputs["position_ids"].device)
                num_tokens_in_batch = self.accelerator.gather_for_metrics(local_num_tokens).sum().item()
            else:
                raise ValueError("Expected 'attention_mask' or 'position_ids' in inputs.")
            self._total_train_tokens += num_tokens_in_batch
        self._metrics[mode]["num_tokens"] = [self._total_train_tokens]

        # Compute token accuracy if we have labels and if the model is not using Liger (no logits)
        if "labels" in inputs and not self.args.use_liger_kernel:
            shift_logits = outputs.logits[..., :-1, :].contiguous()
            shift_labels = inputs["labels"][..., 1:].contiguous()

            # Get predictions
            predictions = shift_logits.argmax(dim=-1)

            # Create mask for non-padding tokens (assuming ignore_index is -100)
            mask = shift_labels != -100

            # Calculate accuracy only on non-padding tokens
>           correct_predictions = (predictions == shift_labels) & mask
E           RuntimeError: The size of tensor a (2) must match the size of tensor b (6) at non-singleton dimension 0

It crashes as num_items_in_batch in L3837 is a 1-D tensor, and it cannot be scattered to multi-gpus successfully, hence although the input bs=6 in L3839, the output bs will be 2, and hence the test case fails.

kaixuanliu · 2025-06-20T10:24:35Z

@zach-huggingface, @SunMarc and @qgallouedec, pls help review

SunMarc

Thanks ! Can you add a test that cover this specific case ?

…atch Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

kaixuanliu · 2025-06-23T07:04:00Z

@SunMarc , Hi thx for advice. I think the existing one is OK for this case:
pytest -sv -rA tests/trainer/test_trainer.py::TrainerIntegrationTest::test_num_batches_in_training_with_gradient_accumulation
I added related assertion in latest commit. Pls help check if it is OK.

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

into ddp-trl-fix

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

yao-matrix · 2025-06-25T00:42:48Z

@kaixuanliu , CI has failed cases, pls take a look

kaixuanliu · 2025-06-25T02:42:10Z

@yao-matrix , Updated the code and the failed case passed. I also double checked the failed case on my own machine. @SunMarc Can you help review again? thx!

kaixuanliu · 2025-07-07T03:25:35Z

@SunMarc Hi, this is a 2 weeks ago PR, can you help review it? Many thanks!

kaixuanliu · 2025-07-21T07:26:09Z

@qgallouedec ,Hi, can you help review? Thx.

yao-matrix · 2025-08-05T17:43:35Z

@qgallouedec, could you help review this PR?

regisss · 2025-08-08T12:32:33Z

src/transformers/trainer.py

+        actual_bs = None
+        if "labels" in inputs and isinstance(inputs["labels"], torch.Tensor):
+            actual_bs = inputs["labels"].shape[0]


actual_bs could be defined inside the IF block at line 3819 since it is only used there

@regisss ，thx for the review! I put actual_bs here is because inputs will be deleted in L3798. It is also OK to put the delete and free memory operation later. Have updated the code.

qgallouedec · 2025-08-11T04:01:57Z

Shouldn't this be fixed in trl instead?

qgallouedec · 2025-08-11T04:03:20Z

I'm not quite sure to understand because you use 3 devices, but don't run the test in a distributed manner?

kaixuanliu · 2025-08-11T09:31:00Z

@qgallouedec Hi, I think it is a corner case issue that is not properly processed in transformers. Although it can be avoided in trl level or other upper application level, it is best to handle it in their common base level(transformers). DP maybe a dated approach, but since it is not deprecated formally, it's best to fix it. WDYT?

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

into ddp-trl-fix

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

SunMarc

Sorry for the delay, I have a left a few questions !

SunMarc · 2025-08-29T15:55:25Z

src/transformers/trainer.py

+                    assert loss_bs == self.args.n_gpu, (
+                        f"Expected loss to have {self.args.n_gpu} elements, but got {loss_bs} elements. "
+                        "This usually happens when the model does not return a loss for each device."
+                    )
+                else:
+                    assert loss_bs == actual_bs, (
+                        f"Expected loss to have {actual_bs} elements, but got {loss_bs} elements. "
+                        "This usually happens when the model does not return a loss for each device."
+                    )


Instead of assert, let's just raise RuntimeError instead. Also the error msg don't help that much, is there an actionnable step here for the users to fix the issue ?

SunMarc · 2025-08-29T15:56:29Z

src/transformers/trainer.py

+        if self.args.n_gpu > 1:
+            if "labels" in inputs and isinstance(inputs["labels"], torch.Tensor):
+                actual_bs = inputs["labels"].shape[0]
+                loss_bs = loss.shape[0] if isinstance(loss, torch.Tensor) else len(loss)
+                if actual_bs >= self.args.n_gpu:


do we really need those checks as we didn't need until now ? I feel like the isse was if num_items_in_batch, not with the labels or loss bs actually

SunMarc · 2025-08-29T15:57:25Z

src/transformers/trainer.py

+            loss = loss.mean()  # mean() to average on multi-gpu parallel training
+


we are already average the loss somewhere else no ?

SunMarc · 2025-08-29T15:58:04Z

src/transformers/trainer.py

+                    # In the DataParallel case, convert the scalar tensor into a 2-dim tensor with bs = n_gpu
+                    num_items_in_batch = num_items_in_batch.unsqueeze(0).expand(self.args.n_gpu, -1)


happy to have that but I ran the test you told me and it passed without this PR, maybe i'm doing something wrong ? pytest -sv -rA tests/trainer/test_trainer.py::TrainerIntegrationTest::test_num_batches_in_training_with_gradient_accumulation

Hi @SunMarc , you may need to revert trl to commit 3ef9faf257 as my comment above to reproduce. Anyway, I think it make sense as @qgallouedec mentioned that when using DP, it's best to use accelerate launch, so we can close this PR.

qgallouedec · 2025-08-29T17:47:10Z

If you want to use DP, you should launch the training with accelerate launch, not directly with python (or pytest).
When not using DP, make sure that only one device is visible (CUDA_VISIBLE_DEVICES=0)

SunMarc reviewed Jun 20, 2025

View reviewed changes

kaixuanliu added 2 commits June 20, 2025 15:06

fix bug when using DP in trl, the batch size of input and output dism…

d5183dc

…atch Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge branch 'main' into ddp-trl-fix

4dc4ea3

kaixuanliu added 5 commits June 23, 2025 12:10

add assertion

ddae301

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

adjust

39c9bf3

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge branch 'ddp-trl-fix' of https://github.com/kaixuanliu/transformers

2a3aec2

into ddp-trl-fix

refine code

1ef799b

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

adjust

bc99e70

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge branch 'main' into ddp-trl-fix

7906883

Merge branch 'main' into ddp-trl-fix

074ee8d

Merge branch 'main' into ddp-trl-fix

8b23c3f

SunMarc requested a review from qgallouedec July 15, 2025 13:09

Merge branch 'main' into ddp-trl-fix

96374d4

regisss reviewed Aug 8, 2025

View reviewed changes

Merge branch 'main' into ddp-trl-fix

5adc0d1

kaixuanliu added 4 commits August 11, 2025 06:18

nice code

83c6831

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

adjust code

1bde770

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge branch 'ddp-trl-fix' of https://github.com/kaixuanliu/transformers

6aaaad5

into ddp-trl-fix

adjust

37845e0

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge branch 'main' into ddp-trl-fix

36b31db

kaixuanliu closed this Aug 19, 2025

kaixuanliu reopened this Aug 20, 2025

kaixuanliu added 3 commits August 20, 2025 10:19

Merge branch 'main' into ddp-trl-fix

f724b7e

delete duplicate code

c7c4fc7

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge branch 'main' into ddp-trl-fix

1a1d713

SunMarc reviewed Aug 29, 2025

View reviewed changes

kaixuanliu closed this Sep 3, 2025

SunMarc mentioned this pull request Sep 10, 2025

[Trainer] Fix DP loss #40799

Merged

		loss = loss.mean() # mean() to average on multi-gpu parallel training

		# In the DataParallel case, convert the scalar tensor into a 2-dim tensor with bs = n_gpu
		num_items_in_batch = num_items_in_batch.unsqueeze(0).expand(self.args.n_gpu, -1)

Conversation

kaixuanliu commented Jun 20, 2025

Uh oh!

kaixuanliu commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaixuanliu commented Jun 20, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

kaixuanliu commented Jun 23, 2025

Uh oh!

yao-matrix commented Jun 25, 2025

Uh oh!

kaixuanliu commented Jun 25, 2025

Uh oh!

kaixuanliu commented Jul 7, 2025

Uh oh!

kaixuanliu commented Jul 21, 2025

Uh oh!

yao-matrix commented Aug 5, 2025

Uh oh!

regisss Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

kaixuanliu Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented Aug 11, 2025

Uh oh!

qgallouedec commented Aug 11, 2025

Uh oh!

kaixuanliu commented Aug 11, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

kaixuanliu Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

kaixuanliu commented Jun 20, 2025 •

edited

Loading

kaixuanliu Aug 11, 2025 •

edited

Loading