Skip to content

fix bug when using DP in trl, the batch size of input and output dism…#38938

Closed
kaixuanliu wants to merge 20 commits intohuggingface:mainfrom
kaixuanliu:ddp-trl-fix
Closed

fix bug when using DP in trl, the batch size of input and output dism…#38938
kaixuanliu wants to merge 20 commits intohuggingface:mainfrom
kaixuanliu:ddp-trl-fix

Conversation

@kaixuanliu
Copy link
Contributor

No description provided.

@kaixuanliu
Copy link
Contributor Author

kaixuanliu commented Jun 20, 2025

Steps to reproduce the bug:

git clone https://github.com/huggingface/trl.git
cd trl
git checkout 3ef9faf257
pip install .
export CUDA_VISIBLE_DEVICES=0,1,2
pytest -sv -rA tests/slow/test_sft_slow.py::SFTTrainerSlowTester::test_train_offloading_0_trl_internal_testing_tiny_LlamaForCausalLM_3_2

it will fail and return error:

def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        """
        Compute training loss and additionally compute token accuracies
        """
        mode = "train" if self.model.training else "eval"
        (loss, outputs) = super().compute_loss(
            model, inputs, return_outputs=True, num_items_in_batch=num_items_in_batch
        )
        if mode == "train":
            # When using padding-free, the attention_mask is not present in the inputs, instead we have cu_seq_lens_q,
            # cu_seq_lens_k, and max_length_k, max_length_q and position_ids.
            if "attention_mask" in inputs:
                num_tokens_in_batch = self.accelerator.gather_for_metrics(inputs["attention_mask"].sum()).sum().item()
            elif "position_ids" in inputs:
                local_num_tokens = torch.tensor(inputs["position_ids"].size(1), device=inputs["position_ids"].device)
                num_tokens_in_batch = self.accelerator.gather_for_metrics(local_num_tokens).sum().item()
            else:
                raise ValueError("Expected 'attention_mask' or 'position_ids' in inputs.")
            self._total_train_tokens += num_tokens_in_batch
        self._metrics[mode]["num_tokens"] = [self._total_train_tokens]

        # Compute token accuracy if we have labels and if the model is not using Liger (no logits)
        if "labels" in inputs and not self.args.use_liger_kernel:
            shift_logits = outputs.logits[..., :-1, :].contiguous()
            shift_labels = inputs["labels"][..., 1:].contiguous()

            # Get predictions
            predictions = shift_logits.argmax(dim=-1)

            # Create mask for non-padding tokens (assuming ignore_index is -100)
            mask = shift_labels != -100

            # Calculate accuracy only on non-padding tokens
>           correct_predictions = (predictions == shift_labels) & mask
E           RuntimeError: The size of tensor a (2) must match the size of tensor b (6) at non-singleton dimension 0

It crashes as num_items_in_batch in L3837 is a 1-D tensor, and it cannot be scattered to multi-gpus successfully, hence although the input bs=6 in L3839, the output bs will be 2, and hence the test case fails.

@kaixuanliu
Copy link
Contributor Author

@zach-huggingface, @SunMarc and @qgallouedec, pls help review

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Can you add a test that cover this specific case ?

@kaixuanliu
Copy link
Contributor Author

@SunMarc , Hi thx for advice. I think the existing one is OK for this case:
pytest -sv -rA tests/trainer/test_trainer.py::TrainerIntegrationTest::test_num_batches_in_training_with_gradient_accumulation
I added related assertion in latest commit. Pls help check if it is OK.

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
@yao-matrix
Copy link
Contributor

@kaixuanliu , CI has failed cases, pls take a look

@kaixuanliu
Copy link
Contributor Author

@yao-matrix , Updated the code and the failed case passed. I also double checked the failed case on my own machine. @SunMarc Can you help review again? thx!

@kaixuanliu
Copy link
Contributor Author

@SunMarc Hi, this is a 2 weeks ago PR, can you help review it? Many thanks!

@SunMarc SunMarc requested a review from qgallouedec July 15, 2025 13:09
@kaixuanliu
Copy link
Contributor Author

@qgallouedec ,Hi, can you help review? Thx.

@yao-matrix
Copy link
Contributor

@qgallouedec, could you help review this PR?

Comment on lines 3787 to 3789
actual_bs = None
if "labels" in inputs and isinstance(inputs["labels"], torch.Tensor):
actual_bs = inputs["labels"].shape[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actual_bs could be defined inside the IF block at line 3819 since it is only used there

Copy link
Contributor Author

@kaixuanliu kaixuanliu Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@regisss ,thx for the review! I put actual_bs here is because inputs will be deleted in L3798. It is also OK to put the delete and free memory operation later. Have updated the code.

@qgallouedec
Copy link
Member

Shouldn't this be fixed in trl instead?

@qgallouedec
Copy link
Member

I'm not quite sure to understand because you use 3 devices, but don't run the test in a distributed manner?

@kaixuanliu
Copy link
Contributor Author

@qgallouedec Hi, I think it is a corner case issue that is not properly processed in transformers. Although it can be avoided in trl level or other upper application level, it is best to handle it in their common base level(transformers). DP maybe a dated approach, but since it is not deprecated formally, it's best to fix it. WDYT?

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
@kaixuanliu kaixuanliu closed this Aug 19, 2025
@kaixuanliu kaixuanliu reopened this Aug 20, 2025
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I have a left a few questions !

Comment on lines +3883 to +3891
assert loss_bs == self.args.n_gpu, (
f"Expected loss to have {self.args.n_gpu} elements, but got {loss_bs} elements. "
"This usually happens when the model does not return a loss for each device."
)
else:
assert loss_bs == actual_bs, (
f"Expected loss to have {actual_bs} elements, but got {loss_bs} elements. "
"This usually happens when the model does not return a loss for each device."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of assert, let's just raise RuntimeError instead. Also the error msg don't help that much, is there an actionnable step here for the users to fix the issue ?

Comment on lines +3878 to +3882
if self.args.n_gpu > 1:
if "labels" in inputs and isinstance(inputs["labels"], torch.Tensor):
actual_bs = inputs["labels"].shape[0]
loss_bs = loss.shape[0] if isinstance(loss, torch.Tensor) else len(loss)
if actual_bs >= self.args.n_gpu:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need those checks as we didn't need until now ? I feel like the isse was if num_items_in_batch, not with the labels or loss bs actually

Comment on lines +3892 to +3893
loss = loss.mean() # mean() to average on multi-gpu parallel training

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are already average the loss somewhere else no ?

Comment on lines +5486 to +5487
# In the DataParallel case, convert the scalar tensor into a 2-dim tensor with bs = n_gpu
num_items_in_batch = num_items_in_batch.unsqueeze(0).expand(self.args.n_gpu, -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to have that but I ran the test you told me and it passed without this PR, maybe i'm doing something wrong ? pytest -sv -rA tests/trainer/test_trainer.py::TrainerIntegrationTest::test_num_batches_in_training_with_gradient_accumulation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SunMarc , you may need to revert trl to commit 3ef9faf257 as my comment above to reproduce. Anyway, I think it make sense as @qgallouedec mentioned that when using DP, it's best to use accelerate launch, so we can close this PR.

@qgallouedec
Copy link
Member

If you want to use DP, you should launch the training with accelerate launch, not directly with python (or pytest).
When not using DP, make sure that only one device is visible (CUDA_VISIBLE_DEVICES=0)

@kaixuanliu kaixuanliu closed this Sep 3, 2025
@SunMarc SunMarc mentioned this pull request Sep 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments