[CP] Add attention_mask to the buffer when the mask is causal by kashif · Pull Request #40619 · huggingface/transformers

kashif · 2025-09-02T11:00:04Z

What does this PR do?

Ensured that the attention_mask is always checked for causality before being added to the buffer, and that this validation is performed only once for performance.
Modified the logic so that the attention_mask is appended to buffers only after successful validation, preventing non-causal masks from being used.

HuggingFaceDocBuilderDev · 2025-09-02T11:09:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

Thanks !

SunMarc · 2025-09-02T16:29:32Z

cc @S1ro1 for visibility

S1ro1 · 2025-09-02T17:03:09Z

It's useless to add this to buffers no? CP doesn't work with explicitly passed attention mask so in accelerate we attach a hook that pops it out

kashif · 2025-09-02T18:13:40Z

i was getting errors in the SFT Trainer since we use the attention mask for metrics and its shape was not matching

kashif · 2025-09-02T18:15:42Z

@S1ro1 the entropy metric for example is erroring out:

https://github.com/huggingface/trl/blob/main/trl/trainer/sft_trainer.py#L1041-L1056

S1ro1 · 2025-09-02T19:04:16Z

Okay, ig makes sense. But in this case it's not entirely correct to shard mask across dim(1), as mask should be sharded across dim(1) and dim(2) which is not expressable in torch DTensor placements, so just to look after that.

kashif · 2025-09-03T10:11:12Z

@S1ro1 good point! How about we only split the 2d masks?

Fix attention mask validation for context parallelism

0dd1b04

kashif requested a review from SunMarc September 2, 2025 11:00

SunMarc approved these changes Sep 2, 2025

View reviewed changes

Merge branch 'main' into kashif-patch-1

9a8ae93

SunMarc enabled auto-merge (squash) September 2, 2025 16:29

only split 2d attention masks

484321d

kashif mentioned this pull request Sep 3, 2025

🥓 [docs] add CP docs huggingface/trl#3994

Merged

SunMarc merged commit acc968c into main Sep 3, 2025
25 checks passed

SunMarc deleted the kashif-patch-1 branch September 3, 2025 10:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CP] Add attention_mask to the buffer when the mask is causal #40619

[CP] Add attention_mask to the buffer when the mask is causal #40619
SunMarc merged 3 commits intomainfrom
kashif-patch-1

kashif commented Sep 2, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Sep 2, 2025

Uh oh!

SunMarc left a comment

Uh oh!

SunMarc commented Sep 2, 2025

Uh oh!

S1ro1 commented Sep 2, 2025

Uh oh!

kashif commented Sep 2, 2025

Uh oh!

kashif commented Sep 2, 2025

Uh oh!

S1ro1 commented Sep 2, 2025

Uh oh!

kashif commented Sep 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kashif commented Sep 2, 2025

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Sep 2, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc commented Sep 2, 2025

Uh oh!

S1ro1 commented Sep 2, 2025

Uh oh!

kashif commented Sep 2, 2025

Uh oh!

kashif commented Sep 2, 2025

Uh oh!

S1ro1 commented Sep 2, 2025

Uh oh!

kashif commented Sep 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants