Delay float32 upcast in ForCausalLMLoss after filtering ignore_index by starcatmeow · Pull Request #40065 · huggingface/transformers

starcatmeow · 2025-08-10T07:45:02Z

What does this PR do?

This PR implements the optimization discussed in #38452, originally proposed by @harshit2997.
Thanks for the original suggestion and discussion.

Move the float32 upcast in ForCausalLMLoss to after filtering out ignore_index labels.
Ensures only relevant logits are upcasted, reducing VRAM usage without affecting correctness.

Fixes #38452

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. Memory saving by upcasting logits for only non-ignored positions #38452
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…uggingface#38452) This avoids upcasting logits corresponding to ignore_index positions, reducing unnecessary memory usage during loss computation. Particularly useful when fine-tuning causal LMs with prompt tokens set to ignore_index (e.g., -100).

Rocketknight1 · 2025-08-13T13:55:19Z

Hi @starcatmeow, I just took a look and I'm not sure we can accept this! Although I thought it was a good optimization in the issue at #38452, this is because I misread - I thought the code was already selecting masked labels, it was just doing it after casting to float32 instead of before. In that case, doing the select first would save memory with no issues.

However, adding a mask select step when we didn't use one before introduces some problems - in particular, it makes the sizes of the output tensors data-dependent, which can force recompilations with less efficient dynamic shapes. I'm not sure it's worth it for the memory saving here! cc core maintainers @ArthurZucker @Cyrilvallez for their opinion

Cyrilvallez · 2025-08-14T22:19:29Z

Interesting idea, I'm not sure how often we expect to see ignore_index in the labels, cc @ArthurZucker do you have an idea? It could actually save a LOT of memory as vocabularies tend to be large now (200k+)

Mathematically, it's fully equivalent

harshit2997 · 2025-08-14T22:26:58Z

Thanks @starcatmeow for picking this change up. @Cyrilvallez my main motivation behind proposing the change was to account for fine-tuning cases where one doesn't want to fine tune on prompt tokens. In addition, won't this also help with getting rid of padding tokens before the upcast which is a common scenario?

Ported from huggingface#40065 (head 2b9e9cf).

Rocketknight1 force-pushed the for-casual-lm-loss-optim branch from 2235288 to 2b9e9cf Compare August 13, 2025 13:47

evalstate added a commit to evalstate/transformers that referenced this pull request Apr 29, 2026

Delay causal LM loss upcast until after label filtering

2b31948

Ported from huggingface#40065 (head 2b9e9cf).

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay float32 upcast in ForCausalLMLoss after filtering ignore_index#40065

Delay float32 upcast in ForCausalLMLoss after filtering ignore_index#40065
starcatmeow wants to merge 1 commit intohuggingface:mainfrom
starcatmeow:for-casual-lm-loss-optim

starcatmeow commented Aug 10, 2025

Uh oh!

Rocketknight1 commented Aug 13, 2025

Uh oh!

Cyrilvallez commented Aug 14, 2025 •

edited

Loading

Uh oh!

harshit2997 commented Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

starcatmeow commented Aug 10, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

Rocketknight1 commented Aug 13, 2025

Uh oh!

Cyrilvallez commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harshit2997 commented Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Cyrilvallez commented Aug 14, 2025 •

edited

Loading