Trainer: set skip_logits for loss-only eval when liger enabled by AkshajKashyap · Pull Request #44981 · huggingface/transformers

AkshajKashyap · 2026-03-25T00:38:02Z

What does this PR do?

When prediction_loss_only=True during evaluation and use_liger_kernel=True, Trainer.prediction_step now passes skip_logits=True to the model forward if the forward signature supports it and labels are present.

This avoids materializing logits during loss-only eval and enables fused loss paths for implementations that use skip_logits (for example, Liger kernel integrations), which can reduce memory usage during evaluation.

Implementation details

Injects skip_logits=True into inputs only when:
- prediction_loss_only is true
- use_liger_kernel is enabled
- labels are present
- model forward accepts a skip_logits parameter (checked via signature)
No behavior change when labels are missing, or when the model does not support skip_logits.

Tests

Added CPU-only unit tests:

test_trainer_sets_skip_logits_for_loss_only_eval_when_liger_enabled
test_trainer_does_not_set_skip_logits_when_no_labels_but_return_loss_true

Run with:
python -m pytest -q tests/trainer/test_skip_logits_eval.py -x

Code Agent Policy

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Issue: When using the Liger Kernel, torch.nn.functional.cross_entropy is called #43039
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Tagging: @SunMarc

…ngfacegh-43039)

SunMarc

did you see memory gain with this ?

AkshajKashyap · 2026-03-27T00:56:54Z

did you see memory gain with this ?

Yep, I measured a clear peak GPU memory reduction on my RTX 3050 (4GB) using a small Llama model with Liger enabled.

Benchmark setup:

Model: hf-internal-testing/tiny-random-LlamaForCausalLM
batch=1, seq_len=1024, fp16
Measured peak allocated GPU memory during loss-only eval

Results:

baseline (forced skip_logits=False): 324.7 MB peak
with this PR behavior (Trainer injects skip_logits=True when supported): 10.6 MB peak
delta: 314.0 MB (~96.7%)

This matches the intent of the fix: in loss-only eval, skipping logits avoids materializing the logits tensor and enables fused loss paths for implementations that use skip_logits (like Liger integrations). Savings should generally increase with longer sequence length / larger vocab.

Appreciate your reply, and really just trying to help and make myself useful.

AkshajKashyap · 2026-04-03T03:22:23Z

did you see memory gain with this ?

Quick follow-up since it’s been about a week and CI is still green: is this change directionally OK, or would you prefer a different guard / placement?

Happy to adjust quickly if you want this behind an additional condition (for example, only for specific model types) or moved to a different part of the eval path. @SunMarc

SunMarc

Thanks, just a few nits

SunMarc · 2026-04-09T15:01:28Z

+            try:
+                forward_sig = inspect.signature(unwrap_model(model).forward)
+                if "skip_logits" in forward_sig.parameters:
+                    inputs["skip_logits"] = True
+            except (TypeError, ValueError):
+                pass


why try except ? in which cases we hit ValueError or TypeError ?

Also this is an arg from liger, so it should always be there no ? or not ?

SunMarc · 2026-04-09T15:03:14Z

@@ -0,0 +1,94 @@
+import tempfile


don't create a new file, put it in an existing file. Also add one test only, it should be enough. Try to make the tests as simple as possible + small if possible

SunMarc · 2026-04-09T15:30:41Z

+        # Enable Liger fused loss path during eval when we only need the loss (no logits).
+        if (
+            prediction_loss_only
+            and getattr(self.args, "use_liger_kernel", False)


we don't need the getattr

SunMarc · 2026-04-09T15:31:16Z

+        if (
+            prediction_loss_only
+            and getattr(self.args, "use_liger_kernel", False)
+            and inputs.get("labels") is not None


like here, can can just put this piece of code in the correct place so that we don't have to check that https://github.com/huggingface/transformers/pull/45273/changes

SunMarc · 2026-04-09T15:31:37Z

+            and "skip_logits" not in inputs
+        ):


not sure why skip_logits will be in inputs

AkshajKashyap and others added 5 commits March 24, 2026 17:32

Trainer: set skip_logits for loss-only eval when liger enabled (huggi…

48a44f1

…ngfacegh-43039)

tests: remove unused pytest import

24766f8

ci: rerun

f0980a6

tests: isolate RNG state in skip_logits trainer tests

5442597

Merge branch 'main' into fix/huggingfacegh-43039-skip-logits-eval

32a4c2e

SunMarc reviewed Mar 26, 2026

View reviewed changes

AkshajKashyap added 2 commits March 27, 2026 00:34

Merge branch 'main' into fix/huggingfacegh-43039-skip-logits-eval

f2cbdd3

Merge branch 'main' into fix/huggingfacegh-43039-skip-logits-eval

125fece

SunMarc reviewed Apr 9, 2026

View reviewed changes

SunMarc mentioned this pull request Apr 9, 2026

fix: liger unnecessarily materializes logits in VRAM during eval, causing OOM #45273

Open

6 tasks

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer: set skip_logits for loss-only eval when liger enabled#44981

Trainer: set skip_logits for loss-only eval when liger enabled#44981
AkshajKashyap wants to merge 7 commits intohuggingface:mainfrom
AkshajKashyap:fix/gh-43039-skip-logits-eval

AkshajKashyap commented Mar 25, 2026

Uh oh!

SunMarc left a comment

Uh oh!

AkshajKashyap commented Mar 27, 2026

Uh oh!

AkshajKashyap commented Apr 3, 2026

Uh oh!

SunMarc left a comment

Uh oh!

SunMarc Apr 9, 2026

Uh oh!

SunMarc Apr 9, 2026

Uh oh!

SunMarc Apr 9, 2026

Uh oh!

SunMarc Apr 9, 2026

Uh oh!

SunMarc Apr 9, 2026

Uh oh!

SunMarc Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AkshajKashyap commented Mar 25, 2026

What does this PR do?

Implementation details

Tests

Code Agent Policy

Before submitting

Who can review?

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

AkshajKashyap commented Mar 27, 2026

Uh oh!

AkshajKashyap commented Apr 3, 2026

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

SunMarc Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

SunMarc Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

SunMarc Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

SunMarc Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

SunMarc Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants