[generate] Always pass full input_ids in `prepare_inputs_for_generation` by Cyrilvallez · Pull Request #44226 · huggingface/transformers

Cyrilvallez · 2026-02-23T13:27:23Z

What does this PR do?

As per the title. It looks like some models (xlnet and kosmos2_5) and most audio models sometimes rely on the full previous input_ids to prepare inputs. Note that this cannot be compatible with restarting generation from a previously filled cache and new inputs, so those models are not well-behaved in general (if they use a cache, they should be able to restart from it with any input). However, it's a simple fix to forward the full inputs and slice inside prepare_inputs_for_generation for those models.

This should fix all audio models cc @eustlb. Please note my comment about how that means audio models cannot restart from old cache, not sure if it's known/intended/fixable.
As for the other only 2 non-audio models requiring past input_ids (xlnet and kosmos2_5), it's in general the same issue. Kosmos could be fixed, the other not sure.

See https://huggingface.co/datasets/hf-internal-testing/transformers_daily_ci/raw/8785954cca2fdca181de0b9567059471bcadb959/2026-02-21/ci_results_run_models_gpu/new_failures_with_bad_commit_grouped_by_authors.json for details on the failing tests.

cc @zucchini-nlp @vasqu

HuggingFaceDocBuilderDev · 2026-02-23T13:42:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2026-02-23T13:47:56Z

-                use_cache = model_kwargs.get("use_cache", True)
-                new_inputs_ids = input_ids[:, -1:] if use_cache else input_ids
-                model_inputs = self.prepare_inputs_for_generation(new_inputs_ids, **model_kwargs)
+                next_sequence_length = 1 if model_kwargs.get("use_cache", True) else None


nit: is it oke to default to use_cache=True? Prev we had a fallback to config attr

It's actually guaranteed to exist in the model_kwargs. I updated to show it explicitly

Cyrilvallez · 2026-02-23T18:05:07Z

run-slow: kosmos2_5

github-actions · 2026-02-23T18:06:55Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/kosmos2_5"]
quantizations: []

github-actions · 2026-02-23T18:18:23Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	2e262fed	workflow commit (merge commit)
PR	3bd0a68b	branch commit (from PR)
main	6ed9ee36	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

zucchini-nlp

The only question I have is about input embeddings. Not really clear why we need to slice them before prefill, otherwise lgtm

zucchini-nlp · 2026-02-24T09:12:22Z

+            # The cache is already taken into account in `_get_initial_cache_position`, so the length is only the new tokens if we slice
+            effective_input_length = next_sequence_length if next_sequence_length is not None else input_ids.shape[1]


Lost about this part, why inputs embeds cannot be sliced as is and does this work when both ids and embeds are passed?

We never use them both simultaneously in prepare_inputs_for_generation. The issue is that _get_initial_cache_position will override the given sequence_length if inputs_embeds are present, so if we want the correct position we need to slice!

ahh I see, it's surprising that _get_initial_cache_position treats embeds and ids differently

Yes, way too surprising IMO. It should only take the sequence_length and/or cache and that's it imo, but for another time

zucchini-nlp · 2026-02-24T09:16:06Z

+        next_sequence_length: int | None = None,
        past_key_values: Cache | None = None,
        attention_mask: torch.LongTensor | None = None,
        inputs_embeds: torch.FloatTensor | None = None,
        cache_position: torch.LongTensor | None = None,


imo now we have next_sequence_length which has similar functionality as cache_position or even just past_key_values in simple cases. Let's clean up redundant arg and deprecate them properly in subsequent PRs

Yeah the whole goal of those PRs was to remove cache_position - even though those 2 kwargs are not really the same as next_sequence_length is not intended to be used in modeling code at all, I've already started removing cache_position everywhere here #44181 🤗

vasqu

LGTM, just some smaller comments. I guess this is a pre-step to removing cache positions in #44181?

vasqu · 2026-02-24T10:03:29Z

+            input_ids = input_ids[:, -next_sequence_length:] if next_sequence_length is not None else input_ids
            model_inputs[input_ids_key] = input_ids.clone(memory_format=torch.contiguous_format)
            batch_size, sequence_length = input_ids.shape[:2]  # we slice here as some models may have them 3D



Not super relevant to this PR but this is probably a limitation to audio models, no? We can have 3D input ids with the codebook channel dim, so something along [bsz, channels, seq_len]

It's actually not, as omitting the dim is equivalent to slicing it all with :!

vasqu · 2026-02-24T10:06:22Z

+            if model_input is not None and model_input.shape[-1] != sequence_length:
+                # Input can be 2D or 3D, and we always slice on `seq-length` (last dim)
+                model_input = model_input[..., -sequence_length:].clone(memory_format=torch.contiguous_format)
+                model_inputs[model_input_name] = model_input


Suggested change

if model_input is not None and model_input.shape[-1] != sequence_length:

# Input can be 2D or 3D, and we always slice on `seq-length` (last dim)

model_input = model_input[..., -sequence_length:].clone(memory_format=torch.contiguous_format)

model_inputs[model_input_name] = model_input

if model_input is not None:

# Input can be 2D or 3D, and we always slice on `seq-length` (last dim)

model_input = model_input[..., -sequence_length:].clone(memory_format=torch.contiguous_format)

model_inputs[model_input_name] = model_input

Not a fan of a data dependent control flow here; should work in any case, no?

I don't think shape checks are data-dependent flows in terms of dynamo - we can remove anyway but then we always clone which is unnecessary (though not really a big deal)

github-actions · 2026-02-24T10:31:58Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: csm, higgs_audio_v2, kosmos2_5, paligemma, rag, xglm, xlm

Cyrilvallez · 2026-02-24T10:32:45Z

Well, technically #44130 was the pre-step for removing cache_position - this is a fix after said PR as audio models need to access full input_ids in prepare_inputs_for_generation unfortunately, so we have to delay the slicing a little bit!

…chmarks Rewrite static_sample_investigation.md with: - Context: goal is to determine neuron-only vs general static path - Methodology: align on newest _sample algorithm, not neuron_sample fork - Full comparison table: _static_sample vs neuron_sample (14 items) - Benchmark results for Items A (output_ids CPU) and B (4D mask) - Recent PRs affecting _sample (#44226, #44130, #44181, #44126) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Cyrilvallez force-pushed the send-full-inputs branch from 74727d9 to 766a86e Compare February 23, 2026 13:29

zucchini-nlp reviewed Feb 23, 2026

View reviewed changes

zucchini-nlp approved these changes Feb 24, 2026

View reviewed changes

Cyrilvallez added 13 commits February 24, 2026 10:57

send full input_ids, slice after

8dac28e

fix xlm

46586c5

fix

79c9c9e

improve

41a565d

fix

dc03701

make sure

b66b1a7

finally get it... they are modified in-place and clone was saving us...

4817b5e

all the same

2a4672b

fix

5e60358

fix

9bcfb81

finally

f9dccdf

doc

a5b1c12

higgs

ac30867

Cyrilvallez force-pushed the send-full-inputs branch from 8b4e2ae to ac30867 Compare February 24, 2026 09:59

vasqu approved these changes Feb 24, 2026

View reviewed changes

add comment

cfdf7c0

Cyrilvallez merged commit 3c52b78 into main Feb 24, 2026
26 checks passed

Cyrilvallez deleted the send-full-inputs branch February 24, 2026 10:45

Cyrilvallez mentioned this pull request Feb 27, 2026

[core] 🚨 Completely remove cache positions #44181

Merged

		# The cache is already taken into account in `_get_initial_cache_position`, so the length is only the new tokens if we slice
		effective_input_length = next_sequence_length if next_sequence_length is not None else input_ids.shape[1]

Conversation

Cyrilvallez commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Feb 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez commented Feb 23, 2026

Uh oh!

github-actions Bot commented Feb 23, 2026

Uh oh!

github-actions Bot commented Feb 23, 2026

CI Results

Commit Info

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Feb 24, 2026

Uh oh!

Cyrilvallez commented Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Cyrilvallez commented Feb 23, 2026 •

edited

Loading

Cyrilvallez Feb 24, 2026 •

edited

Loading