Skip to content

Remove cache_position in more models#44330

Merged
Cyrilvallez merged 11 commits intomainfrom
remove-cache-pos-more-models
Mar 11, 2026
Merged

Remove cache_position in more models#44330
Cyrilvallez merged 11 commits intomainfrom
remove-cache-pos-more-models

Conversation

@Cyrilvallez
Copy link
Copy Markdown
Member

@Cyrilvallez Cyrilvallez commented Feb 27, 2026

What does this PR do?

As per the title! Follow-up of #44181 with more models!

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Cyrilvallez Cyrilvallez force-pushed the remove-cache-pos-more-models branch from e69f373 to 2c58c77 Compare March 4, 2026 20:41
@Cyrilvallez Cyrilvallez changed the title Remove cache_position in more models Remove cache_position in more models Mar 9, 2026
@Cyrilvallez
Copy link
Copy Markdown
Member Author

run-slow: got_ocr2 marian lfm2_vl modernbert_decoder video_llama_3 qwen2_5_omni qwen3_omni_moe biogpt janus dbrx fast_vlm bart perception_lm blenderbot llava_onevision roberta gemma2 llava ovis2 deepseek_vl_hybrid mistral3 paligemma glm_ocr t5gemma2 ernie bigbird_pegasus pegasus t5gemma glm4v_moe time_series_transformer timesfm2_5 ernie4_5_moe voxtral llava_next_video nanochat qwen2_5_vl gpt_neox xlm_roberta_xl deepseek_vl vipllava cohere2 big_bird gemma3n granitemoehybrid internvl git lighton_ocr gpt_oss colqwen2 m2m_100 glm46v plbart aya_vision bamba metaclip_2 camembert zamba2 blt informer glm_image gemma3 mbart blenderbot_small cohere2_vision phi4_multimodal glm4v roberta_prelayernorm aimv2 smolvlm florence2 vaultgemma data2vec_text jetmoe pegasus_x

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 9, 2026

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/aimv2", "models/aya_vision", "models/bamba", "models/bart", "models/big_bird", "models/bigbird_pegasus", "models/biogpt", "models/blenderbot", "models/blenderbot_small", "models/blt", "models/camembert", "models/cohere2", "models/cohere2_vision", "models/colqwen2", "models/dbrx", "models/deepseek_vl", "models/deepseek_vl_hybrid", "models/ernie", "models/ernie4_5_moe", "models/fast_vlm", "models/florence2", "models/gemma2", "models/gemma3", "models/gemma3n", "models/git", "models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_image", "models/glm_ocr", "models/got_ocr2", "models/gpt_neox", "models/gpt_oss", "models/granitemoehybrid", "models/informer", "models/internvl", "models/janus", "models/jetmoe", "models/lfm2_vl", "models/lighton_ocr", "models/llava", "models/llava_next_video", "models/llava_onevision", "models/m2m_100", "models/marian", "models/mbart", "models/metaclip_2", "models/mistral3", "models/modernbert_decoder", "models/nanochat", "models/ovis2", "models/paligemma", "models/pegasus", "models/pegasus_x", "models/perception_lm", "models/phi4_multimodal", "models/plbart", "models/qwen2_5_omni", "models/qwen2_5_vl", "models/qwen3_omni_moe", "models/roberta", "models/roberta_prelayernorm", "models/smolvlm", "models/t5gemma", "models/t5gemma2", "models/time_series_transformer", "models/timesfm2_5", "models/vaultgemma", "models/video_llama_3", "models/vipllava", "models/voxtral", "models/xlm_roberta_xl", "models/zamba2"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 9, 2026

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 693885da workflow commit (merge commit)
PR e13feda8 branch commit (from PR)
main 0a0ac7a2 base commit (on main)

⚠️ Model CI failed to report results

The test failure analysis could not be completed. Please check the workflow run for details.

@Cyrilvallez
Copy link
Copy Markdown
Member Author

run-slow: got_ocr2 marian lfm2_vl modernbert_decoder video_llama_3 qwen2_5_omni qwen3_omni_moe janus dbrx fast_vlm bart perception_lm blenderbot roberta gemma2 llava ovis2 deepseek_vl_hybrid mistral3 paligemma glm_ocr t5gemma2 pegasus t5gemma glm4v_moe time_series_transformer timesfm2_5 ernie4_5_moe voxtral llava_next_video nanochat qwen2_5_vl gemma3n

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/bart", "models/blenderbot", "models/dbrx", "models/deepseek_vl_hybrid", "models/ernie4_5_moe", "models/fast_vlm", "models/gemma2", "models/gemma3n", "models/glm4v_moe", "models/glm_ocr", "models/got_ocr2", "models/janus", "models/lfm2_vl", "models/llava", "models/llava_next_video", "models/marian", "models/mistral3", "models/modernbert_decoder", "models/nanochat", "models/ovis2", "models/paligemma", "models/pegasus", "models/perception_lm", "models/qwen2_5_omni", "models/qwen2_5_vl", "models/qwen3_omni_moe", "models/roberta", "models/t5gemma", "models/t5gemma2", "models/time_series_transformer", "models/timesfm2_5", "models/video_llama_3", "models/voxtral"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN c43a2d40 workflow commit (merge commit)
PR 1df3aa4e branch commit (from PR)
main 0e3be387 base commit (on main)

⚠️ Model CI failed to report results

The test failure analysis could not be completed. Please check the workflow run for details.

@Cyrilvallez
Copy link
Copy Markdown
Member Author

run-slow: lfm2_vl modernbert_decoder qwen2_5_omni qwen3_omni_moe bart gemma2 llava gemma3n

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/bart", "models/gemma2", "models/gemma3n", "models/lfm2_vl", "models/llava", "models/modernbert_decoder", "models/qwen2_5_omni", "models/qwen3_omni_moe"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN c43a2d40 workflow commit (merge commit)
PR 1df3aa4e branch commit (from PR)
main 0e3be387 base commit (on main)

⚠️ Model CI failed to report results

The test failure analysis could not be completed. Please check the workflow run for details.

@Cyrilvallez
Copy link
Copy Markdown
Member Author

Cyrilvallez commented Mar 11, 2026

Because the run-slow command does not work, I checked locally and all those models:
got_ocr2 marian lfm2_vl modernbert_decoder video_llama_3 qwen2_5_omni qwen3_omni_moe janus dbrx fast_vlm bart perception_lm blenderbot roberta gemma2 llava ovis2 deepseek_vl_hybrid mistral3 paligemma glm_ocr t5gemma2 pegasus t5gemma glm4v_moe time_series_transformer timesfm2_5 ernie4_5_moe voxtral llava_next_video nanochat qwen2_5_vl gemma3n
have no diffs with main on their IntegrationTest!
So all good!

Copy link
Copy Markdown
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only thing I would address

  • Hybrid models nit
  • Some simplification on the input_ids/embeds check on a small number of models (iirc 2 I commented on)

The rest is mostly me commenting to myself a bit or nits

Comment on lines -614 to -615
and cache_position is not None
and cache_position[0] > 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly we never needed that condition at all, given we have cache_params.has_previous_state

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, no idea why it was there... But that's why I fully removed!

Comment thread src/transformers/models/bamba/modeling_bamba.py Outdated
Comment thread src/transformers/models/bamba/modeling_bamba.py
cache_position = torch.arange(
past_key_values_length, past_key_values_length + seq_length, device=inputs_embeds.device
)
position_ids = torch.arange(seq_length, device=inputs_embeds.device) + past_key_values_length
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Position ids are always None I assume?

Copy link
Copy Markdown
Member Author

@Cyrilvallez Cyrilvallez Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, since for some reason later on it was doing positions = self.embed_positions(..., position_ids=cache_position) and cache_position can never be different from a very simple arange based on the current length (i.e. it does not have the subtlety of position_ids which are different based on padding, packing format etc...), it's the same to recreate the arange in this way! So independently of if the "real" position_ids are passed as kwarg. This is because the model recreates the positions anyway

# embed positions
if position_ids is None:
position_ids = cache_position.unsqueeze(0)
position_ids = torch.arange(seq_length, device=inputs_embeds.device) + past_key_values_length
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing it multiple times but is there any difference for torch to do arange + int vs arange only?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! It's a subtlety, but because past_key_values_length will be a tensor if and only if the cache is a StaticCache, doing it this way avoids a precious cuda synchronization for those cases! See #42433 and the issue mentioned for more details!

Comment thread src/transformers/models/ernie/modeling_ernie.py Outdated
)

# Copied from transformers.models.bert.modeling_bert.BertModel._create_attention_masks
# No longer Copied from transformers.models.bert.modeling_bert.BertModel._create_attention_masks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm a bit weird to change this model but not bert 🤔 how come?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was focusing on finishing the models using modular, and a LOT of other models are copied from bert, so will keep it for next batch!

Comment thread src/transformers/models/glm_image/modeling_glm_image.py
Comment thread src/transformers/models/modernbert_decoder/modeling_modernbert_decoder.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: aimv2, aya_vision, bamba, bart, big_bird, bigbird_pegasus, biogpt, blenderbot, blenderbot_small, blt, camembert, cohere2, cohere2_vision, colqwen2, data2vec, dbrx

@Cyrilvallez Cyrilvallez merged commit 50cc0b8 into main Mar 11, 2026
29 checks passed
@Cyrilvallez Cyrilvallez deleted the remove-cache-pos-more-models branch March 11, 2026 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants