Remove cache_position in more models#44330
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
e69f373 to
2c58c77
Compare
cache_position in more models
|
run-slow: got_ocr2 marian lfm2_vl modernbert_decoder video_llama_3 qwen2_5_omni qwen3_omni_moe biogpt janus dbrx fast_vlm bart perception_lm blenderbot llava_onevision roberta gemma2 llava ovis2 deepseek_vl_hybrid mistral3 paligemma glm_ocr t5gemma2 ernie bigbird_pegasus pegasus t5gemma glm4v_moe time_series_transformer timesfm2_5 ernie4_5_moe voxtral llava_next_video nanochat qwen2_5_vl gpt_neox xlm_roberta_xl deepseek_vl vipllava cohere2 big_bird gemma3n granitemoehybrid internvl git lighton_ocr gpt_oss colqwen2 m2m_100 glm46v plbart aya_vision bamba metaclip_2 camembert zamba2 blt informer glm_image gemma3 mbart blenderbot_small cohere2_vision phi4_multimodal glm4v roberta_prelayernorm aimv2 smolvlm florence2 vaultgemma data2vec_text jetmoe pegasus_x |
|
This comment contains models: ["models/aimv2", "models/aya_vision", "models/bamba", "models/bart", "models/big_bird", "models/bigbird_pegasus", "models/biogpt", "models/blenderbot", "models/blenderbot_small", "models/blt", "models/camembert", "models/cohere2", "models/cohere2_vision", "models/colqwen2", "models/dbrx", "models/deepseek_vl", "models/deepseek_vl_hybrid", "models/ernie", "models/ernie4_5_moe", "models/fast_vlm", "models/florence2", "models/gemma2", "models/gemma3", "models/gemma3n", "models/git", "models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_image", "models/glm_ocr", "models/got_ocr2", "models/gpt_neox", "models/gpt_oss", "models/granitemoehybrid", "models/informer", "models/internvl", "models/janus", "models/jetmoe", "models/lfm2_vl", "models/lighton_ocr", "models/llava", "models/llava_next_video", "models/llava_onevision", "models/m2m_100", "models/marian", "models/mbart", "models/metaclip_2", "models/mistral3", "models/modernbert_decoder", "models/nanochat", "models/ovis2", "models/paligemma", "models/pegasus", "models/pegasus_x", "models/perception_lm", "models/phi4_multimodal", "models/plbart", "models/qwen2_5_omni", "models/qwen2_5_vl", "models/qwen3_omni_moe", "models/roberta", "models/roberta_prelayernorm", "models/smolvlm", "models/t5gemma", "models/t5gemma2", "models/time_series_transformer", "models/timesfm2_5", "models/vaultgemma", "models/video_llama_3", "models/vipllava", "models/voxtral", "models/xlm_roberta_xl", "models/zamba2"] |
CI ResultsCommit Info
The test failure analysis could not be completed. Please check the workflow run for details. |
|
run-slow: got_ocr2 marian lfm2_vl modernbert_decoder video_llama_3 qwen2_5_omni qwen3_omni_moe janus dbrx fast_vlm bart perception_lm blenderbot roberta gemma2 llava ovis2 deepseek_vl_hybrid mistral3 paligemma glm_ocr t5gemma2 pegasus t5gemma glm4v_moe time_series_transformer timesfm2_5 ernie4_5_moe voxtral llava_next_video nanochat qwen2_5_vl gemma3n |
|
This comment contains models: ["models/bart", "models/blenderbot", "models/dbrx", "models/deepseek_vl_hybrid", "models/ernie4_5_moe", "models/fast_vlm", "models/gemma2", "models/gemma3n", "models/glm4v_moe", "models/glm_ocr", "models/got_ocr2", "models/janus", "models/lfm2_vl", "models/llava", "models/llava_next_video", "models/marian", "models/mistral3", "models/modernbert_decoder", "models/nanochat", "models/ovis2", "models/paligemma", "models/pegasus", "models/perception_lm", "models/qwen2_5_omni", "models/qwen2_5_vl", "models/qwen3_omni_moe", "models/roberta", "models/t5gemma", "models/t5gemma2", "models/time_series_transformer", "models/timesfm2_5", "models/video_llama_3", "models/voxtral"] |
CI ResultsCommit Info
The test failure analysis could not be completed. Please check the workflow run for details. |
|
run-slow: lfm2_vl modernbert_decoder qwen2_5_omni qwen3_omni_moe bart gemma2 llava gemma3n |
|
This comment contains models: ["models/bart", "models/gemma2", "models/gemma3n", "models/lfm2_vl", "models/llava", "models/modernbert_decoder", "models/qwen2_5_omni", "models/qwen3_omni_moe"] |
CI ResultsCommit Info
The test failure analysis could not be completed. Please check the workflow run for details. |
|
Because the |
vasqu
left a comment
There was a problem hiding this comment.
Only thing I would address
- Hybrid models nit
- Some simplification on the input_ids/embeds check on a small number of models (iirc 2 I commented on)
The rest is mostly me commenting to myself a bit or nits
| and cache_position is not None | ||
| and cache_position[0] > 0 |
There was a problem hiding this comment.
Honestly we never needed that condition at all, given we have cache_params.has_previous_state
There was a problem hiding this comment.
Yup, no idea why it was there... But that's why I fully removed!
| cache_position = torch.arange( | ||
| past_key_values_length, past_key_values_length + seq_length, device=inputs_embeds.device | ||
| ) | ||
| position_ids = torch.arange(seq_length, device=inputs_embeds.device) + past_key_values_length |
There was a problem hiding this comment.
Position ids are always None I assume?
There was a problem hiding this comment.
Yup, since for some reason later on it was doing positions = self.embed_positions(..., position_ids=cache_position) and cache_position can never be different from a very simple arange based on the current length (i.e. it does not have the subtlety of position_ids which are different based on padding, packing format etc...), it's the same to recreate the arange in this way! So independently of if the "real" position_ids are passed as kwarg. This is because the model recreates the positions anyway
| # embed positions | ||
| if position_ids is None: | ||
| position_ids = cache_position.unsqueeze(0) | ||
| position_ids = torch.arange(seq_length, device=inputs_embeds.device) + past_key_values_length |
There was a problem hiding this comment.
Seeing it multiple times but is there any difference for torch to do arange + int vs arange only?
There was a problem hiding this comment.
Yes! It's a subtlety, but because past_key_values_length will be a tensor if and only if the cache is a StaticCache, doing it this way avoids a precious cuda synchronization for those cases! See #42433 and the issue mentioned for more details!
| ) | ||
|
|
||
| # Copied from transformers.models.bert.modeling_bert.BertModel._create_attention_masks | ||
| # No longer Copied from transformers.models.bert.modeling_bert.BertModel._create_attention_masks |
There was a problem hiding this comment.
Hmm a bit weird to change this model but not bert 🤔 how come?
There was a problem hiding this comment.
Was focusing on finishing the models using modular, and a LOT of other models are copied from bert, so will keep it for next batch!
|
[For maintainers] Suggested jobs to run (before merge) run-slow: aimv2, aya_vision, bamba, bart, big_bird, bigbird_pegasus, biogpt, blenderbot, blenderbot_small, blt, camembert, cohere2, cohere2_vision, colqwen2, data2vec, dbrx |
What does this PR do?
As per the title! Follow-up of #44181 with more models!