Fix IndexError with DeepSpeed ZeRO-3 when kernels rotary is active#45395
Fix IndexError with DeepSpeed ZeRO-3 when kernels rotary is active#45395ArthurZucker wants to merge 3 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
When `kernels` is installed, `@use_kernelized_func` attaches a `rotary_fn` child `nn.Module` to attention layers. DeepSpeed ZeRO-3's parameter coordinator traces the module graph at init and expects every registered submodule to be invoked during forward. The model's forward still calls the plain Python `apply_rotary_pos_emb`, so `rotary_fn` is never executed and the trace desynchronizes, raising `IndexError: pop from an empty deque` on the second forward. Skip attaching the kernelized submodule when ZeRO-3 is enabled; users running under ZeRO-3 fall back to the Python implementation, which is what they were getting before huggingface#41147. Fixes huggingface#45137
f7b48c5 to
bc4d35d
Compare
vasqu
left a comment
There was a problem hiding this comment.
My only real question is: Deepspeed will no longer use a kernelized RoPE or does it?
(Also what's up with CI 👀)
|
|
||
| --> | ||
| *This model was released on 2023-02-06 and added to Hugging Face Transformers on 2026-03-19.* | ||
| *This model was released on 2023-02-06 and added to Hugging Face Transformers on 2026-03-21.* |
There was a problem hiding this comment.
Unrelated? Not a big deal either way
| from .deepspeed import is_deepspeed_zero3_enabled | ||
|
|
||
| if is_deepspeed_zero3_enabled(): | ||
| return |
There was a problem hiding this comment.
Does this mean the kernels version of RoPE cannot be used with deepspeed this way?
The whole kernelizing processes seems to go through multiple paths and it's a bit hard to keep track of:
- Kernelized RoPE only happens later (after init?)
- We register it under the module (as it's a requirement for kernels ig)
It should be fine as quick fix for sure
There was a problem hiding this comment.
I think its the way we do functtion kernelize, (today only rope)
There was a problem hiding this comment.
but yeah we do at init as module, only triggered lazily I think
|
I thought a bit more about this and it might seem that this whole approach is a bit outdated? Imo, it would make more sense to use the lazy load approach as in e.g. Mamba: transformers/src/transformers/models/mamba/modeling_mamba.py Lines 107 to 118 in def8e6a No decorator anymore but keeping it in global variable state. I'm surprised either way that we had to set this as attribute to be honest 😬 |
|
Hhaha yeah its not outdated, that's the issue. It's "new": |
Summary
Fixes #45137.
Since #41147, attention layers are decorated with
@use_kernelized_func(apply_rotary_pos_emb)which attaches arotary_fnchildnn.Moduleat init when thekernelslibrary is available.DeepSpeed ZeRO-3's parameter coordinator traces the module graph at init and expects every registered submodule to run during forward. The attention forward still calls the Python
apply_rotary_pos_emb, sorotary_fnis never invoked and the parameter-fetch trace desynchronizes, raising:on the second forward (reproducible via TRL's RLOO/GRPO trainers under ZeRO-3, see huggingface/trl#4899).