add rotary kernel support to Qwen3 model#41147
Conversation
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
|
I made benchmark for |
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
MekkCyber
left a comment
There was a problem hiding this comment.
Thanks for this integration @kaixuanliu ! I left few nits to consider
| global use_kernels | ||
| use_kernels = getattr(self, "use_kernels", False) | ||
|
|
There was a problem hiding this comment.
It's better to have an attention kwarg passed use_rotary_kernel for example than defining a global variable like this
There was a problem hiding this comment.
You mean add a param called use_rotary_kernel to kwargs here, and passed it down to Qwen3Attention?
| from ...cache_utils import Cache, DynamicCache | ||
| from ...generation import GenerationMixin | ||
| from ...integrations import use_kernel_forward_from_hub | ||
| from ...integrations.hub_kernels import rotary_kernel |
There was a problem hiding this comment.
I think we need to lazily load the kernel, because here we are loading it before even knowing if the user wants to use kernels or not
There was a problem hiding this comment.
Thx for your advice! Have updated related code
| def apply_rotary_kernel(q, k, cos, sin, position_ids=None, unsqueeze_dim=1): | ||
| """ | ||
| Rotary kernel implementation wrapper | ||
| Adapts rotary kernels implementation to match HuggingFace apply_rotary_pos_emb signature | ||
| """ | ||
| cos = cos.unsqueeze(unsqueeze_dim) | ||
| sin = sin.unsqueeze(unsqueeze_dim) | ||
|
|
||
| q_rotated = q.clone() | ||
| k_rotated = k.clone() | ||
|
|
||
| # Get half dimension for rotation | ||
| half_dim = q.shape[-1] // 2 | ||
| q1 = q_rotated[..., :half_dim] | ||
| q2 = q_rotated[..., half_dim:] | ||
| k1 = k_rotated[..., :half_dim] | ||
| k2 = k_rotated[..., half_dim:] | ||
| if cos.shape[-1] != half_dim: | ||
| # Trim cos/sin to match half_dim | ||
| cos = cos[..., :half_dim] | ||
| sin = sin[..., :half_dim] | ||
|
|
||
| # Apply rotary embedding using our kernel | ||
| rotary_kernel.apply_rotary(q1, q2, cos, sin, q1, q2, False) |
There was a problem hiding this comment.
Did you try to benchmark the performance with and without this kernel ?
There was a problem hiding this comment.
Yes, on Intel XPU, one single rotary op needs 0.22 ms, and it drops to 0.1 ms after applying this patch. above 2x speedup.
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
ArthurZucker
left a comment
There was a problem hiding this comment.
hey! unfortunately this is not how we want to be adding support for kernels in general!
There should be 0 modeling changes involved, especially here it does not even seem to be required!
We'd rather import once from kernels to replace the rotary embed if the function is defined or something, but in the broad scheme of things, we want a mapping for function ! Like we do for classes
|
@ArthurZucker Thx for the comment, it makes sense.Since for function level, we do not have a shema for mapping like classes in kernels, we will add related support and then based on this, I will adjust this PR. |
|
@ArthurZucker @danieldk , could you comment the feasibility of Kaixuan's proposal? |
|
I think @MekkCyber is working on that feature specifically! |
@ArthurZucker @MekkCyber , do you mean this PR: #41577 ? |
|
Hey @kaixuanliu, yes we will start using the hub mapping in the PR you linked, but the kernel needs to be a drop in replacement for the function in the modeling so we don't have to change the modeling files apart from lazily loading the kernel, in case you need a special function for example in the case of |
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
ArthurZucker
left a comment
There was a problem hiding this comment.
Very very nice! IDK if we "have" to use self.rotary func? if not would be perfect hehe
Kudos everyone 🚀
| self.q_norm = Dots1RMSNorm(self.head_dim, eps=config.rms_norm_eps) # unlike olmo, only on the head dim! | ||
| self.k_norm = Dots1RMSNorm(self.head_dim, eps=config.rms_norm_eps) # thus post q_norm does not need reshape | ||
| self.sliding_window = config.sliding_window if self.layer_type == "sliding_attention" else None | ||
| self.rotary_fn = apply_rotary_pos_emb |
There was a problem hiding this comment.
do we need self? if not we can just directly use the func? (i did not follow precisely!)
There was a problem hiding this comment.
Here is generated by modular, if we modify this, it will fail for utils/check_modular_conversion.py
There was a problem hiding this comment.
I know I mean for the original model!
There was a problem hiding this comment.
self here is necessary, the kernelized function must be included in the module so that calling kernelize on the model can detect it.
| return lambda cls: cls | ||
|
|
||
| def use_kernel_func_from_hub(func_name: str): | ||
| if _kernels_enabled and _has_use_kernel_func_from_hub: |
There was a problem hiding this comment.
@MekkCyber we need some docs here on usage etc!
| return attn_output, attn_weights | ||
|
|
||
|
|
||
| @use_kernel_func_from_hub("rotary_pos_emb") |
There was a problem hiding this comment.
lets put it on llama and all models that have the same no?
There was a problem hiding this comment.
Thanks for the advice! Since current implemetations will not use the kernels for functions by default as the former version. I think it is ok to add this to all models. Have updated the code.
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
8b420e4 to
898e36e
Compare
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
This reverts commit b8b68c7.
|
[For maintainers] Suggested jobs to run (before merge) run-slow: apertus, arcee, aria, bamba, bitnet, cohere, csm, cwm, dbrx, deepseek_v3, dia, diffllama, doge, dots1, emu3, ernie4_5 |
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
MekkCyber
left a comment
There was a problem hiding this comment.
Thanks for your patience @kaixuanliu ! lgtm
* add rotary kernel support to Qwen3 model Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * delete unnecessary import Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * put get rotary kernel to hub_kernels.py Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix wrong import Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * refine code and adjust related modular code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix modular mismatch bug Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update code, use lazy load kernels Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix check modular conversion issue Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix CI bug for qwen3-next Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix CI issue Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * delete unused code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * rename to `apply_rotary_transformers` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust import `lazy_load_kernel` location Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * Update modular-generated modeling files with lazy_load_kernel import location Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix conflicts Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add more check Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * use decorator to map kernels for functions Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * small fix Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * small adjustment Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix LINT issue Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update code to adapt to new `use_kernel_func_from_hub` API in kernels Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * do not consider check_modular first Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add compatibility for old version `kernels` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add rotary fn kernel to all models Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update modular part Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * Revert "update modular part" This reverts commit b8b68c7. * update code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
|
Humm, this adds a random |
|
Hi, @Cyrilvallez , |
|
It's not really necessary to use the |
|
Yeah I know it's needed, was just saying that it's a bit awkward rn as it's not being used! But all good if @MekkCyber is looking for a better way then it can wait in the meantime! |
|
Yes, I agree. It would be much better if we use |
* add rotary kernel support to Qwen3 model Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * delete unnecessary import Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * put get rotary kernel to hub_kernels.py Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix wrong import Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * refine code and adjust related modular code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix modular mismatch bug Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update code, use lazy load kernels Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix check modular conversion issue Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix CI bug for qwen3-next Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix CI issue Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * delete unused code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * rename to `apply_rotary_transformers` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust import `lazy_load_kernel` location Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * Update modular-generated modeling files with lazy_load_kernel import location Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix conflicts Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add more check Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * use decorator to map kernels for functions Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * small fix Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * small adjustment Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix LINT issue Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update code to adapt to new `use_kernel_func_from_hub` API in kernels Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * do not consider check_modular first Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add compatibility for old version `kernels` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add rotary fn kernel to all models Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update modular part Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * Revert "update modular part" This reverts commit b8b68c7. * update code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
When `kernels` is installed, `@use_kernelized_func` attaches a `rotary_fn` child `nn.Module` to attention layers. DeepSpeed ZeRO-3's parameter coordinator traces the module graph at init and expects every registered submodule to be invoked during forward. The model's forward still calls the plain Python `apply_rotary_pos_emb`, so `rotary_fn` is never executed and the trace desynchronizes, raising `IndexError: pop from an empty deque` on the second forward. Skip attaching the kernelized submodule when ZeRO-3 is enabled; users running under ZeRO-3 fall back to the Python implementation, which is what they were getting before huggingface#41147. Fixes huggingface#45137
When `kernels` is installed, `@use_kernelized_func` attaches a `rotary_fn` child `nn.Module` to attention layers. DeepSpeed ZeRO-3's parameter coordinator traces the module graph at init and expects every registered submodule to be invoked during forward. The model's forward still calls the plain Python `apply_rotary_pos_emb`, so `rotary_fn` is never executed and the trace desynchronizes, raising `IndexError: pop from an empty deque` on the second forward. Skip attaching the kernelized submodule when ZeRO-3 is enabled; users running under ZeRO-3 fall back to the Python implementation, which is what they were getting before huggingface#41147. Fixes huggingface#45137
…45414) * Fix `IndexError: pop from an empty deque` under DeepSpeed ZeRO-3 When `kernels` is installed, `@use_kernelized_func` attaches a `rotary_fn` child `nn.Module` to attention layers. DeepSpeed ZeRO-3's parameter coordinator traces the module graph at init and expects every registered submodule to be invoked during forward. The model's forward still calls the plain Python `apply_rotary_pos_emb`, so `rotary_fn` is never executed and the trace desynchronizes, raising `IndexError: pop from an empty deque` on the second forward. Skip attaching the kernelized submodule when ZeRO-3 is enabled; users running under ZeRO-3 fall back to the Python implementation, which is what they were getting before #41147. Fixes #45137 * Add dates to new model cards to satisfy check-repository-consistency
…45414) * Fix `IndexError: pop from an empty deque` under DeepSpeed ZeRO-3 When `kernels` is installed, `@use_kernelized_func` attaches a `rotary_fn` child `nn.Module` to attention layers. DeepSpeed ZeRO-3's parameter coordinator traces the module graph at init and expects every registered submodule to be invoked during forward. The model's forward still calls the plain Python `apply_rotary_pos_emb`, so `rotary_fn` is never executed and the trace desynchronizes, raising `IndexError: pop from an empty deque` on the second forward. Skip attaching the kernelized submodule when ZeRO-3 is enabled; users running under ZeRO-3 fall back to the Python implementation, which is what they were getting before #41147. Fixes #45137 * Add dates to new model cards to satisfy check-repository-consistency
…uggingface#45414) * Fix `IndexError: pop from an empty deque` under DeepSpeed ZeRO-3 When `kernels` is installed, `@use_kernelized_func` attaches a `rotary_fn` child `nn.Module` to attention layers. DeepSpeed ZeRO-3's parameter coordinator traces the module graph at init and expects every registered submodule to be invoked during forward. The model's forward still calls the plain Python `apply_rotary_pos_emb`, so `rotary_fn` is never executed and the trace desynchronizes, raising `IndexError: pop from an empty deque` on the second forward. Skip attaching the kernelized submodule when ZeRO-3 is enabled; users running under ZeRO-3 fall back to the Python implementation, which is what they were getting before huggingface#41147. Fixes huggingface#45137 * Add dates to new model cards to satisfy check-repository-consistency
Adds Rotary kernels from https://huggingface.co/kernels-community/rotary to Qwen3 series models
Here are Some benchmarks comparing perfs between rotary kernels and


apply_rotary_pos_embfunc in transformers:For A100,
And for Intel XPU: