add rotary kernel support to Qwen3 model by kaixuanliu · Pull Request #41147 · huggingface/transformers

kaixuanliu · 2025-09-25T05:13:28Z

Adds Rotary kernels from https://huggingface.co/kernels-community/rotary to Qwen3 series models

Here are Some benchmarks comparing perfs between rotary kernels and apply_rotary_pos_emb func in transformers:
For A100,

And for Intel XPU:

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

kaixuanliu · 2025-09-25T08:07:19Z

I made benchmark for Qwen/Qwen3-4B-Instruct-2507 model, and on Intel XPU, it will get ~10% performance improvement for E2E time. While on A100, there is no obvious performance improvement or drop. Pls let me know if it is OK using this manner to apply rotary kernel, and then I will add the support for more models.

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

…rmers into rotary-kernel

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Rocketknight1 · 2025-09-25T14:22:52Z

cc @ArthurZucker

MekkCyber

Thanks for this integration @kaixuanliu ! I left few nits to consider

MekkCyber · 2025-09-25T14:24:18Z

+        global use_kernels
+        use_kernels = getattr(self, "use_kernels", False)
+


It's better to have an attention kwarg passed use_rotary_kernel for example than defining a global variable like this

You mean add a param called use_rotary_kernel to kwargs here, and passed it down to Qwen3Attention?

MekkCyber · 2025-09-25T14:26:14Z

 from ...cache_utils import Cache, DynamicCache
 from ...generation import GenerationMixin
 from ...integrations import use_kernel_forward_from_hub
+from ...integrations.hub_kernels import rotary_kernel


I think we need to lazily load the kernel, because here we are loading it before even knowing if the user wants to use kernels or not

Thx for your advice! Have updated related code

MekkCyber · 2025-09-25T14:26:37Z

+def apply_rotary_kernel(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
+    """
+    Rotary kernel implementation wrapper
+    Adapts rotary kernels implementation to match HuggingFace apply_rotary_pos_emb signature
+    """
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+
+    q_rotated = q.clone()
+    k_rotated = k.clone()
+
+    # Get half dimension for rotation
+    half_dim = q.shape[-1] // 2
+    q1 = q_rotated[..., :half_dim]
+    q2 = q_rotated[..., half_dim:]
+    k1 = k_rotated[..., :half_dim]
+    k2 = k_rotated[..., half_dim:]
+    if cos.shape[-1] != half_dim:
+        # Trim cos/sin to match half_dim
+        cos = cos[..., :half_dim]
+        sin = sin[..., :half_dim]
+
+    # Apply rotary embedding using our kernel
+    rotary_kernel.apply_rotary(q1, q2, cos, sin, q1, q2, False)


Did you try to benchmark the performance with and without this kernel ?

Yes, on Intel XPU, one single rotary op needs 0.22 ms, and it drops to 0.1 ms after applying this patch. above 2x speedup.

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

…rmers into rotary-kernel

ArthurZucker

hey! unfortunately this is not how we want to be adding support for kernels in general!
There should be 0 modeling changes involved, especially here it does not even seem to be required!

We'd rather import once from kernels to replace the rotary embed if the function is defined or something, but in the broad scheme of things, we want a mapping for function ! Like we do for classes

kaixuanliu · 2025-10-10T06:14:17Z

@ArthurZucker Thx for the comment, it makes sense.Since for function level, we do not have a shema for mapping like classes in kernels, we will add related support and then based on this, I will adjust this PR.

yao-matrix · 2025-10-14T01:20:59Z

@ArthurZucker @danieldk , could you comment the feasibility of Kaixuan's proposal?

ArthurZucker · 2025-10-14T11:11:19Z

I think @MekkCyber is working on that feature specifically!

kaixuanliu · 2025-10-17T10:48:17Z

I think @MekkCyber is working on that feature specifically!

@ArthurZucker @MekkCyber , do you mean this PR: #41577 ?

MekkCyber · 2025-10-17T11:39:15Z

Hey @kaixuanliu, yes we will start using the hub mapping in the PR you linked, but the kernel needs to be a drop in replacement for the function in the modeling so we don't have to change the modeling files apart from lazily loading the kernel, in case you need a special function for example in the case of rotary, we can expose it directly in the kernel

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

ArthurZucker

Very very nice! IDK if we "have" to use self.rotary func? if not would be perfect hehe

Kudos everyone 🚀

ArthurZucker · 2025-11-28T06:42:44Z

        self.q_norm = Dots1RMSNorm(self.head_dim, eps=config.rms_norm_eps)  # unlike olmo, only on the head dim!
        self.k_norm = Dots1RMSNorm(self.head_dim, eps=config.rms_norm_eps)  # thus post q_norm does not need reshape
        self.sliding_window = config.sliding_window if self.layer_type == "sliding_attention" else None
+        self.rotary_fn = apply_rotary_pos_emb


do we need self? if not we can just directly use the func? (i did not follow precisely!)

Here is generated by modular, if we modify this, it will fail for utils/check_modular_conversion.py

I know I mean for the original model!

self here is necessary, the kernelized function must be included in the module so that calling kernelize on the model can detect it.

ArthurZucker · 2025-11-28T06:43:11Z

            return lambda cls: cls

+    def use_kernel_func_from_hub(func_name: str):
+        if _kernels_enabled and _has_use_kernel_func_from_hub:


@MekkCyber we need some docs here on usage etc!

ArthurZucker · 2025-11-28T10:56:54Z

    return attn_output, attn_weights


+@use_kernel_func_from_hub("rotary_pos_emb")


lets put it on llama and all models that have the same no?

Thanks for the advice! Since current implemetations will not use the kernels for functions by default as the former version. I think it is ok to add this to all models. Have updated the code.

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

This reverts commit b8b68c7.

github-actions · 2025-11-28T15:58:19Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: apertus, arcee, aria, bamba, bitnet, cohere, csm, cwm, dbrx, deepseek_v3, dia, diffllama, doge, dots1, emu3, ernie4_5

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

HuggingFaceDocBuilderDev · 2025-11-28T17:09:21Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

MekkCyber

Thanks for your patience @kaixuanliu ! lgtm

* add rotary kernel support to Qwen3 model Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * delete unnecessary import Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * put get rotary kernel to hub_kernels.py Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix wrong import Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * refine code and adjust related modular code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix modular mismatch bug Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update code, use lazy load kernels Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix check modular conversion issue Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix CI bug for qwen3-next Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix CI issue Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * delete unused code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * rename to `apply_rotary_transformers` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust import `lazy_load_kernel` location Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * Update modular-generated modeling files with lazy_load_kernel import location Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix conflicts Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add more check Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * use decorator to map kernels for functions Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * small fix Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * small adjustment Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix LINT issue Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update code to adapt to new `use_kernel_func_from_hub` API in kernels Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * do not consider check_modular first Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add compatibility for old version `kernels` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add rotary fn kernel to all models Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update modular part Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * Revert "update modular part" This reverts commit b8b68c7. * update code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Cyrilvallez · 2025-12-08T11:50:05Z

Humm, this adds a random self.rotary_fn in the module, which is not used... IMO forward should be changed to use self.rotary_fn then!

kaixuanliu · 2025-12-08T13:24:46Z

Hi, @Cyrilvallez , self.rotary_fn is needed here, as curent design for use_kernel_func_from_hub decorator need to bind the function to a Module. Maybe we can add a line of comment here?

MekkCyber · 2025-12-08T13:27:34Z

It's not really necessary to use the self.rotary_fn since it's only used to make the function discoverable by the kernelize process. Btw i'm trying to think of a better way to do that

Cyrilvallez · 2025-12-08T13:51:58Z

Yeah I know it's needed, was just saying that it's a bit awkward rn as it's not being used! But all good if @MekkCyber is looking for a better way then it can wait in the meantime!

kaixuanliu · 2025-12-08T14:05:17Z

Yes, I agree. It would be much better if we use self.rotary_fn to replace apply_rotary_pos_emb forward. But let's wait to see if @MekkCyber has some better design.

* add rotary kernel support to Qwen3 model Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * delete unnecessary import Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * put get rotary kernel to hub_kernels.py Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix wrong import Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * refine code and adjust related modular code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix modular mismatch bug Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update code, use lazy load kernels Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix check modular conversion issue Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix CI bug for qwen3-next Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix CI issue Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * delete unused code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * rename to `apply_rotary_transformers` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * adjust import `lazy_load_kernel` location Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * Update modular-generated modeling files with lazy_load_kernel import location Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix conflicts Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add more check Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * use decorator to map kernels for functions Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * small fix Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * small adjustment Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix LINT issue Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update code to adapt to new `use_kernel_func_from_hub` API in kernels Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * do not consider check_modular first Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * fix Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add compatibility for old version `kernels` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * add rotary fn kernel to all models Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * update modular part Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> * Revert "update modular part" This reverts commit b8b68c7. * update code Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

When `kernels` is installed, `@use_kernelized_func` attaches a `rotary_fn` child `nn.Module` to attention layers. DeepSpeed ZeRO-3's parameter coordinator traces the module graph at init and expects every registered submodule to be invoked during forward. The model's forward still calls the plain Python `apply_rotary_pos_emb`, so `rotary_fn` is never executed and the trace desynchronizes, raising `IndexError: pop from an empty deque` on the second forward. Skip attaching the kernelized submodule when ZeRO-3 is enabled; users running under ZeRO-3 fall back to the Python implementation, which is what they were getting before huggingface#41147. Fixes huggingface#45137

…45414) * Fix `IndexError: pop from an empty deque` under DeepSpeed ZeRO-3 When `kernels` is installed, `@use_kernelized_func` attaches a `rotary_fn` child `nn.Module` to attention layers. DeepSpeed ZeRO-3's parameter coordinator traces the module graph at init and expects every registered submodule to be invoked during forward. The model's forward still calls the plain Python `apply_rotary_pos_emb`, so `rotary_fn` is never executed and the trace desynchronizes, raising `IndexError: pop from an empty deque` on the second forward. Skip attaching the kernelized submodule when ZeRO-3 is enabled; users running under ZeRO-3 fall back to the Python implementation, which is what they were getting before #41147. Fixes #45137 * Add dates to new model cards to satisfy check-repository-consistency

…uggingface#45414) * Fix `IndexError: pop from an empty deque` under DeepSpeed ZeRO-3 When `kernels` is installed, `@use_kernelized_func` attaches a `rotary_fn` child `nn.Module` to attention layers. DeepSpeed ZeRO-3's parameter coordinator traces the module graph at init and expects every registered submodule to be invoked during forward. The model's forward still calls the plain Python `apply_rotary_pos_emb`, so `rotary_fn` is never executed and the trace desynchronizes, raising `IndexError: pop from an empty deque` on the second forward. Skip attaching the kernelized submodule when ZeRO-3 is enabled; users running under ZeRO-3 fall back to the Python implementation, which is what they were getting before huggingface#41147. Fixes huggingface#45137 * Add dates to new model cards to satisfy check-repository-consistency

kaixuanliu added 2 commits September 25, 2025 13:12

add rotary kernel support to Qwen3 model

69f2ca8

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

delete unnecessary import

d2bf5c5

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

kaixuanliu marked this pull request as ready for review September 25, 2025 07:45

github-actions Bot requested review from ArthurZucker and Rocketknight1 September 25, 2025 07:45

adjust code

b0cbab5

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

kaixuanliu marked this pull request as draft September 25, 2025 08:43

kaixuanliu added 4 commits September 25, 2025 10:04

adjust code

8dede65

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge branch 'rotary-kernel' of https://github.com/kaixuanliu/transfo…

5c02189

…rmers into rotary-kernel

put get rotary kernel to hub_kernels.py

137069b

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

fix wrong import

8ac3e1e

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

kaixuanliu marked this pull request as ready for review September 25, 2025 10:26

MekkCyber reviewed Sep 25, 2025

View reviewed changes

kaixuanliu added 4 commits September 26, 2025 02:51

refine code and adjust related modular code

29f83f2

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge branch 'main' into rotary-kernel

7729b7f

fix modular mismatch bug

94e4f60

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge branch 'rotary-kernel' of https://github.com/kaixuanliu/transfo…

b96a7c9

…rmers into rotary-kernel

ArthurZucker reviewed Oct 8, 2025

View reviewed changes

kaixuanliu added 2 commits October 20, 2025 10:07

Merge branch 'main' into rotary-kernel

aebac76

update code, use lazy load kernels

af67a74

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

kaixuanliu marked this pull request as draft October 21, 2025 06:36

kaixuanliu added 2 commits October 21, 2025 15:25

fix check modular conversion issue

28c69d3

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge branch 'main' into rotary-kernel

cff1580

ArthurZucker approved these changes Nov 28, 2025

View reviewed changes

ArthurZucker reviewed Nov 28, 2025

View reviewed changes

add rotary fn kernel to all models

898e36e

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

kaixuanliu force-pushed the rotary-kernel branch from 8b420e4 to 898e36e Compare November 28, 2025 14:30

kaixuanliu marked this pull request as draft November 28, 2025 14:43

kaixuanliu added 2 commits November 28, 2025 15:31

update modular part

b8b68c7

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Revert "update modular part"

af25ce0

This reverts commit b8b68c7.

update code

4b9da30

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

ArthurZucker marked this pull request as ready for review November 28, 2025 17:02

MekkCyber approved these changes Nov 28, 2025

View reviewed changes

MekkCyber merged commit 6587d77 into huggingface:main Nov 28, 2025
24 checks passed

vasqu mentioned this pull request Nov 28, 2025

[CI] Fix copies #42487

Merged

This was referenced Mar 31, 2026

RLOO and GRPO failing with ZeRO3: IndexError: pop from an empty deque huggingface/trl#4899

Closed

Update tests with zero3 for RLOO and GRPO as xfail only with transformers >= v5 huggingface/trl#5420

Merged

ArthurZucker mentioned this pull request Apr 13, 2026

Fix IndexError with DeepSpeed ZeRO-3 when kernels rotary is active #45395

Closed

ArthurZucker mentioned this pull request Apr 13, 2026

Fix IndexError with DeepSpeed ZeRO-3 when kernels rotary is active #45414

Merged

3 tasks

		global use_kernels
		use_kernels = getattr(self, "use_kernels", False)

		return attn_output, attn_weights


		@use_kernel_func_from_hub("rotary_pos_emb")

Conversation

kaixuanliu commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaixuanliu commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Sep 25, 2025

Uh oh!

MekkCyber left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaixuanliu Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

kaixuanliu commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yao-matrix commented Oct 14, 2025

Uh oh!

ArthurZucker commented Oct 14, 2025

Uh oh!

kaixuanliu commented Oct 17, 2025

Uh oh!

MekkCyber commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Nov 28, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 28, 2025

Uh oh!

MekkCyber left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Cyrilvallez commented Dec 8, 2025

Uh oh!

kaixuanliu commented Dec 8, 2025

Uh oh!

MekkCyber commented Dec 8, 2025

Uh oh!

Cyrilvallez commented Dec 8, 2025

Uh oh!

kaixuanliu commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

kaixuanliu commented Sep 25, 2025 •

edited

Loading

kaixuanliu commented Sep 25, 2025 •

edited

Loading

kaixuanliu Sep 26, 2025 •

edited

Loading

kaixuanliu commented Oct 10, 2025 •

edited

Loading

MekkCyber commented Oct 17, 2025 •

edited

Loading