[dump] support npu fusion patch by zheliuyu · Pull Request #39238 · huggingface/transformers

zheliuyu · 2025-07-06T08:11:13Z

What does this PR do?

An attempt for #39105

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

WIP

ArthurZucker

I recommend using somehting like #36853 ! We can add documentation about this if you want!

zheliuyu · 2025-07-08T09:15:06Z

I recommend using somehting like #36853 ! We can add documentation about this if you want!

Do you mean loading the accelerated APIs of npu through kernels?

ArthurZucker · 2025-07-09T21:07:05Z

yes, via the _KERNEL_MAPPING

zheliuyu · 2025-07-13T09:54:59Z

yes, via the _KERNEL_MAPPING

Thanks for your suggestion.
I prefer to use this simple patch method rather than through _KERNEL_MAPPING for two reasons:

1.Some users may not be able to access huggingface-hub, if npu fusion kernels are obtained through _KERNEL_MAPPING.
2._KERNEL_MAPPING contains many acceleration modules for GPU, and the addition of acceleration modules for third party devices has disrupted its original architecture. I am worried that it will make _KERNEL_MAPPING increasingly complex.

Please consider these opinions. Look forward to your reply.

zheliuyu · 2025-07-16T07:43:22Z

[2025.07.16] Experiment: Test the time-cost statistics after adding different npu fusion kernels.

Experimental design

Start an SFT task through verl's run_qwen2_5_05b_sft_peft_sp2_npu.sh. This task uses the Qwen/Qwen2-7B-Instruct in transformers. We use the following configuration to run 5 epochs and count time-consuming. Finally calculate the time-consuming mean of 5 epochs.

torchrun --standalone --nnodes=1 --nproc_per_node=8 \
     -m verl.trainer.fsdp_sft_trainer \
    data.train_files=/data/gsm8k/train.parquet \
    data.val_files=/data/gsm8k/test.parquet \
    data.prompt_key=extra_info \
    data.response_key=extra_info \
    optim.lr=1e-4 \
    data.prompt_dict_keys=['question'] \
    +data.response_dict_keys=['answer'] \
    data.micro_batch_size_per_gpu=64 \
    model.partial_pretrain=Qwen/Qwen2-7B-Instruct \
    trainer.default_local_dir=./save_dir \
    trainer.project_name=gsm8k-sft \
    trainer.experiment_name=gsm8k-sft-qwen-2-7b-instruct \
    trainer.logger=console \
    trainer.total_epochs=5 $@ \
    model.lora_rank=32 \
    model.lora_alpha=16 \
    model.target_modules=all-linear \
    model.strategy=fsdp \
    ulysses_sequence_parallel_size=2 \
    use_remove_padding=true

Result

illustrate:

ori: original configuration.
add_rms_norm: add patch of rms norm forward.
add_silu: add patch of silu forward.
add_rms_norm_silu: both patches are enabled.

For the mean, rms norm can be increased by ~5.49%. silu can be increased by ~0.72%. The two patches are enabled at the same time to increase by ~6.21%.

zheliuyu · 2025-07-19T12:47:15Z

yes, via the _KERNEL_MAPPING

Thanks for your suggestion. I prefer to use this simple patch method rather than through _KERNEL_MAPPING for two reasons:

1.Some users may not be able to access huggingface-hub, if npu fusion kernels are obtained through _KERNEL_MAPPING.

2._KERNEL_MAPPING contains many acceleration modules for GPU, and the addition of acceleration modules for third party devices has disrupted its original architecture. I am worried that it will make _KERNEL_MAPPING increasingly complex.

Please consider these opinions. Look forward to your reply.

Please give some suggestions to me for the modification plan of this part. :) thanks ssssso much. @ArthurZucker @FightingZhen

ArthurZucker

Hey!
Thanks for the feedbacks!

1.Some users may not be able to access huggingface-hub, if npu fusion kernels are obtained through _KERNEL_MAPPING.
2._KERNEL_MAPPING contains many acceleration modules for GPU, and the addition of acceleration modules for third party devices has disrupted its original architecture. I am worried that it will make _KERNEL_MAPPING increasingly complex.

regarding your comments, we want to make sure that both points are adressed!
So:

Let's isolate the kernels and make sure we register them in _KERNEL_MAPPING using npu as the device
Let's maybe think of a better API / Design! But the goal is to have many kernel mappings, which will be good defaults! And allow users to register their own mapping!

In a way, even if it is not via kernels I want to make sure we set a good precedent! and the current PR does not really scale well with new models, and the rest of our code!

zheliuyu · 2025-07-28T11:47:45Z

Hey! Thanks for the feedbacks!
1.Some users may not be able to access huggingface-hub, if npu fusion kernels are obtained through _KERNEL_MAPPING.
2._KERNEL_MAPPING contains many acceleration modules for GPU, and the addition of acceleration modules for third party devices has disrupted its original architecture. I am worried that it will make _KERNEL_MAPPING increasingly complex.
regarding your comments, we want to make sure that both points are adressed! So:

Let's isolate the kernels and make sure we register them in _KERNEL_MAPPING using npu as the device

Let's maybe think of a better API / Design! But the goal is to have many kernel mappings, which will be good defaults! And allow users to register their own mapping!

In a way, even if it is not via kernels I want to make sure we set a good precedent! and the current PR does not really scale well with new models, and the rest of our code!

I agree with your viewpoint. The releases of transformers v0.45.0 gave me some inspiration, and I am currently refactoring this PR.

ArthurZucker · 2025-07-29T09:24:20Z

Nice! Eager to see 🤗

ArthurZucker · 2025-09-11T08:35:29Z

Hey @zheliuyu any follow up here? Seems like the community is intereseted!

zheliuyu · 2025-09-17T03:41:32Z

Hey @zheliuyu any follow up here? Seems like the community is intereseted!

Progress was affected by some other tasks.

Let's restart with this pr. huggingface/kernels#146 ＼(＾▽＾)／

zheliuyu marked this pull request as draft July 6, 2025 08:35

zheliuyu mentioned this pull request Jul 6, 2025

How to use other acceleration apis of npu? #39105

Closed

ArthurZucker reviewed Jul 7, 2025

View reviewed changes

support npu fusion kernels patch

b0c329c

ArthurZucker reviewed Jul 28, 2025

View reviewed changes

add RoPE patch

59bc9e8

zheliuyu closed this Aug 3, 2025

ArthurZucker mentioned this pull request Sep 11, 2025

[WIP]Add openpangu_dense model #40637

Draft

5 tasks

zheliuyu changed the title ~~[WIP]support npu fusion patch~~ [dump] support npu fusion patch Sep 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[dump] support npu fusion patch#39238

[dump] support npu fusion patch#39238
zheliuyu wants to merge 2 commits intohuggingface:mainfrom
zheliuyu:main

zheliuyu commented Jul 6, 2025 •

edited

Loading

Uh oh!

ArthurZucker left a comment

Uh oh!

zheliuyu commented Jul 8, 2025

Uh oh!

ArthurZucker commented Jul 9, 2025

Uh oh!

zheliuyu commented Jul 13, 2025

Uh oh!

zheliuyu commented Jul 16, 2025

Uh oh!

zheliuyu commented Jul 19, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

zheliuyu commented Jul 28, 2025

Uh oh!

ArthurZucker commented Jul 29, 2025

Uh oh!

ArthurZucker commented Sep 11, 2025

Uh oh!

zheliuyu commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

zheliuyu commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

zheliuyu commented Jul 8, 2025

Uh oh!

ArthurZucker commented Jul 9, 2025

Uh oh!

zheliuyu commented Jul 13, 2025

Uh oh!

zheliuyu commented Jul 16, 2025

[2025.07.16] Experiment: Test the time-cost statistics after adding different npu fusion kernels.

Experimental design

Result

Uh oh!

zheliuyu commented Jul 19, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

zheliuyu commented Jul 28, 2025

Uh oh!

ArthurZucker commented Jul 29, 2025

Uh oh!

ArthurZucker commented Sep 11, 2025

Uh oh!

zheliuyu commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zheliuyu commented Jul 6, 2025 •

edited

Loading