Skip to content

Comments

[dump] support npu fusion patch#39238

Closed
zheliuyu wants to merge 2 commits intohuggingface:mainfrom
zheliuyu:main
Closed

[dump] support npu fusion patch#39238
zheliuyu wants to merge 2 commits intohuggingface:mainfrom
zheliuyu:main

Conversation

@zheliuyu
Copy link
Contributor

@zheliuyu zheliuyu commented Jul 6, 2025

What does this PR do?

An attempt for #39105

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

WIP

@zheliuyu zheliuyu marked this pull request as draft July 6, 2025 08:35
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend using somehting like #36853 ! We can add documentation about this if you want!

@zheliuyu
Copy link
Contributor Author

zheliuyu commented Jul 8, 2025

I recommend using somehting like #36853 ! We can add documentation about this if you want!

Do you mean loading the accelerated APIs of npu through kernels?

@ArthurZucker
Copy link
Collaborator

yes, via the _KERNEL_MAPPING

@zheliuyu
Copy link
Contributor Author

yes, via the _KERNEL_MAPPING

Thanks for your suggestion.
I prefer to use this simple patch method rather than through _KERNEL_MAPPING for two reasons:

  • 1.Some users may not be able to access huggingface-hub, if npu fusion kernels are obtained through _KERNEL_MAPPING.
  • 2._KERNEL_MAPPING contains many acceleration modules for GPU, and the addition of acceleration modules for third party devices has disrupted its original architecture. I am worried that it will make _KERNEL_MAPPING increasingly complex.

Please consider these opinions. Look forward to your reply.

@zheliuyu
Copy link
Contributor Author

[2025.07.16] Experiment: Test the time-cost statistics after adding different npu fusion kernels.

Experimental design

Start an SFT task through verl's run_qwen2_5_05b_sft_peft_sp2_npu.sh. This task uses the Qwen/Qwen2-7B-Instruct in transformers. We use the following configuration to run 5 epochs and count time-consuming. Finally calculate the time-consuming mean of 5 epochs.

torchrun --standalone --nnodes=1 --nproc_per_node=8 \
     -m verl.trainer.fsdp_sft_trainer \
    data.train_files=/data/gsm8k/train.parquet \
    data.val_files=/data/gsm8k/test.parquet \
    data.prompt_key=extra_info \
    data.response_key=extra_info \
    optim.lr=1e-4 \
    data.prompt_dict_keys=['question'] \
    +data.response_dict_keys=['answer'] \
    data.micro_batch_size_per_gpu=64 \
    model.partial_pretrain=Qwen/Qwen2-7B-Instruct \
    trainer.default_local_dir=./save_dir \
    trainer.project_name=gsm8k-sft \
    trainer.experiment_name=gsm8k-sft-qwen-2-7b-instruct \
    trainer.logger=console \
    trainer.total_epochs=5 $@ \
    model.lora_rank=32 \
    model.lora_alpha=16 \
    model.target_modules=all-linear \
    model.strategy=fsdp \
    ulysses_sequence_parallel_size=2 \
    use_remove_padding=true

Result

耗时统计 平均耗时统计

illustrate:

  • ori: original configuration.
  • add_rms_norm: add patch of rms norm forward.
  • add_silu: add patch of silu forward.
  • add_rms_norm_silu: both patches are enabled.

For the mean, rms norm can be increased by ~5.49%. silu can be increased by ~0.72%. The two patches are enabled at the same time to increase by ~6.21%.

@zheliuyu
Copy link
Contributor Author

yes, via the _KERNEL_MAPPING

Thanks for your suggestion. I prefer to use this simple patch method rather than through _KERNEL_MAPPING for two reasons:

  • 1.Some users may not be able to access huggingface-hub, if npu fusion kernels are obtained through _KERNEL_MAPPING.
  • 2._KERNEL_MAPPING contains many acceleration modules for GPU, and the addition of acceleration modules for third party devices has disrupted its original architecture. I am worried that it will make _KERNEL_MAPPING increasingly complex.

Please consider these opinions. Look forward to your reply.

Please give some suggestions to me for the modification plan of this part. :) thanks ssssso much. @ArthurZucker @FightingZhen

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey!
Thanks for the feedbacks!

1.Some users may not be able to access huggingface-hub, if npu fusion kernels are obtained through _KERNEL_MAPPING.
2._KERNEL_MAPPING contains many acceleration modules for GPU, and the addition of acceleration modules for third party devices has disrupted its original architecture. I am worried that it will make _KERNEL_MAPPING increasingly complex.

regarding your comments, we want to make sure that both points are adressed!
So:

  1. Let's isolate the kernels and make sure we register them in _KERNEL_MAPPING using npu as the device
  2. Let's maybe think of a better API / Design! But the goal is to have many kernel mappings, which will be good defaults! And allow users to register their own mapping!

In a way, even if it is not via kernels I want to make sure we set a good precedent! and the current PR does not really scale well with new models, and the rest of our code!

@zheliuyu
Copy link
Contributor Author

Hey! Thanks for the feedbacks!

1.Some users may not be able to access huggingface-hub, if npu fusion kernels are obtained through _KERNEL_MAPPING.
2._KERNEL_MAPPING contains many acceleration modules for GPU, and the addition of acceleration modules for third party devices has disrupted its original architecture. I am worried that it will make _KERNEL_MAPPING increasingly complex.

regarding your comments, we want to make sure that both points are adressed! So:

  1. Let's isolate the kernels and make sure we register them in _KERNEL_MAPPING using npu as the device
  2. Let's maybe think of a better API / Design! But the goal is to have many kernel mappings, which will be good defaults! And allow users to register their own mapping!

In a way, even if it is not via kernels I want to make sure we set a good precedent! and the current PR does not really scale well with new models, and the rest of our code!

I agree with your viewpoint. The releases of transformers v0.45.0 gave me some inspiration, and I am currently refactoring this PR.

@ArthurZucker
Copy link
Collaborator

Nice! Eager to see 🤗

@zheliuyu zheliuyu closed this Aug 3, 2025
@ArthurZucker
Copy link
Collaborator

Hey @zheliuyu any follow up here? Seems like the community is intereseted!

@zheliuyu
Copy link
Contributor Author

Hey @zheliuyu any follow up here? Seems like the community is intereseted!

Progress was affected by some other tasks.

Let's restart with this pr. huggingface/kernels#146 \(^▽^)/

@zheliuyu zheliuyu changed the title [WIP]support npu fusion patch [dump] support npu fusion patch Sep 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants