ggml-webgpu: Add fused RMS_NORM + MUL#21983
Conversation
|
@reeselevine This PR should conflicts with #21873 (especially in |
8fd976c to
5fef017
Compare
reeselevine
left a comment
There was a problem hiding this comment.
Thanks for starting to work on fusion, this is a big step!
The performance not changing too much is a little disappointing, but also not a blocker. Once we get the fusion format working we can optimize. I wonder if the reason for the lack of performance is just that the current RMS_NORM is not very well-optimized? So the reduction in bandwidth ends up being hidden because RMS_NORM is too slow.
Do you know if this fusion path leads to significant performance gains in other backends?
| size_t memset_bytes_per_thread; | ||
|
|
||
| bool disable_fusion; | ||
| uint32_t num_additional_fused_ops; |
There was a problem hiding this comment.
this general structure comes from the vulkan backend right? I haven't looked into it too closely, but my first thought is that it seems too general, at least based on this PR, because you end up having to check which ops you are actually fusing, and this doesn't encode that at all.
There was a problem hiding this comment.
Yes, it is almost the same as the vulkan backend, but I agree that this is too general. I removed this as I explain below.
|
|
||
| static bool ggml_webgpu_can_fuse_check(webgpu_context & ctx, const struct ggml_cgraph * cgraph, int node_idx) { | ||
| // RMS_NORM + MUL | ||
| if (ggml_webgpu_can_fuse(cgraph, node_idx, { GGML_OP_RMS_NORM, GGML_OP_MUL })) { |
There was a problem hiding this comment.
under the hood can_fuse ends up repeating the if condition on RMS_NORM + MUL, so really should we have separate functions for each set of potential fusions?
There was a problem hiding this comment.
You're right, it looks better to have a specific can_fuse function for each fusion.
| if (!ctx->disable_fusion) { | ||
| ggml_webgpu_can_fuse_check(ctx, cgraph, i); | ||
| } | ||
| if (auto cmd = ggml_webgpu_encode_node(ctx, cgraph->nodes, i)) { |
There was a problem hiding this comment.
Right now we're hiding whether the encode_node encodes based on whether the next node will be fused.
Instead, what do you think of just calling encode node (maybe function name should just change to encode to encompass the fact it might encode multiple nodes), and updating i based on the number of fused operations. So for the new RMS_NORM + MUL, we end updating i by 2. That avoids hiding the fusion in the additional_fused_ops variable. That to me seems cleaner for now, but maybe I'm missing something that doesn't translate well to future fusions?
There was a problem hiding this comment.
Sounds good. It looks like whether the ops can be fused can basically be determined by cgraph->nodes, at least from my understanding. So I moved the can_fuse logic to ggml_webgpu_encode, which removes num_additional_fused_ops and also eliminates the double-checking for fusion (ggml_webgpu_can_fuse_check in ggml_backend_webgpu_graph_computeand ctx->num_additional_fused_ops > 0). I think this fix addresses your comment. How about this?
| (uint32_t) dst->ne[1], | ||
| (uint32_t) dst->ne[2], | ||
| (uint32_t) dst->ne[3], | ||
| *(uint32_t *) rn_dst->op_params // epsilon, treated as f32 in the shader |
There was a problem hiding this comment.
this leads to compiler warnings and will fail when the new ggml-webgpu-nvidia-ci is enabled, I moved to a new format: https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-webgpu/ggml-webgpu.cpp#L1910
There was a problem hiding this comment.
Got it, I’ll update after this PR is rebased because this fix depends on the new function ggml_webgpu_u32_from_f32.
|
I re-ran the benchmarks on the same three models with As in the first test, end-to-end The command is like this: Llama-3.2-3B-Instruct Q4_K_M
Qwen3.5-4B Q4_K_M
Gemma-4-E4B-it Q4_K_M
Using
Notes on the table:
That's plausible, yes. In addition, the RMS_NORM + MUL path's share of total
I haven't measured other backends in my environment, but the corresponding |
reeselevine
left a comment
There was a problem hiding this comment.
Needs a merge or rebase but otherwise this is looking good!
| static std::optional<webgpu_encoded_op> ggml_webgpu_encode(webgpu_context ctx, | ||
| ggml_cgraph * cgraph, | ||
| int node_idx, | ||
| int & num_encoded_ops) { |
There was a problem hiding this comment.
I think there might be a cleaner way of returning the number of encoded ops then passing in the extra parameter by reference, but I also realize that the return of the optional encoded_op makes it slightly more complex. One option would be to just put num_encoded_ops as a field in webgpu_encoded_op. But I think this is also ok for now, I might think about it a bit more.
There was a problem hiding this comment.
Got it, thanks. I leave the current form for now then.
| case GGML_OP_REPEAT: | ||
| return ggml_webgpu_repeat(ctx, src0, node); | ||
| case GGML_OP_RMS_NORM: | ||
| if (ggml_webgpu_can_fuse_rms_norm_mul(ctx, cgraph, node_idx)) { |
There was a problem hiding this comment.
I think this is a nice way of doing the check, thanks!
| webgpu_ctx->global_ctx = dev_ctx->webgpu_global_ctx; | ||
| webgpu_ctx->shader_lib = std::make_unique<ggml_webgpu_shader_lib>(dev_ctx->webgpu_global_ctx->device); | ||
| webgpu_ctx->shader_lib = std::make_unique<ggml_webgpu_shader_lib>(dev_ctx->webgpu_global_ctx->device); | ||
| webgpu_ctx->disable_fusion = getenv("GGML_WEBGPU_DISABLE_FUSION") != nullptr; |
There was a problem hiding this comment.
I'm hesitant to include environment variables in the WebGPU backend, ideally all the logic for why we should/should not do something can exist in the backend code itself. Is there a specific reason for including disabling fusion as an option like this?
There was a problem hiding this comment.
This is because I followed the other backends (vulkan, cuda and opencl have this env var), so I didn't have a ton of thoughts about this. I confirmed that the PR (#14545) adds the same env var to the vulkan backend because an execution error was reported after the RMS_NORM + MUL fusion PR was merged. However, the error was actually due to fusion implementation bugs, so it seems they added the env var only as a tentative guard. I realize we should be cautious about adding more external dependencies too, and I'm not sure disabling fusion is necessary for now. So I'm starting to think we should remove this env var and disable_fusion for now. What do you think?
There was a problem hiding this comment.
I agree, I think we can remove it, and if we do run into issues due to fusion, we can try to fix forward or try to add guards within the WebGPU backend itself to disable it.
8d925e6 to
ac69ca8
Compare
| key.overlap = context.overlap; | ||
| key.src_overlap = context.src_overlap; | ||
| ggml_webgpu_binary_pipeline_key key = { | ||
| .type = context.dst->type, |
There was a problem hiding this comment.
I'm trying to move away from using the C++20 initializers, they lead to compiler warnings right now and when I enable the ggml-ci-webgpu-nvidia node they will cause errors, as that CI is stricter.
There was a problem hiding this comment.
Got it, I updated accordingly.
be84398 to
89cc0bb
Compare
* fused rms_norm_mul + mul * Add GGML_WEBGPU_DISABLE_FUSION for being able to disable kernel fusion. * Decouple num_fused_ops from webgpu_context; misc cleanup * Fix eps handling and remove disable_fusion. * Fix not to use c++20 initializers.
* fused rms_norm_mul + mul * Add GGML_WEBGPU_DISABLE_FUSION for being able to disable kernel fusion. * Decouple num_fused_ops from webgpu_context; misc cleanup * Fix eps handling and remove disable_fusion. * Fix not to use c++20 initializers.
Overview
This PR adds the initial kernel fusion to WebGPU backend with RMS_NORM + MUL (it is similar to #14800).
The performance on the major models on my device (M2, Metal 4) is as follows, but unfortunately, the performance is almost the same on this implementation.
The command is like this:
llama-bench -m Llama-3.2-3B-Instruct-Q4_K_M.gguf -fa 1 -p 512 -n 0a620695)Requirements