Optimize MOE GEMV kernel for BS > 1.#20905
Conversation
|
You can check till bs=8, the earlier kernel was performance wise not better after bs = 4, but this kernel might be. |
I thought max batch size for MOE is |
|
Yes that was because the current kernel didn't show any benefit beyond bs=4, we can increase it if there is some benefit from this kernel. The added benefit is that those batch sizes get cuda graph enabled for them |
Will check and get back. @am17an As I mentioned at #20885 (comment), this PR shows better perf across models and GPUs. Your |
|
@gaugarg-nv you can add that check to this PR. However, we would need to do another round of tests with all devices + quants that @JohannesGaessler did earlier with this new kernel. Also did you check performance for bs=1 with this new kernel? |
|
Changes look reasonable to me from static analysis. Performance spot check of the current version looks ok for cdna
Never mind the 0.98 result at batch 1 - is noise. Will run a larger sweep when from #20885 change is added |
|
@gaugarg-nv ai usage disclosure? |
Performance
In my testing of this PR the performance is basically universally increasing, I don't think it makes sense to add special cases for some combinations of models and GPUs. The small-k logic for batch size 1 still needs to be adjusted however:
|
The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row. New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync). This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.
…optimization only for cases where it benefits Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8
|
Sorry for the late follow-up on the comments. I was on vacation.
I see a significant speed-up for the BS=8 case with this kernel, so I have changed the value of
Yes, I tried implementing a separate CUDA kernel for BS=1 based on similar ideas of doing warp-only reduction with kernel fusion enabled. But I was not able to beat performance on the master branch (and there were slight regressions for a few models).
|
Added to the PR description |
|
I benchmarked the performance: Details
The maximum value at which MMVQ should be used over MMQ for
Please make one |
|
I forgot: a "_" in the table means to just use the default value of 8. For Ada Lovelace and Blackwell the new code seems to always be faster. Also for Volta since it lacks int8 tensor cores so MMQ is comparatively slower. |
… arch and datatype
|
@JohannesGaessler Made the max batch size dependent on GPU arch and datatype. Please take a look and see if this is what you were expecting. |
| // No shared memory reduction needed since each warp works alone. | ||
| template <ggml_type type, int c_rows_per_block> | ||
| __launch_bounds__(MMVQ_MMID_MAX_BATCH_SIZE*ggml_cuda_get_physical_warp_size(), 1) | ||
| __launch_bounds__(get_mmvq_mmid_max_batch_for_device<type>()*ggml_cuda_get_physical_warp_size(), 1) |
There was a problem hiding this comment.
Ah sorry, I forgot that this kernel only operates on a single column at a time so there are no template specializations where the skipping would have been needed. Still, it's good to have a tighter bound on the launch bounds.
Adds ggml-org/llama.cpp as upstream alongside TheTom's fork. Key pickup: PR ggml-org#20905 — optimized MoE GEMV kernel for BS>1 (+25% on MoE models). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Optimize MOE GEMV kernel for BS > 1. The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row. New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync). This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization. * Remove em-dashes * Cherry-pick changes from @am17an PR ggml-org#20885 to enable small_k optimization only for cases where it benefits Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8 * Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Optimize MOE GEMV kernel for BS > 1. The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row. New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync). This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization. * Remove em-dashes * Cherry-pick changes from @am17an PR ggml-org#20885 to enable small_k optimization only for cases where it benefits Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8 * Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Optimize MOE GEMV kernel for BS > 1. The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row. New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync). This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization. * Remove em-dashes * Cherry-pick changes from @am17an PR ggml-org#20885 to enable small_k optimization only for cases where it benefits Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8 * Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>
This PR is in follow-up to #20635
The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per threadblock. threadblock of (32, 4) was doing inner dot product for a single row.
New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).
This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of
is_multi_token_idspecialization.Performance
Requirements
rows_per_blockvalue.