vulkan: Use mul_mat_vec_id for small values of n by jeffbolznv · Pull Request #18918 · ggml-org/llama.cpp

jeffbolznv · 2026-01-18T16:24:18Z

Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and update the indexing calculations in get_offsets.

Mat-vec is faster than mat-mat for small values of n. We don't get the same reuse of the weights as in the non-ID path, but with this the cost is linear in n rather than n>1 being far slower than n==1.

Perf on 5090:

test-backend-ops.exe perf -o MUL_MAT_ID -p mxfp4

before:

  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=1,k=2880):                75400 runs -    13.37 us/run -  66.36 MFLOP/run -   4.96 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=4,k=2880):                 5278 runs -   190.94 us/run - 265.42 MFLOP/run -   1.39 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=8,k=2880):                 4347 runs -   239.34 us/run - 530.84 MFLOP/run -   2.22 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=512,k=2880):               1878 runs -   533.28 us/run -  33.97 GFLOP/run -  63.71 TFLOPS

after:

  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=1,k=2880):                73892 runs -    13.57 us/run -  66.36 MFLOP/run -   4.89 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=4,k=2880):                21112 runs -    48.15 us/run - 265.42 MFLOP/run -   5.51 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=8,k=2880):                10395 runs -    96.23 us/run - 530.84 MFLOP/run -   5.52 TFLOPS
  MUL_MAT_ID(type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=512,k=2880):               1941 runs -   515.45 us/run -  33.97 GFLOP/run -  65.91 TFLOPS

This came up when testing #18892. There are small batches that dominate the runtime:

before:
MUL_MAT_ID mxfp4 m=2880 n=4 k=2880 n_expert=32 batch=4: 71436 x 218.869 us = 1.56352e+07 us (1212.48 GFLOPS/s)

after:
MUL_MAT_ID_ADD_ID MUL_MAT_ID mxfp4 m=2880 n=4 k=2880 n_expert=32 batch=4: 71508 x 53.676 us = 3.83833e+06 us (4943.93 GFLOPS/s)

That test also shows a lot of time in small batches of FA, but I'll fix that separately.

Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and update the indexing calculations in get_offsets. Mat-vec is faster than mat-mat for small values of n. We don't get the same reuse of the weights as in the non-ID path, but with this the cost is linear in n rather than n>1 being far slower than n==1.

0cc4m

Nice, good idea. Improvements all around, in my tests.

Change ggml_vk_mul_mat_vec_id_q_f16 to loop over the batch dimension and update the indexing calculations in get_offsets. Mat-vec is faster than mat-mat for small values of n. We don't get the same reuse of the weights as in the non-ID path, but with this the cost is linear in n rather than n>1 being far slower than n==1.

jeffbolznv requested a review from 0cc4m as a code owner January 18, 2026 16:24

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 18, 2026

0cc4m approved these changes Jan 21, 2026

View reviewed changes

0cc4m merged commit 50b7f07 into ggml-org:master Jan 21, 2026
77 of 78 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: Use mul_mat_vec_id for small values of n#18918

vulkan: Use mul_mat_vec_id for small values of n#18918
0cc4m merged 1 commit intoggml-org:masterfrom
jeffbolznv:mul_mat_id_small_batch

jeffbolznv commented Jan 18, 2026

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeffbolznv commented Jan 18, 2026

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants