vulkan: optimize mul_mat_id loading row ids into shared memory#15427
Merged
0cc4m merged 1 commit intoggml-org:masterfrom Aug 23, 2025
Merged
vulkan: optimize mul_mat_id loading row ids into shared memory#154270cc4m merged 1 commit intoggml-org:masterfrom
0cc4m merged 1 commit intoggml-org:masterfrom
Conversation
Contributor
jeffbolznv
commented
Aug 19, 2025
- Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead.
- Specialize the code for when the division is by a power of two.
- Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two.
0cc4m
approved these changes
Aug 23, 2025
Contributor
0cc4m
left a comment
There was a problem hiding this comment.
Wow, great improvement. This closes the gap between CUDA and Vulkan MMID significantly. In some cases Vulkan even beats CUDA in pp512 now on my RTX 3090.
Example result, on Master:
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | Vulkan | 99 | 0 | pp512 | 1254.06 ± 7.48 |
| qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | Vulkan | 99 | 0 | tg128 | 140.24 ± 0.75 |
| qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | Vulkan | 99 | 1 | pp512 | 1284.92 ± 6.66 |
| qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | Vulkan | 99 | 1 | tg128 | 143.79 ± 0.53 |
PR:
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | Vulkan | 99 | 0 | pp512 | 1986.57 ± 16.09 |
| qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | Vulkan | 99 | 0 | tg128 | 137.40 ± 1.86 |
| qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | Vulkan | 99 | 1 | pp512 | 2055.43 ± 15.58 |
| qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | Vulkan | 99 | 1 | tg128 | 139.03 ± 0.14 |
CUDA:
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | CUDA | 99 | 0 | pp512 | 1972.23 ± 16.20 |
| qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | CUDA | 99 | 0 | tg128 | 139.29 ± 0.31 |
| qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | CUDA | 99 | 1 | pp512 | 2106.30 ± 9.63 |
| qwen3moe 30B.A3B Q4_K - Medium | 17.28 GiB | 30.53 B | CUDA | 99 | 1 | tg128 | 141.42 ± 0.42 |
No change on (non-coopmat) AMD and Intel, of course.
qnixsynapse
pushed a commit
to janhq/llama.cpp
that referenced
this pull request
Aug 25, 2025
…org#15427) - Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two.
blime4
referenced
this pull request
in blime4/llama.cpp
Feb 5, 2026
- Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two.
Seunghhon
pushed a commit
to Seunghhon/llama.cpp
that referenced
this pull request
Apr 26, 2026
…org#15427) - Spread the work across the whole workgroup. Using more threads seems to far outweigh the synchronization overhead. - Specialize the code for when the division is by a power of two.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.