vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron by jeffbolznv · Pull Request #18295 · ggml-org/llama.cpp

jeffbolznv · 2025-12-22T17:00:38Z

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs in a fusion test (topk_moe has two outputs). Previously only the final node was verified.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128,128 -m c:\models\Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |       269.32 ± 13.22 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        260.52 ± 1.17 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        267.10 ± 5.18 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       340.67 ± 22.33 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        356.88 ± 9.24 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       333.40 ± 12.02 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128,128 -m c:\models\Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |       288.13 ± 13.10 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        284.81 ± 2.36 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        289.09 ± 3.86 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       343.03 ± 19.78 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        355.02 ± 4.88 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        353.27 ± 0.69 |

ggerganov

Ack on the ggml-backend changes

0cc4m · 2025-12-26T15:49:21Z

From my side it looks fine, but the Vulkan Mac CI is reporting an issue. Can you look into that?

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2). Fewer pipeline variants and spec constants, just use push constants. In test_topk_moe, change exp_probs_b to be 1D, matching real networks. Update test-backend-ops and ggml-backend to allow verifying multiple outputs in a fusion test (topk_moe has two outputs). Previously only the final node was verified.

jeffbolznv · 2025-12-26T19:16:54Z

I don't know why the mac system is failing. It's in test cases that should be fused so I don't think it's a fluke, but I can't reproduce it locally on NVIDIA or lavapipe and I can't find anything from code inspection.

While trying I did find that sometimes there are ties that lead to spurious failures, so I've updated the tests to avoid that. I doubt this is related to the mac failures. If it still fails in CI I'll probably need to just disable this fusion for moltenvk.

jeffbolznv · 2025-12-27T04:01:38Z

I tried a couple experiments through CI, but don't have a workaround for the moltenvk failures. I've disabled the new fusion for moltenvk.

…gml-org#18295) * vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron Also handle GGML_OP_SCALE at the end (nemotron, deepseek2). Fewer pipeline variants and spec constants, just use push constants. In test_topk_moe, change exp_probs_b to be 1D, matching real networks. Update test-backend-ops and ggml-backend to allow verifying multiple outputs in a fusion test (topk_moe has two outputs). Previously only the final node was verified. * change test_topk_moe to allow results in arbitrary order * disable sigmoid fusion for moltenvk

…(#18295) * vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron Also handle GGML_OP_SCALE at the end (nemotron, deepseek2). Fewer pipeline variants and spec constants, just use push constants. In test_topk_moe, change exp_probs_b to be 1D, matching real networks. Update test-backend-ops and ggml-backend to allow verifying multiple outputs in a fusion test (topk_moe has two outputs). Previously only the final node was verified. * change test_topk_moe to allow results in arbitrary order * disable sigmoid fusion for moltenvk

…gml-org#18295) * vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron Also handle GGML_OP_SCALE at the end (nemotron, deepseek2). Fewer pipeline variants and spec constants, just use push constants. In test_topk_moe, change exp_probs_b to be 1D, matching real networks. Update test-backend-ops and ggml-backend to allow verifying multiple outputs in a fusion test (topk_moe has two outputs). Previously only the final node was verified. * change test_topk_moe to allow results in arbitrary order * disable sigmoid fusion for moltenvk

jeffbolznv requested review from 0cc4m and ggerganov as code owners December 22, 2025 17:00

github-actions Bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 22, 2025

ggerganov approved these changes Dec 22, 2025

View reviewed changes

0cc4m approved these changes Dec 26, 2025

View reviewed changes

jeffbolznv force-pushed the topk_moe_sigmoid_bias branch from 4a17402 to 75bcc84 Compare December 26, 2025 19:14

loci-dev mentioned this pull request Dec 26, 2025

UPSTREAM PR #18295: vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron auroralabs-loci/llama.cpp#714

Open

change test_topk_moe to allow results in arbitrary order

797b4ef

jeffbolznv force-pushed the topk_moe_sigmoid_bias branch 2 times, most recently from bfbd40e to 03b18c9 Compare December 27, 2025 02:48

disable sigmoid fusion for moltenvk

86df563

jeffbolznv force-pushed the topk_moe_sigmoid_bias branch from 03b18c9 to 86df563 Compare December 27, 2025 03:18

0cc4m merged commit be47fb9 into ggml-org:master Jan 1, 2026
67 of 71 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron#18295

vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron#18295
0cc4m merged 3 commits intoggml-org:masterfrom
jeffbolznv:topk_moe_sigmoid_bias

jeffbolznv commented Dec 22, 2025

Uh oh!

ggerganov left a comment

Uh oh!

0cc4m commented Dec 26, 2025

Uh oh!

jeffbolznv commented Dec 26, 2025

Uh oh!

jeffbolznv commented Dec 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jeffbolznv commented Dec 22, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

0cc4m commented Dec 26, 2025

Uh oh!

jeffbolznv commented Dec 26, 2025

Uh oh!

jeffbolznv commented Dec 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants