Skip to content

vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron#18295

Merged
0cc4m merged 3 commits intoggml-org:masterfrom
jeffbolznv:topk_moe_sigmoid_bias
Jan 1, 2026
Merged

vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron#18295
0cc4m merged 3 commits intoggml-org:masterfrom
jeffbolznv:topk_moe_sigmoid_bias

Conversation

@jeffbolznv
Copy link
Copy Markdown
Contributor

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs in a fusion test (topk_moe has two outputs). Previously only the final node was verified.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128,128 -m c:\models\Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |       269.32 ± 13.22 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        260.52 ± 1.17 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        267.10 ± 5.18 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       340.67 ± 22.33 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        356.88 ± 9.24 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       333.40 ± 12.02 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 0 -n 128,128,128 -m c:\models\Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |       288.13 ± 13.10 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        284.81 ± 2.36 |
| nemotron_h_moe 31B.A3.5B Q4_K - Medium |  22.88 GiB |    31.58 B | Vulkan     |  99 |  1 |           tg128 |        289.09 ± 3.86 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       343.03 ± 19.78 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        355.02 ± 4.88 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        353.27 ± 0.69 |

@github-actions github-actions Bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 22, 2025
Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack on the ggml-backend changes

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Dec 26, 2025

From my side it looks fine, but the Vulkan Mac CI is reporting an issue. Can you look into that?

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.
@jeffbolznv jeffbolznv force-pushed the topk_moe_sigmoid_bias branch from 4a17402 to 75bcc84 Compare December 26, 2025 19:14
@jeffbolznv
Copy link
Copy Markdown
Contributor Author

I don't know why the mac system is failing. It's in test cases that should be fused so I don't think it's a fluke, but I can't reproduce it locally on NVIDIA or lavapipe and I can't find anything from code inspection.

While trying I did find that sometimes there are ties that lead to spurious failures, so I've updated the tests to avoid that. I doubt this is related to the mac failures. If it still fails in CI I'll probably need to just disable this fusion for moltenvk.

@jeffbolznv jeffbolznv force-pushed the topk_moe_sigmoid_bias branch 2 times, most recently from bfbd40e to 03b18c9 Compare December 27, 2025 02:48
@jeffbolznv jeffbolznv force-pushed the topk_moe_sigmoid_bias branch from 03b18c9 to 86df563 Compare December 27, 2025 03:18
@jeffbolznv
Copy link
Copy Markdown
Contributor Author

I tried a couple experiments through CI, but don't have a workaround for the moltenvk failures. I've disabled the new fusion for moltenvk.

@0cc4m 0cc4m merged commit be47fb9 into ggml-org:master Jan 1, 2026
67 of 71 checks passed
srogmann pushed a commit to srogmann/llama.cpp that referenced this pull request Jan 1, 2026
…gml-org#18295)

* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
…(#18295)

* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk
XXjcontiniXX pushed a commit to XXjcontiniXX/llama.cpp that referenced this pull request Feb 21, 2026
…gml-org#18295)

* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…gml-org#18295)

* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron

Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).

Fewer pipeline variants and spec constants, just use push constants.

In test_topk_moe, change exp_probs_b to be 1D, matching real networks.

Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.

* change test_topk_moe to allow results in arbitrary order

* disable sigmoid fusion for moltenvk
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants