Skip to content

llama-bench: add --n-cpu-moe support#15952

Merged
slaren merged 2 commits intoggml-org:masterfrom
jacekpoplawski:llama-bench-n-cpu-moe
Sep 16, 2025
Merged

llama-bench: add --n-cpu-moe support#15952
slaren merged 2 commits intoggml-org:masterfrom
jacekpoplawski:llama-bench-n-cpu-moe

Conversation

@jacekpoplawski
Copy link
Copy Markdown
Contributor

@jacekpoplawski jacekpoplawski commented Sep 12, 2025

followup to #15077

This changeset addresses three items (I can split it into atomic commits if needed):

  • Support --n-cpu-moe in llama-bench the same way it is supported by llama-server.

  • Fix the table by trimming tensor_buft_overrides output in markdown_printer.

  • Refactor duplicated ffn_(up|down|gate)_exps regex into helper functions.

tested on Windows:

PS C:\Users\jacek\git\llama.cpp> .\build_2025.09.12\bin\Release\llama-bench.exe -m J:\llm\models\Qwen3-30B-A3B-Instruct-2507-Q3_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |           pp512 |        378.46 ± 1.43 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |           tg128 |         11.09 ± 0.00 |

build: 08a216474 (6457)
PS C:\Users\jacek\git\llama.cpp> .\build_2025.09.12\bin\Release\llama-bench.exe -m J:\llm\models\Qwen3-30B-A3B-Instruct-2507-Q3_K_M.gguf --n-cpu-moe 12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |                                       ot |            test |                  t/s |     
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------------------------------------: | --------------: | -------------------: |     
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 | blk\.0\.ffn_(up|down|gate)_exps=CPU;blk\ |           pp512 |      1253.62 ± 20.28 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 | blk\.0\.ffn_(up|down|gate)_exps=CPU;blk\ |           tg128 |         54.04 ± 0.27 |

build: 08a216474 (6457)

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented Sep 12, 2025

Nice, I had this on my TODO list as well :)

@pwilkin pwilkin self-assigned this Sep 12, 2025
Copy link
Copy Markdown
Member

@pwilkin pwilkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but would be nice if this had --cpu-moe support as well so that we can get it done in one go.

Comment thread tools/llama-bench/llama-bench.cpp Outdated
Comment thread tools/llama-bench/llama-bench.cpp Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have the typical behavior of "strip last 3 characters + add '...'" instead to make it clear to the user we're truncating?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about several ways to handle it:

  • simple cut
  • “nice” cut (with “…”)
  • using a special string like “-” instead of the value
  • renaming the “ot” column to “ncmoe”
  • removing the “ot” column

I went with the simplest option just to start the discussion.
I think the last option (skipping that column) is the best.
However, since someone added “-ot” to llama-bench, I don’t want to introduce a regression.
Still, if you look at the table without the cut, it only works correctly for short strings.

Is this column useful? The way I implemented --n-cpu-moe, it can have only one value, so each row will have the same value in that column.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, final version: I think we should unify how parameters to llama-bench behave:

  • make --n-cpu-moe a vector-type parameter, similar to all the other switches (like -fa, -p, -n and the like), so that it's suitable for comparison
  • keep -ot for now, but:
  • move all the parameters that don't take multiple options, i.e. don't take part in the comparison, above the table. Should look something like this:
PS C:\Users\jacek\git\llama.cpp> .\build_2025.09.12\bin\Release\llama-bench.exe -m J:\llm\models\Qwen3-30B-A3B-Instruct-2507-Q3_K_M.gguf --n-cpu-moe 12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
model: qwen3moe 30B.A3B Q3_K - Medium
model size: 13.70 GiB
model parameters: 30.53 B
override_tensors: blk\.0\.ffn_(up|down|gate)_exps=CPU;blk... (full expression here if actually ot was set)

| backend | ngl | test         | t/s                          |
| CUDA     | 99  | pp512     | 1253.62 ± 20.28    |
| CUDA     | 99  | tg128      | 54.04 ± 0.27          |

This way, all parameters that are actually part of the benchmark are in the table, while the table isn't cluttered with repeated values that don't actually change during the test.

(if you don't want to implement all of this, just do the --n-cpu-moe as a vector part and move override_tensors above and we can add another PR for the rest)

@ggerganov @slaren @CISC what do you think? does this sound like a reasonable idea?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n-cpu-moe should be a different parameter and shown as its own column in the markdown output, not just merged with the rest of the overrides.

Copy link
Copy Markdown
Contributor Author

@jacekpoplawski jacekpoplawski Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n-cpu-moe should be a different parameter and shown as its own column in the markdown output, not just merged with the rest of the overrides.

I agree that this would be the ideal behavior, and I will try to implement it that way.
n_cpu_moe will probably be part of struct test (not just an alias for ot in the argument parser), so the change will be bigger.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS C:\Users\jacek\git\llama.cpp> .\build_2025.09.12\bin\Release\llama-bench.exe -m J:\llm\models\Qwen3-30B-A3B-Instruct-2507-Q3_K_M.gguf --n-cpu-moe 0,6,12,18
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |  n_cpu_moe |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |          0 |           pp512 |        433.44 + 2.62 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |          0 |           tg128 |         10.85 + 0.00 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |          6 |           pp512 |        388.82 + 8.21 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |          6 |           tg128 |         13.56 + 0.01 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |         12 |           pp512 |      1239.89 + 11.49 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |         12 |           tg128 |         51.03 + 1.88 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |         18 |           pp512 |       652.27 + 20.82 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |         18 |           tg128 |         33.97 + 1.17 |

build: c0bb1ae67 (6457)

Comment thread tools/llama-bench/llama-bench.cpp Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to change this to unsigned?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to fix the CI (but it’s still failing on mac for an unrelated reason)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI issue is not related. Please revert this change if it is not necessary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Support --n-cpu-moe in llama-bench the same way it is supported by
llama-server.

Refactor duplicated ffn_(up|down|gate)_exps regex into helper functions.
Comment thread common/common.h Outdated
@slaren
Copy link
Copy Markdown
Member

slaren commented Sep 16, 2025

Looks good, but would be nice if this had --cpu-moe support as well so that we can get it done in one go.

Maybe we can add --cpu-moe 0|1 later as an alias to -n-cpu-moe 0|999 or similar in the future, but for now this should be good.

@slaren slaren merged commit 8ff2060 into ggml-org:master Sep 16, 2025
45 checks passed
@jacekpoplawski
Copy link
Copy Markdown
Contributor Author

looks like it's actually useful :)
https://www.reddit.com/r/LocalLLaMA/comments/1nt2c38/llamacpp_moe_models_find_best_ncpumoe_value/

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
* llama-bench: add --n-cpu-moe support

Support --n-cpu-moe in llama-bench the same way it is supported by
llama-server.
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* llama-bench: add --n-cpu-moe support

Support --n-cpu-moe in llama-bench the same way it is supported by
llama-server.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants