Skip to content

llama : add --n-cpu-moe option#15077

Merged
slaren merged 2 commits intomasterfrom
sl/ncmoe
Aug 4, 2025
Merged

llama : add --n-cpu-moe option#15077
slaren merged 2 commits intomasterfrom
sl/ncmoe

Conversation

@slaren
Copy link
Copy Markdown
Member

@slaren slaren commented Aug 4, 2025

Following @jacekpoplawski suggestion in #14992, adds an option to keeps the MoE weights of the first N layers in the CPU. You can use:

  • --cpu-moe to keep all MoE weights in the CPU
  • --n-cpu-moe N to keep the MoE weights of the first N layers in the CPU

The goal is to avoid having to write complex regular expressions when trying to optimize the number of MoE layers to keep in the CPU.

These options work by adding the necessary tensor overrides. If you use --override-tensor before these options, your overrides will take priority.

slaren added 2 commits August 4, 2025 23:41
Keeps the MoE weights of the first N layers in the CPU
adding a destructor to common_params would cause issues when the object is copied
@slaren slaren merged commit ec428b0 into master Aug 4, 2025
45 of 47 checks passed
@slaren slaren deleted the sl/ncmoe branch August 4, 2025 23:05
@jacekpoplawski
Copy link
Copy Markdown
Contributor

Thank you :)

@SlavikCA
Copy link
Copy Markdown

SlavikCA commented Aug 5, 2025

Should this options be added to this page, too:
https://github.com/ggml-org/llama.cpp/tree/master/tools/server
?

thad0ctor added a commit to thad0ctor/llama-server-launcher that referenced this pull request Aug 6, 2025
--cpu-moe to keep all MoE weights in the CPU
--n-cpu-moe N to keep the MoE weights of the first N layers in the CPU

ggml-org/llama.cpp#15077
@g0t4
Copy link
Copy Markdown

g0t4 commented Aug 7, 2025

thank you! just got 108T/s with gpt-oss:120b on my dual 5090s with --n-cpu-moe 3... so awesome I haven't had time to see if I should tweak it further :)

llama-server -hf ggml-org/gpt-oss-120b-GGUF --ctx-size 0 --jinja --flash-attn --n-gpu-layers 99 --reasoning-format none --n-cpu-moe 3

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
* llama : add --n-cpu-moe option

Keeps the MoE weights of the first N layers in the CPU
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* llama : add --n-cpu-moe option

Keeps the MoE weights of the first N layers in the CPU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants