Skip to content

llama : allow other bufts when overriding to CPU, add --no-repack option#14990

Merged
slaren merged 1 commit intomasterfrom
sl/ot-repacking
Jul 31, 2025
Merged

llama : allow other bufts when overriding to CPU, add --no-repack option#14990
slaren merged 1 commit intomasterfrom
sl/ot-repacking

Conversation

@slaren
Copy link
Copy Markdown
Member

@slaren slaren commented Jul 31, 2025

  • When using --override-tensor to override to the CPU, other buffer types will be considered as well. In practice, what this means is that the host buffer types will be used, which may improve performance when prompt processing is offloaded (Note that mmap needs to be disabled to use host buffers).
  • Adds --no-repack (-nr) option to disable weight repacking.

llama-bench -m Qwen3-30B-A3B-Q4_0.gguf -ot exps=CPU -n 0 -p 32,64,128,256,512,1024 -ub 1024 -mmp 0:

Model Test t/s master t/s sl/ot-repacking Speedup
qwen3moe 30B.A3B Q4_0 pp32 15.03 22.62 1.50
qwen3moe 30B.A3B Q4_0 pp64 28.87 45.04 1.56
qwen3moe 30B.A3B Q4_0 pp128 61.06 89.35 1.46
qwen3moe 30B.A3B Q4_0 pp256 121.44 173.97 1.43
qwen3moe 30B.A3B Q4_0 pp512 227.41 309.59 1.36
qwen3moe 30B.A3B Q4_0 pp1024 421.50 594.32 1.41

@slaren slaren merged commit d6818d0 into master Jul 31, 2025
47 checks passed
@slaren slaren deleted the sl/ot-repacking branch July 31, 2025 16:11
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Aug 1, 2025
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants