model : add GroveMoE support by CISC · Pull Request #15510 · ggml-org/llama.cpp

CISC · 2025-08-22T17:58:47Z

Adds support for inclusionAI/GroveMoE, a novel adjugate experts grouped with ordinary experts architecture (paper).

The PR is in a fully working state, but I submit it as draft because it requires a scalar div implementation that was quickly hacked together just to get the model running. Only div is (very crudely) implemented, and only for CPU (doesn't matter, not much computation is spent here), and I'm not satisfied that the API makes sense, in short this requires more thought!

CISC · 2025-08-22T18:23:41Z

Looks like ccache breaks the build (using cached files newer than this branch), not important right now though...

ggerganov · 2025-09-24T07:03:12Z


-    ggml_tensor * weights = ggml_get_rows(ctx0,
-            ggml_reshape_3d(ctx0, probs, 1, n_expert, n_tokens), selected_experts); // [1, n_expert_used, n_tokens]
+    if (arch == LLM_ARCH_GROVEMOE && n_expert != hparams.n_expert) {


When is n_expert != hparams.n_expert?

When doing the adjugate experts pass:

llama.cpp/src/llama-model.cpp

Lines 19025 to 19038 in ee51669

// TODO: Only do the expert selection and weights once

moe_out =

build_moe_ffn(cur,

nullptr,

model.layers[il].ffn_up_chexps,

model.layers[il].ffn_gate_chexps,

model.layers[il].ffn_down_chexps,

nullptr,

n_chunk_expert, n_expert_used > n_chunk_expert ? n_chunk_expert : n_expert_used,

LLM_FFN_SILU, true,

false, 0.0,

LLAMA_EXPERT_GATING_FUNC_TYPE_SOFTMAX,

il, probs);

cb(moe_out, "ffn_adj_moe_out", il);

CISC · 2025-09-24T07:27:52Z

Oh, I just noticed it breaks (just outputs endless ?) if you're unlucky with the distribution of ch/exps with -ncmoe (and not always on the first pass), I'm guessing because of some buffer not being on the same backend? Works fine with -cmoe...

ggerganov · 2025-09-24T07:36:55Z

Are you running F16 weights? If yes, there is a chance you are hitting this assert:

llama.cpp/ggml/src/ggml-cpu/vec.cpp

Lines 327 to 328 in 152729f

    
           // if you hit this, you are likely running outside the FP range 
        
           assert(!isnan(sumf) && !isinf(sumf));

Build in Debug to confirm that.

CISC · 2025-09-24T09:28:54Z

Are you running F16 weights? If yes, there is a chance you are hitting this assert:

Nope, Q8_0.

Build in Debug to confirm that.

I will do that and try to figure out the issue later.

CISC · 2025-09-25T17:17:42Z

Build in Debug to confirm that.

I will do that and try to figure out the issue later.

Didn't catch anything, however when I run it through llama-eval-callback I get encountered NaN - aborting whenever I'm using -ncmoe, but -cmoe works fine, so definitely something odd going on...

CISC · 2025-09-25T17:49:38Z

Ok, fully offloading works fine too, so this is unlikely to be a model issue, just seems to be triggering some problem with partial offloading of experts.

Merging.

ggerganov · 2025-09-25T17:52:50Z

Btw, which was the first op that produced the NaN when you ran the eval-callback?

CISC · 2025-09-25T18:10:02Z

Btw, which was the first op that produced the NaN when you ran the eval-callback?

ffn_moe_weighted-35 = (f32) SWIGLU(ffn_moe_gate-35, contains -inf, but the 2 previous ops have very suspicous values:

ggml_debug:          ffn_moe_gate-35 = (f32) MUL_MAT_ID(blk.35.ffn_gate_chexps.weight{2048, 128, 64, 1}, ffn_moe_out-35 (reshaped){2048, 1, 11, 1}}) = {128, 8, 11, 1}
                                     [
                                      [
                                       ..., 
                                      ],
                                      [
                                       ..., 
                                      ],
                                      [
                                       ..., 
                                       [692609189888710725876244691447447552.0000, 249074347992884282519164127694290944.0000, 1018055346873854459523945924594237440.0000, ..., 51586575710833024190924974872068096.0000, -2067560154401421119326973933186449408.0000, -1679141552374039747815821723257798656.0000],
                                       [      0.0699,       0.0112,      -0.0616, ...,       0.0889,       0.1895,      -0.0447],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                      ],
                                      ..., 
                                      [
                                       ..., 
                                      ],
                                      [
                                       ..., 
                                      ],
                                      [
                                       ..., 
                                      ],
                                     ]
                                     sum = -34555508860415417087805366059915018240.000000
ggml_debug:            ffn_moe_up-35 = (f32) MUL_MAT_ID(blk.35.ffn_up_chexps.weight{2048, 128, 64, 1}, ffn_moe_out-35 (reshaped){2048, 1, 11, 1}}) = {128, 8, 11, 1}
                                     [
                                      [
                                       ..., 
                                      ],
                                      [
                                       ..., 
                                      ],
                                      [
                                       ..., 
                                       [946507641661885421081057718347759616.0000, -2441153741115458376516297386216128512.0000, -407195320016530705331302955210506240.0000, ..., 764541706286011546206824740660707328.0000, -81787806567653304114262973125492736.0000, -1564830842585809104478506137537216512.0000],
                                       [     -0.0081,       0.0035,       0.0455, ...,       0.0669,       0.1087,      -0.0017],
                                       [     -0.0052,      -0.0083,       0.0079, ...,       0.0068,       0.0106,      -0.0006],
                                      ],
                                      ..., 
                                      [
                                       ..., 
                                      ],
                                      [
                                       ..., 
                                      ],
                                      [
                                       ..., 
                                      ],
                                     ]
                                     sum = 16824229610265255367388010517576548352.000000

Edit: probs and topk look normal.

ggerganov · 2025-09-25T18:18:49Z

Ok, I'll likely take a look when the GGUFs appear and if I don't forget.

gabriellarson · 2025-09-25T23:50:54Z

I'm attempting to make an imatrix and im getting this error:

/workspace/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2163: GGML_ASSERT(ids_to_sorted_host.size() == size_t(ne_get_rows)) failed
[New LWP 12890]
[New LWP 12892]
[New LWP 12893]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fe90a96f42f in wait4 () from /usr/lib/x86_64-linux-gnu/libc.so.6
#0  0x00007fe90a96f42f in wait4 () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fe90ae0784b in ggml_print_backtrace () from /workspace/llama.cpp/build/bin/libggml-base.so
#2  0x00007fe90ae079e2 in ggml_abort () from /workspace/llama.cpp/build/bin/libggml-base.so
#3  0x00007fe8f6a5a1d1 in ggml_cuda_mul_mat_id(ggml_backend_cuda_context&, ggml_tensor*) () from /workspace/llama.cpp/build/bin/libggml-cuda.so
#4  0x00007fe8f6a5bc6b in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /workspace/llama.cpp/build/bin/libggml-cuda.so
#5  0x00007fe90ae22f9a in ggml_backend_sched_graph_compute_async () from /workspace/llama.cpp/build/bin/libggml-base.so
#6  0x00007fe90af3fdd1 in llama_context::graph_compute(ggml_cgraph*, bool) () from /workspace/llama.cpp/build/bin/libllama.so
#7  0x00007fe90af40165 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /workspace/llama.cpp/build/bin/libllama.so
#8  0x00007fe90af469a7 in llama_context::decode(llama_batch const&) () from /workspace/llama.cpp/build/bin/libllama.so
#9  0x00007fe90af47840 in llama_decode () from /workspace/llama.cpp/build/bin/libllama.so
#10 0x0000555ac1cedbc1 in main ()
[Inferior 1 (process 12889) detached]

CISC · 2025-09-26T07:46:14Z

I'm attempting to make an imatrix and im getting this error:

Interesting, is it fully offloaded?

k3d3 · 2025-09-26T12:34:47Z

I'm also receiving a similar error:

/opt/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2163: GGML_ASSERT(ids_to_sorted_host.size() == size_t(ne_get_rows)) failed
/usr/local/lib64/libggml-base.so(+0x3525) [0x7f985d6b4525]
/usr/local/lib64/libggml-base.so(ggml_print_backtrace+0x1eb) [0x7f985d6b48eb]
/usr/local/lib64/libggml-base.so(ggml_abort+0x11f) [0x7f985d6b4a6f]
/usr/local/lib64/libggml-hip.so(+0x25207d9) [0x7f985fc747d9]
/usr/local/lib64/libggml-base.so(ggml_backend_sched_graph_compute_async+0x7f3) [0x7f985d6ce203]
/usr/local/lib64/libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa0) [0x7f986019ff10]
/usr/local/lib64/libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xe2) [0x7f98601a1b72]
/usr/local/lib64/libllama.so(_ZN13llama_context6decodeERK11llama_batch+0x3af) [0x7f98601a688f]
/usr/local/lib64/libllama.so(llama_decode+0xe) [0x7f98601a76ce]
llama-server() [0x5c1058]
llama-server() [0x496bfe]
llama-server() [0x43731a]
/lib64/libc.so.6(+0x35b5) [0x7f985d1295b5]
/lib64/libc.so.6(__libc_start_main+0x88) [0x7f985d129668]
llama-server() [0x439525]
Aborted                    (core dumped) llama-server --no-mmap -ngl 999 -fa on -c 128000 --jinja -m models/grovemoe-inst/GroveMoE-Inst-128x4.2B-F16.gguf

This is running Gabriel Larson's F16 GGUF, fully offloaded and using a rocm rc7-rocwmma docker/toolbox (I'm using a Strix Halo APU), running llama.cpp version: 6588 (a86a580a).

Unfortunately it appears the core dump is getting snatched up by Fedora (the joys of Bazzite), but if it's useful, I could try to finagle my settings to get the file.

If I use This Q8 GGUF instead then it loads and runs without issue.

CISC · 2025-09-26T12:40:32Z

Even more interesting, so seems to be an issue with that GGUF.

@gabriellarson Can you try creating a BF16 GGUF instead?

* add GroveMoE support * remove constexpr that fails on certain compilers * revert crude scalar div implementation, use cast * build_attn_inp_kv_unified -> build_attn_inp_kv * fix build_attn * re-apply ffn_exps regex changes

bartowski1182 · 2025-10-09T22:31:09Z

@CISC I get the same issue with bf16

bartowski1182 · 2025-10-09T23:05:36Z

I tried doing imatrix with Q8 and got:

inf detected in blk.0.ffn_down_chexps.weight

CISC · 2025-10-10T07:46:33Z

@bartowski1182 Thanks for testing, could be the chunked experts contain junk, though doesn't fully explain the partial offload issue.

bartowski1182 · 2025-10-10T15:08:31Z

Let me know if there's any other info I can provide

* add GroveMoE support * remove constexpr that fails on certain compilers * revert crude scalar div implementation, use cast * build_attn_inp_kv_unified -> build_attn_inp_kv * fix build_attn * re-apply ffn_exps regex changes

add GroveMoE support

25963a8

remove constexpr that fails on certain compilers

9b8a31a

ngxson reviewed Aug 22, 2025

View reviewed changes

Comment thread src/llama-graph.cpp Outdated

re-apply ffn_exps regex changes

ee51669

ggerganov reviewed Sep 24, 2025

View reviewed changes

ggerganov approved these changes Sep 24, 2025

View reviewed changes

CISC merged commit 835b2b9 into master Sep 25, 2025
67 of 69 checks passed

CISC deleted the cisc/grovemoe branch September 25, 2025 17:50

github-actions Bot mentioned this pull request Sep 26, 2025

Reddit News Daily 2025-09-26 gitlawr/reddit-daily-news#14

Open

CISC mentioned this pull request Nov 2, 2025

Eval bug: data corruption on CUDA experts offload #16945

Closed

	// TODO: Only do the expert selection and weights once
	moe_out =
	build_moe_ffn(cur,
	nullptr,
	model.layers[il].ffn_up_chexps,
	model.layers[il].ffn_gate_chexps,
	model.layers[il].ffn_down_chexps,
	nullptr,
	n_chunk_expert, n_expert_used > n_chunk_expert ? n_chunk_expert : n_expert_used,
	LLM_FFN_SILU, true,
	false, 0.0,
	LLAMA_EXPERT_GATING_FUNC_TYPE_SOFTMAX,
	il, probs);
	cb(moe_out, "ffn_adj_moe_out", il);

Conversation

CISC commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Aug 22, 2025

Uh oh!

Uh oh!

ggerganov Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

CISC commented Sep 24, 2025

Uh oh!

ggerganov commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Sep 24, 2025

Uh oh!

CISC commented Sep 25, 2025

Uh oh!

CISC commented Sep 25, 2025

Uh oh!

Uh oh!

ggerganov commented Sep 25, 2025

Uh oh!

CISC commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Sep 25, 2025

Uh oh!

gabriellarson commented Sep 25, 2025

Uh oh!

CISC commented Sep 26, 2025

Uh oh!

k3d3 commented Sep 26, 2025

Uh oh!

CISC commented Sep 26, 2025

Uh oh!

bartowski1182 commented Oct 9, 2025

Uh oh!

bartowski1182 commented Oct 9, 2025

Uh oh!

CISC commented Oct 10, 2025

Uh oh!

bartowski1182 commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

CISC commented Aug 22, 2025 •

edited

Loading

ggerganov commented Sep 24, 2025 •

edited

Loading

CISC commented Sep 25, 2025 •

edited

Loading