Conversation
|
Looks like ccache breaks the build (using cached files newer than this branch), not important right now though... |
|
|
||
| ggml_tensor * weights = ggml_get_rows(ctx0, | ||
| ggml_reshape_3d(ctx0, probs, 1, n_expert, n_tokens), selected_experts); // [1, n_expert_used, n_tokens] | ||
| if (arch == LLM_ARCH_GROVEMOE && n_expert != hparams.n_expert) { |
There was a problem hiding this comment.
When is n_expert != hparams.n_expert?
There was a problem hiding this comment.
When doing the adjugate experts pass:
Lines 19025 to 19038 in ee51669
|
Oh, I just noticed it breaks (just outputs endless |
|
Are you running F16 weights? If yes, there is a chance you are hitting this assert: llama.cpp/ggml/src/ggml-cpu/vec.cpp Lines 327 to 328 in 152729f Build in Debug to confirm that. |
Nope, Q8_0.
I will do that and try to figure out the issue later. |
Didn't catch anything, however when I run it through |
|
Ok, fully offloading works fine too, so this is unlikely to be a model issue, just seems to be triggering some problem with partial offloading of experts. Merging. |
|
Btw, which was the first op that produced the NaN when you ran the |
Edit: |
|
Ok, I'll likely take a look when the GGUFs appear and if I don't forget. |
|
I'm attempting to make an imatrix and im getting this error: |
Interesting, is it fully offloaded? |
|
I'm also receiving a similar error: This is running Gabriel Larson's F16 GGUF, fully offloaded and using a rocm rc7-rocwmma docker/toolbox (I'm using a Strix Halo APU), running llama.cpp Unfortunately it appears the core dump is getting snatched up by Fedora (the joys of Bazzite), but if it's useful, I could try to finagle my settings to get the file. If I use This Q8 GGUF instead then it loads and runs without issue. |
|
Even more interesting, so seems to be an issue with that GGUF. @gabriellarson Can you try creating a |
* add GroveMoE support * remove constexpr that fails on certain compilers * revert crude scalar div implementation, use cast * build_attn_inp_kv_unified -> build_attn_inp_kv * fix build_attn * re-apply ffn_exps regex changes
|
@CISC I get the same issue with bf16 |
|
I tried doing imatrix with Q8 and got: inf detected in blk.0.ffn_down_chexps.weight |
|
@bartowski1182 Thanks for testing, could be the chunked experts contain junk, though doesn't fully explain the partial offload issue. |
|
Let me know if there's any other info I can provide |
* add GroveMoE support * remove constexpr that fails on certain compilers * revert crude scalar div implementation, use cast * build_attn_inp_kv_unified -> build_attn_inp_kv * fix build_attn * re-apply ffn_exps regex changes
* add GroveMoE support * remove constexpr that fails on certain compilers * revert crude scalar div implementation, use cast * build_attn_inp_kv_unified -> build_attn_inp_kv * fix build_attn * re-apply ffn_exps regex changes
* add GroveMoE support * remove constexpr that fails on certain compilers * revert crude scalar div implementation, use cast * build_attn_inp_kv_unified -> build_attn_inp_kv * fix build_attn * re-apply ffn_exps regex changes
* add GroveMoE support * remove constexpr that fails on certain compilers * revert crude scalar div implementation, use cast * build_attn_inp_kv_unified -> build_attn_inp_kv * fix build_attn * re-apply ffn_exps regex changes
Adds support for inclusionAI/GroveMoE, a novel adjugate experts grouped with ordinary experts architecture (paper).
The PR is in a fully working state, but I submit it as draft because it requires a scalar div implementation that was quickly hacked together just to get the model running. Only div is (very crudely) implemented, and only for CPU (doesn't matter, not much computation is spent here), and I'm not satisfied that the API makes sense, in short this requires more thought!