DeepSeek V2/V3 implementation refactored to allow non-MLA and MLA#12313
DeepSeek V2/V3 implementation refactored to allow non-MLA and MLA#12313jukofyork wants to merge 11 commits intoggml-org:masterfrom jukofyork:mla-final-refactor
Conversation
|
@jukofyork I wanted to try this, but there seems to be a problem with DeepSeek R1 model conversion in your branch: |
|
@fairydreaming I'm actually just reverting this as I realised it was going to be really hard to maintain I'm now just merging the older "with flash attention" PR with the "-mla" options, but trying to use at least struct ggml_tensor * q_states = ggml_concat(ctx0, q_nope_absorbed, q_pe, 0);
cb(q_states, "q_states", il);
struct ggml_tensor * k_states = ggml_concat(ctx0, kv_compressed, k_pe_view, 0);
cb(k_states, "k_states", il);
struct ggml_tensor * v_states = kv_compressed;
cb(v_states, "v_states", il);
// these nodes are added to the graph together so that they are not reordered
// by doing so, the number of splits in the graph is reduced
ggml_build_forward_expand(gf, q_states);
ggml_build_forward_expand(gf, k_states);
ggml_build_forward_expand(gf, v_states);
llm_build_kv_store(ctx0, hparams, cparams, kv_self, gf, k_states, v_states, n_tokens, kv_head, cb, il);I'll have it done in a couple of hours and there won't be any need to requant then too (closing this for now). |
IMPORTANT: This will require re-quantising all models that use this PR!!!
This is a vastly tided up continuation of #11446 and #12227 which allows the use of the
-mla(--mla-attn) option:-mlaoption it essentially converts MLA into MQA (with very low KV-cache overhead, but at the cost of more compute).build_deepseek2()code now uses the properllm_build_kv()calls for both the non-MLA and MLA branches.the forced,F32upcastno 2D x 2D optimisations, and the splitting of theq_bandkv_btensors to extract the MQA (ie: RoPE part) separately (see below).NOTE: This will require re-quantising all models that use this, but this won't change and I intend to run some experiments over the next few days to find better quant rules for the newly split-up tensors (to hopefully avoid so many of the numerical problems that seem to plague this model).
I also plan to see if I can get back some of the lost performance my previous PR gave (but at the cost of a vastly more complex/unmaintainableDONEbuild_deepseek2()due to all the 2D/3D views it used).I have left context shifting disabled for now, but I have been careful to move the RoPE parts to the first
n_rotparameters so it should be possible eventually to get working withbuild_k_shift()andbuild_defrag(), etc. I can't cleanly add this currently though and if I try it will likely end up a confusing mess of overriding the GGUF file parameters forn_embd_k_gqa,n_embd_v_gqa. I've tried to do this as cleanly as the current code allows in:llama-kv-cache.cpp::llama_kv_cache_init(),llama.cpp::llm_build_kv_store()andllama.cpp::llm_build_kqv(). I'm also not 100% clear on the ins-and-outs of the YaRN implementation and how it works for context shifting, etc.Things in
llama.cppandggmlI'm still a bit unsure of:-mlaoption, and I'm not entirely confident I have them all (I looked at how the-faoption was used and tried to copy that as best I could).Should I be using the
nb[]values? I'm currently just quantising everything toBf16(for the attention tensors anyway), so it's possible some of my views are not going to work when quantised... 😕