Conversation
|
Just linking this old attempt at doing this: as from @fairydreaming's CPU test and my CUDA test, it seemed the tile size was just too large to be useful. @JohannesGaessler explained in #12227 (comment) why this likely failed for CUDA, and I guess for CPU it was just such a massive quadratic increase over the previous maximum tile size (256^2 --> 512^2) that it no longer fits in cache. One other very MLA-specific thing to think about is that if the V-cache doesn't need transposing, the last 512 elements of the K-cache hold the same values, so there would be no need to store these and a 2D view starting at element 64 and offsetting by 576 would get the same data untransposed. |
…-org#12953) * graph : make mla compatible with FA * metal : add exp FA kernels for DeepSeek models ggml-ci * llama : minor naming updates ggml-ci * ggml : disable FA for DS head sizes * tests : add FA tests for MLA shapes ggml-ci
|
HI there, sorry to bother. I was testing Deepseek V3 0324 on CPU + GPU, but when using FA, I get this issue I'm running the model like this I did build from source with When not using -fa, it works correctly. Did I do the setup incorrectly? Raised an issue with more info here #13252 |
…-org#12953) * graph : make mla compatible with FA * metal : add exp FA kernels for DeepSeek models ggml-ci * llama : minor naming updates ggml-ci * ggml : disable FA for DS head sizes * tests : add FA tests for MLA shapes ggml-ci
…-org#12953) * graph : make mla compatible with FA * metal : add exp FA kernels for DeepSeek models ggml-ci * llama : minor naming updates ggml-ci * ggml : disable FA for DS head sizes * tests : add FA tests for MLA shapes ggml-ci
cont #12801
For backends that support FA with different K and V head sizes, the FA path can be used. To support that, we decompress the FA result using the
v_mlatensor.