sgemm for IQ4_NL#8049
sgemm for IQ4_NL#8049netrunnereve wants to merge 22 commits intoggml-org:masterfrom netrunnereve:sgemm_iq4_nl
Conversation
|
After further testing on my desktop (not the inconsistent server VM that I posted my original results with) I'm seeing a clear 5% degradation in inference speed with sgemm on IQ4_NL, while prompt processing speed is improved by around 10%. On the server I have seen up to a 15% prompt processing boost in some cases but the 5% inference slowdown is present as well. What's happening here is that sgemm overrides the existing Desktop results (Xeon E3 v2, 4c/8t)
Server results (8 core VM on Xeon E5 v2, 8c/16t, unloaded rerun)
I'm not interested in modifying sgemm to do two blocks per loop and that'll also mess with how tiling is set up. Right now I guess the question is whether or not a 10-15% improvement in prompt processing is worth a 5% regression in inference speed. |
|
I'm closing this as IQ4_XS and Q4_K_S completely trump IQ4_NL performance wise on CPU even without sgemm, while having the same or better perplexity and KL divergence. IQ4_NL was made for the special case where we can't use the I or K quant superblocks and pretty much all modern models don't have this issue. If anyone's interested feel free to reopen this or improve on my code, but I really don't see the point in this.
|
* squashed readd my iq4_nl sgemm PR ggml-org/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggml-org/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggml-org/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggml-org/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggml-org/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggml-org/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggml-org/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggml-org/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggml-org#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggml-org#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggml-org#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggml-org#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggml-org#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggml-org#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggml-org/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggml-org/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggml-org/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggml-org/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggml-org#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggml-org#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
Since IQ4_NL is basically Q4_0 with an additional look up table on the weights we can easily add it to sgemm alongside the existing Q4_0 implementation. Currently prompt processing is around 10% faster with this change but inference becomes 5% slower.
As I only have an Ivy Bridge computer I'll need someone to benchmark this with AVX2 and check if it's actually faster than master for prompt processing. I mean I think it's faster, but if it isn't I'll make this change AVX only.
(
llama_benchchart removed as the numbers were off, see the comment below for my new results)