sgemm for IQ4_NL by netrunnereve · Pull Request #8049 · ggml-org/llama.cpp

netrunnereve · 2024-06-21T04:30:17Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Since IQ4_NL is basically Q4_0 with an additional look up table on the weights we can easily add it to sgemm alongside the existing Q4_0 implementation. Currently prompt processing is around 10% faster with this change but inference becomes 5% slower.

As I only have an Ivy Bridge computer I'll need someone to benchmark this with AVX2 and check if it's actually faster than master for prompt processing. I mean I think it's faster, but if it isn't I'll make this change AVX only.

(llama_bench chart removed as the numbers were off, see the comment below for my new results)

netrunnereve · 2024-06-21T21:56:29Z

After further testing on my desktop (not the inconsistent server VM that I posted my original results with) I'm seeing a clear 5% degradation in inference speed with sgemm on IQ4_NL, while prompt processing speed is improved by around 10%. On the server I have seen up to a 15% prompt processing boost in some cases but the 5% inference slowdown is present as well.

What's happening here is that sgemm overrides the existing ggml_vec_dot kernels with its own for both prompt processing and inference. The IQ4_NL ggml_vec_dot implementation obviously doesn't have tiling so it's slower for prompt processing multiplication but it computes two blocks per loop which gives it a small boost during inference.

Desktop results (Xeon E3 v2, 4c/8t)

model	size	params	backend	threads	test	t/s
llama 8B IQ4_NL - 4.5 bpw (Master)	4.35 GiB	8.03 B	CPU	8	pp512	6.12 ± 0.02
llama 8B IQ4_NL - 4.5 bpw (Master)	4.35 GiB	8.03 B	CPU	8	tg128	4.62 ± 0.02
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	8	pp512	6.74 ± 0.03
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	8	tg128	4.37 ± 0.00

Server results (8 core VM on Xeon E5 v2, 8c/16t, unloaded rerun)

model	size	params	backend	threads	test	t/s
llama 8B IQ4_NL - 4.5 bpw (Master)	4.35 GiB	8.03 B	CPU	8	pp512	9.23 ± 0.02
llama 8B IQ4_NL - 4.5 bpw (Master)	4.35 GiB	8.03 B	CPU	8	tg128	6.96 ± 0.05
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	8	pp512	10.29 ± 0.01
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	8	tg128	6.54 ± 0.17

I'm not interested in modifying sgemm to do two blocks per loop and that'll also mess with how tiling is set up. Right now I guess the question is whether or not a 10-15% improvement in prompt processing is worth a 5% regression in inference speed.

netrunnereve · 2024-06-22T18:12:02Z

I'm closing this as IQ4_XS and Q4_K_S completely trump IQ4_NL performance wise on CPU even without sgemm, while having the same or better perplexity and KL divergence. IQ4_NL was made for the special case where we can't use the I or K quant superblocks and pretty much all modern models don't have this issue.

If anyone's interested feel free to reopen this or improve on my code, but I really don't see the point in this.

model	size	params	backend	threads	test	t/s
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	CPU	8	pp512	10.82 ± 0.01
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	CPU	8	tg128	7.74 ± 0.08
llama 8B Q4_K - Small	4.36 GiB	8.03 B	CPU	8	pp512	11.89 ± 0.17
llama 8B Q4_K - Small	4.36 GiB	8.03 B	CPU	8	tg128	7.93 ± 0.03
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	8	pp512	10.29 ± 0.01
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	8	tg128	6.54 ± 0.17

* squashed readd my iq4_nl sgemm PR #8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per #8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

* squashed readd my iq4_nl sgemm PR ggml-org/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggml-org/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

* squashed readd my iq4_nl sgemm PR ggml-org#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggml-org#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

* squashed readd my iq4_nl sgemm PR ggml-org/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggml-org/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

* squashed readd my iq4_nl sgemm PR ggml-org#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggml-org#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

netrunnereve added 20 commits June 9, 2024 23:48

initial iq4_xs

0fd5a1b

fix ci

b7e1707

Merge branch 'ggerganov:master' into avx_iq

2f37328

iq4_nl

8d1d112

iq1_m

5ff64ad

iq1_s

75370d7

iq2_xxs

65765c9

Merge branch 'ggerganov:master' into avx_iq

520361f

iq3_xxs

5926186

iq2_s

dcfee06

iq2_xs

eccc609

iq3_s before sllv

39e816e

iq3_s

99f666c

iq3_s small fix

b57187f

iq3_s sllv can be safely replaced with sse multiply

29e2a96

Merge branch 'ggerganov:master' into avx_iq

a055767

iq4_nl sgemm

6559208

Merge branch 'avx_iq' into sgemm_iq4_nl

c848b71

oops

b54877c

Update sgemm.cpp

ced082c

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 21, 2024

netrunnereve added 2 commits June 21, 2024 00:33

fix ci

ffd430b

Merge branch 'ggerganov:master' into sgemm_iq4_nl

1f6e1b0

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 21, 2024

netrunnereve closed this Jun 22, 2024

netrunnereve deleted the sgemm_iq4_nl branch June 22, 2024 18:12

This was referenced Sep 6, 2024

Only enable sgemm for prompt processing, not for inference #9330

Merged

IQ4_NL sgemm + Q4_0 AVX optimization #9422

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sgemm for IQ4_NL#8049

sgemm for IQ4_NL#8049
netrunnereve wants to merge 22 commits intoggml-org:masterfrom
netrunnereve:sgemm_iq4_nl

netrunnereve commented Jun 21, 2024 •

edited

Loading

Uh oh!

netrunnereve commented Jun 21, 2024 •

edited

Loading

Uh oh!

netrunnereve commented Jun 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

netrunnereve commented Jun 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netrunnereve commented Jun 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netrunnereve commented Jun 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netrunnereve commented Jun 21, 2024 •

edited

Loading

netrunnereve commented Jun 21, 2024 •

edited

Loading