avx2: compute ksigns instead of loading from table by dfriehs · Pull Request #19657 · ggml-org/llama.cpp

dfriehs · 2026-02-16T07:04:40Z

This is my attempt at implementing the ksign computation like in #19624 for the cpu AVX2 backend. Unfortunately I'm not sure if it's an improvement; test-backend-ops perf shows slightly higher FLOPs (but uses hyperthreading) while llama-bench shows a reduction in performance.

Disclaimer: This is the first time I've seriously worked with AVX instructions. The flow matches the one implemented in ggml_vec_dot_iq2_xs_q8_K quite a bit (which I realized far too late), but IQ2_XS signs seem packed in a more beneficial way for SIMD.

As I'm not sure computation is faster and I'm not sure how (and if even possible) to tune this further I would appreciate if someone experienced with AVX2 could give me a review, or if someone that uses IQ2_XSS models with layers on the cpu could benchmark the changes. Right now I'd err on the side of closing this PR.

All performance tests built without BLAS, and only CPU backend. test-backend-ops test -b CUDA0 -p iq2_xxs passes for me, so the computation should be correct.

nice -20 test-backend-ops perf -b CPU -p iq2_xxs on ff4affb (master)

Backend 1/1: CPU
  Device description: AMD Ryzen 9 3950X 16-Core Processor
  Device memory: 128711 MB (128711 MB free)

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    163185 runs -    367.74 us/run - 117.44 MFLOP/run - 319.36 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     79135 runs -    758.39 us/run - 234.88 MFLOP/run - 309.71 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     52256 runs -   1148.44 us/run - 352.32 MFLOP/run - 306.78 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     39780 runs -   1508.89 us/run - 469.76 MFLOP/run - 311.33 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     31724 runs -   1891.36 us/run - 587.20 MFLOP/run - 310.47 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     19962 runs -   3006.93 us/run - 939.52 MFLOP/run - 312.45 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     318 runs - 189173.98 us/run -  60.13 GFLOP/run - 317.85 GFLOPS
  Backend CPU: OK
1/1 backends passed
OK

on 7fe317f (this branch)

Backend 1/1: CPU
  Device description: AMD Ryzen 9 3950X 16-Core Processor
  Device memory: 128711 MB (128711 MB free)

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    167187 runs -    358.97 us/run - 117.44 MFLOP/run - 327.16 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     79240 runs -    757.28 us/run - 234.88 MFLOP/run - 310.16 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     53636 runs -   1118.74 us/run - 352.32 MFLOP/run - 314.93 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     40212 runs -   1492.56 us/run - 469.76 MFLOP/run - 314.74 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     32508 runs -   1846.09 us/run - 587.20 MFLOP/run - 318.08 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     20412 runs -   2940.07 us/run - 939.52 MFLOP/run - 319.56 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     322 runs - 186651.46 us/run -  60.13 GFLOP/run - 322.15 GFLOPS
  Backend CPU: OK
1/1 backends passed
OK

taskset -c 0-15 nice -20 llama-bench -m Qwen2.5-Coder-32B-Instruct-IQ2_XXS.gguf -fa 0,1 -p 128 -n 64 --threads 16 on ff4affb (master)

model	size	params	backend	threads	fa	test	t/s
qwen2 32B IQ2_XXS - 2.0625 bpw	8.40 GiB	32.76 B	CPU	16	0	pp128	5.95 ± 0.07
qwen2 32B IQ2_XXS - 2.0625 bpw	8.40 GiB	32.76 B	CPU	16	0	tg64	3.90 ± 0.00
qwen2 32B IQ2_XXS - 2.0625 bpw	8.40 GiB	32.76 B	CPU	16	1	pp128	5.81 ± 0.01
qwen2 32B IQ2_XXS - 2.0625 bpw	8.40 GiB	32.76 B	CPU	16	1	tg64	3.96 ± 0.00

on 7fe317f (this branch)

model	size	params	backend	threads	fa	test	t/s
qwen2 32B IQ2_XXS - 2.0625 bpw	8.40 GiB	32.76 B	CPU	16	0	pp128	5.56 ± 0.07
qwen2 32B IQ2_XXS - 2.0625 bpw	8.40 GiB	32.76 B	CPU	16	0	tg64	3.88 ± 0.00
qwen2 32B IQ2_XXS - 2.0625 bpw	8.40 GiB	32.76 B	CPU	16	1	pp128	5.53 ± 0.01
qwen2 32B IQ2_XXS - 2.0625 bpw	8.40 GiB	32.76 B	CPU	16	1	tg64	3.95 ± 0.00

dfriehs · 2026-02-16T21:28:20Z

I found one more trick that seems to speed things up a bit, performance seems better in both cases now. I'd appreciate benchmarks on different systems though.

There's still some things left to do (mask >> 27 | 1 trick, unifying the arrays with ggml_vec_dot_iq2_xs_q8_K, applying the same optimization to iq3xxs, ...) but I would appreciate a review before to know if I should continue with this AVX2 code.

nice -20 test-backend-ops perf -b CPU -p iq2_xxs on 4515987 (this branch)

Backend 1/1: CPU
  Device description: AMD Ryzen 9 3950X 16-Core Processor
  Device memory: 128711 MB (128711 MB free)

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    189336 runs -    316.95 us/run - 117.44 MFLOP/run - 370.53 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     89425 runs -    671.07 us/run - 234.88 MFLOP/run - 350.01 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     59202 runs -   1013.48 us/run - 352.32 MFLOP/run - 347.63 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     44352 runs -   1353.04 us/run - 469.76 MFLOP/run - 347.19 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     35546 runs -   1688.08 us/run - 587.20 MFLOP/run - 347.85 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     22158 runs -   2708.75 us/run - 939.52 MFLOP/run - 346.85 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     349 runs - 172106.64 us/run -  60.13 GFLOP/run - 349.37 GFLOPS
  Backend CPU: OK
1/1 backends passed
OK

taskset -c 0-15 nice -20 llama-bench -m Qwen2.5-Coder-32B-Instruct-IQ2_XXS.gguf -fa 0,1 -p 128 -n 64 --threads 16

model	size	params	backend	threads	fa	test	t/s
qwen2 32B IQ2_XXS - 2.0625 bpw	8.40 GiB	32.76 B	CPU	16	0	pp128	6.26 ± 0.07
qwen2 32B IQ2_XXS - 2.0625 bpw	8.40 GiB	32.76 B	CPU	16	0	tg64	3.89 ± 0.00
qwen2 32B IQ2_XXS - 2.0625 bpw	8.40 GiB	32.76 B	CPU	16	1	pp128	6.06 ± 0.01
qwen2 32B IQ2_XXS - 2.0625 bpw	8.40 GiB	32.76 B	CPU	16	1	tg64	3.96 ± 0.00

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 16, 2026

dfriehs marked this pull request as ready for review February 16, 2026 21:28

dfriehs requested a review from ggerganov as a code owner February 16, 2026 21:28

loci-dev mentioned this pull request Feb 17, 2026

UPSTREAM PR #19657: avx2: compute ksigns instead of loading from table auroralabs-loci/llama.cpp#1183

Open

dfriehs added 5 commits February 22, 2026 14:17

avx2: calculate ksigns instead of using keven_signs_q2xs

44791d0

avx2: drop 2 instructions, finally approach lut speed

7474264

avx2: inline unpack_ksigns, fix up 8 sign bytes at once

8d3b8da

avx2: shift and mask in __m128i then pack to save on permutes

0fb8c89

avx2: align constants to 16/32 bytes and use aligned loads

28a98d3

dfriehs force-pushed the iq2xxs-avx2 branch from 4515987 to 28a98d3 Compare February 22, 2026 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avx2: compute ksigns instead of loading from table#19657

avx2: compute ksigns instead of loading from table#19657
dfriehs wants to merge 5 commits intoggml-org:masterfrom
dfriehs:iq2xxs-avx2

dfriehs commented Feb 16, 2026

Uh oh!

dfriehs commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dfriehs commented Feb 16, 2026

Uh oh!

dfriehs commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant