Skip to content

avx2: compute ksigns instead of loading from table#19657

Open
dfriehs wants to merge 5 commits intoggml-org:masterfrom
dfriehs:iq2xxs-avx2
Open

avx2: compute ksigns instead of loading from table#19657
dfriehs wants to merge 5 commits intoggml-org:masterfrom
dfriehs:iq2xxs-avx2

Conversation

@dfriehs
Copy link
Copy Markdown
Contributor

@dfriehs dfriehs commented Feb 16, 2026

This is my attempt at implementing the ksign computation like in #19624 for the cpu AVX2 backend. Unfortunately I'm not sure if it's an improvement; test-backend-ops perf shows slightly higher FLOPs (but uses hyperthreading) while llama-bench shows a reduction in performance.

Disclaimer: This is the first time I've seriously worked with AVX instructions. The flow matches the one implemented in ggml_vec_dot_iq2_xs_q8_K quite a bit (which I realized far too late), but IQ2_XS signs seem packed in a more beneficial way for SIMD.

As I'm not sure computation is faster and I'm not sure how (and if even possible) to tune this further I would appreciate if someone experienced with AVX2 could give me a review, or if someone that uses IQ2_XSS models with layers on the cpu could benchmark the changes. Right now I'd err on the side of closing this PR.


All performance tests built without BLAS, and only CPU backend. test-backend-ops test -b CUDA0 -p iq2_xxs passes for me, so the computation should be correct.

nice -20 test-backend-ops perf -b CPU -p iq2_xxs on ff4affb (master)

Backend 1/1: CPU
  Device description: AMD Ryzen 9 3950X 16-Core Processor
  Device memory: 128711 MB (128711 MB free)

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    163185 runs -    367.74 us/run - 117.44 MFLOP/run - 319.36 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     79135 runs -    758.39 us/run - 234.88 MFLOP/run - 309.71 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     52256 runs -   1148.44 us/run - 352.32 MFLOP/run - 306.78 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     39780 runs -   1508.89 us/run - 469.76 MFLOP/run - 311.33 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     31724 runs -   1891.36 us/run - 587.20 MFLOP/run - 310.47 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     19962 runs -   3006.93 us/run - 939.52 MFLOP/run - 312.45 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     318 runs - 189173.98 us/run -  60.13 GFLOP/run - 317.85 GFLOPS
  Backend CPU: OK
1/1 backends passed
OK

on 7fe317f (this branch)

Backend 1/1: CPU
  Device description: AMD Ryzen 9 3950X 16-Core Processor
  Device memory: 128711 MB (128711 MB free)

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    167187 runs -    358.97 us/run - 117.44 MFLOP/run - 327.16 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     79240 runs -    757.28 us/run - 234.88 MFLOP/run - 310.16 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     53636 runs -   1118.74 us/run - 352.32 MFLOP/run - 314.93 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     40212 runs -   1492.56 us/run - 469.76 MFLOP/run - 314.74 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     32508 runs -   1846.09 us/run - 587.20 MFLOP/run - 318.08 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     20412 runs -   2940.07 us/run - 939.52 MFLOP/run - 319.56 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     322 runs - 186651.46 us/run -  60.13 GFLOP/run - 322.15 GFLOPS
  Backend CPU: OK
1/1 backends passed
OK

taskset -c 0-15 nice -20 llama-bench -m Qwen2.5-Coder-32B-Instruct-IQ2_XXS.gguf -fa 0,1 -p 128 -n 64 --threads 16 on ff4affb (master)

model size params backend threads fa test t/s
qwen2 32B IQ2_XXS - 2.0625 bpw 8.40 GiB 32.76 B CPU 16 0 pp128 5.95 ± 0.07
qwen2 32B IQ2_XXS - 2.0625 bpw 8.40 GiB 32.76 B CPU 16 0 tg64 3.90 ± 0.00
qwen2 32B IQ2_XXS - 2.0625 bpw 8.40 GiB 32.76 B CPU 16 1 pp128 5.81 ± 0.01
qwen2 32B IQ2_XXS - 2.0625 bpw 8.40 GiB 32.76 B CPU 16 1 tg64 3.96 ± 0.00

on 7fe317f (this branch)

model size params backend threads fa test t/s
qwen2 32B IQ2_XXS - 2.0625 bpw 8.40 GiB 32.76 B CPU 16 0 pp128 5.56 ± 0.07
qwen2 32B IQ2_XXS - 2.0625 bpw 8.40 GiB 32.76 B CPU 16 0 tg64 3.88 ± 0.00
qwen2 32B IQ2_XXS - 2.0625 bpw 8.40 GiB 32.76 B CPU 16 1 pp128 5.53 ± 0.01
qwen2 32B IQ2_XXS - 2.0625 bpw 8.40 GiB 32.76 B CPU 16 1 tg64 3.95 ± 0.00

@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 16, 2026
@dfriehs
Copy link
Copy Markdown
Contributor Author

dfriehs commented Feb 16, 2026

I found one more trick that seems to speed things up a bit, performance seems better in both cases now. I'd appreciate benchmarks on different systems though.

There's still some things left to do (mask >> 27 | 1 trick, unifying the arrays with ggml_vec_dot_iq2_xs_q8_K, applying the same optimization to iq3xxs, ...) but I would appreciate a review before to know if I should continue with this AVX2 code.


nice -20 test-backend-ops perf -b CPU -p iq2_xxs on 4515987 (this branch)

Backend 1/1: CPU
  Device description: AMD Ryzen 9 3950X 16-Core Processor
  Device memory: 128711 MB (128711 MB free)

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):    189336 runs -    316.95 us/run - 117.44 MFLOP/run - 370.53 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     89425 runs -    671.07 us/run - 234.88 MFLOP/run - 350.01 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     59202 runs -   1013.48 us/run - 352.32 MFLOP/run - 347.63 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     44352 runs -   1353.04 us/run - 469.76 MFLOP/run - 347.19 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     35546 runs -   1688.08 us/run - 587.20 MFLOP/run - 347.85 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     22158 runs -   2708.75 us/run - 939.52 MFLOP/run - 346.85 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):     349 runs - 172106.64 us/run -  60.13 GFLOP/run - 349.37 GFLOPS
  Backend CPU: OK
1/1 backends passed
OK

taskset -c 0-15 nice -20 llama-bench -m Qwen2.5-Coder-32B-Instruct-IQ2_XXS.gguf -fa 0,1 -p 128 -n 64 --threads 16

model size params backend threads fa test t/s
qwen2 32B IQ2_XXS - 2.0625 bpw 8.40 GiB 32.76 B CPU 16 0 pp128 6.26 ± 0.07
qwen2 32B IQ2_XXS - 2.0625 bpw 8.40 GiB 32.76 B CPU 16 0 tg64 3.89 ± 0.00
qwen2 32B IQ2_XXS - 2.0625 bpw 8.40 GiB 32.76 B CPU 16 1 pp128 6.06 ± 0.01
qwen2 32B IQ2_XXS - 2.0625 bpw 8.40 GiB 32.76 B CPU 16 1 tg64 3.96 ± 0.00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant