Bit-interleaved Q1_0 8x32 repack kernels for x86 AVX2 by pl752 · Pull Request #29 · PrismML-Eng/llama.cpp

pl752 · 2026-05-02T10:42:45Z

Continuation of #21 and #10

Been a hot minute

Decided to drop nrc==2 (might revisit if plain AVX and SSSE3 are needed) as it is mostly used in specific situations for ARM_DOTPROD and focus on optimized gemv and gemm.

Also I have finally moved to native linux from WSL2, so now benchmarks are run with -fa 1 -mmp 0 -r 5 -t 6 instead of -t10 as SMT threads don't help significantly with performance anymore, but increase memory pressure. So benchmark baselines have shifted again.

flow	run	dot	repack	delta
`AVX2`	`pp512`	139.80 t/s	190.98 t/s	+36.61%
`AVX2`	`tg128`	91.70 t/s	115.17 t/s	+25.59%
`AVX512`*	`pp512`	145.09 t/s	219.96 t/s	+51.60%
`AVX512`*	`tg128`	93.34 t/s	120.47 t/s	+29.07%

* - register file increase only, no special kernel

AVX512 is in theory usable, but couldn't implement kernel which won't regress Zen 4 AVX512 performance yet, so currently relying on AVX2 code

Perplexity

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9558 ±    3.1805      -0.00009 ±    0.00239       0.00021 ±    0.00003     0.396 ±  0.056 %    99.608 ±  0.392 %
   2      20.2053 ±    3.4389       0.01465 ±    0.01152       0.00022 ±    0.00002     0.386 ±  0.034 %    99.412 ±  0.339 %
   3      20.8472 ±    2.7882       0.00892 ±    0.00770       0.00022 ±    0.00001     0.365 ±  0.026 %    99.085 ±  0.344 %
   4      21.1986 ±    2.3887       0.00633 ±    0.00579       0.00022 ±    0.00001     0.377 ±  0.026 %    99.216 ±  0.276 %
   5      21.0772 ±    2.1025       0.00518 ±    0.00466       0.00023 ±    0.00001     0.365 ±  0.022 %    99.216 ±  0.247 %

====== Perplexity statistics ======
Mean PPL(Q)                   :  21.077184 ±   2.102473
Mean PPL(base)                :  20.968387 ±   2.074795
Cor(ln(PPL(Q)), ln(PPL(base))):  99.89%
Mean ln(PPL(Q)/PPL(base))     :   0.005175 ±   0.004663
Mean PPL(Q)/PPL(base)         :   1.005189 ±   0.004688
Mean PPL(Q)-PPL(base)         :   0.108796 ±   0.100463

====== KL divergence statistics ======
Mean    KLD:   0.000226 ±   0.000011
Maximum KLD:   0.006768
99.9%   KLD:   0.005245
99.0%   KLD:   0.001404
95.0%   KLD:   0.000682
90.0%   KLD:   0.000481
Median  KLD:   0.000135
10.0%   KLD:   0.000002
 5.0%   KLD:   0.000000
 1.0%   KLD:  -0.000010
 0.1%   KLD:  -0.000033
Minimum KLD:  -0.000039

====== Token probability statistics ======
Mean    Δp:  0.020 ± 0.010 %
Maximum Δp:  3.536%
99.9%   Δp:  2.703%
99.0%   Δp:  1.293%
95.0%   Δp:  0.595%
90.0%   Δp:  0.300%
75.0%   Δp:  0.065%
Median  Δp:  0.000%
25.0%   Δp: -0.041%
10.0%   Δp: -0.277%
 5.0%   Δp: -0.472%
 1.0%   Δp: -1.087%
 0.1%   Δp: -1.576%
Minimum Δp: -1.698%
RMS Δp    :  0.365 ± 0.022 %
Same top p: 99.216 ± 0.247 %

For some reason model identifies its type as Q2_0

Benchmarks for various number of threads for repack AVX512

model	size	params	backend	threads	fa	test	t/s
qwen3 1.7B Q2_0 (HUH!? Y?)	231.13 MiB	1.72 B	CPU	4	1	pp512	167.58 ± 2.59
qwen3 1.7B Q2_0	231.13 MiB	1.72 B	CPU	4	1	tg128	94.55 ± 0.14
qwen3 1.7B Q2_0	231.13 MiB	1.72 B	CPU	6	1	pp512	219.96 ± 0.17
qwen3 1.7B Q2_0	231.13 MiB	1.72 B	CPU	6	1	tg128	120.47 ± 0.16
qwen3 1.7B Q2_0	231.13 MiB	1.72 B	CPU	8	1	pp512	200.69 ± 0.23
qwen3 1.7B Q2_0	231.13 MiB	1.72 B	CPU	8	1	tg128	120.49 ± 0.08
qwen3 1.7B Q2_0	231.13 MiB	1.72 B	CPU	10	1	pp512	197.99 ± 1.67
qwen3 1.7B Q2_0	231.13 MiB	1.72 B	CPU	10	1	tg128	116.79 ± 1.11
qwen3 1.7B Q2_0	231.13 MiB	1.72 B	CPU	12	1	pp512	210.22 ± 0.35
qwen3 1.7B Q2_0	231.13 MiB	1.72 B	CPU	12	1	tg128	121.91 ± 0.16

Implemented bit-interleaved Q1_0 8x32 repack kernels for x86 AVX2

d11c45d

github-actions Bot added the ggml label May 2, 2026

pl752 marked this pull request as ready for review May 2, 2026 11:09

pl752 mentioned this pull request May 2, 2026

(Prototype) q1_0 nrc = 2 and diabolic tiles branches #21

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bit-interleaved Q1_0 8x32 repack kernels for x86 AVX2#29

Bit-interleaved Q1_0 8x32 repack kernels for x86 AVX2#29
pl752 wants to merge 1 commit intoPrismML-Eng:prismfrom
pl752:perf/q1_0_8x32_repack_AVX2

pl752 commented May 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pl752 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pl752 commented May 2, 2026 •

edited

Loading