Skip to content

Bit-interleaved Q1_0 8x32 repack kernels for x86 AVX2#29

Open
pl752 wants to merge 1 commit intoPrismML-Eng:prismfrom
pl752:perf/q1_0_8x32_repack_AVX2
Open

Bit-interleaved Q1_0 8x32 repack kernels for x86 AVX2#29
pl752 wants to merge 1 commit intoPrismML-Eng:prismfrom
pl752:perf/q1_0_8x32_repack_AVX2

Conversation

@pl752
Copy link
Copy Markdown

@pl752 pl752 commented May 2, 2026

Continuation of #21 and #10

Been a hot minute

Decided to drop nrc==2 (might revisit if plain AVX and SSSE3 are needed) as it is mostly used in specific situations for ARM_DOTPROD and focus on optimized gemv and gemm.

Also I have finally moved to native linux from WSL2, so now benchmarks are run with -fa 1 -mmp 0 -r 5 -t 6 instead of -t10 as SMT threads don't help significantly with performance anymore, but increase memory pressure. So benchmark baselines have shifted again.

flow run dot repack delta
AVX2 pp512 139.80 t/s 190.98 t/s +36.61%
AVX2 tg128 91.70 t/s 115.17 t/s +25.59%
AVX512* pp512 145.09 t/s 219.96 t/s +51.60%
AVX512* tg128 93.34 t/s 120.47 t/s +29.07%

* - register file increase only, no special kernel

AVX512 is in theory usable, but couldn't implement kernel which won't regress Zen 4 AVX512 performance yet, so currently relying on AVX2 code

Perplexity
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9558 ±    3.1805      -0.00009 ±    0.00239       0.00021 ±    0.00003     0.396 ±  0.056 %    99.608 ±  0.392 %
   2      20.2053 ±    3.4389       0.01465 ±    0.01152       0.00022 ±    0.00002     0.386 ±  0.034 %    99.412 ±  0.339 %
   3      20.8472 ±    2.7882       0.00892 ±    0.00770       0.00022 ±    0.00001     0.365 ±  0.026 %    99.085 ±  0.344 %
   4      21.1986 ±    2.3887       0.00633 ±    0.00579       0.00022 ±    0.00001     0.377 ±  0.026 %    99.216 ±  0.276 %
   5      21.0772 ±    2.1025       0.00518 ±    0.00466       0.00023 ±    0.00001     0.365 ±  0.022 %    99.216 ±  0.247 %

====== Perplexity statistics ======
Mean PPL(Q)                   :  21.077184 ±   2.102473
Mean PPL(base)                :  20.968387 ±   2.074795
Cor(ln(PPL(Q)), ln(PPL(base))):  99.89%
Mean ln(PPL(Q)/PPL(base))     :   0.005175 ±   0.004663
Mean PPL(Q)/PPL(base)         :   1.005189 ±   0.004688
Mean PPL(Q)-PPL(base)         :   0.108796 ±   0.100463

====== KL divergence statistics ======
Mean    KLD:   0.000226 ±   0.000011
Maximum KLD:   0.006768
99.9%   KLD:   0.005245
99.0%   KLD:   0.001404
95.0%   KLD:   0.000682
90.0%   KLD:   0.000481
Median  KLD:   0.000135
10.0%   KLD:   0.000002
 5.0%   KLD:   0.000000
 1.0%   KLD:  -0.000010
 0.1%   KLD:  -0.000033
Minimum KLD:  -0.000039

====== Token probability statistics ======
Mean    Δp:  0.020 ± 0.010 %
Maximum Δp:  3.536%
99.9%   Δp:  2.703%
99.0%   Δp:  1.293%
95.0%   Δp:  0.595%
90.0%   Δp:  0.300%
75.0%   Δp:  0.065%
Median  Δp:  0.000%
25.0%   Δp: -0.041%
10.0%   Δp: -0.277%
 5.0%   Δp: -0.472%
 1.0%   Δp: -1.087%
 0.1%   Δp: -1.576%
Minimum Δp: -1.698%
RMS Δp    :  0.365 ± 0.022 %
Same top p: 99.216 ± 0.247 %

For some reason model identifies its type as Q2_0

Benchmarks for various number of threads for repack AVX512
model size params backend threads fa mmap test t/s
qwen3 1.7B Q2_0 (HUH!? Y?) 231.13 MiB 1.72 B CPU 4 1 0 pp512 167.58 ± 2.59
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 4 1 0 tg128 94.55 ± 0.14
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 6 1 0 pp512 219.96 ± 0.17
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 6 1 0 tg128 120.47 ± 0.16
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 8 1 0 pp512 200.69 ± 0.23
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 8 1 0 tg128 120.49 ± 0.08
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 10 1 0 pp512 197.99 ± 1.67
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 10 1 0 tg128 116.79 ± 1.11
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 12 1 0 pp512 210.22 ± 0.35
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 12 1 0 tg128 121.91 ± 0.16

@github-actions github-actions Bot added the ggml label May 2, 2026
@pl752 pl752 marked this pull request as ready for review May 2, 2026 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant