(Prototype) q1_0 nrc = 2 and diabolic tiles branches by pl752 · Pull Request #21 · PrismML-Eng/llama.cpp

pl752 · 2026-04-09T08:20:54Z

Pretty much direct continuation of #10.
Vibe coded prototype (just for proof of concept, needs refining) of nrows = 2 branches for x86 SIMD.
Yields significant PP improvements as it allows better utilization of memory bandwidth (hot y operand, high compute density).
I also think ARM NEON is worth trying to expand with nrows = 2

flow	run	nrc=1	nrc=2	delta
`SSSE3`	`pp512`	43.22 t/s	52.69 t/s	+21.91%
`SSSE3`	`tg128`	32.10 t/s	32.16 t/s	+0.19%
`AVX`	`pp512`	52.76 t/s	62.05 t/s	+17.61%
`AVX`	`tg128`	40.21 t/s	40.34 t/s	+0.32%
`AVX` + `F16C`	`pp512`	68.62 t/s	92.15 t/s	+34.29%
`AVX` + `F16C`	`tg128`	42.87 t/s	45.36 t/s	+5.81% (idk)
`AVX2` + `FMA`	`pp512`	121.54 t/s	160.94 t/s	+32.42%
`AVX2` + `FMA`	`tg128`	68.73 t/s	69.24 t/s	+0.74%
`AVX512BW`	`pp512`	128.38 t/s	158.17 t/s	+23.20%
`AVX512BW`	`tg128`	74.15 t/s	71.90 t/s	-3.03%

Also for some reason AVX-512 opts do hurt performance for PP consistently for nrows = 2 and sometimes results are inconsistent
Code for these branches is enormous and is most likely suboptimal, so suggestions are welcome, register spills occur of course
Funny part is that I have tried iterating the AVX2 prototype, but haven't managed to achieve any improvements.

I have also tried altering tile geometry to use rectangular blocks due to significant operand size assymetry like was attempted in #4 by @Marxist-Leninist, which yields some changes, but is inconclusive.

`blck_0`	AVX2 `pp512`	delta	AVX2 `tg128`	delta	AVX-512 `pp512`	delta	AVX-512 `tg128`	delta
16	172.63	-	73.36	-	170.97	-	75.24	-
32	176.26	+2.10%	74.00	+0.87%	171.14	+0.10%	75.57	+0.44%
64	175.55	+1.69%	74.42	+1.44%	172.33	+0.80%	76.57	+1.77%

zcattacz · 2026-04-09T09:40:56Z

Hi @pl752 , impressive. Initially the tps gain is not visible, since my test prompt does not have a long context. Have you considered a 4x4 ~~or even 8x8~~ mode for AVX-512 ?

2x2 batch, no loop/if
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |                        
| qwen3 1.7B unknown, may not work | 231.13 MiB |     1.72 B | CPU        |       2 |           pp512 |         24.61 ± 0.25 |                      
| qwen3 1.7B unknown, may not work | 231.13 MiB |     1.72 B | CPU        |       2 |           tg128 |         14.32 ± 0.29 |                      

no loop/if + single accumulator
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 1.7B Q1_0                | 231.13 MiB |     1.72 B | CPU        |       2 |           pp512 |         18.40 ± 1.04 |
| qwen3 1.7B Q1_0                | 231.13 MiB |     1.72 B | CPU        |       2 |           tg128 |         13.79 ± 0.74 |

pl752 · 2026-04-09T10:56:52Z

@zcattacz I was experimenting with repack and other shapes of 2x2 dot for AVX2, yet not so successuly, 4x4 dot is likely my next target
8x8 is ridiculuos and is likely to result in diminishing returns

pl752 · 2026-04-09T11:09:24Z

Most likely that I will implement 4x1, 8x1, 4x2 or other kernel shapes outside default mul_mat

zcattacz · 2026-04-09T11:21:49Z

Some points from AI's analysis that seem to make sense:

tg hit memory bandwidth bottleneck, 70% sth is likely the top for that hardware. The activation matrix is almost always [D,1], effectively 1xN during tg. The extra overhead in AVX512 handling isn't paying off.
4x4 is the standard sweet spot used in modern llama.cpp AVX512 kernels, expect another 15–35% improvement.
Check out the smart gear switching logics in ggml_compute_forward_mul_mat in ggml/src/ggml-cpu/ggml-cpu.c to avoid overhead.

pl752 · 2026-04-09T12:23:33Z

Also I have found reason for lower than usual results, I have somehow missed later tests are run with -fa 0, still it doesn't change things much

pl752 · 2026-04-09T12:32:08Z

Also cooking ~~the slop~~ continues, I have hooked up 4x2 kernel and got 200+ for PP on AVX-512

pl752 · 2026-04-09T16:02:15Z

I have completed trying various tile shapes (final used forms are 1x1, 2x2, 2x1, 4x2 and 4x4), large tiles are only used where it is reasonable from register counts and memory bandwidth limitations.

Resulting code is pretty cursed/diabolic (ofc it is vibe coded and won't go anywhere near mainline), however it seems that it more or less maxes out my cpu, if there are no other significant refinements.

Results since nrc=2 as following (SSSE3 was not affected code wise), most benefits are from AVX-512 (4x4) in pp and AVX-2/512 (2x1) in tg:

flow	run	baseline	new	delta
`AVX`	`pp512`	94.66 t/s	94.72 t/s	+0.06%
`AVX`	`tg128`	43.11 t/s	43.20 t/s	+0.21%
`AVX` + `F16C`	`pp512`	94.78 t/s	94.57 t/s	-0.22%
`AVX` + `F16C`	`tg128`	48.15 t/s	48.10 t/s	-0.10%
`AVX2` + `FMA`	`pp512`	180.95 t/s	183.14 t/s	+1.21%
`AVX2` + `FMA`	`tg128`	78.22 t/s	80.79 t/s	+3.29%
`AVX512BW`	`pp512`	177.11 t/s	207.96 t/s	+17.42%
`AVX512BW`	`tg128`	80.99 t/s	83.01 t/s	+2.49%

I have tried other (larger or wider/longer shapes) and didn't obtain notable improvements
This PR (again) is just demo of extremities in x86 SIMD implementations, so bad code is intended

khosravipasha · 2026-04-13T23:55:06Z

@pl752
Is the biggest advantage here for AVX512? What cpu can I try it on?

Also going to poin this PR to prism branch, just cleaned up the branch from fresh llama.cpp master and applied your pending 86 PR and our cuda PR on top.

There might be some merge conflicts. This is a lot code but glad to see its 17% better on some CPUs. Will keep this PR open until we figure out the initial x86 PR>

pl752 · 2026-04-16T05:39:15Z

@khosravipasha As for now I am looking for the better solution than the current additional kernels injected into regular loop, I think I will stick with nrc==2 support and add repack with 8x8 (might be changed) panel gemm and gemv kernels. I have already found the way to boost tg further (gemv), but still a little bit stuck with gemm implementation. PLus I won't be as active because I have got a small problem to sort out on my real job.

khosravipasha · 2026-04-16T05:50:42Z

@pl752 awesome thanks for the update. Yeah there is lots of things to optimize, I am sure you know cpus better than me :)

Also recently added a community-benchmark section if you are curios how it runs on other hardware.
Will add some cpu results there when I get a chance too
https://github.com/PrismML-Eng/Bonsai-demo/tree/main/community-benchmarks

…Eng#21 Added Metal shader implementations: - quantize_turbo3_0 / quantize_turbo4_0 (per-block quantization) - dequantize_turbo3_0 / dequantize_turbo4_0 (type4x4 and type4 variants) - kernel_set_rows_turbo template (128-element block size) - Flash attention instantiations for all dk/dv variants Added TURBO3_0/TURBO4_0 to Metal device SET_ROWS validation. Builds successfully. Testing with Qwen 3.5 35B-A3B MoE on M5 Max. Note: Initial version uses simplified quantization (no rotation matrix) for Metal compatibility. Full rotation requires custom kernel with extra buffer bindings — tracked for follow-up. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…-Eng#21 Embedded pre-computed 128×128 rotation and QJL matrices (256KB constant memory) directly in the Metal shader. Both quantize and dequantize now perform the full TurboQuant algorithm: Quantize: normalize → rotate → codebook → inverse rotate → residual → QJL Dequantize: codebook → inverse rotate → QJL correction → rescale Previous version (no rotation) produced garbage. This should produce meaningful output since the rotation Gaussianizes the KV distribution. Note: dequantize does full 128-element rotation per chunk (8× work). Optimization possible with caching or restructured kernel in follow-up. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ismML-Eng#21 - Inlined turbo-matrices.h directly into ggml-metal.metal (256KB) to fix JIT compilation failure with #include - Added C round-trip test (test-turbo-quant.c): turbo3 cosine=0.906, turbo4 cosine=0.966 — matches Python prototype - Metal library loads successfully ("loaded in 5.9 sec") - Model runs on Metal but output quality needs debugging (Metal quantize/dequantize may have a bug vs the working C version) C round-trip PROVES the algorithm works in C. Metal shader needs debugging — likely an issue with the dequantize chunk addressing or the large constant arrays in thread-local memory. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pl752 added 5 commits April 7, 2026 11:46

Implemented optimized q1_0 dot for x86 and generic

195593b

Removed redundant helper definition

e29cd48

Initial AVX2 nrc2 experiment

4d0a787

Experiment with rectangular tile

7ee5400

Implemented prototype nrc2 branches for SSSE3 and AVX

50d5d39

pl752 changed the title ~~(Prototype) q1_0 nrc = 2 brances~~ (Prototype) q1_0 nrc = 2 branches Apr 9, 2026

github-actions Bot added the ggml label Apr 9, 2026

pl752 mentioned this pull request Apr 9, 2026

(Performance) Optimized x86 and generic q1_0(_g128) dot #10

Closed

pl752 added 4 commits April 9, 2026 17:43

Hooked up special q1_0 4x2 dot kernel for AVX2/512

e5ce6d4

Implemented 4x4 dot fastpath

34f547d

Added 2x1 kernel for slightly faster q1_0 decoding

bec8051

Completed x86 SIMD extremities demo experiments

a88ee90

pl752 changed the title ~~(Prototype) q1_0 nrc = 2 branches~~ (Prototype) q1_0 nrc = 2 and diabolic tiles branches Apr 9, 2026

khosravipasha changed the base branch from master to prism April 13, 2026 23:55

khosravipasha changed the base branch from prism-v1 to prism April 22, 2026 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Prototype) q1_0 nrc = 2 and diabolic tiles branches#21

(Prototype) q1_0 nrc = 2 and diabolic tiles branches#21
pl752 wants to merge 9 commits intoPrismML-Eng:prismfrom
pl752:perf/q1_0_nrc2

pl752 commented Apr 9, 2026 •

edited

Loading

Uh oh!

zcattacz commented Apr 9, 2026 •

edited

Loading

Uh oh!

pl752 commented Apr 9, 2026 •

edited

Loading

Uh oh!

pl752 commented Apr 9, 2026

Uh oh!

zcattacz commented Apr 9, 2026 •

edited

Loading

Uh oh!

pl752 commented Apr 9, 2026 •

edited

Loading

Uh oh!

pl752 commented Apr 9, 2026

Uh oh!

pl752 commented Apr 9, 2026

Uh oh!

khosravipasha commented Apr 13, 2026 •

edited

Loading

Uh oh!

pl752 commented Apr 16, 2026

Uh oh!

khosravipasha commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pl752 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zcattacz commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 9, 2026

Uh oh!

zcattacz commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 9, 2026

Uh oh!

pl752 commented Apr 9, 2026

Uh oh!

khosravipasha commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented Apr 16, 2026

Uh oh!

khosravipasha commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pl752 commented Apr 9, 2026 •

edited

Loading

zcattacz commented Apr 9, 2026 •

edited

Loading

pl752 commented Apr 9, 2026 •

edited

Loading

zcattacz commented Apr 9, 2026 •

edited

Loading

pl752 commented Apr 9, 2026 •

edited

Loading

khosravipasha commented Apr 13, 2026 •

edited

Loading