(Prototype) q1_0 nrc = 2 and diabolic tiles branches#21
(Prototype) q1_0 nrc = 2 and diabolic tiles branches#21pl752 wants to merge 9 commits intoPrismML-Eng:prismfrom
Conversation
|
Hi @pl752 , impressive. Initially the tps gain is not visible, since my test prompt does not have a long context. Have you considered a 4x4 |
|
@zcattacz I was experimenting with repack and other shapes of 2x2 dot for AVX2, yet not so successuly, 4x4 dot is likely my next target |
|
Most likely that I will implement 4x1, 8x1, 4x2 or other kernel shapes outside default mul_mat |
|
Some points from AI's analysis that seem to make sense:
|
|
Also I have found reason for lower than usual results, I have somehow missed later tests are run with |
|
Also cooking |
|
I have completed trying various tile shapes (final used forms are 1x1, 2x2, 2x1, 4x2 and 4x4), large tiles are only used where it is reasonable from register counts and memory bandwidth limitations. Resulting code is pretty cursed/diabolic (ofc it is vibe coded and won't go anywhere near mainline), however it seems that it more or less maxes out my cpu, if there are no other significant refinements. Results since nrc=2 as following (SSSE3 was not affected code wise), most benefits are from AVX-512 (4x4) in pp and AVX-2/512 (2x1) in tg:
I have tried other (larger or wider/longer shapes) and didn't obtain notable improvements |
|
@pl752 Also going to poin this PR to prism branch, just cleaned up the branch from fresh llama.cpp master and applied your pending 86 PR and our cuda PR on top. There might be some merge conflicts. This is a lot code but glad to see its 17% better on some CPUs. Will keep this PR open until we figure out the initial x86 PR> |
|
@khosravipasha As for now I am looking for the better solution than the current additional kernels injected into regular loop, I think I will stick with nrc==2 support and add repack with 8x8 (might be changed) panel gemm and gemv kernels. I have already found the way to boost tg further (gemv), but still a little bit stuck with gemm implementation. PLus I won't be as active because I have got a small problem to sort out on my real job. |
|
@pl752 awesome thanks for the update. Yeah there is lots of things to optimize, I am sure you know cpus better than me :) Also recently added a community-benchmark section if you are curios how it runs on other hardware. |
…Eng#21 Added Metal shader implementations: - quantize_turbo3_0 / quantize_turbo4_0 (per-block quantization) - dequantize_turbo3_0 / dequantize_turbo4_0 (type4x4 and type4 variants) - kernel_set_rows_turbo template (128-element block size) - Flash attention instantiations for all dk/dv variants Added TURBO3_0/TURBO4_0 to Metal device SET_ROWS validation. Builds successfully. Testing with Qwen 3.5 35B-A3B MoE on M5 Max. Note: Initial version uses simplified quantization (no rotation matrix) for Metal compatibility. Full rotation requires custom kernel with extra buffer bindings — tracked for follow-up. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-Eng#21 Embedded pre-computed 128×128 rotation and QJL matrices (256KB constant memory) directly in the Metal shader. Both quantize and dequantize now perform the full TurboQuant algorithm: Quantize: normalize → rotate → codebook → inverse rotate → residual → QJL Dequantize: codebook → inverse rotate → QJL correction → rescale Previous version (no rotation) produced garbage. This should produce meaningful output since the rotation Gaussianizes the KV distribution. Note: dequantize does full 128-element rotation per chunk (8× work). Optimization possible with caching or restructured kernel in follow-up. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ismML-Eng#21 - Inlined turbo-matrices.h directly into ggml-metal.metal (256KB) to fix JIT compilation failure with #include - Added C round-trip test (test-turbo-quant.c): turbo3 cosine=0.906, turbo4 cosine=0.966 — matches Python prototype - Metal library loads successfully ("loaded in 5.9 sec") - Model runs on Metal but output quality needs debugging (Metal quantize/dequantize may have a bug vs the working C version) C round-trip PROVES the algorithm works in C. Metal shader needs debugging — likely an issue with the dequantize chunk addressing or the large constant arrays in thread-local memory. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pretty much direct continuation of #10.
Vibe coded prototype (just for proof of concept, needs refining) of nrows = 2 branches for x86 SIMD.
Yields significant PP improvements as it allows better utilization of memory bandwidth (hot y operand, high compute density).
I also think ARM NEON is worth trying to expand with nrows = 2
SSSE3pp512SSSE3tg128AVXpp512AVXtg128AVX+F16Cpp512AVX+F16Ctg128AVX2+FMApp512AVX2+FMAtg128AVX512BWpp512AVX512BWtg128Also for some reason AVX-512 opts do hurt performance for PP consistently for nrows = 2 and sometimes results are inconsistent
Code for these branches is enormous and is most likely suboptimal, so suggestions are welcome, register spills occur of course
Funny part is that I have tried iterating the AVX2 prototype, but haven't managed to achieve any improvements.
I have also tried altering tile geometry to use rectangular blocks due to significant operand size assymetry like was attempted in #4 by @Marxist-Leninist, which yields some changes, but is inconclusive.
blck_0pp512tg128pp512tg128