Q2k interleaving implementation - x86/x64 SIMD by Srihari-mcw · Pull Request #14373 · ggml-org/llama.cpp

Srihari-mcw · 2025-06-25T12:52:00Z

The PR contains block interleaving approach for Q2_K quantization for x64/x86 AVX2/AVX512 SIMD Architecture
AVX512 and AVX2 Versions are implemented for the GEMM function, whereas GEMV is implemented with AVX2 intrinsics
The existing quantize_q8_K_4x8 function quantizes the float values to block_q8_Kx4 format
repack_q2_K_to_q2_K_8_bl function rearranges the weight in Q2_K format to Q2_Kx8 format(block_q2_Kx8)

Block Interleaving Formats

Block_Q2_Kx8 :

Used to contain data of 8 Q2_K blocks in interleaved fashion
uint8 scales[128] - Scales and Mins from source Q2_K blocks are taken. Every 16 byte here is packed such that it contains scales and mins for corresponding sub blocks from Q2_K structure - There are 16 sub blocks in original Q2_K structure
The d and dmin values from source Q2_K blocks are stored together in an array
Quant values from the source Q2_K blocks are sequentially extracted and interleaved into groups of eight bytes

Performance Impact :

Gains of ~5.5 % seen with the AVX2 version and gains of ~25.5% seen with the AVX512 Version over the base commit with GCC Linux

GCC Linux :

Q2_K Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	84.64 ± 0.20		38de3fb - Base Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	89.26 ± 0.21	5.45%	ef03580 - AVX2 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	106.27 ± 0.32	25.54%	ef03580- AVX512 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	37.81 ± 0.02		38de3fb - Base Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	37.80 ± 0.02	-0.03%	ef03580 - AVX2 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	37.64 ± 0.01	-0.46%	ef03580 - AVX512 Commit

GCC Version = 12.3

Clang Linux:

More gains of ~26.3% seen with the AVX2 version and gains of ~53.9% seen with the AVX512 Version over the base commit with Clang Linux

Q2_K Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	92.33 ± 0.20		38de3fb - Base Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	116.68 ± 0.40	26.37%	ef03580 - AVX2 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	pp 512	142.13 ± 0.63	53.93%	ef03580- AVX512 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	38.26 ± 0.00		38de3fb - Base Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	38.11 ± 0.01	-0.38%	ef03580 - AVX2 Commit
phi3 3B Q2_K - Medium	1.32 GiB	3.82 B	CPU	6	tg 128	37.98 ± 0.01	-0.71%	ef03580 - AVX512 Commit

Clang Version = 20.1.0

The model tested was - https://huggingface.co/bartowski/Phi-3-mini-4k-instruct-GGUF

The PR was tested in AMD Ryzen 5 9600X which supports the following flags by default :

CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Further the perplexity was tested and found to be similar with the Q2_K Model

The perplexity results are tabulated as follows :

model	perplexity (Final estimate PPL)	Commit id
phi3 3B Q2_K - Medium	9.5511 +/- 0.064212	38de3fb - Base Commit
phi3 3B Q2_K - Medium	9.5488 +/- 0.06419	ef03580 - Updated Commit

slaren · 2025-07-04T13:01:04Z

I tested this on a 13900k with gcc 13 and clang 19, but the improvement is not very significant. Repacking has a significant cost, since it increases load time and prevents usage of mmap, and as it is, I find this very hard to justify for AVX2. It may make sense for AVX512, but I cannot test that.

Details

GCC-13:

Model	Threads	Test	t/s master	t/s q2k_interleaving_implementation	Speedup
llama 7B Q2_K_M	8	pp64	47.56	48.55	1.02
llama 7B Q2_K_M	8	tg32	19.92	19.38	0.97
llama 7B Q2_K_M	16	pp64	63.04	60.08	0.95
llama 7B Q2_K_M	16	tg32	21.22	20.44	0.96
llama 7B Q2_K_M	24	pp64	68.39	68.07	1.00
llama 7B Q2_K_M	24	tg32	19.72	19.76	1.00
llama 7B Q2_K_M	32	pp64	71.18	71.62	1.01
llama 7B Q2_K_M	32	tg32	17.87	17.51	0.98

Clang-19:

Model	Threads	Test	t/s master	t/s q2k_interleaving_implementation	Speedup
llama 7B Q2_K_M	8	pp64	48.28	52.27	1.08
llama 7B Q2_K_M	8	tg32	20.78	19.08	0.92
llama 7B Q2_K_M	16	pp64	65.23	61.42	0.94
llama 7B Q2_K_M	16	tg32	20.94	19.79	0.95
llama 7B Q2_K_M	24	pp64	69.69	71.17	1.02
llama 7B Q2_K_M	24	tg32	19.90	19.59	0.98
llama 7B Q2_K_M	32	pp64	71.04	75.26	1.06
llama 7B Q2_K_M	32	tg32	16.91	17.30	1.02

Srihari-mcw · 2025-07-11T16:08:32Z

Hi @slaren ,
Thanks for the reply. Based on your feedback and further internal testing, we have currently updated the patch to enable the repacking for machines that have AVX512 support alone, so that the patch can be considered for optimization in AVX512 based machines. We will continue investigating the AVX2 performance further

Thanks

Srihari-mcw · 2025-07-30T09:33:01Z

Hi @slaren , @ggerganov ,
With regards to the AVX512 changes, are there are any other steps to be done to get this PR merged? Kindly share your thoughts. Thanks

ggerganov · 2025-07-30T10:31:13Z

+            }
+            // Store the accumulated values
+            for (int i = 0; i < 16; i++) {
+


Deduplicate the generic GEMV and GEMM implementations following #14897.

After that, feel free to merge.

Srihari-mcw · 2025-08-01T05:21:43Z

Hi @slaren , @ggerganov ,

The code has been updated with de-duplication of generic code. Please let us know if the code is good for merging. Thanks

…rg#14373)" This reverts commit baad948.

* Initial Q2_K Block Interleaving Implementation * Addressed review comments and clean up of the code * Post rebase fixes * Initial CI/CD fixes * Update declarations in arch-fallback.h * Changes for GEMV Q2_K in arch-fallback.h * Enable repacking only on AVX-512 machines * Update comments in repack.cpp * Address q2k comments --------- Co-authored-by: Manogna-Sree <elisetti.manognasree@multicorewareinc.com>

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 25, 2025

Srihari-mcw changed the title ~~Q2k interleaving implementation~~ Q2k interleaving implementation - x86/x64 SIMD Jun 25, 2025

Srihari-mcw force-pushed the q2k_interleaving_implementation branch 3 times, most recently from 39ab344 to c2c53bc Compare June 26, 2025 06:02

Srihari-mcw force-pushed the q2k_interleaving_implementation branch from 75dd04b to 3f6c61d Compare July 11, 2025 11:28

slaren approved these changes Jul 17, 2025

View reviewed changes

ggerganov approved these changes Jul 17, 2025

View reviewed changes

ggerganov reviewed Jul 30, 2025

View reviewed changes

Srihari-mcw and others added 9 commits July 30, 2025 22:43

Initial Q2_K Block Interleaving Implementation

4039c22

Addressed review comments and clean up of the code

39a7590

Post rebase fixes

91d216c

Initial CI/CD fixes

2926cfb

Update declarations in arch-fallback.h

7a5e23a

Changes for GEMV Q2_K in arch-fallback.h

7023709

Enable repacking only on AVX-512 machines

d45c9f0

Update comments in repack.cpp

d6ee6da

Address q2k comments

a1053fb

Srihari-mcw force-pushed the q2k_interleaving_implementation branch from 6c758bb to a1053fb Compare July 31, 2025 05:47

ggerganov merged commit baad948 into ggml-org:master Aug 1, 2025
47 checks passed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 7, 2025

Revert "ggml : Q2k interleaving implementation - x86/x64 SIMD (ggml-o…

df3c79d

…rg#14373)" This reverts commit baad948.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q2k interleaving implementation - x86/x64 SIMD#14373

Q2k interleaving implementation - x86/x64 SIMD#14373
ggerganov merged 9 commits intoggml-org:masterfrom
Srihari-mcw:q2k_interleaving_implementation

Srihari-mcw commented Jun 25, 2025

Uh oh!

slaren commented Jul 4, 2025

Uh oh!

Srihari-mcw commented Jul 11, 2025

Uh oh!

Srihari-mcw commented Jul 30, 2025

Uh oh!

ggerganov Jul 30, 2025

Uh oh!

Srihari-mcw commented Aug 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Srihari-mcw commented Jun 25, 2025

Uh oh!

slaren commented Jul 4, 2025

Uh oh!

Srihari-mcw commented Jul 11, 2025

Uh oh!

Srihari-mcw commented Jul 30, 2025

Uh oh!

ggerganov Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Srihari-mcw commented Aug 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants