Skip to content

ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot#11917

Merged
ggerganov merged 5 commits intoggml-org:masterfrom
Vithulep:Q3_SVE_Kernel
Feb 20, 2025
Merged

ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot#11917
ggerganov merged 5 commits intoggml-org:masterfrom
Vithulep:Q3_SVE_Kernel

Conversation

@Vithulep
Copy link
Copy Markdown
Contributor

@Vithulep Vithulep commented Feb 17, 2025

This PR introduces support for SVE (Scalable Vector Extensions) kernels for the q3_K_q8_K vector dot on the Arm architecture. A similar proposal for SVE support is made in PR #7433 and #11227.

This PR contains the SVE implementation of the vector dot used to compute the Q3_K quantization.
By running a Q3_K quantized model of mistral-7b-v01, on Graviton 3 (Perf 01 XL), Accuracy and Performance are measured.

Performance

The performance enhancement with this PR (SVE) is ~ x1.02 to x1.15 faster than the NEON implementation.

  • Decoding Throughput (TPOT)
Threads NEON (original) This PR(SVE) Ratio
2 4.21 4.86 1.15
4 8.26 9.37 1.13
8 15.90 17.49 1.10
16 29.09 31.05 1.06
32 42.59 43.80 1.03
48 48.36 49.41 1.02

The command used to measure the performance is

./llama-bench  -m ${PATH_TO_MODEL} -n 0 -n 16 -p 64 -t 2,4,8,16,32,48

Perplexity

I also verified that perplexity matches between the NEON and SVE Implementation.

NEON (original) SVE (this PR)
2.9394 +/- 0.35779 2.9394 +/- 0.35779

@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 17, 2025
Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve the formatting of the code to be more consistent with the rest of the code. I've given a few hints below.

Comment thread ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated
Comment thread ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated
Comment thread ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated
@Vithulep
Copy link
Copy Markdown
Contributor Author

Improve the formatting of the code to be more consistent with the rest of the code. I've given a few hints below.

Thank you. Improved the code formatting for consistency.

Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't ran any tests myself, so taking a small leap of faith here, assuming you've done all the necessary tests for this change.

Comment thread ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated
Comment thread ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated
Comment thread ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated
Comment thread ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated
Comment thread ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated
Comment thread ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated
Comment thread ggml/src/ggml-cpu/ggml-cpu-quants.c
Comment thread ggml/src/ggml-cpu/ggml-cpu-quants.c
Comment thread ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated
@ggerganov ggerganov merged commit 4806498 into ggml-org:master Feb 20, 2025
@Vithulep
Copy link
Copy Markdown
Contributor Author

Haven't ran any tests myself, so taking a small leap of faith here, assuming you've done all the necessary tests for this change.

Thank you! We've done all the necessary tests for this change.

orca-zhang pushed a commit to orca-zhang/llama.cpp that referenced this pull request Feb 26, 2025
…rg#11917)

* Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file

* Improved Formating of code in  ggml-cpu-quants.c file

* style : minor fixes

* style : less whitespaces

* style : ptr spaceing

---------

Co-authored-by: vithulep <p.m.vithule1517@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
…rg#11917)

* Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file

* Improved Formating of code in  ggml-cpu-quants.c file

* style : minor fixes

* style : less whitespaces

* style : ptr spaceing

---------

Co-authored-by: vithulep <p.m.vithule1517@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…rg#11917)

* Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file

* Improved Formating of code in  ggml-cpu-quants.c file

* style : minor fixes

* style : less whitespaces

* style : ptr spaceing

---------

Co-authored-by: vithulep <p.m.vithule1517@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants