ggml : Q4_2 ARM by ggerganov · Pull Request #1046 · ggml-org/llama.cpp

ggerganov · 2023-04-18T18:19:12Z

This is a reimplementation of #1026 by introducing new quantization type Q4_2

This PR implements only ARM NEON. The plan is to merge this soon and add rest of the SIMD implementations.
For now, no need for SIMD quantize / dequantize - will be added later when needed

prusnak

One nitpick. Debug contains:

llama_model_load_internal: ftype      = 5 (unknown, may not work)

Fix:

diff --git a/llama.cpp b/llama.cpp
index ef8ee20..dd970f7 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -838,6 +838,7 @@ static const char *llama_ftype_name(enum llama_ftype ftype) {
         case LLAMA_FTYPE_MOSTLY_F16:  return "mostly F16";
         case LLAMA_FTYPE_MOSTLY_Q4_0: return "mostly Q4_0";
         case LLAMA_FTYPE_MOSTLY_Q4_1: return "mostly Q4_1";
+        case LLAMA_FTYPE_MOSTLY_Q4_2: return "mostly Q4_2";
         case LLAMA_FTYPE_MOSTLY_Q4_1_SOME_F16:
                                       return "mostly Q4_1, some F16";
         default:                      return "unknown, may not work";

prusnak · 2023-04-18T19:13:19Z

Benchmark on Macbook M1 16 GB:

7B q4_0: 75 ms/token
7B q4_2: 105 ms/token

ggerganov · 2023-04-18T19:20:17Z

Benchmark on Macbook M1 16 GB:

7B q4_0: 75 ms/token 7B q4_2: 105 ms/token

I guess this is with 4 threads?
I have been using only 8 threads and didn't realize the the slow down is bigger for less threads

sw · 2023-04-18T19:26:47Z

This should probably use ggml_is_quantized as well:
https://github.com/ggerganov/llama.cpp/blob/99092f2f21809fa4a2b68f0c16b0607410ed4bb1/ggml.c#L10720

Otherwise, looking great! I get 1s/token :-(

prusnak · 2023-04-18T19:52:10Z

I guess this is with 4 threads?

Yes, 8 threads are around 2-times slower on M1 (which has 4 performance cores and 4 efficiency cores).

More tests on M1 with the speed up in f30dbf9a8be4755f8d0bd9575ba5540ccf9335a9:

7B q4_0 4 threads: 75 ms/token
7B q4_0 8 threads: 135 ms/token
7B q4_1 4 threads: 122 ms/token
7B q4_1 8 threads: 240 ms/token
7B q4_2 4 threads: 89 ms/token
7B q4_2 8 threads: 180 ms/token

This is great!

- 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms

ggerganov · 2023-04-18T20:37:11Z

I guess this is with 4 threads?

Yes, 8 threads are around 2-times slower on M1 (which has 4 performance cores and 4 efficiency cores).

More tests on M1 with the speed up in f30dbf9:

7B q4_0 4 threads: 75 ms/token 7B q4_0 8 threads: 135 ms/token 7B q4_1 4 threads: 122 ms/token 7B q4_1 8 threads: 240 ms/token 7B q4_2 4 threads: 89 ms/token 7B q4_2 8 threads: 180 ms/token

This is great!

Try again with ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32

I just found that vmlaq_n_f32 can be used to speed up things significantly.
Also optimized the Q4_1 to be about ~56 ms / token on M1 Pro (branch: q4_1xq8_0) which is pretty good.
Only slightly slower compared to Q4_0 at ~48 ms / token.

I have a feeling that the next Q4_3 quantization (i.e. Q4_1 but with F16 factors) will be able to evaluate at ~60 ms / token and hopefully the perplexity will be very close to full F16 (i.e. ~6.00 for 7B). Which would be just perfect

prusnak · 2023-04-18T21:00:43Z

Post-merge tests (from master 77a7340):

I see no significant change from earlier tests for 7B q4_2 4 threads

7B q4_2 4 threads: 89 ms/token

But I see a small improvement for 7B q4_2 8 threads:

7B q4_2 8 threads: 180 -> 173 ms/token

* ggml : Q4_2 ARM * ggml : add ggml_is_quantized() * llama : update llama_type_name() with Q4_2 entry * ggml : speed-up q4_2 - 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms * ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32

ggerganov force-pushed the q4_2-arm branch from 0b575b6 to bbd2921 Compare April 18, 2023 18:25

ggerganov marked this pull request as ready for review April 18, 2023 18:25

ggerganov mentioned this pull request Apr 18, 2023

New Q4_0 implementation using 2x F16 instead of 1x F32 #1026

Closed

ggerganov added the generation quality Quality of model output label Apr 18, 2023

ggerganov requested a review from sw April 18, 2023 18:26

ggerganov commented Apr 18, 2023

View reviewed changes

Comment thread ggml.c Outdated

prusnak suggested changes Apr 18, 2023

View reviewed changes

ggerganov added 5 commits April 18, 2023 23:00

ggml : Q4_2 ARM

e435b81

ggml : add ggml_is_quantized()

fe85929

llama : update llama_type_name() with Q4_2 entry

5e6b62c

ggml : speed-up q4_2

3a79089

- 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms

ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32

5843b45

ggerganov force-pushed the q4_2-arm branch from f30dbf9 to 5843b45 Compare April 18, 2023 20:09

ggerganov merged commit 77a7340 into master Apr 18, 2023

ggerganov deleted the q4_2-arm branch April 18, 2023 20:55

sw mentioned this pull request Apr 19, 2023

Q4 cleanup #1061

Merged

ggerganov self-assigned this Apr 22, 2023

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : Q4_2 ARM#1046

ggml : Q4_2 ARM#1046
ggerganov merged 5 commits intomasterfrom
q4_2-arm

ggerganov commented Apr 18, 2023 •

edited

Loading

Uh oh!

Uh oh!

prusnak left a comment

Uh oh!

prusnak commented Apr 18, 2023

Uh oh!

ggerganov commented Apr 18, 2023

Uh oh!

sw commented Apr 18, 2023

Uh oh!

prusnak commented Apr 18, 2023 •

edited

Loading

Uh oh!

ggerganov commented Apr 18, 2023 •

edited

Loading

Uh oh!

prusnak commented Apr 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ggerganov commented Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

prusnak left a comment

Choose a reason for hiding this comment

Uh oh!

prusnak commented Apr 18, 2023

Uh oh!

ggerganov commented Apr 18, 2023

Uh oh!

sw commented Apr 18, 2023

Uh oh!

prusnak commented Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prusnak commented Apr 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggerganov commented Apr 18, 2023 •

edited

Loading

prusnak commented Apr 18, 2023 •

edited

Loading

ggerganov commented Apr 18, 2023 •

edited

Loading