ggml : alternative Q4_3 implementation using modified Q8_0 by ggerganov · Pull Request #1109 · ggml-org/llama.cpp

ggerganov · 2023-04-21T17:56:41Z

This one looks promising - it does not change the Q4_3 format from master and only modifies slightly Q8_0 by adding low and high sums. The results should be identical, but now the Q4_3 dot product evaluates much faster:

#define QK8_0 32
typedef struct {
    float   d;          // delta
    float   s0;         // d * sum(qs[i]) low
    float   s1;         // d * sum(qs[i]) high
    int8_t  qs[QK8_0];  // quants
} block_q8_0;

llama_print_timings:      sample time =    47.11 ms /    64 runs   (    0.74 ms per run)
llama_print_timings: prompt eval time =   482.44 ms /     8 tokens (   60.30 ms per token)
llama_print_timings:        eval time =  3419.36 ms /    63 runs   (   54.28 ms per run)
llama_print_timings:       total time =  3959.05 ms

I think this is the way to go. But, let's see the ppl results from the Q4_3a #1108 approach first

ggerganov · 2023-04-21T20:20:27Z

Will fix the AVX2 implementation tomorrow and merge it

sw · 2023-04-21T20:50:40Z

As mentioned in #1099 where I intend to fix this, the #if condition is wrong here, causing the code below to be executed for AVX2, essentially duplicating the work. Just a thing to keep in mind or fix when measuring performance.

This way we always use the same type of instruction across all quantizations

…1109) * ggml : prefer vzip to vuzp This way we always use the same type of instruction across all quantizations * ggml : alternative Q4_3 implementation using modified Q8_0 * ggml : fix Q4_3 scalar imlpementation * ggml : slight improvement of Q4_3 - no need for loop unrolling * ggml : fix AVX paths for Q8_0 quantization

ggerganov mentioned this pull request Apr 21, 2023

ggml : alternative Q4_3 format + implementation #1108

Closed

ggerganov marked this pull request as ready for review April 21, 2023 20:14

sw reviewed Apr 21, 2023

View reviewed changes

ggerganov mentioned this pull request Apr 22, 2023

AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring #1099

Merged

ggerganov added 4 commits April 22, 2023 10:37

ggml : prefer vzip to vuzp

ec805ee

This way we always use the same type of instruction across all quantizations

ggml : alternative Q4_3 implementation using modified Q8_0

5425e06

ggml : fix Q4_3 scalar imlpementation

829c480

ggml : slight improvement of Q4_3 - no need for loop unrolling

76b6b26

ggerganov force-pushed the q4_3b branch from 25b41a3 to 76b6b26 Compare April 22, 2023 07:42

ggml : fix AVX paths for Q8_0 quantization

2c358ec

ggerganov merged commit 955ef9a into master Apr 22, 2023

ggerganov deleted the q4_3b branch April 22, 2023 07:55

This was referenced Apr 22, 2023

Q8_0: unbreak AVX #1117

Closed

Continuous layouts for quantization q4_0c #1073

Closed

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : alternative Q4_3 implementation using modified Q8_0#1109

ggml : alternative Q4_3 implementation using modified Q8_0#1109
ggerganov merged 5 commits intomasterfrom
q4_3b

ggerganov commented Apr 21, 2023 •

edited

Loading

Uh oh!

ggerganov commented Apr 21, 2023

Uh oh!

sw Apr 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ggerganov commented Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Apr 21, 2023

Uh oh!

sw Apr 21, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggerganov commented Apr 21, 2023 •

edited

Loading