Skip to content

ggml : alternative Q4_3 implementation using modified Q8_0#1109

Merged
ggerganov merged 5 commits intomasterfrom
q4_3b
Apr 22, 2023
Merged

ggml : alternative Q4_3 implementation using modified Q8_0#1109
ggerganov merged 5 commits intomasterfrom
q4_3b

Conversation

@ggerganov
Copy link
Copy Markdown
Member

@ggerganov ggerganov commented Apr 21, 2023

This one looks promising - it does not change the Q4_3 format from master and only modifies slightly Q8_0 by adding low and high sums. The results should be identical, but now the Q4_3 dot product evaluates much faster:

#define QK8_0 32
typedef struct {
    float   d;          // delta
    float   s0;         // d * sum(qs[i]) low
    float   s1;         // d * sum(qs[i]) high
    int8_t  qs[QK8_0];  // quants
} block_q8_0;
llama_print_timings:      sample time =    47.11 ms /    64 runs   (    0.74 ms per run)
llama_print_timings: prompt eval time =   482.44 ms /     8 tokens (   60.30 ms per token)
llama_print_timings:        eval time =  3419.36 ms /    63 runs   (   54.28 ms per run)
llama_print_timings:       total time =  3959.05 ms

I think this is the way to go. But, let's see the ppl results from the Q4_3a #1108 approach first

@ggerganov ggerganov marked this pull request as ready for review April 21, 2023 20:14
@ggerganov
Copy link
Copy Markdown
Member Author

Will fix the AVX2 implementation tomorrow and merge it

Comment thread ggml.c Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #1099 where I intend to fix this, the #if condition is wrong here, causing the code below to be executed for AVX2, essentially duplicating the work. Just a thing to keep in mind or fix when measuring performance.

@ggerganov ggerganov merged commit 955ef9a into master Apr 22, 2023
@ggerganov ggerganov deleted the q4_3b branch April 22, 2023 07:55
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…1109)

* ggml : prefer vzip to vuzp

This way we always use the same type of instruction across all quantizations

* ggml : alternative Q4_3 implementation using modified Q8_0

* ggml : fix Q4_3 scalar imlpementation

* ggml : slight improvement of Q4_3 - no need for loop unrolling

* ggml : fix AVX paths for Q8_0 quantization
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
…1109)

* ggml : prefer vzip to vuzp

This way we always use the same type of instruction across all quantizations

* ggml : alternative Q4_3 implementation using modified Q8_0

* ggml : fix Q4_3 scalar imlpementation

* ggml : slight improvement of Q4_3 - no need for loop unrolling

* ggml : fix AVX paths for Q8_0 quantization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants