Conversation
prusnak
left a comment
There was a problem hiding this comment.
One nitpick. Debug contains:
llama_model_load_internal: ftype = 5 (unknown, may not work)
Fix:
diff --git a/llama.cpp b/llama.cpp
index ef8ee20..dd970f7 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -838,6 +838,7 @@ static const char *llama_ftype_name(enum llama_ftype ftype) {
case LLAMA_FTYPE_MOSTLY_F16: return "mostly F16";
case LLAMA_FTYPE_MOSTLY_Q4_0: return "mostly Q4_0";
case LLAMA_FTYPE_MOSTLY_Q4_1: return "mostly Q4_1";
+ case LLAMA_FTYPE_MOSTLY_Q4_2: return "mostly Q4_2";
case LLAMA_FTYPE_MOSTLY_Q4_1_SOME_F16:
return "mostly Q4_1, some F16";
default: return "unknown, may not work";|
Benchmark on Macbook M1 16 GB: 7B q4_0: 75 ms/token |
I guess this is with 4 threads? |
|
This should probably use Otherwise, looking great! I get 1s/token :-( |
Yes, 8 threads are around 2-times slower on M1 (which has 4 performance cores and 4 efficiency cores). More tests on M1 with the speed up in f30dbf9a8be4755f8d0bd9575ba5540ccf9335a9: 7B q4_0 4 threads: 75 ms/token This is great! |
- 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms
Try again with ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32 I just found that I have a feeling that the next |
|
Post-merge tests (from master 77a7340): I see no significant change from earlier tests for 7B q4_2 4 threads 7B q4_2 4 threads: 89 ms/token But I see a small improvement for 7B q4_2 8 threads: 7B q4_2 8 threads: 180 -> 173 ms/token |
* ggml : Q4_2 ARM * ggml : add ggml_is_quantized() * llama : update llama_type_name() with Q4_2 entry * ggml : speed-up q4_2 - 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms * ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32
* ggml : Q4_2 ARM * ggml : add ggml_is_quantized() * llama : update llama_type_name() with Q4_2 entry * ggml : speed-up q4_2 - 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms * ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32
ref #959
This is a reimplementation of #1026 by introducing new quantization type
Q4_2This PR implements only ARM NEON. The plan is to merge this soon and add rest of the SIMD implementations.
For now, no need for SIMD
quantize/dequantize- will be added later when needed