ggml : test dot product q4_0 x f32 by ggerganov · Pull Request #1043 · ggml-org/llama.cpp

ggerganov · 2023-04-18T16:23:32Z

On master, I get ~51 ms / token:

 $  make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -c 2048 -n 512 -t 8 --ignore-eos -s 3 -n 64 -t 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/main/main.cpp ggml.o llama.o common.o -o main  -framework Accelerate
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/quantize/quantize.cpp ggml.o llama.o -o quantize  -framework Accelerate
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/quantize-stats/quantize-stats.cpp ggml.o llama.o -o quantize-stats  -framework Accelerate
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity  -framework Accelerate
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding  -framework Accelerate

====  Run ./main -h for help.  ====

main: seed = 3
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 8, n_predict = 64, n_keep = 0


 I believe the meaning of life is to serve others.
I am a mother, wife and daughter who believes in community service and helping others. My career started as a legal assistant for a criminal defense attorney but soon realized that I was more interested in assisting my clients with their personal matters than with their court cases. I switched to working as
llama_print_timings:        load time =   398.00 ms
llama_print_timings:      sample time =    47.12 ms /    64 runs   (    0.74 ms per run)
llama_print_timings: prompt eval time =   380.06 ms /     8 tokens (   47.51 ms per token)
llama_print_timings:        eval time =  3270.89 ms /    63 runs   (   51.92 ms per run)
llama_print_timings:       total time =  3717.02 ms

On this branch I get ~226 ms / token for same run:

$  make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -c 2048 -n 512 -t 8 --ignore-eos -s 3 -n 64 -t 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: seed = 3
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 8, n_predict = 64, n_keep = 0


 I believe the meaning of life is to learn, love and leave a legacy.
I believe that if you give it away, it will always come back. If you treat others with kindness and respect, they will reciprocate in time.
If you put out good energy, it will return to you.
What I would like my children
llama_print_timings:        load time =  1595.40 ms
llama_print_timings:      sample time =    47.13 ms /    64 runs   (    0.74 ms per run)
llama_print_timings: prompt eval time =  1586.76 ms /     8 tokens (  198.35 ms per token)
llama_print_timings:        eval time = 14298.88 ms /    63 runs   (  226.97 ms per run)
llama_print_timings:       total time = 15942.39 ms

If I have to guess, at 8 threads the computation becomes memory bound and therefore, even though the Q4_0 x F32 is faster, the Q4_0 x Q8_0 ends up being more performant due to the less memory data being used

ggerganov · 2023-04-18T16:25:27Z

+        //            wdata += row_size;
+        //        }
+        //    }
+        //}


Here I disable the Q8_0 quantization - we don't need it

slaren · 2023-04-18T23:03:44Z

I tested this a while ago because I was also very surprised that quantizing first is faster, but it is indeed faster. I even tried an AVX2 implementation of ggml_vec_dot_q4_0_f32. Here it is in case anyone is interested, though it doesn't incorporate the latest optimizations to ggml_vec_dot_q4_0:

Details

static void ggml_vec_dot_q4_0_f32(const int n, float * restrict s, const void * restrict vx, const float * restrict y) {
    const int nb = n / QK;

    assert(n % QK == 0);
    assert(nb % 2 == 0);

    const block_q4_0 * restrict x = (const block_q4_0*)vx;

    __m256 acc = _mm256_setzero_ps();

    // Main loop
    for (int i = 0; i < nb; ++i) {
        const __m256 d_v = _mm256_broadcast_ss(&x[i].d);

        // Load 32x4-bit integers into 32x8-bit integers
        __m256i vx8 = bytesFromNibbles(x[i].qs);

        // Subtract 8 from the integers
        vx8 = _mm256_sub_epi8(vx8, _mm256_set1_epi8(8));

        // Convert to 16-bit int
        const __m256i vx16_lo = _mm256_cvtepi8_epi16(_mm256_extracti128_si256(vx8, 0));
        const __m256i vx16_hi = _mm256_cvtepi8_epi16(_mm256_extracti128_si256(vx8, 1));

        // Convert to 32-bit int -> float 32
        const __m256 vf[4] = {
            _mm256_cvtepi32_ps(_mm256_cvtepi16_epi32(_mm256_extracti128_si256(vx16_lo, 0))),
            _mm256_cvtepi32_ps(_mm256_cvtepi16_epi32(_mm256_extracti128_si256(vx16_lo, 1))),
            _mm256_cvtepi32_ps(_mm256_cvtepi16_epi32(_mm256_extracti128_si256(vx16_hi, 0))),
            _mm256_cvtepi32_ps(_mm256_cvtepi16_epi32(_mm256_extracti128_si256(vx16_hi, 1)))
        };

        // Scale and fma
        const float * yj = y + i*32;
        for (int j = 0; j < 4; j++) {
            const __m256 jx = _mm256_mul_ps(vf[j], d_v);
            const __m256 jy = _mm256_loadu_ps(yj + j*8);
            acc = _mm256_fmadd_ps(jx, jy, acc);
        }
    }

    // Return horizontal sum of the acc vector
    __m128 res = _mm256_extractf128_ps( acc, 1 );
    res = _mm_add_ps( res, _mm256_castps256_ps128( acc ) );
    res = _mm_add_ps( res, _mm_movehl_ps( res, res ) );
    res = _mm_add_ss( res, _mm_movehdup_ps( res ) );

    *s = _mm_cvtss_f32( res );
}

ikawrakow · 2023-04-20T15:15:14Z

The thing that I did not realize is that the number of values being quantized is many times less than the number of values used in the dot products (512X or more) . Which means that my simple test code from #1041 does not adequately measure the actual situation where the amount of time spent on quantization is small compared to the amount of time spent on dot products between quantized values. If I only consider the dot products, then indeed the quantized version is faster.

Sorry for wasting @ggerganov's time.

ggml : test dot product q4_0 x f32

72cd433

ggerganov commented Apr 18, 2023

View reviewed changes

ggerganov closed this Apr 19, 2023

ggerganov deleted the q4_0_f32 branch April 24, 2023 19:19

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : test dot product q4_0 x f32#1043

ggml : test dot product q4_0 x f32#1043
ggerganov wants to merge 1 commit intomasterfrom
q4_0_f32

ggerganov commented Apr 18, 2023 •

edited

Loading

Uh oh!

ggerganov Apr 18, 2023

Uh oh!

slaren commented Apr 18, 2023 •

edited

Loading

Uh oh!

ikawrakow commented Apr 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ggerganov commented Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov Apr 18, 2023

Choose a reason for hiding this comment

Uh oh!

slaren commented Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Apr 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggerganov commented Apr 18, 2023 •

edited

Loading

slaren commented Apr 18, 2023 •

edited

Loading