Adding a simple program to measure speed of dot products by ikawrakow · Pull Request #1041 · ggml-org/llama.cpp

ikawrakow · 2023-04-18T14:41:40Z

I was surprised by the belief that the dot product x * y, where x holds quantized model weights and y contains floating point values, it is faster to quantize y, and to perform the dot product using the quantized y values (accepting the associated loss in precision), then to just directly compute x * y. So, I had to try it myself. This PR adds a simple program that measures the time it takes to perform the dot product between vectors holding 1 << 18 values. I picked a relatively large vector size to not have to get involved with the science of accurately measuring elapsed time for short lasting operations.

Basically, we fill two vectors x and y with random values and quantize x into q. We then measure the time for

Computing d = q * y directly
Computing y' = quantize(y); d = q * y'

For 2. we use the vectorized (SIMD-ified) functions from ggml (or, if requested by a command line argument, the corresponding scalar functions from ggml).

On my Mac, 1. is faster than 2 (~55 us vs ~75 us). On the x86_64 CPU that I have available (Ryzen 7950X), 1. is somewhat slower compared to the AVX2 implementation (~50 us vs ~35 us).

On both CPUs the direct product 1. as implemented in the dot() function in this POC is much faster than the scalar version of 2 from ggml. (~15X faster on the Ryzen 7950X and ~6X faster on the Mac).

I think that with some ARM_NEON or AVX2 magic one should be able to further speed up 1.

To use it, make -j and then e.g. ./vdot 100 to measure 100 dot products with the SIMD-ified ggml functions, or ./vdot 100 1 to measure the scalar ggml functions instead.

Added a comparison for Q4_1 quantization. Here, the direct product 1. is faster than 2. for ARM_NEON and AVX2. On my Mac I get ~69 us for 1 and ~121 us for 2. On the Ryzen 7950X I measured ~60 us for 1. and ~62 us for 2. In any case, implemented as in this POC, the dot product of Q4_1 quantized values is only marginally slower (~20%) than Q4_0.

sw · 2023-04-18T15:06:48Z

On my Core i3-8100 (AVX2):

$ ./vdot 100
<dot> = -74.2272, -73.9193
time = 128.407 +/- 4.38483 us. maxt = 150.281 us
timeq = 106.679 +/- 4.16175 us. maxt = 126.484 us

Please consider putting it in examples/benchmark instead of creating a new folder.

On my Mac, the direct Q4_1 product is marginally slower (~69 vs ~55 us for Q4_0). The SIMD-ified ggml version is now almost 2X slower (~121 us). On a Ryzen 7950X CPU, the direct product for Q4_1 quantization is faster than the AVX2 implementation (~60 vs ~62 us).

ikawrakow · 2023-04-18T15:29:44Z

@sw Thank you for the measurement. Yes, of course, I can move to examples. My thinking was that this is a POC, so it is better to have a folder for POCs for this (and possibly future POCs) before some of these POCs become "examples".

ggerganov

Either adding poc or moving to examples is fine. Just leave it as it is if you haven't started moving it

I'll try to plug this approach in the actual mul mat routines and see how is the performance.

P.S. Squash your commits on merge

sw · 2023-04-18T15:44:24Z

Yes, examples is maybe not a great name, but it already contains various bits and pieces like your program.

On my Mac, the direct Q4_1 product is marginally slower (~69 vs ~55 us for Q4_0). The SIMD-ified ggml version is now almost 2X slower (~121 us). On a Ryzen 7950X CPU, the direct product for Q4_1 quantization is faster than the AVX2 implementation (~60 vs ~62 us). --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Adding a simple program to measure speed of dot products

42031da

ikawrakow requested a review from ggerganov April 18, 2023 14:41

ggerganov approved these changes Apr 18, 2023

View reviewed changes

ggerganov mentioned this pull request Apr 18, 2023

ggml : test dot product q4_0 x f32 #1043

Closed

sw merged commit 5ecff35 into master Apr 18, 2023

sw deleted the poc-dot-product branch April 18, 2023 19:00

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a simple program to measure speed of dot products#1041

Adding a simple program to measure speed of dot products#1041
sw merged 2 commits intomasterfrom
poc-dot-product

ikawrakow commented Apr 18, 2023 •

edited

Loading

Uh oh!

sw commented Apr 18, 2023

Uh oh!

ikawrakow commented Apr 18, 2023

Uh oh!

ggerganov left a comment •

edited

Loading

Uh oh!

sw commented Apr 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ikawrakow commented Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sw commented Apr 18, 2023

Uh oh!

ikawrakow commented Apr 18, 2023

Uh oh!

ggerganov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sw commented Apr 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ikawrakow commented Apr 18, 2023 •

edited

Loading

ggerganov left a comment •

edited

Loading