Adding a simple program to measure speed of dot products#1041
Merged
Conversation
Contributor
|
On my Core i3-8100 (AVX2): Please consider putting it in examples/benchmark instead of creating a new folder. |
On my Mac, the direct Q4_1 product is marginally slower (~69 vs ~55 us for Q4_0). The SIMD-ified ggml version is now almost 2X slower (~121 us). On a Ryzen 7950X CPU, the direct product for Q4_1 quantization is faster than the AVX2 implementation (~60 vs ~62 us).
Contributor
Author
|
@sw Thank you for the measurement. Yes, of course, I can move to |
ggerganov
approved these changes
Apr 18, 2023
Contributor
|
Yes, |
Seunghhon
pushed a commit
to Seunghhon/llama.cpp
that referenced
this pull request
Apr 26, 2026
On my Mac, the direct Q4_1 product is marginally slower (~69 vs ~55 us for Q4_0). The SIMD-ified ggml version is now almost 2X slower (~121 us). On a Ryzen 7950X CPU, the direct product for Q4_1 quantization is faster than the AVX2 implementation (~60 vs ~62 us). --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
phuongncn
pushed a commit
to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4
that referenced
this pull request
Apr 28, 2026
On my Mac, the direct Q4_1 product is marginally slower (~69 vs ~55 us for Q4_0). The SIMD-ified ggml version is now almost 2X slower (~121 us). On a Ryzen 7950X CPU, the direct product for Q4_1 quantization is faster than the AVX2 implementation (~60 vs ~62 us). --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I was surprised by the belief that the dot product
x * y, wherexholds quantized model weights andycontains floating point values, it is faster to quantizey, and to perform the dot product using the quantizedyvalues (accepting the associated loss in precision), then to just directly computex * y. So, I had to try it myself. This PR adds a simple program that measures the time it takes to perform the dot product between vectors holding1 << 18values. I picked a relatively large vector size to not have to get involved with the science of accurately measuring elapsed time for short lasting operations.Basically, we fill two vectors
xandywith random values and quantizexintoq. We then measure the time ford = q * ydirectlyy' = quantize(y); d = q * y'For 2. we use the vectorized (SIMD-ified) functions from
ggml(or, if requested by a command line argument, the corresponding scalar functions fromggml).On my Mac, 1. is faster than 2 (~55 us vs ~75 us). On the
x86_64CPU that I have available (Ryzen 7950X), 1. is somewhat slower compared to theAVX2implementation (~50 us vs ~35 us).On both CPUs the direct product 1. as implemented in the
dot()function in this POC is much faster than the scalar version of 2 fromggml. (~15X faster on the Ryzen 7950X and ~6X faster on the Mac).I think that with some
ARM_NEONorAVX2magic one should be able to further speed up 1.To use it,
make -jand then e.g../vdot 100to measure 100 dot products with the SIMD-ifiedggmlfunctions, or./vdot 100 1to measure the scalarggmlfunctions instead.Added a comparison for
Q4_1quantization. Here, the direct product 1. is faster than 2. forARM_NEONandAVX2. On my Mac I get ~69 us for 1 and ~121 us for 2. On the Ryzen 7950X I measured ~60 us for 1. and ~62 us for 2. In any case, implemented as in this POC, the dot product ofQ4_1quantized values is only marginally slower (~20%) thanQ4_0.