A better packNibbles and mul_sum_i8_pairs_float implementation using AVX512#1119
A better packNibbles and mul_sum_i8_pairs_float implementation using AVX512#1119sw merged 3 commits intoggml-org:masterfrom
packNibbles and mul_sum_i8_pairs_float implementation using AVX512#1119Conversation
|
With an AVX512 machine, you may want to look into using Rebasing/merging latest master should fix the failing checks. |
015aeda to
81cb1ee
Compare
Thank you very much for your suggestion! |
packNibbles implementation using AVX512packNibbles and mul_sum_i8_pairs_float implementation using AVX512
| #if __AVXVNNIINT8__ | ||
| const __m256i zero = _mm256_setzero_si256(); | ||
| const __m256i summed_pairs = _mm256_dpbssd_epi32(zero, x, y); | ||
| return _mm256_cvtepi32_ps(summed_pairs); | ||
| #else |
There was a problem hiding this comment.
As far as I'm aware, there is no hardware out there supporting AVX-VNNI-INT8 yet, so I don't think it's a good idea to use _mm256_dpbssd_epi32() here (the code is under an #if, so there is no harm in merging this, but it isn't useful either).
What we can use is AVX-VNNI. It is present on already existing Intel CPUs starting from Alder Lake (12th gen) and includes _mm256_dpbusd_epi32(). The difference is the left operand should be unsigned, so you have to keep _mm256_sign_epi8(...) and only replace the _mm256_maddubs_epi16() + sum_i16_pairs_float() pair with _mm256_dpbusd_epi32().
Essentially, this would be a backport of what I did with AVX-512 (zmm registers) to AVX-VNNI (ymm registers), which was proposed by @ultoris here.
Maybe we should open a separate PR for the VNNI optimization? It might speed-up the quantized dot product, which is on the hot code path, so it's nice to get some performance measurements independently of the packNibbles() optimization.
There was a problem hiding this comment.
What we can use is
AVX-VNNI. It is present on already existing Intel CPUs starting from Alder Lake (12th gen) and includes_mm256_dpbusd_epi32(). The difference is the left operand should be unsigned, so you have to keep_mm256_sign_epi8(...)and only replace the_mm256_maddubs_epi16() + sum_i16_pairs_float()pair with_mm256_dpbusd_epi32().
Good suggestion! I have changed mul_sum_i8_pairs_float to use AVX_VNNI, I also think the packNibbles does not affect the inference speed, so I think it is ok to measure inference performance in this PR.
| #if __AVX512F__ | ||
| const __m256i bytes_srli_4 = _mm256_srli_epi16(bytes, 4); // 0000_0000_abcd_0000 | ||
| bytes = _mm256_or_si256(bytes, bytes_srli_4); // 0000_abcd_abcd_efgh | ||
| return _mm256_cvtepi16_epi8(bytes); // abcd_efgh | ||
| #else |
There was a problem hiding this comment.
This looks great, thank you! I made a microbenchmark, and it shows a nice improvement on a Tiger Lake CPU:
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
BenchPackNibblesAvx2 583 ns 582 ns 1212627
BenchPackNibblesAvx512 514 ns 513 ns 1313028
I don't think packNibbles() is on the hot path during inference, so it will probably not affect the overall inference speed (might be wrong here). However, it's makes the code both more readable and more performant, and is worth merging.
* server: improve speed of speculative decoding change logs rpc: add recompute spec dec fix * Fix n_batch_size not set to context size for draft model --------- Co-authored-by: firecoperana <firecoperana>
Use only three instructions to implement
packNibbleswhen AVX512 is available. (The_mm256_cvtepi16_epi8requires AVX512 support)