Conversation
| // wdata += row_size; | ||
| // } | ||
| // } | ||
| //} |
There was a problem hiding this comment.
Here I disable the Q8_0 quantization - we don't need it
|
I tested this a while ago because I was also very surprised that quantizing first is faster, but it is indeed faster. I even tried an AVX2 implementation of Detailsstatic void ggml_vec_dot_q4_0_f32(const int n, float * restrict s, const void * restrict vx, const float * restrict y) {
const int nb = n / QK;
assert(n % QK == 0);
assert(nb % 2 == 0);
const block_q4_0 * restrict x = (const block_q4_0*)vx;
__m256 acc = _mm256_setzero_ps();
// Main loop
for (int i = 0; i < nb; ++i) {
const __m256 d_v = _mm256_broadcast_ss(&x[i].d);
// Load 32x4-bit integers into 32x8-bit integers
__m256i vx8 = bytesFromNibbles(x[i].qs);
// Subtract 8 from the integers
vx8 = _mm256_sub_epi8(vx8, _mm256_set1_epi8(8));
// Convert to 16-bit int
const __m256i vx16_lo = _mm256_cvtepi8_epi16(_mm256_extracti128_si256(vx8, 0));
const __m256i vx16_hi = _mm256_cvtepi8_epi16(_mm256_extracti128_si256(vx8, 1));
// Convert to 32-bit int -> float 32
const __m256 vf[4] = {
_mm256_cvtepi32_ps(_mm256_cvtepi16_epi32(_mm256_extracti128_si256(vx16_lo, 0))),
_mm256_cvtepi32_ps(_mm256_cvtepi16_epi32(_mm256_extracti128_si256(vx16_lo, 1))),
_mm256_cvtepi32_ps(_mm256_cvtepi16_epi32(_mm256_extracti128_si256(vx16_hi, 0))),
_mm256_cvtepi32_ps(_mm256_cvtepi16_epi32(_mm256_extracti128_si256(vx16_hi, 1)))
};
// Scale and fma
const float * yj = y + i*32;
for (int j = 0; j < 4; j++) {
const __m256 jx = _mm256_mul_ps(vf[j], d_v);
const __m256 jy = _mm256_loadu_ps(yj + j*8);
acc = _mm256_fmadd_ps(jx, jy, acc);
}
}
// Return horizontal sum of the acc vector
__m128 res = _mm256_extractf128_ps( acc, 1 );
res = _mm_add_ps( res, _mm256_castps256_ps128( acc ) );
res = _mm_add_ps( res, _mm_movehl_ps( res, res ) );
res = _mm_add_ss( res, _mm_movehdup_ps( res ) );
*s = _mm_cvtss_f32( res );
} |
|
The thing that I did not realize is that the number of values being quantized is many times less than the number of values used in the dot products (512X or more) . Which means that my simple test code from #1041 does not adequately measure the actual situation where the amount of time spent on quantization is small compared to the amount of time spent on dot products between quantized values. If I only consider the dot products, then indeed the quantized version is faster. Sorry for wasting @ggerganov's time. |
Plugged @ikawrakow's idea from #1041
On
master, I get ~51 ms / token:On this branch I get ~226 ms / token for same run:
If I have to guess, at 8 threads the computation becomes memory bound and therefore, even though the
Q4_0 x F32is faster, theQ4_0 x Q8_0ends up being more performant due to the less memory data being used