[SYCL] Add q3_s and q1_s#5886
Conversation
|
have you ran |
|
|
@abhilash1910, do you have any performance estimates for the supported types in this PR? |
For
|
Great! The UT will cover every OPs. |
|
ggerganov
left a comment
There was a problem hiding this comment.
I can't give this a test, but seems OK
Btw, might need some of your help in #5940 to help move the tables with quantum constants to the new ggml-common.h header. Will make the change and ping you to give it a try and confirm that it works
|
@abhilash1910 increasing the grid space seems to fix the regression. However. iQ3_S still throwing a ggml_assert. Probably around here: inline void ggml_sycl_op_dequantize_mul_mat_vec(
const ggml_tensor *src0, const ggml_tensor *src1, ggml_tensor *dst,
const char *src0_dd_i, const float *src1_ddf_i, const char *src1_ddq_i,
float *dst_dd_i, const int64_t row_low, const int64_t row_high,
const int64_t src1_ncols, const int64_t src1_padded_row_size,
const dpct::queue_ptr &stream) {
const int64_t ne00 = src0->ne[0];
const int64_t row_diff = row_high - row_low;
GGML_ASSERT(src1->type == GGML_TYPE_F32);
// on some GPUs it is faster to convert src1 to half and to use half precision intrinsics
#ifdef GGML_SYCL_F16
sycl_pool_alloc<sycl::half> src1_dfloat_a;
sycl::half *src1_dfloat = nullptr; // dfloat == half
bool src1_convert_f16 =
src0->type == GGML_TYPE_Q4_0 || src0->type == GGML_TYPE_Q4_1 ||
src0->type == GGML_TYPE_Q5_0 || src0->type == GGML_TYPE_Q5_1 ||
src0->type == GGML_TYPE_Q8_0 || src0->type == GGML_TYPE_F16;
if (src1_convert_f16) {
src1_dfloat = src1_dfloat_a.alloc(ne00);
const to_fp16_sycl_t to_fp16_sycl = ggml_get_to_fp16_sycl(src1->type);
GGML_ASSERT(to_fp16_sycl != nullptr);
to_fp16_sycl(src1_ddf_i, src1_dfloat, ne00, stream);
}
#else
const dfloat * src1_dfloat = (const dfloat *) src1_ddf_i; // dfloat == float, no conversion
#endif // GGML_SYCL_F16
switch (src0->type) {
case GGML_TYPE_Q4_0:
dequantize_mul_mat_vec_q4_0_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
break;
case GGML_TYPE_Q4_1:
dequantize_mul_mat_vec_q4_1_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
break;
case GGML_TYPE_Q5_0:
dequantize_mul_mat_vec_q5_0_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
break;
case GGML_TYPE_Q5_1:
dequantize_mul_mat_vec_q5_1_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
break;
case GGML_TYPE_Q8_0:
dequantize_mul_mat_vec_q8_0_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
break;
case GGML_TYPE_Q2_K:
dequantize_mul_mat_vec_q2_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
break;
case GGML_TYPE_Q3_K:
dequantize_mul_mat_vec_q3_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
break;
case GGML_TYPE_Q4_K:
dequantize_mul_mat_vec_q4_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
break;
case GGML_TYPE_Q5_K:
dequantize_mul_mat_vec_q5_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
break;
case GGML_TYPE_Q6_K:
dequantize_mul_mat_vec_q6_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
break;
case GGML_TYPE_F16:
convert_mul_mat_vec_f16_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
break;
default:
GGML_ASSERT(false);
break;
}
(void) src1;
(void) dst;
(void) src1_ddq_i;
(void) src1_ncols;
(void) src1_padded_row_size;
} |
echo, can't work on model level |
* Add q3_s and q1_s * fix compilation * fix build * fix build * fix build * enable ops * rm macro * increase grid space
* Add q3_s and q1_s * fix compilation * fix build * fix build * fix build * enable ops * rm macro * increase grid space
@abhilash1910 Could you check and fix this issue? |
Fix in progress at #6052 . |
ikawrakow
left a comment
There was a problem hiding this comment.
Please remove this incorrect implementation.
| const int ib = tid%8; // 0...7 | ||
| dst_t * y = yy + i*QK_K + 32*ib + 8*il; | ||
| const uint8_t * qs = x[i].qs + 8*ib; | ||
| const uint8_t * grid1 = (const uint8_t *)(iq3s_grid + qs[2*il+0]); |
There was a problem hiding this comment.
This is wrong. When you copy-paste my core without attribution, please make sure you are copy-pasting the correct code.
There was a problem hiding this comment.
I think we are currently reviewing this, and interms of "attribution" I would suggest that we follow cuda code to adapt to our sycl backend and since some parts of the code base is almost similar, I donot find a reason to be defensive about it.
Like I said before, we are working on this because not all cuda code is applicable for us. I hope this makes communication easier .
| const block_iq1_s * x = (const block_iq1_s *) vx; | ||
|
|
||
| const int tid = item_ct1.get_local_id(2); | ||
| #if QK_K == 256 |
There was a problem hiding this comment.
This is wrong. Please see PR #6014 for the correct implementation.
* Add q3_s and q1_s * fix compilation * fix build * fix build * fix build * enable ops * rm macro * increase grid space
* Add q3_s and q1_s * fix compilation * fix build * fix build * fix build * enable ops * rm macro * increase grid space
Support GGML_TYPE_IQ3_S, GGML_TYPE_IQ1_S in mul_mal/dequant.
cc @NeoZhangJianyu @airMeng @ggerganov