Conversation
joeatodd
left a comment
There was a problem hiding this comment.
This is a welcome addition since it simplifies the MatMul dispatch 😄
We spotted a couple of issues with the code, I've added comments. I hope they're helpful.
| use_mul_mat_vec_q = use_mul_mat_vec_q; // Check dp4a | ||
| use_mul_mat_q = use_mul_mat_q ; // check dp4a |
There was a problem hiding this comment.
These lines don't do anything.
There was a problem hiding this comment.
there will be refactoring about SYCL computation capablities in #6408 keep here for reminder
| use_mul_mat_vec_q = use_mul_mat_vec_q; // Check dp4a | ||
| use_mul_mat_q = use_mul_mat_q ; // check dp4a | ||
| #ifdef SYCL_USE_XMX | ||
| use_mul_mat_q = use_mul_mat_q && (!fp16_performance_good || src1->ne[1] <= MMQ_MAX_BATCH_SIZE); |
There was a problem hiding this comment.
The logic in this block is a little confusing:
(!fp16_performance_good || src1->ne[1] <= MMQ_MAX_BATCH_SIZE).
It says: use MMQ if either 1) FP16 perf is bad, or 2) the number of columns in src1 is less than or equal to the maximum.
Is this the intention?
There was a problem hiding this comment.
just align with CUDA logic
There was a problem hiding this comment.
I am seeing a new test failures on Arc A770, is this to be expected?
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): GGML_ASSERT: /builds/perseus-performance-libraries/llama_ci/llama.cpp/ggml-sycl.cpp:13858: false
ggml_sycl_op_dequantize_mul_mat_vec unsupported GGML_TYPE 16
| bool ggml_sycl_supports_mmq(enum ggml_type type) { | ||
| // TODO: accuracy issues in MMQ | ||
| return false; |
There was a problem hiding this comment.
Could you elaborate on what accuracy issues you are having with MMQ?
There was a problem hiding this comment.
the master using ggml_sycl_op_mul_mat_sycl for these 5 cases, you can try to force using MMQ
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1]): ggml_sycl_op_mul_mat_sycl
OK
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[1,1]): ggml_sycl_op_mul_mat_sycl
OK
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[2,1]): ggml_sycl_op_mul_mat_sycl
OK
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,1]): ggml_sycl_op_mul_mat_sycl
OK
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,1]): ggml_sycl_op_mul_mat_sycl
bcadd61 to
bfed283
Compare
fixed |
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
NeoZhangJianyu
left a comment
There was a problem hiding this comment.
The performance is increased from 34 token/s to 37 on Arc770 with llama2-7b-Q4.
It's good!
* align GEMM dispatch
* align GEMM dispatch
The issue is exposed during #6408
I will split #6408 into several smaller PRs for eazy reviewing, the tasks will be updated according to issues exposed