ggml-zdnn: fix #15414, activate FP16 and BF16 acceleration and incorrect zTensor free#15839
Conversation
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
…al if guard" This reverts commit 6e780a4. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
This reverts commit 0da4b6a. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Are you sure that you are looking at a weight? It might be part of the attention computation. |
|
Sorry I missed this. Yep I can confirm that I am looking at a weight tensor, unless my debugging code is wrong. Debug Patch diff --git a/ggml/src/ggml-zdnn/ggml-zdnn.cpp b/ggml/src/ggml-zdnn/ggml-zdnn.cpp
index 7947aab87..bd04beb2d 100644
--- a/ggml/src/ggml-zdnn/ggml-zdnn.cpp
+++ b/ggml/src/ggml-zdnn/ggml-zdnn.cpp
@@ -130,7 +130,11 @@ static void ggml_zdnn_mul_mat_op(ggml_backend_zdnn_context * ctx, const ggml_ten
// TODO: Weights are somehow not going through `ggml_backend_zdnn_buffer_set_tensor` during model loading.
// So we need to load the weights here. Remove this when the issue is fixed.
// Problem might be residing in `ggml_backend_zdnn_device_supports_buft`.
- if (weights_extra->ztensor.is_transformed == false) ggml_zdnn_load_tensor(weights_extra->ztensor, weights->data);
+ if (weights_extra->ztensor.is_transformed == false) {
+ GGML_LOG_INFO("%s: tensor->name = %s | tensor->buffer->usage = %d\n", __func__, weights->name, weights->buffer->usage);
+ ggml_zdnn_load_tensor(weights_extra->ztensor, weights->data);
+ std::raise(SIGINT);
+ }
// GGML_LOG_INFO("%s: tensor '%s' tensor dimensions: [%ld, %ld, %ld, %ld] pre_tfm_desc dimensions: [%ld, %ld, %ld, %ld]\n",
// __func__, weights_extra->name,And as logged, the buffer usage is |
|
I did some digging as well and found out that setting
|
|
That's expected, of course you cannot enable user mapped buffers if you need to modify the tensor data. |
|
Got it. Will create a separate PR by tomorrow to fix it. Do let me know if I need to make any changes to this PR |
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
…d incorrect zTensor free (ggml-org#15839)
fixes #15414
Not sure if my
.supports_buftis implemented inaccurately but the weights tensor are not going through the.set_tensorfunction, and thus, will have to re-initialise the weight zTensors on-the-fly during matmul. Not ideal though.Activates the following data types:
Fixes:
LLAMA_SET_ROWS=1causing the inference to be incorrect (see: Eval bug: zDNN backend not inferencing correctly after LLAMA_SET_ROWS enablement #15414)llama-benchwas used with more than 1 model.init_tensorfor performance improvementsPerformance
Note
Tests were conducted on an IBM z17 Mainframe with 40 IFLs (cores) and 128 GB Memory on a shared R&D LPAR.
test-backend-ops