Skip to content

[SYCL] Add q3_s and q1_s#5886

Merged
abhilash1910 merged 9 commits intomasterfrom
sycl_q3s_q1s
Mar 11, 2024
Merged

[SYCL] Add q3_s and q1_s#5886
abhilash1910 merged 9 commits intomasterfrom
sycl_q3s_q1s

Conversation

@abhilash1910
Copy link
Copy Markdown
Contributor

Support GGML_TYPE_IQ3_S, GGML_TYPE_IQ1_S in mul_mal/dequant.
cc @NeoZhangJianyu @airMeng @ggerganov

@airMeng
Copy link
Copy Markdown
Contributor

airMeng commented Mar 5, 2024

have you ran test-backend-ops or verified on model level?

@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

NeoZhangJianyu commented Mar 6, 2024

  1. Please update ggml_backend_sycl_supports_op() to suport the new types.
    And run ci/run.sh to test it.

  2. How do you test with the new types? which model or test case?

Comment thread ggml-sycl.cpp Outdated
@AlexKoff88
Copy link
Copy Markdown

@abhilash1910, do you have any performance estimates for the supported types in this PR?

@abhilash1910
Copy link
Copy Markdown
Contributor Author

  1. Please update ggml_backend_sycl_supports_op() to suport the new types.
    And run ci/run.sh to test it.
  2. How do you test with the new types? which model or test case?

For

  1. handled
  2. I think q3s should support mistral along with llamav1/v2. Same for 1.5bit q1s. test-backend-ops is ok.

@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

NeoZhangJianyu commented Mar 8, 2024

  1. Please update ggml_backend_sycl_supports_op() to suport the new types.
    And run ci/run.sh to test it.
  2. How do you test with the new types? which model or test case?

For

  1. handled
  2. I think q3s should support mistral along with llamav1/v2. Same for 1.5bit q1s. test-backend-ops is ok.

Great!

The UT will cover every OPs.
But in real inference, it's depended on the condition.
Have you tested this new type with Mistral model? and it's passed?

@abhilash1910
Copy link
Copy Markdown
Contributor Author

  1. Please update ggml_backend_sycl_supports_op() to suport the new types.
    And run ci/run.sh to test it.
  2. How do you test with the new types? which model or test case?

For

  1. handled
  2. I think q3s should support mistral along with llamav1/v2. Same for 1.5bit q1s. test-backend-ops is ok.

Great!

The UT will cover every OPs. But in real inference, it's depended on the condition. Have you tested this new type with Mistral model? and it's passed?

  • Tested on mistral 7b without errors functionality wise.

Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't give this a test, but seems OK

Btw, might need some of your help in #5940 to help move the tables with quantum constants to the new ggml-common.h header. Will make the change and ping you to give it a try and confirm that it works

@qnixsynapse
Copy link
Copy Markdown
Collaborator

qnixsynapse commented Mar 9, 2024

@abhilash1910 increasing the grid space seems to fix the regression. However. iQ3_S still throwing a ggml_assert.
ggml-sycl.cpp:14654: false

Probably around here:

  inline void ggml_sycl_op_dequantize_mul_mat_vec(
    const ggml_tensor *src0, const ggml_tensor *src1, ggml_tensor *dst,
    const char *src0_dd_i, const float *src1_ddf_i, const char *src1_ddq_i,
    float *dst_dd_i, const int64_t row_low, const int64_t row_high,
    const int64_t src1_ncols, const int64_t src1_padded_row_size,
    const dpct::queue_ptr &stream) {

    const int64_t ne00 = src0->ne[0];
    const int64_t row_diff = row_high - row_low;

    GGML_ASSERT(src1->type == GGML_TYPE_F32);

    // on some GPUs it is faster to convert src1 to half and to use half precision intrinsics
#ifdef GGML_SYCL_F16
    sycl_pool_alloc<sycl::half> src1_dfloat_a;
    sycl::half *src1_dfloat = nullptr; // dfloat == half

    bool src1_convert_f16 =
        src0->type == GGML_TYPE_Q4_0 || src0->type == GGML_TYPE_Q4_1 ||
        src0->type == GGML_TYPE_Q5_0 || src0->type == GGML_TYPE_Q5_1 ||
        src0->type == GGML_TYPE_Q8_0 || src0->type == GGML_TYPE_F16;

    if (src1_convert_f16) {
        src1_dfloat = src1_dfloat_a.alloc(ne00);
        const to_fp16_sycl_t to_fp16_sycl = ggml_get_to_fp16_sycl(src1->type);
        GGML_ASSERT(to_fp16_sycl != nullptr);
        to_fp16_sycl(src1_ddf_i, src1_dfloat, ne00, stream);
    }
#else
    const dfloat * src1_dfloat = (const dfloat *) src1_ddf_i; // dfloat == float, no conversion
#endif // GGML_SYCL_F16

    switch (src0->type) {
        case GGML_TYPE_Q4_0:
            dequantize_mul_mat_vec_q4_0_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q4_1:
            dequantize_mul_mat_vec_q4_1_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q5_0:
            dequantize_mul_mat_vec_q5_0_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q5_1:
            dequantize_mul_mat_vec_q5_1_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q8_0:
            dequantize_mul_mat_vec_q8_0_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q2_K:
            dequantize_mul_mat_vec_q2_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q3_K:
            dequantize_mul_mat_vec_q3_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q4_K:
            dequantize_mul_mat_vec_q4_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q5_K:
            dequantize_mul_mat_vec_q5_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_Q6_K:
            dequantize_mul_mat_vec_q6_K_sycl(src0_dd_i, src1_ddf_i, dst_dd_i, ne00, row_diff, stream);
            break;
        case GGML_TYPE_F16:
            convert_mul_mat_vec_f16_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
            break;
        default:
            GGML_ASSERT(false);
            break;
    }

    (void) src1;
    (void) dst;
    (void) src1_ddq_i;
    (void) src1_ncols;
    (void) src1_padded_row_size;
}

@airMeng
Copy link
Copy Markdown
Contributor

airMeng commented Mar 11, 2024

@abhilash1910 increasing the grid space seems to fix the regression. However. iQ3_S still throwing a ggml_assert. ggml-sycl.cpp:14654: false

echo, can't work on model level

@abhilash1910 abhilash1910 merged commit ef3ced2 into master Mar 11, 2024
NeoZhangJianyu pushed a commit to NeoZhangJianyu/llama.cpp that referenced this pull request Mar 12, 2024
* Add q3_s and q1_s

* fix compilation

* fix build

* fix build

* fix build

* enable ops

* rm macro

* increase grid space
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* Add q3_s and q1_s

* fix compilation

* fix build

* fix build

* fix build

* enable ops

* rm macro

* increase grid space
@NeoZhangJianyu
Copy link
Copy Markdown
Contributor

@abhilash1910 increasing the grid space seems to fix the regression. However. iQ3_S still throwing a ggml_assert. ggml-sycl.cpp:14654: false

echo, can't work on model level

@abhilash1910 Could you check and fix this issue?

@abhilash1910
Copy link
Copy Markdown
Contributor Author

@abhilash1910 increasing the grid space seems to fix the regression. However. iQ3_S still throwing a ggml_assert. ggml-sycl.cpp:14654: false

echo, can't work on model level

@abhilash1910 Could you check and fix this issue?

Fix in progress at #6052 .

Copy link
Copy Markdown
Contributor

@ikawrakow ikawrakow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this incorrect implementation.

Comment thread ggml-sycl.cpp
const int ib = tid%8; // 0...7
dst_t * y = yy + i*QK_K + 32*ib + 8*il;
const uint8_t * qs = x[i].qs + 8*ib;
const uint8_t * grid1 = (const uint8_t *)(iq3s_grid + qs[2*il+0]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. When you copy-paste my core without attribution, please make sure you are copy-pasting the correct code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are currently reviewing this, and interms of "attribution" I would suggest that we follow cuda code to adapt to our sycl backend and since some parts of the code base is almost similar, I donot find a reason to be defensive about it.
Like I said before, we are working on this because not all cuda code is applicable for us. I hope this makes communication easier .

Comment thread ggml-sycl.cpp
const block_iq1_s * x = (const block_iq1_s *) vx;

const int tid = item_ct1.get_local_id(2);
#if QK_K == 256
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. Please see PR #6014 for the correct implementation.

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* Add q3_s and q1_s

* fix compilation

* fix build

* fix build

* fix build

* enable ops

* rm macro

* increase grid space
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
* Add q3_s and q1_s

* fix compilation

* fix build

* fix build

* fix build

* enable ops

* rm macro

* increase grid space
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants