Skip to content

opencl: Q1_0 support first attempt#25

Draft
khosravipasha wants to merge 1 commit intoprism-v1from
prism-android-new
Draft

opencl: Q1_0 support first attempt#25
khosravipasha wants to merge 1 commit intoprism-v1from
prism-android-new

Conversation

@khosravipasha
Copy link
Copy Markdown
Collaborator

Just for testing...

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds initial OpenCL backend support for GGML_TYPE_Q1_0, including AoS↔SoA conversion and new Q1_0 matmul/matvec kernels wired into the backend kernel loader and dispatch paths.

Changes:

  • Added block_q1_0 convert/restore kernels in cvt.cl and integrated them into tensor set/get paths.
  • Added new Q1_0 matvec kernels (*_8x_flat, *_1d_8x_flat) and a Q1_0 GEMM kernel (mul_mat_q1_0_Ab_Bi_8x4), with dispatch integration in ggml-opencl.cpp.
  • Added a new float32 transpose variant kernel (kernel_transpose_32_32) and registered new kernels in OpenCL CMake lists.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
ggml/src/ggml-opencl/kernels/transpose.cl Adds a bounds-checked float32 transpose kernel variant intended for padded shapes.
ggml/src/ggml-opencl/kernels/mul_mv_q1_0_f32_8x_flat.cl New Q1_0 SoA matvec kernel (8 outputs per subgroup).
ggml/src/ggml-opencl/kernels/mul_mv_q1_0_f32_1d_8x_flat.cl New Q1_0 SoA matvec kernel variant for 1d/batch dispatch.
ggml/src/ggml-opencl/kernels/mul_mat_q1_0_Ab_Bi_8x4.cl New Q1_0 GEMM kernel computing an 8x4 output tile per work-item.
ggml/src/ggml-opencl/kernels/cvt.cl Adds block_q1_0 definition + convert/restore kernels (AoS↔SoA).
ggml/src/ggml-opencl/ggml-opencl.cpp Wires Q1_0 programs/kernels, tensor set/get conversion, and mul_mat dispatch paths.
ggml/src/ggml-opencl/CMakeLists.txt Registers the new OpenCL kernels for build/embed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +51 to +63
// Pointers for 4 weight columns (SOA layout, row-major)
// For Q1_0: each block is 16 bytes (128 bits)
global const uchar* weight_base0 = src0_q + (gx_4 + 0) * num_blocks * 16;
global const uchar* weight_base1 = src0_q + (gx_4 + 1) * num_blocks * 16;
global const uchar* weight_base2 = src0_q + (gx_4 + 2) * num_blocks * 16;
global const uchar* weight_base3 = src0_q + (gx_4 + 3) * num_blocks * 16;

// Scale pointers for 4 columns
global const half* scale_ptr0 = src0_d + (gx_4 + 0) * num_blocks;
global const half* scale_ptr1 = src0_d + (gx_4 + 1) * num_blocks;
global const half* scale_ptr2 = src0_d + (gx_4 + 2) * num_blocks;
global const half* scale_ptr3 = src0_d + (gx_4 + 3) * num_blocks;

Comment on lines +192 to +215
if (row_base + 0 < n_no_padding) {
vstore4((float4)(c0.s0, c1.s0, c2.s0, c3.s0), 0, dst + (row_base + 0) * m + (gx << 2));
}
if (row_base + 1 < n_no_padding) {
vstore4((float4)(c0.s1, c1.s1, c2.s1, c3.s1), 0, dst + (row_base + 1) * m + (gx << 2));
}
if (row_base + 2 < n_no_padding) {
vstore4((float4)(c0.s2, c1.s2, c2.s2, c3.s2), 0, dst + (row_base + 2) * m + (gx << 2));
}
if (row_base + 3 < n_no_padding) {
vstore4((float4)(c0.s3, c1.s3, c2.s3, c3.s3), 0, dst + (row_base + 3) * m + (gx << 2));
}
if (row_base + 4 < n_no_padding) {
vstore4((float4)(c0.s4, c1.s4, c2.s4, c3.s4), 0, dst + (row_base + 4) * m + (gx << 2));
}
if (row_base + 5 < n_no_padding) {
vstore4((float4)(c0.s5, c1.s5, c2.s5, c3.s5), 0, dst + (row_base + 5) * m + (gx << 2));
}
if (row_base + 6 < n_no_padding) {
vstore4((float4)(c0.s6, c1.s6, c2.s6, c3.s6), 0, dst + (row_base + 6) * m + (gx << 2));
}
if (row_base + 7 < n_no_padding) {
vstore4((float4)(c0.s7, c1.s7, c2.s7, c3.s7), 0, dst + (row_base + 7) * m + (gx << 2));
}
Comment on lines +1 to +10
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
#pragma OPENCL EXTENSION cl_khr_subgroups : enable

#ifdef cl_intel_required_subgroup_size
#pragma OPENCL EXTENSION cl_intel_required_subgroup_size : enable
#define REQD_SUBGROUP_SIZE_16 __attribute__((intel_reqd_sub_group_size(16)))
#elif defined(cl_qcom_reqd_sub_group_size)
#pragma OPENCL EXTENSION cl_qcom_reqd_sub_group_size : enable
#define REQD_SUBGROUP_SIZE_64 __attribute__((qcom_reqd_sub_group_size("half")))
#endif
Comment on lines +1 to +10
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
#pragma OPENCL EXTENSION cl_khr_subgroups : enable

#ifdef cl_intel_required_subgroup_size
#pragma OPENCL EXTENSION cl_intel_required_subgroup_size : enable
#define REQD_SUBGROUP_SIZE_16 __attribute__((intel_reqd_sub_group_size(16)))
#elif defined(cl_qcom_reqd_sub_group_size)
#pragma OPENCL EXTENSION cl_qcom_reqd_sub_group_size : enable
#define REQD_SUBGROUP_SIZE_64 __attribute__((qcom_reqd_sub_group_size("half")))
#endif
Comment on lines +10789 to +10791
size_t global_work_size[3] = {(size_t)((N + 7) / 8), (size_t)(M / 4), 1};
size_t local_work_size[3] = {1, 128, 1};

bricklc pushed a commit to bricklc/prism-ml-llama.cpp that referenced this pull request Apr 25, 2026
 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bricklc pushed a commit to bricklc/prism-ml-llama.cpp that referenced this pull request Apr 25, 2026
…smML-Eng#25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants