Fix OpenCL kernels for the new formats by SlyEcho · Pull Request #1422 · ggml-org/llama.cpp

SlyEcho · 2023-05-12T22:04:56Z

This should fix the CLBlast related errors with the new formats.

I also rewrote them to be almost identical to the CUDA versions, so future updates could be easier.

I also figured out the solution to the Q5_0 that required preconversion to a different format with f32 (and malloc!), the issue was, of course, an alignment issue which an __attribute__((packed)) as per the OpenCL 1.1 spec solved.

Test results

Test models:

417111a40c36bff7ae6c6b3f773ac6efdb1c46584ef1077a1f3404d668e3944f  llama-7b-q4_0.bin
0fc3f4925923cafe4681370e863319e8ff8f2d760e6b3f5435b415a407aa8d56  llama-7b-q4_1.bin
1226673013a28d61acb94d46eeb15d3623bf0f1472a99ecaf0da8076d680fdf8  llama-7b-q5_0.bin
72040d380ab1067dc08c28d5f16269453bf1d4d7172c24424d4300d8474b42b6  llama-7b-q5_1.bin
7dbee72e1b9d541ed75911488d305d239a0fc4fe86bd287ee002fe75f6423859  llama-7b-q8_0.bin
666a4bb533b303bdaf89e1b6a3b6f93535d868de31d903afdc20983dc526c847  llama-7b-f16.bin

Test data:

head -n 102 wiki.test.raw > wiki.test.mini

Test command:

for q in q4_0 q4_1 q5_0 q5_1 q8_0 f16; do 
    ./build-clblast/bin/perplexity -m ./models/llama-7b-$q.bin --no-mmap -f ./models/wiki.test.mini;
done

Test outputs:

7B Q4_0

main: build = 534 (3243b99)
main: seed  = 1683929450
llama.cpp: loading model from ./models/llama-7b-q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113748.20 KB
llama_model_load_internal: mem required  = 5809.33 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 7.99 seconds per pass - ETA 1 minutes
[1]4.4536,[2]5.4657,[3]6.5626,[4]7.2850,[5]7.1903,[6]7.1784,[7]7.3922,[8]7.5547,[9]7.6881,[10]8.0088,[11]8.2590,[12]8.1878,

llama_print_timings:        load time =  9274.76 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 88911.93 ms /  6144 tokens (   14.47 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 90911.25 ms

7B Q4_1

main: build = 534 (3243b99)
main: seed  = 1683929541
llama.cpp: loading model from ./models/llama-7b-q4_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4936276.20 KB
llama_model_load_internal: mem required  = 6612.58 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 8.20 seconds per pass - ETA 1 minutes
[1]4.4737,[2]5.3596,[3]6.4475,[4]7.1958,[5]7.1243,[6]7.1576,[7]7.3686,[8]7.4989,[9]7.6096,[10]7.9235,[11]8.1710,[12]8.1030,

llama_print_timings:        load time = 11045.48 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 90558.13 ms /  6144 tokens (   14.74 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 94128.86 ms

7B Q5_0

main: build = 534 (3243b99)
main: seed  = 1683938023
llama.cpp: loading model from ./models/llama-7b-q5_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4525012.20 KB
llama_model_load_internal: mem required  = 6210.96 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 8.31 seconds per pass - ETA 1 minutes
[1]4.2785,[2]5.1975,[3]6.3031,[4]7.0360,[5]6.9772,[6]7.0087,[7]7.2125,[8]7.3533,[9]7.4901,[10]7.7904,[11]8.0330,[12]7.9637,

llama_print_timings:        load time =  9311.35 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 89214.59 ms /  6144 tokens (   14.52 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 90805.86 ms

7B Q5_1

main: build = 534 (3243b99)
main: seed  = 1683929741
llama.cpp: loading model from ./models/llama-7b-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4936276.20 KB
llama_model_load_internal: mem required  = 6612.58 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 8.23 seconds per pass - ETA 1 minutes
[1]4.3009,[2]5.1991,[3]6.3072,[4]7.0070,[5]6.9728,[6]6.9970,[7]7.2182,[8]7.3313,[9]7.4692,[10]7.7641,[11]8.0044,[12]7.9289,

llama_print_timings:        load time = 11141.63 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 94402.18 ms /  6144 tokens (   15.36 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 98093.45 ms

7B Q8_0

main: build = 534 (3243b99)
main: seed  = 1683929839
llama.cpp: loading model from ./models/llama-7b-q8_0.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 7403860.20 KB
llama_model_load_internal: mem required  = 9022.33 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 9.32 seconds per pass - ETA 1 minutes
[1]4.2510,[2]5.1592,[3]6.2440,[4]6.9448,[5]6.8957,[6]6.9294,[7]7.1463,[8]7.2631,[9]7.3949,[10]7.6989,[11]7.9383,[12]7.8694,

llama_print_timings:        load time = 13512.95 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 96608.18 ms /  6144 tokens (   15.72 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 101527.87 ms

7B F16

main: build = 534 (3243b99)
main: seed  = 1683929941
llama.cpp: loading model from ./models/llama-7b-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 13161556.20 KB
llama_model_load_internal: mem required  = 14645.08 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 12.80 seconds per pass - ETA 2 minutes
[1]4.2555,[2]5.1663,[3]6.2488,[4]6.9504,[5]6.8991,[6]6.9306,[7]7.1476,[8]7.2655,[9]7.3974,[10]7.7015,[11]7.9415,[12]7.8721,

llama_print_timings:        load time = 47153.02 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 141875.01 ms /  6144 tokens (   23.09 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 176944.30 ms

skidd-level-100

Looks legit please merge!

edit: downloaded and compiled make with 'make LLAMA_CLBLAST=1'
it works!

FNsi · 2023-05-12T23:45:22Z

That works fine for me!

mqy · 2023-05-13T05:24:19Z

My device failed to create out of order queue, so I fallback to order queue with this patch:

    queue = clCreateCommandQueue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
    if (err == CL_INVALID_VALUE) {
         queue = clCreateCommandQueue(context, device, 0, &err);
    }

Now it fails with program source errors (10 long error blocks):

Using Platform: Apple Device: Intel(R) UHD Graphics 630
<program source>:1:133: error: variable length arrays are not supported in OpenCL
...
<program source>:1:635: error: automatic variable qualified with an address space
...

Seem these two statements matters:

const uint j = get_local_id(0);
const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d;

EDIT: platform and device info:

Platform: Apple
CL_DEVICE_NAME: Intel(R) UHD Graphics 630
CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.2
CL_DRIVER_VERSION: 1.2(Mar 14 2023 21:43:06)

0cc4m

Thanks, you found the q5_1 problem and fixed the kernels, nice. I tested it and didn't find any issues on AMD and Nvidia.

RTX 3060, Llama 7B:
q5_0: 8.18 ms per token on CLBlast, 4.63 ms per token on CuBLAS
q5_1: 8.25 ms per token on CLBlast, 4.54 ms per token on CuBLAS

LostRuins · 2023-05-13T16:00:13Z

@SlyEcho @0cc4m this works for me, but I have noticed a few people mentioning that they get the error regarding variable length arrays. #1429 (comment)

I also noticed that previously the array lengths are indeed hard coded with a constant. Perhaps this is a platform limitation?

SlyEcho · 2023-05-13T16:09:20Z

@LostRuins I will take care of it.

LostRuins · 2023-05-13T17:07:38Z

@SlyEcho Another thing to add - seems like the some people are reporting that the q8_0 dequant kernel is not working correctly - this seems to be the case for me too. Have you observed similar issues? It works correctly on OpenBLAS though, only Clblast is returning gibberish, and only for q8_0.

* Fix OpenCL kernels for the new formats * Fix Q5_0 alignment issues.

…#1422)

* Fix OpenCL kernels for the new formats * Fix Q5_0 alignment issues.

Fix OpenCL kernels for the new formats

3243b99

skidd-level-100 approved these changes May 12, 2023

View reviewed changes

Fix Q5_0 alignment issues.

cc798cc

SlyEcho force-pushed the clkernels branch from 780a49c to cc798cc Compare May 13, 2023 00:57

SlyEcho mentioned this pull request May 13, 2023

CLBlast: q5_0, q5_1, q8_0 dequant kernels #1225

Merged

mqy mentioned this pull request May 13, 2023

[BUG]The new master ClBLas generate garbage text #1417

Closed

ggerganov approved these changes May 13, 2023

View reviewed changes

ggerganov merged commit 699b1ad into ggml-org:master May 13, 2023

0cc4m approved these changes May 13, 2023

View reviewed changes

SlyEcho deleted the clkernels branch May 13, 2023 19:31

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

opencl : fix kernels for the new formats (ggml-org#1422)

07dd7f3

* Fix OpenCL kernels for the new formats * Fix Q5_0 alignment issues.

phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026

opencl : fix kernels for the new formats (ggml-org#1422)

12d0bb0

* Fix OpenCL kernels for the new formats * Fix Q5_0 alignment issues.

phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026

Add ffn_gate_up_exps to --cpu-moe and --n-cpu-moe overrides (ggml-org…

07ab0d2

…#1422)

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026

opencl : fix kernels for the new formats (ggml-org#1422)

5b6da98

* Fix OpenCL kernels for the new formats * Fix Q5_0 alignment issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OpenCL kernels for the new formats#1422

Fix OpenCL kernels for the new formats#1422
ggerganov merged 2 commits intoggml-org:masterfrom
SlyEcho:clkernels

SlyEcho commented May 12, 2023 •

edited

Loading

Uh oh!

skidd-level-100 left a comment

Uh oh!

FNsi commented May 12, 2023

Uh oh!

mqy commented May 13, 2023 •

edited

Loading

Uh oh!

0cc4m left a comment

Uh oh!

LostRuins commented May 13, 2023

Uh oh!

SlyEcho commented May 13, 2023

Uh oh!

LostRuins commented May 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

SlyEcho commented May 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test results

Uh oh!

skidd-level-100 left a comment

Choose a reason for hiding this comment

Uh oh!

FNsi commented May 12, 2023

Uh oh!

mqy commented May 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

LostRuins commented May 13, 2023

Uh oh!

SlyEcho commented May 13, 2023

Uh oh!

LostRuins commented May 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

SlyEcho commented May 12, 2023 •

edited

Loading

mqy commented May 13, 2023 •

edited

Loading