Strange results when CLBlast and Metal are both enabled and ngl > 1

I don't know whether it makes sense to enable both CLBlast (in a bid to speed up prompt ingestion) and Metal, but clearly there's something wrong about this combo:

When Llama.cpp is built with `-DBUILD_SHARED_LIBS=ON -DLLAMA_NATIVE=ON -DLLAMA_CLBLAST=ON -DLLAMA_BUILD_SERVER=ON`, `server` spews out about 140 lines of `ggml_metal_get_buffer: error: buffer is nil`.

```
$ ./server -ngl 16 --no-mmap -m /Users/akx/Documents/Llama/models/ausboss-llama-30b-supercot-q4_k_m.gguf
ggml_opencl: selecting platform: 'Apple'
ggml_opencl: selecting device: 'Apple M2 Max'
ggml_opencl: device FP16 support: false
{"timestamp":1698317012,"level":"INFO","function":"main","line":2213,"message":"build info","build":1429,"commit":"00ae2aa"}
{"timestamp":1698317012,"level":"INFO","function":"main","line":2220,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":12,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}
llama_model_loader: loaded meta data with 19 key-value pairs and 543 tensors from /Users/akx/Documents/Llama/models/ausboss-llama-30b-supercot-q4_k_m.gguf (version GGUF V2 (latest))
llama_model_loader: (...snip...)
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 6656
llm_load_print_meta: n_head           = 52
llm_load_print_meta: n_head_kv        = 52
llm_load_print_meta: n_layer          = 60
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 17920
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 30B
llm_load_print_meta: model ftype      = mostly Q4_K - Medium
llm_load_print_meta: model params     = 32.53 B
llm_load_print_meta: model size       = 18.27 GiB (4.83 BPW)
llm_load_print_meta: general.name   = models
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 18711.64 MB
llm_load_tensors: using OpenCL for GPU acceleration
llm_load_tensors: mem required  = 13676.17 MB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/61 layers to GPU
llm_load_tensors: VRAM used: 5035.47 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  780.00 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: error: could not use bundle path to find ggml-metal.metal, falling back to trying cwd
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: (...snip loaded...)
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 49152.00 MB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 103.13 MB
llama_new_context_with_model: max tensor size =   166.63 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 18711.64 MB, (23750.41 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   780.02 MB, (24530.42 / 49152.00)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =    97.02 MB, (24627.44 / 49152.00)
ggml_metal_get_buffer: error: buffer is nil [repeated 140 times or so]
Available slots:
 -> Slot 0 - max context: 512

llama server listening at http://127.0.0.1:8080

{"timestamp":1698317017,"level":"INFO","function":"main","line":2495,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
```

and the conversation (using all of the server's defaults) is pretty wonky:

> User: Hello, Llama! How are you?
>
> Llama: Hi user friendships! I am doing fine thankfully yoursselfnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessness

(repeated forever).

While generating, the console is similarly full of `ggml_metal_get_buffer: error: buffer is nil`.

After rebuilding with `-DBUILD_SHARED_LIBS=ON -DLLAMA_NATIVE=ON -DLLAMA_BUILD_SERVER=ON`, with the same command line, there are no `ggml_metal_get_buffer: error: buffer is nil`s and the conversation is back to normal:

> User: Hello, Llama! How are you?
>
> Llama: I'm doing great, thank you for asking! And how about yourself?

The results seem to depend on the `-ngl` setting; without `-ngl` set, the CLBlasted Llama seems to be responding fine, but with e.g. 32,

> User: Hello, Llama! How are you?
>
> Llama: Hiya! fine tuned ready readyreadyReadyReady readyyyahooahoooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

# Environment and Context

* llama.cpp version: that of #3793, so tag `b1428` + 1 commit
* Physical (or virtual) hardware you are using: MacBook Pro, Apple M2 Max
* Operating System: macOS Ventura 13.6
* SDK version: Apple clang version 15.0.0 (clang-1500.0.40.1)
* CLBlast version: clblast: stable 1.6.1 (bottled) from homebrew

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange results when CLBlast and Metal are both enabled and ngl > 1 #3794

Environment and Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Strange results when CLBlast and Metal are both enabled and ngl > 1 #3794

Description

Environment and Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions