You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I don't know whether it makes sense to enable both CLBlast (in a bid to speed up prompt ingestion) and Metal, but clearly there's something wrong about this combo:
When Llama.cpp is built with -DBUILD_SHARED_LIBS=ON -DLLAMA_NATIVE=ON -DLLAMA_CLBLAST=ON -DLLAMA_BUILD_SERVER=ON, server spews out about 140 lines of ggml_metal_get_buffer: error: buffer is nil.
$ ./server -ngl 16 --no-mmap -m /Users/akx/Documents/Llama/models/ausboss-llama-30b-supercot-q4_k_m.gguf
ggml_opencl: selecting platform: 'Apple'
ggml_opencl: selecting device: 'Apple M2 Max'
ggml_opencl: device FP16 support: false
{"timestamp":1698317012,"level":"INFO","function":"main","line":2213,"message":"build info","build":1429,"commit":"00ae2aa"}
{"timestamp":1698317012,"level":"INFO","function":"main","line":2220,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":12,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}
llama_model_loader: loaded meta data with 19 key-value pairs and 543 tensors from /Users/akx/Documents/Llama/models/ausboss-llama-30b-supercot-q4_k_m.gguf (version GGUF V2 (latest))
llama_model_loader: (...snip...)
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 6656
llm_load_print_meta: n_head = 52
llm_load_print_meta: n_head_kv = 52
llm_load_print_meta: n_layer = 60
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 17920
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type = 30B
llm_load_print_meta: model ftype = mostly Q4_K - Medium
llm_load_print_meta: model params = 32.53 B
llm_load_print_meta: model size = 18.27 GiB (4.83 BPW)
llm_load_print_meta: general.name = models
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 18711.64 MB
llm_load_tensors: using OpenCL for GPU acceleration
llm_load_tensors: mem required = 13676.17 MB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/61 layers to GPU
llm_load_tensors: VRAM used: 5035.47 MB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 780.00 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: error: could not use bundle path to find ggml-metal.metal, falling back to trying cwd
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: (...snip loaded...)
ggml_metal_init: GPU name: Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 103.13 MB
llama_new_context_with_model: max tensor size = 166.63 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 18711.64 MB, (23750.41 / 49152.00)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 780.02 MB, (24530.42 / 49152.00)
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 97.02 MB, (24627.44 / 49152.00)
ggml_metal_get_buffer: error: buffer is nil [repeated 140 times or so]
Available slots:
-> Slot 0 - max context: 512
llama server listening at http://127.0.0.1:8080
{"timestamp":1698317017,"level":"INFO","function":"main","line":2495,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
and the conversation (using all of the server's defaults) is pretty wonky:
User: Hello, Llama! How are you?
Llama: Hi user friendships! I am doing fine thankfully yoursselfnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessnessness
(repeated forever).
While generating, the console is similarly full of ggml_metal_get_buffer: error: buffer is nil.
After rebuilding with -DBUILD_SHARED_LIBS=ON -DLLAMA_NATIVE=ON -DLLAMA_BUILD_SERVER=ON, with the same command line, there are no ggml_metal_get_buffer: error: buffer is nils and the conversation is back to normal:
User: Hello, Llama! How are you?
Llama: I'm doing great, thank you for asking! And how about yourself?
The results seem to depend on the -ngl setting; without -ngl set, the CLBlasted Llama seems to be responding fine, but with e.g. 32,
User: Hello, Llama! How are you?
Llama: Hiya! fine tuned ready readyreadyReadyReady readyyyahooahoooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
I don't know whether it makes sense to enable both CLBlast (in a bid to speed up prompt ingestion) and Metal, but clearly there's something wrong about this combo:
When Llama.cpp is built with
-DBUILD_SHARED_LIBS=ON -DLLAMA_NATIVE=ON -DLLAMA_CLBLAST=ON -DLLAMA_BUILD_SERVER=ON,serverspews out about 140 lines ofggml_metal_get_buffer: error: buffer is nil.and the conversation (using all of the server's defaults) is pretty wonky:
(repeated forever).
While generating, the console is similarly full of
ggml_metal_get_buffer: error: buffer is nil.After rebuilding with
-DBUILD_SHARED_LIBS=ON -DLLAMA_NATIVE=ON -DLLAMA_BUILD_SERVER=ON, with the same command line, there are noggml_metal_get_buffer: error: buffer is nils and the conversation is back to normal:The results seem to depend on the
-nglsetting; without-nglset, the CLBlasted Llama seems to be responding fine, but with e.g. 32,Environment and Context
b1428+ 1 commit