I own a Macbook Pro M2 with 32GB memory and try to do inference with a 33B model.
Without Metal (or -ngl 1 flag) this works fine and 13B models also work fine both with or without METAL.
There is sufficient free memory available.
Inference always fails with the error:
ggml_metal_add_buffer: buffer 'data' size 18300780544 is larger than buffer maximum of 17179869184
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I own a Mac Pro M2 with 32GB memory and try to do inference with a 33B model. Without Metal this works fine and 13B models also work fine with or without METAL.
There is sufficient free memory available.
Current Behavior
> llama.cpp git:(master) ./main -m ~/dev2/text-generation-webui/models/guanaco-33B.ggmlv3.q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 -ngl 1
main: build = 661 (fa84c4b)
main: seed = 1686556467
llama.cpp: loading model from /Users/jp/dev2/models/guanaco-33B.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 0,13 MB
llama_model_load_internal: mem required = 19756,66 MB (+ 3124,00 MB per state)
.
llama_init_from_file: kv self size = 780,00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/jp/dev2/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x139809af0
ggml_metal_init: loaded kernel_mul 0x13980a210
ggml_metal_init: loaded kernel_mul_row 0x138f05fa0
ggml_metal_init: loaded kernel_scale 0x138f06430
ggml_metal_init: loaded kernel_silu 0x13980a610
ggml_metal_init: loaded kernel_relu 0x13980ac50
ggml_metal_init: loaded kernel_gelu 0x138f06830
ggml_metal_init: loaded kernel_soft_max 0x126204210
ggml_metal_init: loaded kernel_diag_mask_inf 0x126204c70
ggml_metal_init: loaded kernel_get_rows_f16 0x138f07030
ggml_metal_init: loaded kernel_get_rows_q4_0 0x138f07830
ggml_metal_init: loaded kernel_get_rows_q4_1 0x13980b4b0
ggml_metal_init: loaded kernel_get_rows_q2_k 0x13980bcb0
ggml_metal_init: loaded kernel_get_rows_q4_k 0x138f07ed0
ggml_metal_init: loaded kernel_get_rows_q6_k 0x138f08690
ggml_metal_init: loaded kernel_rms_norm 0x138f08dd0
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x138f09870
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x13980c750
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x13980d050
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32 0x138f09fd0
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32 0x138f0a8d0
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32 0x13980da40
ggml_metal_init: loaded kernel_rope 0x126205440
ggml_metal_init: loaded kernel_cpy_f32_f16 0x13980e6b0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x13980f130
ggml_metal_add_buffer: buffer 'data' size 18300780544 is larger than buffer maximum of 17179869184
llama_init_from_file: failed to add buffer
llama_init_from_gpt_params: error: failed to load model '/Users/jp/dev2/models/guanaco-33B.ggmlv3.q4_0.bin'
main: error: unable to load model
Is this known/expected and are there any workarounds? The mentioned "buffer maximum" of 17179869184 stay the same regardless of how much memory is free.
I own a Macbook Pro M2 with 32GB memory and try to do inference with a 33B model.
Without Metal (or
-ngl 1flag) this works fine and 13B models also work fine both with or without METAL.There is sufficient free memory available.
Inference always fails with the error:
ggml_metal_add_buffer: buffer 'data' size 18300780544 is larger than buffer maximum of 17179869184Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I own a Mac Pro M2 with 32GB memory and try to do inference with a 33B model. Without Metal this works fine and 13B models also work fine with or without METAL.
There is sufficient free memory available.
Current Behavior
Is this known/expected and are there any workarounds? The mentioned "buffer maximum" of 17179869184 stay the same regardless of how much memory is free.