[METAL] GPU Inference fails due to buffer error (buffer "data" size is larger than buffer maximum)

I own a Macbook Pro M2 with 32GB memory and try to do inference with a 33B model. 
Without Metal (or `-ngl 1` flag) this works fine and 13B models also work fine both with or without METAL.
There is sufficient free memory available.

Inference always fails with the error:
`ggml_metal_add_buffer: buffer 'data' size 18300780544 is larger than buffer maximum of 17179869184`


# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [X] I carefully followed the [README.md](https://github.com/abetlen/llama-cpp-python/blob/main/README.md).
- [X] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the [Discussions](https://github.com/abetlen/llama-cpp-python/discussions), and have a new bug or useful enhancement to share. -> same problem as https://github.com/abetlen/llama-cpp-python/discussions/361 , but no Issue here as of yet?

# Expected Behavior

I own a Mac Pro M2 with 32GB memory and try to do inference with a 33B model. Without Metal this works fine and 13B models also work fine with or without METAL.
There is sufficient free memory available.

# Current Behavior

```
> llama.cpp git:(master) ./main -m ~/dev2/text-generation-webui/models/guanaco-33B.ggmlv3.q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 -ngl 1

main: build = 661 (fa84c4b)
main: seed  = 1686556467
llama.cpp: loading model from /Users/jp/dev2/models/guanaco-33B.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0,13 MB
llama_model_load_internal: mem required  = 19756,66 MB (+ 3124,00 MB per state)
.
llama_init_from_file: kv self size  =  780,00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/jp/dev2/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x139809af0
ggml_metal_init: loaded kernel_mul                            0x13980a210
ggml_metal_init: loaded kernel_mul_row                        0x138f05fa0
ggml_metal_init: loaded kernel_scale                          0x138f06430
ggml_metal_init: loaded kernel_silu                           0x13980a610
ggml_metal_init: loaded kernel_relu                           0x13980ac50
ggml_metal_init: loaded kernel_gelu                           0x138f06830
ggml_metal_init: loaded kernel_soft_max                       0x126204210
ggml_metal_init: loaded kernel_diag_mask_inf                  0x126204c70
ggml_metal_init: loaded kernel_get_rows_f16                   0x138f07030
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x138f07830
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x13980b4b0
ggml_metal_init: loaded kernel_get_rows_q2_k                  0x13980bcb0
ggml_metal_init: loaded kernel_get_rows_q4_k                  0x138f07ed0
ggml_metal_init: loaded kernel_get_rows_q6_k                  0x138f08690
ggml_metal_init: loaded kernel_rms_norm                       0x138f08dd0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x138f09870
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x13980c750
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x13980d050
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32               0x138f09fd0
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32               0x138f0a8d0
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32               0x13980da40
ggml_metal_init: loaded kernel_rope                           0x126205440
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x13980e6b0
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x13980f130
ggml_metal_add_buffer: buffer 'data' size 18300780544 is larger than buffer maximum of 17179869184
llama_init_from_file: failed to add buffer
llama_init_from_gpt_params: error: failed to load model '/Users/jp/dev2/models/guanaco-33B.ggmlv3.q4_0.bin'
main: error: unable to load model
```

Is this known/expected and are there any workarounds? The mentioned "buffer maximum" of 17179869184 stay the same regardless of how much memory is free.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[METAL] GPU Inference fails due to buffer error (buffer "data" size is larger than buffer maximum) #1815

Prerequisites

Expected Behavior

Current Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[METAL] GPU Inference fails due to buffer error (buffer "data" size is larger than buffer maximum) #1815

Description

Prerequisites

Expected Behavior

Current Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions