Vulkan Embedding Fix by 0cc4m · Pull Request #7360 · ggml-org/llama.cpp

0cc4m · 2024-05-18T08:21:47Z

Fix empty Vulkan host buffers
Fix embedding calls by adding matmul fp16 fp32 shader
Fix matmul shader alignment
Remove deprecated tensor->backend uses in Vulkan code

Add fp32 fp16 matmul shader Fix matmul shader alignment

…yers

* Fix empty Vulkan host buffers Add fp32 fp16 matmul shader Fix matmul shader alignment * Remove deprecated tensor->backend uses * Fix Vulkan validation errors on embedding models with no offloaded layers * Fix Vulkan llava segfault when not offloading layers

A coherent set of fixes to the no_alloc=true code path so prediction tooling can report byte-exact memory sizes without allocating real backend buffers. Discovered while building a per-cell regression suite (real-load vs dry-run) for cow_fit; each fix is independently justified. Upstream-touching changes: ggml-backend-impl.h, ggml-backend.h, ggml-backend.cpp: New optional iface method `get_alloc_size_for_buffer(buft, size)` on `ggml_backend_buffer_type_i`. Defaults to identity. Lets size-only paths model per-allocation overhead that backends apply inside `alloc_buffer` (e.g. Vulkan_Host's `size += 32` safety pad from PR ggml-org#7360 "Fix empty Vulkan host buffers"). Public wrapper `ggml_backend_buft_get_alloc_size_for_buffer` mirrors `ggml_backend_buft_alloc_buffer`'s size==0 short-circuit. ggml-backend.cpp (ggml_backend_sched_reserve_size): Reset pattern matches `ggml_backend_sched_reserve` (synchronize → split_graph → reserve → reset) instead of reset-at-start. The previous leading reset wiped `hv_tensor_copies` state that `split_graph` relies on for pipeline_parallel input tensor duplicate tracking, causing the size-only path's chunk allocations to diverge from the real-alloc path on multi-GPU layouts (+12 MiB on the second GPU buft for Qwen2-1.5B at ngl=999/ctx=2048 reproducer). ggml-alloc.c: Shadow vbuffer (`ggml_vbuffer_alloc_shadow`) — same shape as `ggml_vbuffer_alloc` but with no backing memory. Each chunk created via `ggml_backend_buffer_init` directly, bypassing the backend's `alloc_buffer`. Chunks have correct declared sizes via the new `get_alloc_size_for_buffer` API; iface methods all NULL (free_buffer is gracefully no-op'd in `ggml_backend_buffer_free`; get_base returns NULL via the size>0 + null-iface fallback). `_reserve_n_impl` uses shadow vbuffer in the `no_alloc=true` branch instead of `buffers[i]=NULL`. Without this, no_alloc=true loses all per-chunk state across calls — every call would re-realloc and the realloc-or-keep monotonic-max logic at lines 904-944 would not accumulate properly. `ggml_gallocr_reserve_n_size` simplified to read sizes via `ggml_gallocr_get_buffer_size` (which reads the shadow vbuffer with the dedup logic that already existed for real-alloc). ggml-vulkan.cpp: `ggml_backend_vk_host_buffer_type_get_alloc_size_for_buffer` returns size + 32 to mirror `alloc_buffer`'s safety pad. llama-memory-recurrent.cpp: Adds a `no_alloc=true` dummy-buffer path mirroring the existing pattern in `llama-kv-cache.cpp:255-262` and the matching `memory_breakdown` path that uses `ggml_backend_alloc_ctx_tensors_from_buft_size` instead of querying the dummy buffer. Without this, hybrid models (mamba+attention) always real-allocated their recurrent state buffers during prediction. llama.h, llama-context.cpp: `llama_pipeline_parallel_type` tristate enum (AUTO/DISABLED/ENABLED) + `pipeline_parallel_type` field on `llama_context_params`, mirroring `llama_flash_attn_type`'s pattern. Default AUTO preserves existing auto-detection behavior; DISABLED skips the prerequisite check; ENABLED warns if prerequisites force it off. Lets memory-prediction tooling explicitly request the post-fallback configuration without relying on real-alloc's compute-alloc-fail fallback at llama-context.cpp:563-568. llama-context.cpp init: Pass tmp_sizes ptr to all 3 graph_reserve calls (PP, TG, final-PP) in `no_alloc=true` mode so each invokes `_reserve_n_impl` and the shadow vbuffer accumulates per-chunk maxes correctly. Drop the `if (!model.hparams.no_alloc)` guard at the post-init read — value is correct in both modes now. llama-context.h: Public `get_backend_buf_exp_size()` accessor for memory-prediction tooling. Validation: 328-cell regression suite (25 models × 2 ngl modes × 8 ctx sizes) running both real-load and dry-run breakdowns and asserting per-cell, per-field, per-device byte-exact match. 100% pass rate on Linux/Vulkan with these patches; cow_fit_predict (tooling that exercises this path) no longer allocates any compute, KV, or recurrent state buffers. No behavior change for `no_alloc=false` (real-load + inference) paths.

0cc4m added 2 commits May 18, 2024 09:51

Fix empty Vulkan host buffers

179103c

Add fp32 fp16 matmul shader Fix matmul shader alignment

Remove deprecated tensor->backend uses

8dbde1f

0cc4m mentioned this pull request May 18, 2024

Embedding fails to run on vulkan backend #7130

Closed

mofosyne added Vulkan Issues specific to the Vulkan backend bugfix fixes an issue or bug Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 18, 2024

github-actions Bot added the python python script changes label May 18, 2024

slaren approved these changes May 18, 2024

View reviewed changes

0cc4m added 2 commits May 19, 2024 09:55

Fix Vulkan validation errors on embedding models with no offloaded la…

6db8ec3

…yers

Fix Vulkan llava segfault when not offloading layers

ab5685e

0cc4m merged commit f030ec1 into master May 19, 2024

0cc4m deleted the 0cc4m/vulkan-embedding-fix branch May 19, 2024 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan Embedding Fix#7360

Vulkan Embedding Fix#7360
0cc4m merged 4 commits intomasterfrom
0cc4m/vulkan-embedding-fix

0cc4m commented May 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

0cc4m commented May 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants