Conversation
Add fp32 fp16 matmul shader Fix matmul shader alignment
slaren
approved these changes
May 18, 2024
Seunghhon
pushed a commit
to Seunghhon/llama.cpp
that referenced
this pull request
Apr 26, 2026
* Fix empty Vulkan host buffers Add fp32 fp16 matmul shader Fix matmul shader alignment * Remove deprecated tensor->backend uses * Fix Vulkan validation errors on embedding models with no offloaded layers * Fix Vulkan llava segfault when not offloading layers
phuongncn
pushed a commit
to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4
that referenced
this pull request
Apr 28, 2026
* Fix empty Vulkan host buffers Add fp32 fp16 matmul shader Fix matmul shader alignment * Remove deprecated tensor->backend uses * Fix Vulkan validation errors on embedding models with no offloaded layers * Fix Vulkan llava segfault when not offloading layers
jolexxa
added a commit
to superbarn-ai/llama.cpp
that referenced
this pull request
Apr 28, 2026
A coherent set of fixes to the no_alloc=true code path so prediction tooling can report byte-exact memory sizes without allocating real backend buffers. Discovered while building a per-cell regression suite (real-load vs dry-run) for cow_fit; each fix is independently justified. Upstream-touching changes: ggml-backend-impl.h, ggml-backend.h, ggml-backend.cpp: New optional iface method `get_alloc_size_for_buffer(buft, size)` on `ggml_backend_buffer_type_i`. Defaults to identity. Lets size-only paths model per-allocation overhead that backends apply inside `alloc_buffer` (e.g. Vulkan_Host's `size += 32` safety pad from PR ggml-org#7360 "Fix empty Vulkan host buffers"). Public wrapper `ggml_backend_buft_get_alloc_size_for_buffer` mirrors `ggml_backend_buft_alloc_buffer`'s size==0 short-circuit. ggml-backend.cpp (ggml_backend_sched_reserve_size): Reset pattern matches `ggml_backend_sched_reserve` (synchronize → split_graph → reserve → reset) instead of reset-at-start. The previous leading reset wiped `hv_tensor_copies` state that `split_graph` relies on for pipeline_parallel input tensor duplicate tracking, causing the size-only path's chunk allocations to diverge from the real-alloc path on multi-GPU layouts (+12 MiB on the second GPU buft for Qwen2-1.5B at ngl=999/ctx=2048 reproducer). ggml-alloc.c: Shadow vbuffer (`ggml_vbuffer_alloc_shadow`) — same shape as `ggml_vbuffer_alloc` but with no backing memory. Each chunk created via `ggml_backend_buffer_init` directly, bypassing the backend's `alloc_buffer`. Chunks have correct declared sizes via the new `get_alloc_size_for_buffer` API; iface methods all NULL (free_buffer is gracefully no-op'd in `ggml_backend_buffer_free`; get_base returns NULL via the size>0 + null-iface fallback). `_reserve_n_impl` uses shadow vbuffer in the `no_alloc=true` branch instead of `buffers[i]=NULL`. Without this, no_alloc=true loses all per-chunk state across calls — every call would re-realloc and the realloc-or-keep monotonic-max logic at lines 904-944 would not accumulate properly. `ggml_gallocr_reserve_n_size` simplified to read sizes via `ggml_gallocr_get_buffer_size` (which reads the shadow vbuffer with the dedup logic that already existed for real-alloc). ggml-vulkan.cpp: `ggml_backend_vk_host_buffer_type_get_alloc_size_for_buffer` returns size + 32 to mirror `alloc_buffer`'s safety pad. llama-memory-recurrent.cpp: Adds a `no_alloc=true` dummy-buffer path mirroring the existing pattern in `llama-kv-cache.cpp:255-262` and the matching `memory_breakdown` path that uses `ggml_backend_alloc_ctx_tensors_from_buft_size` instead of querying the dummy buffer. Without this, hybrid models (mamba+attention) always real-allocated their recurrent state buffers during prediction. llama.h, llama-context.cpp: `llama_pipeline_parallel_type` tristate enum (AUTO/DISABLED/ENABLED) + `pipeline_parallel_type` field on `llama_context_params`, mirroring `llama_flash_attn_type`'s pattern. Default AUTO preserves existing auto-detection behavior; DISABLED skips the prerequisite check; ENABLED warns if prerequisites force it off. Lets memory-prediction tooling explicitly request the post-fallback configuration without relying on real-alloc's compute-alloc-fail fallback at llama-context.cpp:563-568. llama-context.cpp init: Pass tmp_sizes ptr to all 3 graph_reserve calls (PP, TG, final-PP) in `no_alloc=true` mode so each invokes `_reserve_n_impl` and the shadow vbuffer accumulates per-chunk maxes correctly. Drop the `if (!model.hparams.no_alloc)` guard at the post-init read — value is correct in both modes now. llama-context.h: Public `get_backend_buf_exp_size()` accessor for memory-prediction tooling. Validation: 328-cell regression suite (25 models × 2 ngl modes × 8 ctx sizes) running both real-load and dry-run breakdowns and asserting per-cell, per-field, per-device byte-exact match. 100% pass rate on Linux/Vulkan with these patches; cow_fit_predict (tooling that exercises this path) no longer allocates any compute, KV, or recurrent state buffers. No behavior change for `no_alloc=false` (real-load + inference) paths.
jolexxa
added a commit
to superbarn-ai/llama.cpp
that referenced
this pull request
Apr 29, 2026
A coherent set of fixes to the no_alloc=true code path so prediction tooling can report byte-exact memory sizes without allocating real backend buffers. Discovered while building a per-cell regression suite (real-load vs dry-run) for cow_fit; each fix is independently justified. Upstream-touching changes: ggml-backend-impl.h, ggml-backend.h, ggml-backend.cpp: New optional iface method `get_alloc_size_for_buffer(buft, size)` on `ggml_backend_buffer_type_i`. Defaults to identity. Lets size-only paths model per-allocation overhead that backends apply inside `alloc_buffer` (e.g. Vulkan_Host's `size += 32` safety pad from PR ggml-org#7360 "Fix empty Vulkan host buffers"). Public wrapper `ggml_backend_buft_get_alloc_size_for_buffer` mirrors `ggml_backend_buft_alloc_buffer`'s size==0 short-circuit. ggml-backend.cpp (ggml_backend_sched_reserve_size): Reset pattern matches `ggml_backend_sched_reserve` (synchronize → split_graph → reserve → reset) instead of reset-at-start. The previous leading reset wiped `hv_tensor_copies` state that `split_graph` relies on for pipeline_parallel input tensor duplicate tracking, causing the size-only path's chunk allocations to diverge from the real-alloc path on multi-GPU layouts (+12 MiB on the second GPU buft for Qwen2-1.5B at ngl=999/ctx=2048 reproducer). ggml-alloc.c: Shadow vbuffer (`ggml_vbuffer_alloc_shadow`) — same shape as `ggml_vbuffer_alloc` but with no backing memory. Each chunk created via `ggml_backend_buffer_init` directly, bypassing the backend's `alloc_buffer`. Chunks have correct declared sizes via the new `get_alloc_size_for_buffer` API; iface methods all NULL (free_buffer is gracefully no-op'd in `ggml_backend_buffer_free`; get_base returns NULL via the size>0 + null-iface fallback). `_reserve_n_impl` uses shadow vbuffer in the `no_alloc=true` branch instead of `buffers[i]=NULL`. Without this, no_alloc=true loses all per-chunk state across calls — every call would re-realloc and the realloc-or-keep monotonic-max logic at lines 904-944 would not accumulate properly. `ggml_gallocr_reserve_n_size` simplified to read sizes via `ggml_gallocr_get_buffer_size` (which reads the shadow vbuffer with the dedup logic that already existed for real-alloc). ggml-vulkan.cpp: `ggml_backend_vk_host_buffer_type_get_alloc_size_for_buffer` returns size + 32 to mirror `alloc_buffer`'s safety pad. llama-memory-recurrent.cpp: Adds a `no_alloc=true` dummy-buffer path mirroring the existing pattern in `llama-kv-cache.cpp:255-262` and the matching `memory_breakdown` path that uses `ggml_backend_alloc_ctx_tensors_from_buft_size` instead of querying the dummy buffer. Without this, hybrid models (mamba+attention) always real-allocated their recurrent state buffers during prediction. llama.h, llama-context.cpp: `llama_pipeline_parallel_type` tristate enum (AUTO/DISABLED/ENABLED) + `pipeline_parallel_type` field on `llama_context_params`, mirroring `llama_flash_attn_type`'s pattern. Default AUTO preserves existing auto-detection behavior; DISABLED skips the prerequisite check; ENABLED warns if prerequisites force it off. Lets memory-prediction tooling explicitly request the post-fallback configuration without relying on real-alloc's compute-alloc-fail fallback at llama-context.cpp:563-568. llama-context.cpp init: Pass tmp_sizes ptr to all 3 graph_reserve calls (PP, TG, final-PP) in `no_alloc=true` mode so each invokes `_reserve_n_impl` and the shadow vbuffer accumulates per-chunk maxes correctly. Drop the `if (!model.hparams.no_alloc)` guard at the post-init read — value is correct in both modes now. llama-context.h: Public `get_backend_buf_exp_size()` accessor for memory-prediction tooling. Validation: 328-cell regression suite (25 models × 2 ngl modes × 8 ctx sizes) running both real-load and dry-run breakdowns and asserting per-cell, per-field, per-device byte-exact match. 100% pass rate on Linux/Vulkan with these patches; cow_fit_predict (tooling that exercises this path) no longer allocates any compute, KV, or recurrent state buffers. No behavior change for `no_alloc=false` (real-load + inference) paths.
jolexxa
added a commit
to superbarn-ai/llama.cpp
that referenced
this pull request
Apr 29, 2026
A coherent set of fixes to the no_alloc=true code path so prediction tooling can report byte-exact memory sizes without allocating real backend buffers. Discovered while building a per-cell regression suite (real-load vs dry-run) for cow_fit; each fix is independently justified. Upstream-touching changes: ggml-backend-impl.h, ggml-backend.h, ggml-backend.cpp: New optional iface method `get_alloc_size_for_buffer(buft, size)` on `ggml_backend_buffer_type_i`. Defaults to identity. Lets size-only paths model per-allocation overhead that backends apply inside `alloc_buffer` (e.g. Vulkan_Host's `size += 32` safety pad from PR ggml-org#7360 "Fix empty Vulkan host buffers"). Public wrapper `ggml_backend_buft_get_alloc_size_for_buffer` mirrors `ggml_backend_buft_alloc_buffer`'s size==0 short-circuit. ggml-backend.cpp (ggml_backend_sched_reserve_size): Reset pattern matches `ggml_backend_sched_reserve` (synchronize → split_graph → reserve → reset) instead of reset-at-start. The previous leading reset wiped `hv_tensor_copies` state that `split_graph` relies on for pipeline_parallel input tensor duplicate tracking, causing the size-only path's chunk allocations to diverge from the real-alloc path on multi-GPU layouts (+12 MiB on the second GPU buft for Qwen2-1.5B at ngl=999/ctx=2048 reproducer). ggml-alloc.c: Shadow vbuffer (`ggml_vbuffer_alloc_shadow`) — same shape as `ggml_vbuffer_alloc` but with no backing memory. Each chunk created via `ggml_backend_buffer_init` directly, bypassing the backend's `alloc_buffer`. Chunks have correct declared sizes via the new `get_alloc_size_for_buffer` API; iface methods all NULL (free_buffer is gracefully no-op'd in `ggml_backend_buffer_free`; get_base returns NULL via the size>0 + null-iface fallback). `_reserve_n_impl` uses shadow vbuffer in the `no_alloc=true` branch instead of `buffers[i]=NULL`. Without this, no_alloc=true loses all per-chunk state across calls — every call would re-realloc and the realloc-or-keep monotonic-max logic at lines 904-944 would not accumulate properly. `ggml_gallocr_reserve_n_size` simplified to read sizes via `ggml_gallocr_get_buffer_size` (which reads the shadow vbuffer with the dedup logic that already existed for real-alloc). ggml-vulkan.cpp: `ggml_backend_vk_host_buffer_type_get_alloc_size_for_buffer` returns size + 32 to mirror `alloc_buffer`'s safety pad. llama-memory-recurrent.cpp: Adds a `no_alloc=true` dummy-buffer path mirroring the existing pattern in `llama-kv-cache.cpp:255-262` and the matching `memory_breakdown` path that uses `ggml_backend_alloc_ctx_tensors_from_buft_size` instead of querying the dummy buffer. Without this, hybrid models (mamba+attention) always real-allocated their recurrent state buffers during prediction. llama.h, llama-context.cpp: `llama_pipeline_parallel_type` tristate enum (AUTO/DISABLED/ENABLED) + `pipeline_parallel_type` field on `llama_context_params`, mirroring `llama_flash_attn_type`'s pattern. Default AUTO preserves existing auto-detection behavior; DISABLED skips the prerequisite check; ENABLED warns if prerequisites force it off. Lets memory-prediction tooling explicitly request the post-fallback configuration without relying on real-alloc's compute-alloc-fail fallback at llama-context.cpp:563-568. llama-context.cpp init: Pass tmp_sizes ptr to all 3 graph_reserve calls (PP, TG, final-PP) in `no_alloc=true` mode so each invokes `_reserve_n_impl` and the shadow vbuffer accumulates per-chunk maxes correctly. Drop the `if (!model.hparams.no_alloc)` guard at the post-init read — value is correct in both modes now. llama-context.h: Public `get_backend_buf_exp_size()` accessor for memory-prediction tooling. Validation: 328-cell regression suite (25 models × 2 ngl modes × 8 ctx sizes) running both real-load and dry-run breakdowns and asserting per-cell, per-field, per-device byte-exact match. 100% pass rate on Linux/Vulkan with these patches; cow_fit_predict (tooling that exercises this path) no longer allocates any compute, KV, or recurrent state buffers. No behavior change for `no_alloc=false` (real-load + inference) paths.
jolexxa
added a commit
to superbarn-ai/llama.cpp
that referenced
this pull request
Apr 29, 2026
A coherent set of fixes to the no_alloc=true code path so prediction tooling can report byte-exact memory sizes without allocating real backend buffers. Discovered while building a per-cell regression suite (real-load vs dry-run) for cow_fit; each fix is independently justified. Upstream-touching changes: ggml-backend-impl.h, ggml-backend.h, ggml-backend.cpp: New optional iface method `get_alloc_size_for_buffer(buft, size)` on `ggml_backend_buffer_type_i`. Defaults to identity. Lets size-only paths model per-allocation overhead that backends apply inside `alloc_buffer` (e.g. Vulkan_Host's `size += 32` safety pad from PR ggml-org#7360 "Fix empty Vulkan host buffers"). Public wrapper `ggml_backend_buft_get_alloc_size_for_buffer` mirrors `ggml_backend_buft_alloc_buffer`'s size==0 short-circuit. ggml-backend.cpp (ggml_backend_sched_reserve_size): Reset pattern matches `ggml_backend_sched_reserve` (synchronize → split_graph → reserve → reset) instead of reset-at-start. The previous leading reset wiped `hv_tensor_copies` state that `split_graph` relies on for pipeline_parallel input tensor duplicate tracking, causing the size-only path's chunk allocations to diverge from the real-alloc path on multi-GPU layouts (+12 MiB on the second GPU buft for Qwen2-1.5B at ngl=999/ctx=2048 reproducer). ggml-alloc.c: Shadow vbuffer (`ggml_vbuffer_alloc_shadow`) — same shape as `ggml_vbuffer_alloc` but with no backing memory. Each chunk created via `ggml_backend_buffer_init` directly, bypassing the backend's `alloc_buffer`. Chunks have correct declared sizes via the new `get_alloc_size_for_buffer` API; iface methods all NULL (free_buffer is gracefully no-op'd in `ggml_backend_buffer_free`; get_base returns NULL via the size>0 + null-iface fallback). `_reserve_n_impl` uses shadow vbuffer in the `no_alloc=true` branch instead of `buffers[i]=NULL`. Without this, no_alloc=true loses all per-chunk state across calls — every call would re-realloc and the realloc-or-keep monotonic-max logic at lines 904-944 would not accumulate properly. `ggml_gallocr_reserve_n_size` simplified to read sizes via `ggml_gallocr_get_buffer_size` (which reads the shadow vbuffer with the dedup logic that already existed for real-alloc). ggml-vulkan.cpp: `ggml_backend_vk_host_buffer_type_get_alloc_size_for_buffer` returns size + 32 to mirror `alloc_buffer`'s safety pad. llama-memory-recurrent.cpp: Adds a `no_alloc=true` dummy-buffer path mirroring the existing pattern in `llama-kv-cache.cpp:255-262` and the matching `memory_breakdown` path that uses `ggml_backend_alloc_ctx_tensors_from_buft_size` instead of querying the dummy buffer. Without this, hybrid models (mamba+attention) always real-allocated their recurrent state buffers during prediction. llama.h, llama-context.cpp: `llama_pipeline_parallel_type` tristate enum (AUTO/DISABLED/ENABLED) + `pipeline_parallel_type` field on `llama_context_params`, mirroring `llama_flash_attn_type`'s pattern. Default AUTO preserves existing auto-detection behavior; DISABLED skips the prerequisite check; ENABLED warns if prerequisites force it off. Lets memory-prediction tooling explicitly request the post-fallback configuration without relying on real-alloc's compute-alloc-fail fallback at llama-context.cpp:563-568. llama-context.cpp init: Pass tmp_sizes ptr to all 3 graph_reserve calls (PP, TG, final-PP) in `no_alloc=true` mode so each invokes `_reserve_n_impl` and the shadow vbuffer accumulates per-chunk maxes correctly. Drop the `if (!model.hparams.no_alloc)` guard at the post-init read — value is correct in both modes now. llama-context.h: Public `get_backend_buf_exp_size()` accessor for memory-prediction tooling. Validation: 328-cell regression suite (25 models × 2 ngl modes × 8 ctx sizes) running both real-load and dry-run breakdowns and asserting per-cell, per-field, per-device byte-exact match. 100% pass rate on Linux/Vulkan with these patches; cow_fit_predict (tooling that exercises this path) no longer allocates any compute, KV, or recurrent state buffers. No behavior change for `no_alloc=false` (real-load + inference) paths.
jolexxa
added a commit
to superbarn-ai/llama.cpp
that referenced
this pull request
Apr 29, 2026
A coherent set of fixes to the no_alloc=true code path so prediction tooling can report byte-exact memory sizes without allocating real backend buffers. Discovered while building a per-cell regression suite (real-load vs dry-run) for cow_fit; each fix is independently justified. Upstream-touching changes: ggml-backend-impl.h, ggml-backend.h, ggml-backend.cpp: New optional iface method `get_alloc_size_for_buffer(buft, size)` on `ggml_backend_buffer_type_i`. Defaults to identity. Lets size-only paths model per-allocation overhead that backends apply inside `alloc_buffer` (e.g. Vulkan_Host's `size += 32` safety pad from PR ggml-org#7360 "Fix empty Vulkan host buffers"). Public wrapper `ggml_backend_buft_get_alloc_size_for_buffer` mirrors `ggml_backend_buft_alloc_buffer`'s size==0 short-circuit. ggml-backend.cpp (ggml_backend_sched_reserve_size): Reset pattern matches `ggml_backend_sched_reserve` (synchronize → split_graph → reserve → reset) instead of reset-at-start. The previous leading reset wiped `hv_tensor_copies` state that `split_graph` relies on for pipeline_parallel input tensor duplicate tracking, causing the size-only path's chunk allocations to diverge from the real-alloc path on multi-GPU layouts (+12 MiB on the second GPU buft for Qwen2-1.5B at ngl=999/ctx=2048 reproducer). ggml-alloc.c: Shadow vbuffer (`ggml_vbuffer_alloc_shadow`) — same shape as `ggml_vbuffer_alloc` but with no backing memory. Each chunk created via `ggml_backend_buffer_init` directly, bypassing the backend's `alloc_buffer`. Chunks have correct declared sizes via the new `get_alloc_size_for_buffer` API; iface methods all NULL (free_buffer is gracefully no-op'd in `ggml_backend_buffer_free`; get_base returns NULL via the size>0 + null-iface fallback). `_reserve_n_impl` uses shadow vbuffer in the `no_alloc=true` branch instead of `buffers[i]=NULL`. Without this, no_alloc=true loses all per-chunk state across calls — every call would re-realloc and the realloc-or-keep monotonic-max logic at lines 904-944 would not accumulate properly. `ggml_gallocr_reserve_n_size` simplified to read sizes via `ggml_gallocr_get_buffer_size` (which reads the shadow vbuffer with the dedup logic that already existed for real-alloc). ggml-vulkan.cpp: `ggml_backend_vk_host_buffer_type_get_alloc_size_for_buffer` returns size + 32 to mirror `alloc_buffer`'s safety pad. llama-memory-recurrent.cpp: Adds a `no_alloc=true` dummy-buffer path mirroring the existing pattern in `llama-kv-cache.cpp:255-262` and the matching `memory_breakdown` path that uses `ggml_backend_alloc_ctx_tensors_from_buft_size` instead of querying the dummy buffer. Without this, hybrid models (mamba+attention) always real-allocated their recurrent state buffers during prediction. llama.h, llama-context.cpp: `llama_pipeline_parallel_type` tristate enum (AUTO/DISABLED/ENABLED) + `pipeline_parallel_type` field on `llama_context_params`, mirroring `llama_flash_attn_type`'s pattern. Default AUTO preserves existing auto-detection behavior; DISABLED skips the prerequisite check; ENABLED warns if prerequisites force it off. Lets memory-prediction tooling explicitly request the post-fallback configuration without relying on real-alloc's compute-alloc-fail fallback at llama-context.cpp:563-568. llama-context.cpp init: Pass tmp_sizes ptr to all 3 graph_reserve calls (PP, TG, final-PP) in `no_alloc=true` mode so each invokes `_reserve_n_impl` and the shadow vbuffer accumulates per-chunk maxes correctly. Drop the `if (!model.hparams.no_alloc)` guard at the post-init read — value is correct in both modes now. llama-context.h: Public `get_backend_buf_exp_size()` accessor for memory-prediction tooling. Validation: 328-cell regression suite (25 models × 2 ngl modes × 8 ctx sizes) running both real-load and dry-run breakdowns and asserting per-cell, per-field, per-device byte-exact match. 100% pass rate on Linux/Vulkan with these patches; cow_fit_predict (tooling that exercises this path) no longer allocates any compute, KV, or recurrent state buffers. No behavior change for `no_alloc=false` (real-load + inference) paths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #7130