Skip to content

Vulkan Embedding Fix#7360

Merged
0cc4m merged 4 commits intomasterfrom
0cc4m/vulkan-embedding-fix
May 19, 2024
Merged

Vulkan Embedding Fix#7360
0cc4m merged 4 commits intomasterfrom
0cc4m/vulkan-embedding-fix

Conversation

@0cc4m
Copy link
Copy Markdown
Contributor

@0cc4m 0cc4m commented May 18, 2024

  • Fix empty Vulkan host buffers
  • Fix embedding calls by adding matmul fp16 fp32 shader
  • Fix matmul shader alignment
  • Remove deprecated tensor->backend uses in Vulkan code

Fixes #7130

0cc4m added 2 commits May 18, 2024 09:51
Add fp32 fp16 matmul shader

Fix matmul shader alignment
@mofosyne mofosyne added Vulkan Issues specific to the Vulkan backend bugfix fixes an issue or bug Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 18, 2024
@github-actions github-actions Bot added the python python script changes label May 18, 2024
@0cc4m 0cc4m merged commit f030ec1 into master May 19, 2024
@0cc4m 0cc4m deleted the 0cc4m/vulkan-embedding-fix branch May 19, 2024 15:19
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* Fix empty Vulkan host buffers

Add fp32 fp16 matmul shader

Fix matmul shader alignment

* Remove deprecated tensor->backend uses

* Fix Vulkan validation errors on embedding models with no offloaded layers

* Fix Vulkan llava segfault when not offloading layers
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
* Fix empty Vulkan host buffers

Add fp32 fp16 matmul shader

Fix matmul shader alignment

* Remove deprecated tensor->backend uses

* Fix Vulkan validation errors on embedding models with no offloaded layers

* Fix Vulkan llava segfault when not offloading layers
jolexxa added a commit to superbarn-ai/llama.cpp that referenced this pull request Apr 28, 2026
A coherent set of fixes to the no_alloc=true code path so prediction
tooling can report byte-exact memory sizes without allocating real
backend buffers. Discovered while building a per-cell regression suite
(real-load vs dry-run) for cow_fit; each fix is independently
justified.

Upstream-touching changes:

ggml-backend-impl.h, ggml-backend.h, ggml-backend.cpp:
  New optional iface method `get_alloc_size_for_buffer(buft, size)` on
  `ggml_backend_buffer_type_i`. Defaults to identity. Lets size-only
  paths model per-allocation overhead that backends apply inside
  `alloc_buffer` (e.g. Vulkan_Host's `size += 32` safety pad from
  PR ggml-org#7360 "Fix empty Vulkan host buffers"). Public wrapper
  `ggml_backend_buft_get_alloc_size_for_buffer` mirrors
  `ggml_backend_buft_alloc_buffer`'s size==0 short-circuit.

ggml-backend.cpp (ggml_backend_sched_reserve_size):
  Reset pattern matches `ggml_backend_sched_reserve` (synchronize →
  split_graph → reserve → reset) instead of reset-at-start. The
  previous leading reset wiped `hv_tensor_copies` state that
  `split_graph` relies on for pipeline_parallel input tensor duplicate
  tracking, causing the size-only path's chunk allocations to diverge
  from the real-alloc path on multi-GPU layouts (+12 MiB on the second
  GPU buft for Qwen2-1.5B at ngl=999/ctx=2048 reproducer).

ggml-alloc.c:
  Shadow vbuffer (`ggml_vbuffer_alloc_shadow`) — same shape as
  `ggml_vbuffer_alloc` but with no backing memory. Each chunk created
  via `ggml_backend_buffer_init` directly, bypassing the backend's
  `alloc_buffer`. Chunks have correct declared sizes via the new
  `get_alloc_size_for_buffer` API; iface methods all NULL (free_buffer
  is gracefully no-op'd in `ggml_backend_buffer_free`; get_base returns
  NULL via the size>0 + null-iface fallback).
  `_reserve_n_impl` uses shadow vbuffer in the `no_alloc=true` branch
  instead of `buffers[i]=NULL`. Without this, no_alloc=true loses all
  per-chunk state across calls — every call would re-realloc and the
  realloc-or-keep monotonic-max logic at lines 904-944 would not
  accumulate properly.
  `ggml_gallocr_reserve_n_size` simplified to read sizes via
  `ggml_gallocr_get_buffer_size` (which reads the shadow vbuffer with
  the dedup logic that already existed for real-alloc).

ggml-vulkan.cpp:
  `ggml_backend_vk_host_buffer_type_get_alloc_size_for_buffer` returns
  size + 32 to mirror `alloc_buffer`'s safety pad.

llama-memory-recurrent.cpp:
  Adds a `no_alloc=true` dummy-buffer path mirroring the existing
  pattern in `llama-kv-cache.cpp:255-262` and the matching
  `memory_breakdown` path that uses
  `ggml_backend_alloc_ctx_tensors_from_buft_size` instead of querying
  the dummy buffer. Without this, hybrid models (mamba+attention)
  always real-allocated their recurrent state buffers during
  prediction.

llama.h, llama-context.cpp:
  `llama_pipeline_parallel_type` tristate enum (AUTO/DISABLED/ENABLED)
  + `pipeline_parallel_type` field on `llama_context_params`, mirroring
  `llama_flash_attn_type`'s pattern. Default AUTO preserves existing
  auto-detection behavior; DISABLED skips the prerequisite check;
  ENABLED warns if prerequisites force it off. Lets memory-prediction
  tooling explicitly request the post-fallback configuration without
  relying on real-alloc's compute-alloc-fail fallback at
  llama-context.cpp:563-568.

llama-context.cpp init:
  Pass tmp_sizes ptr to all 3 graph_reserve calls (PP, TG, final-PP)
  in `no_alloc=true` mode so each invokes `_reserve_n_impl` and the
  shadow vbuffer accumulates per-chunk maxes correctly. Drop the
  `if (!model.hparams.no_alloc)` guard at the post-init read — value
  is correct in both modes now.

llama-context.h:
  Public `get_backend_buf_exp_size()` accessor for memory-prediction
  tooling.

Validation: 328-cell regression suite (25 models × 2 ngl modes × 8
ctx sizes) running both real-load and dry-run breakdowns and asserting
per-cell, per-field, per-device byte-exact match. 100% pass rate on
Linux/Vulkan with these patches; cow_fit_predict (tooling that
exercises this path) no longer allocates any compute, KV, or recurrent
state buffers.

No behavior change for `no_alloc=false` (real-load + inference) paths.
jolexxa added a commit to superbarn-ai/llama.cpp that referenced this pull request Apr 29, 2026
A coherent set of fixes to the no_alloc=true code path so prediction
tooling can report byte-exact memory sizes without allocating real
backend buffers. Discovered while building a per-cell regression suite
(real-load vs dry-run) for cow_fit; each fix is independently
justified.

Upstream-touching changes:

ggml-backend-impl.h, ggml-backend.h, ggml-backend.cpp:
  New optional iface method `get_alloc_size_for_buffer(buft, size)` on
  `ggml_backend_buffer_type_i`. Defaults to identity. Lets size-only
  paths model per-allocation overhead that backends apply inside
  `alloc_buffer` (e.g. Vulkan_Host's `size += 32` safety pad from
  PR ggml-org#7360 "Fix empty Vulkan host buffers"). Public wrapper
  `ggml_backend_buft_get_alloc_size_for_buffer` mirrors
  `ggml_backend_buft_alloc_buffer`'s size==0 short-circuit.

ggml-backend.cpp (ggml_backend_sched_reserve_size):
  Reset pattern matches `ggml_backend_sched_reserve` (synchronize →
  split_graph → reserve → reset) instead of reset-at-start. The
  previous leading reset wiped `hv_tensor_copies` state that
  `split_graph` relies on for pipeline_parallel input tensor duplicate
  tracking, causing the size-only path's chunk allocations to diverge
  from the real-alloc path on multi-GPU layouts (+12 MiB on the second
  GPU buft for Qwen2-1.5B at ngl=999/ctx=2048 reproducer).

ggml-alloc.c:
  Shadow vbuffer (`ggml_vbuffer_alloc_shadow`) — same shape as
  `ggml_vbuffer_alloc` but with no backing memory. Each chunk created
  via `ggml_backend_buffer_init` directly, bypassing the backend's
  `alloc_buffer`. Chunks have correct declared sizes via the new
  `get_alloc_size_for_buffer` API; iface methods all NULL (free_buffer
  is gracefully no-op'd in `ggml_backend_buffer_free`; get_base returns
  NULL via the size>0 + null-iface fallback).
  `_reserve_n_impl` uses shadow vbuffer in the `no_alloc=true` branch
  instead of `buffers[i]=NULL`. Without this, no_alloc=true loses all
  per-chunk state across calls — every call would re-realloc and the
  realloc-or-keep monotonic-max logic at lines 904-944 would not
  accumulate properly.
  `ggml_gallocr_reserve_n_size` simplified to read sizes via
  `ggml_gallocr_get_buffer_size` (which reads the shadow vbuffer with
  the dedup logic that already existed for real-alloc).

ggml-vulkan.cpp:
  `ggml_backend_vk_host_buffer_type_get_alloc_size_for_buffer` returns
  size + 32 to mirror `alloc_buffer`'s safety pad.

llama-memory-recurrent.cpp:
  Adds a `no_alloc=true` dummy-buffer path mirroring the existing
  pattern in `llama-kv-cache.cpp:255-262` and the matching
  `memory_breakdown` path that uses
  `ggml_backend_alloc_ctx_tensors_from_buft_size` instead of querying
  the dummy buffer. Without this, hybrid models (mamba+attention)
  always real-allocated their recurrent state buffers during
  prediction.

llama.h, llama-context.cpp:
  `llama_pipeline_parallel_type` tristate enum (AUTO/DISABLED/ENABLED)
  + `pipeline_parallel_type` field on `llama_context_params`, mirroring
  `llama_flash_attn_type`'s pattern. Default AUTO preserves existing
  auto-detection behavior; DISABLED skips the prerequisite check;
  ENABLED warns if prerequisites force it off. Lets memory-prediction
  tooling explicitly request the post-fallback configuration without
  relying on real-alloc's compute-alloc-fail fallback at
  llama-context.cpp:563-568.

llama-context.cpp init:
  Pass tmp_sizes ptr to all 3 graph_reserve calls (PP, TG, final-PP)
  in `no_alloc=true` mode so each invokes `_reserve_n_impl` and the
  shadow vbuffer accumulates per-chunk maxes correctly. Drop the
  `if (!model.hparams.no_alloc)` guard at the post-init read — value
  is correct in both modes now.

llama-context.h:
  Public `get_backend_buf_exp_size()` accessor for memory-prediction
  tooling.

Validation: 328-cell regression suite (25 models × 2 ngl modes × 8
ctx sizes) running both real-load and dry-run breakdowns and asserting
per-cell, per-field, per-device byte-exact match. 100% pass rate on
Linux/Vulkan with these patches; cow_fit_predict (tooling that
exercises this path) no longer allocates any compute, KV, or recurrent
state buffers.

No behavior change for `no_alloc=false` (real-load + inference) paths.
jolexxa added a commit to superbarn-ai/llama.cpp that referenced this pull request Apr 29, 2026
A coherent set of fixes to the no_alloc=true code path so prediction
tooling can report byte-exact memory sizes without allocating real
backend buffers. Discovered while building a per-cell regression suite
(real-load vs dry-run) for cow_fit; each fix is independently
justified.

Upstream-touching changes:

ggml-backend-impl.h, ggml-backend.h, ggml-backend.cpp:
  New optional iface method `get_alloc_size_for_buffer(buft, size)` on
  `ggml_backend_buffer_type_i`. Defaults to identity. Lets size-only
  paths model per-allocation overhead that backends apply inside
  `alloc_buffer` (e.g. Vulkan_Host's `size += 32` safety pad from
  PR ggml-org#7360 "Fix empty Vulkan host buffers"). Public wrapper
  `ggml_backend_buft_get_alloc_size_for_buffer` mirrors
  `ggml_backend_buft_alloc_buffer`'s size==0 short-circuit.

ggml-backend.cpp (ggml_backend_sched_reserve_size):
  Reset pattern matches `ggml_backend_sched_reserve` (synchronize →
  split_graph → reserve → reset) instead of reset-at-start. The
  previous leading reset wiped `hv_tensor_copies` state that
  `split_graph` relies on for pipeline_parallel input tensor duplicate
  tracking, causing the size-only path's chunk allocations to diverge
  from the real-alloc path on multi-GPU layouts (+12 MiB on the second
  GPU buft for Qwen2-1.5B at ngl=999/ctx=2048 reproducer).

ggml-alloc.c:
  Shadow vbuffer (`ggml_vbuffer_alloc_shadow`) — same shape as
  `ggml_vbuffer_alloc` but with no backing memory. Each chunk created
  via `ggml_backend_buffer_init` directly, bypassing the backend's
  `alloc_buffer`. Chunks have correct declared sizes via the new
  `get_alloc_size_for_buffer` API; iface methods all NULL (free_buffer
  is gracefully no-op'd in `ggml_backend_buffer_free`; get_base returns
  NULL via the size>0 + null-iface fallback).
  `_reserve_n_impl` uses shadow vbuffer in the `no_alloc=true` branch
  instead of `buffers[i]=NULL`. Without this, no_alloc=true loses all
  per-chunk state across calls — every call would re-realloc and the
  realloc-or-keep monotonic-max logic at lines 904-944 would not
  accumulate properly.
  `ggml_gallocr_reserve_n_size` simplified to read sizes via
  `ggml_gallocr_get_buffer_size` (which reads the shadow vbuffer with
  the dedup logic that already existed for real-alloc).

ggml-vulkan.cpp:
  `ggml_backend_vk_host_buffer_type_get_alloc_size_for_buffer` returns
  size + 32 to mirror `alloc_buffer`'s safety pad.

llama-memory-recurrent.cpp:
  Adds a `no_alloc=true` dummy-buffer path mirroring the existing
  pattern in `llama-kv-cache.cpp:255-262` and the matching
  `memory_breakdown` path that uses
  `ggml_backend_alloc_ctx_tensors_from_buft_size` instead of querying
  the dummy buffer. Without this, hybrid models (mamba+attention)
  always real-allocated their recurrent state buffers during
  prediction.

llama.h, llama-context.cpp:
  `llama_pipeline_parallel_type` tristate enum (AUTO/DISABLED/ENABLED)
  + `pipeline_parallel_type` field on `llama_context_params`, mirroring
  `llama_flash_attn_type`'s pattern. Default AUTO preserves existing
  auto-detection behavior; DISABLED skips the prerequisite check;
  ENABLED warns if prerequisites force it off. Lets memory-prediction
  tooling explicitly request the post-fallback configuration without
  relying on real-alloc's compute-alloc-fail fallback at
  llama-context.cpp:563-568.

llama-context.cpp init:
  Pass tmp_sizes ptr to all 3 graph_reserve calls (PP, TG, final-PP)
  in `no_alloc=true` mode so each invokes `_reserve_n_impl` and the
  shadow vbuffer accumulates per-chunk maxes correctly. Drop the
  `if (!model.hparams.no_alloc)` guard at the post-init read — value
  is correct in both modes now.

llama-context.h:
  Public `get_backend_buf_exp_size()` accessor for memory-prediction
  tooling.

Validation: 328-cell regression suite (25 models × 2 ngl modes × 8
ctx sizes) running both real-load and dry-run breakdowns and asserting
per-cell, per-field, per-device byte-exact match. 100% pass rate on
Linux/Vulkan with these patches; cow_fit_predict (tooling that
exercises this path) no longer allocates any compute, KV, or recurrent
state buffers.

No behavior change for `no_alloc=false` (real-load + inference) paths.
jolexxa added a commit to superbarn-ai/llama.cpp that referenced this pull request Apr 29, 2026
A coherent set of fixes to the no_alloc=true code path so prediction
tooling can report byte-exact memory sizes without allocating real
backend buffers. Discovered while building a per-cell regression suite
(real-load vs dry-run) for cow_fit; each fix is independently
justified.

Upstream-touching changes:

ggml-backend-impl.h, ggml-backend.h, ggml-backend.cpp:
  New optional iface method `get_alloc_size_for_buffer(buft, size)` on
  `ggml_backend_buffer_type_i`. Defaults to identity. Lets size-only
  paths model per-allocation overhead that backends apply inside
  `alloc_buffer` (e.g. Vulkan_Host's `size += 32` safety pad from
  PR ggml-org#7360 "Fix empty Vulkan host buffers"). Public wrapper
  `ggml_backend_buft_get_alloc_size_for_buffer` mirrors
  `ggml_backend_buft_alloc_buffer`'s size==0 short-circuit.

ggml-backend.cpp (ggml_backend_sched_reserve_size):
  Reset pattern matches `ggml_backend_sched_reserve` (synchronize →
  split_graph → reserve → reset) instead of reset-at-start. The
  previous leading reset wiped `hv_tensor_copies` state that
  `split_graph` relies on for pipeline_parallel input tensor duplicate
  tracking, causing the size-only path's chunk allocations to diverge
  from the real-alloc path on multi-GPU layouts (+12 MiB on the second
  GPU buft for Qwen2-1.5B at ngl=999/ctx=2048 reproducer).

ggml-alloc.c:
  Shadow vbuffer (`ggml_vbuffer_alloc_shadow`) — same shape as
  `ggml_vbuffer_alloc` but with no backing memory. Each chunk created
  via `ggml_backend_buffer_init` directly, bypassing the backend's
  `alloc_buffer`. Chunks have correct declared sizes via the new
  `get_alloc_size_for_buffer` API; iface methods all NULL (free_buffer
  is gracefully no-op'd in `ggml_backend_buffer_free`; get_base returns
  NULL via the size>0 + null-iface fallback).
  `_reserve_n_impl` uses shadow vbuffer in the `no_alloc=true` branch
  instead of `buffers[i]=NULL`. Without this, no_alloc=true loses all
  per-chunk state across calls — every call would re-realloc and the
  realloc-or-keep monotonic-max logic at lines 904-944 would not
  accumulate properly.
  `ggml_gallocr_reserve_n_size` simplified to read sizes via
  `ggml_gallocr_get_buffer_size` (which reads the shadow vbuffer with
  the dedup logic that already existed for real-alloc).

ggml-vulkan.cpp:
  `ggml_backend_vk_host_buffer_type_get_alloc_size_for_buffer` returns
  size + 32 to mirror `alloc_buffer`'s safety pad.

llama-memory-recurrent.cpp:
  Adds a `no_alloc=true` dummy-buffer path mirroring the existing
  pattern in `llama-kv-cache.cpp:255-262` and the matching
  `memory_breakdown` path that uses
  `ggml_backend_alloc_ctx_tensors_from_buft_size` instead of querying
  the dummy buffer. Without this, hybrid models (mamba+attention)
  always real-allocated their recurrent state buffers during
  prediction.

llama.h, llama-context.cpp:
  `llama_pipeline_parallel_type` tristate enum (AUTO/DISABLED/ENABLED)
  + `pipeline_parallel_type` field on `llama_context_params`, mirroring
  `llama_flash_attn_type`'s pattern. Default AUTO preserves existing
  auto-detection behavior; DISABLED skips the prerequisite check;
  ENABLED warns if prerequisites force it off. Lets memory-prediction
  tooling explicitly request the post-fallback configuration without
  relying on real-alloc's compute-alloc-fail fallback at
  llama-context.cpp:563-568.

llama-context.cpp init:
  Pass tmp_sizes ptr to all 3 graph_reserve calls (PP, TG, final-PP)
  in `no_alloc=true` mode so each invokes `_reserve_n_impl` and the
  shadow vbuffer accumulates per-chunk maxes correctly. Drop the
  `if (!model.hparams.no_alloc)` guard at the post-init read — value
  is correct in both modes now.

llama-context.h:
  Public `get_backend_buf_exp_size()` accessor for memory-prediction
  tooling.

Validation: 328-cell regression suite (25 models × 2 ngl modes × 8
ctx sizes) running both real-load and dry-run breakdowns and asserting
per-cell, per-field, per-device byte-exact match. 100% pass rate on
Linux/Vulkan with these patches; cow_fit_predict (tooling that
exercises this path) no longer allocates any compute, KV, or recurrent
state buffers.

No behavior change for `no_alloc=false` (real-load + inference) paths.
jolexxa added a commit to superbarn-ai/llama.cpp that referenced this pull request Apr 29, 2026
A coherent set of fixes to the no_alloc=true code path so prediction
tooling can report byte-exact memory sizes without allocating real
backend buffers. Discovered while building a per-cell regression suite
(real-load vs dry-run) for cow_fit; each fix is independently
justified.

Upstream-touching changes:

ggml-backend-impl.h, ggml-backend.h, ggml-backend.cpp:
  New optional iface method `get_alloc_size_for_buffer(buft, size)` on
  `ggml_backend_buffer_type_i`. Defaults to identity. Lets size-only
  paths model per-allocation overhead that backends apply inside
  `alloc_buffer` (e.g. Vulkan_Host's `size += 32` safety pad from
  PR ggml-org#7360 "Fix empty Vulkan host buffers"). Public wrapper
  `ggml_backend_buft_get_alloc_size_for_buffer` mirrors
  `ggml_backend_buft_alloc_buffer`'s size==0 short-circuit.

ggml-backend.cpp (ggml_backend_sched_reserve_size):
  Reset pattern matches `ggml_backend_sched_reserve` (synchronize →
  split_graph → reserve → reset) instead of reset-at-start. The
  previous leading reset wiped `hv_tensor_copies` state that
  `split_graph` relies on for pipeline_parallel input tensor duplicate
  tracking, causing the size-only path's chunk allocations to diverge
  from the real-alloc path on multi-GPU layouts (+12 MiB on the second
  GPU buft for Qwen2-1.5B at ngl=999/ctx=2048 reproducer).

ggml-alloc.c:
  Shadow vbuffer (`ggml_vbuffer_alloc_shadow`) — same shape as
  `ggml_vbuffer_alloc` but with no backing memory. Each chunk created
  via `ggml_backend_buffer_init` directly, bypassing the backend's
  `alloc_buffer`. Chunks have correct declared sizes via the new
  `get_alloc_size_for_buffer` API; iface methods all NULL (free_buffer
  is gracefully no-op'd in `ggml_backend_buffer_free`; get_base returns
  NULL via the size>0 + null-iface fallback).
  `_reserve_n_impl` uses shadow vbuffer in the `no_alloc=true` branch
  instead of `buffers[i]=NULL`. Without this, no_alloc=true loses all
  per-chunk state across calls — every call would re-realloc and the
  realloc-or-keep monotonic-max logic at lines 904-944 would not
  accumulate properly.
  `ggml_gallocr_reserve_n_size` simplified to read sizes via
  `ggml_gallocr_get_buffer_size` (which reads the shadow vbuffer with
  the dedup logic that already existed for real-alloc).

ggml-vulkan.cpp:
  `ggml_backend_vk_host_buffer_type_get_alloc_size_for_buffer` returns
  size + 32 to mirror `alloc_buffer`'s safety pad.

llama-memory-recurrent.cpp:
  Adds a `no_alloc=true` dummy-buffer path mirroring the existing
  pattern in `llama-kv-cache.cpp:255-262` and the matching
  `memory_breakdown` path that uses
  `ggml_backend_alloc_ctx_tensors_from_buft_size` instead of querying
  the dummy buffer. Without this, hybrid models (mamba+attention)
  always real-allocated their recurrent state buffers during
  prediction.

llama.h, llama-context.cpp:
  `llama_pipeline_parallel_type` tristate enum (AUTO/DISABLED/ENABLED)
  + `pipeline_parallel_type` field on `llama_context_params`, mirroring
  `llama_flash_attn_type`'s pattern. Default AUTO preserves existing
  auto-detection behavior; DISABLED skips the prerequisite check;
  ENABLED warns if prerequisites force it off. Lets memory-prediction
  tooling explicitly request the post-fallback configuration without
  relying on real-alloc's compute-alloc-fail fallback at
  llama-context.cpp:563-568.

llama-context.cpp init:
  Pass tmp_sizes ptr to all 3 graph_reserve calls (PP, TG, final-PP)
  in `no_alloc=true` mode so each invokes `_reserve_n_impl` and the
  shadow vbuffer accumulates per-chunk maxes correctly. Drop the
  `if (!model.hparams.no_alloc)` guard at the post-init read — value
  is correct in both modes now.

llama-context.h:
  Public `get_backend_buf_exp_size()` accessor for memory-prediction
  tooling.

Validation: 328-cell regression suite (25 models × 2 ngl modes × 8
ctx sizes) running both real-load and dry-run breakdowns and asserting
per-cell, per-field, per-device byte-exact match. 100% pass rate on
Linux/Vulkan with these patches; cow_fit_predict (tooling that
exercises this path) no longer allocates any compute, KV, or recurrent
state buffers.

No behavior change for `no_alloc=false` (real-load + inference) paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix fixes an issue or bug python python script changes Review Complexity : High Generally require indepth knowledge of LLMs or GPUs Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Embedding fails to run on vulkan backend

3 participants