upstream by visorcraft · Pull Request #20 · visorcraft/llama.cpp

visorcraft · 2026-03-09T13:02:58Z

Make sure to read the contributing guidelines before submitting a PR

* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT

* Revert to OAI-compatible args * Apply workaround::func_args_not_string

* tests: add end-to-end tests per model architecture * fixup for rebase * fix use-after-free in llama-model-loader.cpp * fix CI * fix WebGPU * fix CI * disable CI for macOS-latest-cmake-arm64 * use expert_weights_scale only if != 0.0f * comments

* vulkan: Fix data races in coopmat1 mul_mat(_id) Add barriers between coopmat store and regular loads. We sort of got away with this because it was the same subgroup accessing the values, but it's still a race and may not work. * switch to subgroup control barriers

* ggml-Vulkan: add ELU support * ggml-Vulkan: remove extra spaces and variables * ggml-Vulkan: fix format issue * ggml-Vulkan: fix format issue * fix whitespace issue * Update Vulkan.csv and ops.md

* Fix structured outputs * Update common/chat-auto-parser-generator.cpp Co-authored-by: Aldehir Rojas <hello@alde.dev> --------- Co-authored-by: Aldehir Rojas <hello@alde.dev>

* Fix compile bug * Update common/chat-auto-parser-helpers.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* common : handle incomplete UTF-8 at end of input in PEG parser * cont : if reached end prematurely, emit needs_more_input to propagate partial output * cont: refactor peg parse context to add lenient flag * cont : remove partial flag, keep lenient flag

* PEG parser for LFM2 * Simplify using python_value()

…ault (#20211)

…ion (#20185)

…20219)

Merge pull request #20 from ggml-org/master

…better shader parameter handling (ggml-org#20173) * K quant speedup (#20) * Basic JIT compilation for mul_mat, get_rows, and scale (#17) * scale jit working * preliminary working jit for getrows and mulmat, needs refining * simplified mul_mat preprocessing switch statement * get_rows fixes, mul_mat refinement * formatted + last edits * removed some extraneous prints * fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish * small fix * some changes, working * get_rows and mul_mat jit fixed and working * Update formatting * formatting * Add header --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on all-encompassing shader library * refactor argmax, set_rows * Refactor all but flashattention, mat mul * no gibberish, all k quants added, merged * vec memory fix * q6_k matching metal on my machine, tests passing * Set tile size for q6_k separately * Separate out fast shaders --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com> * Move towards writeBuffer for params * Move away from multiple buffers for set_rows errors, remove host buffer for parameter buffers, minor cleanups * Remove extra file * Formatting --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>

arthw and others added 22 commits March 8, 2026 12:00

[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (#20190)

213c4a0

* support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT

server : correct index on finish in OAI completion streams (#20226)

ff52ee9

Revert to OAI-compatible args (#20213)

b283f6d

* Revert to OAI-compatible args * Apply workaround::func_args_not_string

readme : update infra list (#20212)

a950479

llama: end-to-end tests (#19802)

a976ff0

* tests: add end-to-end tests per model architecture * fixup for rebase * fix use-after-free in llama-model-loader.cpp * fix CI * fix WebGPU * fix CI * disable CI for macOS-latest-cmake-arm64 * use expert_weights_scale only if != 0.0f * comments

ggml-vulkan: Add ELU op support (#20183)

d088d5b

* ggml-Vulkan: add ELU support * ggml-Vulkan: remove extra spaces and variables * ggml-Vulkan: fix format issue * ggml-Vulkan: fix format issue * fix whitespace issue * Update Vulkan.csv and ops.md

Fix structured outputs (#20223)

62b8143

* Fix structured outputs * Update common/chat-auto-parser-generator.cpp Co-authored-by: Aldehir Rojas <hello@alde.dev> --------- Co-authored-by: Aldehir Rojas <hello@alde.dev>

Fix compile bug (#20203)

9b24886

* Fix compile bug * Update common/chat-auto-parser-helpers.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

graph : remove redundant scale_w parameter (#20235)

35bee03

server : do not create checkpoints right after mtmd chunks (#20232)

d417bc4

PEG parser for LFM2 (#20251)

97c64fb

* PEG parser for LFM2 * Simplify using python_value()

llama-bench: introduce -hf and -hff flags & use --mmap 1 by def…

ae87863

…ault (#20211)

cuda : display total and free VRAM capacity during device initializat…

5f4cdac

…ion (#20185)

vulkan: skip zero size tensors in backend copies (#20233)

b2f460b

ggml-vulkan: add SGN operator, auto-generate Vulkan.csv and ops.md (#…

0beb8db

…20219)

contributing: limit open PRs for new contributors to 1 (#20036)

e2763a6

llama-quant : left-align tensor names in output (#20117)

b518195

ggml-cuda: disable gdn for musa (#20278)

e8bbc73

server : add kill switch when server is stuck (#20277)

107d599

models : fix assert in mamba2 graph (#20270)

43e1cbd

visorcraft merged commit 536c449 into visorcraft:fix/hybrid-cache-reuse Mar 9, 2026

visorcraft added a commit that referenced this pull request Mar 9, 2026

Merge pull request #21 from visorcraft/fix/hybrid-cache-reuse

8dc6cbf

Merge pull request #20 from ggml-org/master

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upstream#20

upstream#20
visorcraft merged 22 commits intovisorcraft:fix/hybrid-cache-reusefrom
ggml-org:master

visorcraft commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

Conversation

visorcraft commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants