merge from upstream by l3utterfly · Pull Request #95 · l3utterfly/llama.cpp

l3utterfly · 2026-04-16T11:19:24Z

No description provided.

* fix: enable reasoning budget sampler for gemma4 Add thinking_start_tag and thinking_end_tag to common_chat_params_init_gemma4(). Without these, the reasoning budget sampler never activates for gemma4. Make the newline after "thought" optional in the PEG parser to handle budget=0 (sampler forces end tag before the newline). Add test case for empty thinking block. Fixes ggml-org#21487 * use p.space() instead of p.optional(p.literal("\n")) in gemma4 thought parser

* refactor: Build improvements * chore: Formatting + package lock update

…ml-org#21670) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

I'm not sure what the purpose of keeping `--alias` was when using `--models-preset`, but the result is really weird, as shown in the following logs: $ build/bin/llama-server --models-preset preset.ini --alias "Gemma 4 E4B UD Q8_K_XL" ... init: using 31 threads for HTTP server srv load_models: Loaded 2 cached model presets srv load_models: Loaded 1 custom model presets from preset.ini main: failed to initialize router models: alias 'Gemma 4 E4B UD Q8_K_XL' for model 'angt/test-split-model-stories260K:F32' conflicts with existing model name So I propose to simply ignore `--alias` too in this case. With this commit, the server starts in routing mode correctly. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…agment (ggml-org#21521) * ggml(webgpu): fix the busy-polls in Emscripten in the waitAny after ggml-org#20618, and remove the busy webgpu log * Merge with upstream * Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants * Update Unary wgsl EXP and EXPM1 for f16 stability * Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization * Fix numerical percision for unary sqrt when working with f16 * Fix NaN canonicalization for packed integers using f16 * Update err threshold for binary div ops when using f16 * backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend * clean: uncomment existing code logs * clean: clean the unncessary debug info * Refactor and generalize dequant helpers * Remove deprecated quant structs * Refactor shader defines to reduce repetition * Remove error override for F16 type * fix: fix the accidential removal of the proper initialization of ctx * clean: clean legacy and format code * fix: did not modify tests ops --------- Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv>

ggml-org#21669)

…gml-org#21739)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…rg#21704)

…1705) * hexagon: introduce op request batching and rewrite buffer managment The host now prepares batches of requests and dispatches them via a single dspqueue message. Buffers are mapped explicitly by NPU while processing batches. * hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops * hex-utils: add explicit l2flush and l2clear helpers * hex-opreq: use fine-grain per tensor l2 management * hex-opreq: avoid redundant invalidates for tensors we already flushed * hex-opreq: update debug messages * htp-opreq: reuse ops_context * hex-opreq: do not flush or invalidate cache lines beyond buffer boundry * hex-opreq: fix errors in log message * Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry" This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d. * hexagon: limit l2 flushes to 1MB which covers l2 cache * hex-opreq: limit cache flush to 4MB Looks like 4MB cont. vitual space should cover the 1MB cache. * hexagon: drop cache flush size to 2MB * hex-opreq: start reworking opreq packing * hex-opreq: introduce new way of packing opbatch where tensors are stored separately * hex-opreq: add a simple fastrpc call to force unmap all buffers * hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size * hex-opreq: bump opreq batch size to 256 * hex-mm: place src1 spad at the top of vtcm for easy reuse * hex-ops: introduce internal types and disable src1 reuse for now Nothing new just formalizing the repack / qyn.quant types we've been using. * htp-opreq: use tensor pointers instead of copies * hex-opreq: introduce more robust way for tracking vtcm/spad reuse This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops. * hex-cumsum: fix error post opreq merge * hex-opreq: move request batch handling into the session Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner. * hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx * hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers * hex-buf: add support for allocating shared/pinned buffer for opreqs * hex-opbatch: make opbatches configurable * hex-naming: better name for ggml_hexagon_shared_buffer * hex-naming: add session->c_name() helper * hex-opbatch: start using shm but still copy for now * hex-opbatch: use shared buffer for packing opbatch * hex-opbatch: beter naming for opbatch related classes and code * hex-opbatch: reuse batched tensors with same data/dims/strides * hex-opbatch: update logging * hex-opbatch: add support for vmem limit for op batching * hex-opbatch: update htp side to properly support dynamic mmap/unmap * hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing * hex-opbatch: fixed src1 handling in act ops * hex-act: fix empty src1 handling in swiglu and friends Simplify preamble macro while at it * hex-mm: minor fix vtcm and dma handling in matmul cleaning up some left-overs from merges * hex-opbatch: allocate extra 1KB for dspqueue overhead * hexagon: fix softmax for non-aligned tensors and cleanup vtcm alloc * hex-mm: properly handle hmx_disabled flag * hex-ops: update comments * hex-ops: add debug output for get/set-rows * hex-mmap: optimize un/mapping of buffers * hex-opreq: global cache flush and invalidate beyond 128KB threshold * hex-ops: add super simple opfilter regex for debugging If an Op matches the regex hex backend will reject it. * hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future * hexagon: improved vtcm acquision to remove inter-op overhead Fully compatible with QNN-HTP coex * hex-mm: fixed hvx fallback path * hex-mm: lower the vmem threshold a bit further to ~3GB * hexagon: update debug & error logs This also fixes an issue with newer llvm merging repack and non-repack functions. We use those pointer to distinguish between buffer types. * hexagon: move ops context into main context Just a cleanup. We don't need separate contexts at this point. * hex-opbatch: cleanup naming and headers for opbatch and related descriptors * hex-fa: it's now better to enable FA during TG to reduce graph splits * hexagon: remove GGML_HEXAGON_EXPERIMENTAL env var It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops if needed for debugging or validation. * hexagon: fixed editorconfig check * Update ggml/src/ggml-hexagon/ggml-hexagon.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* hexagon: add support for debian on ex2 * hexagon: add -fvectotize to c/c++ cmake flags * hexagon: remove trailing white space * update onboarding steps * hexagon: update linux setup documentation * hexagon: update intallation scripts * Hexagon: update docs * hexagon: update onboarding scripts --------- Co-authored-by: Zack Li <zackli@qti.qualcomm.com>

…ml-org#21699)

…21716)

* opencl: add general q5_k mv * opencl: add flattened Q5_K mv and general Q5_K mm * opencl: fix Q5_K unit tests

* mtmd : add MERaLiON-2 multimodal audio support Adds support for A*STAR's MERaLiON-2 audio-language model (3B and 10B) to the multimodal framework. Architecture: - Whisper large-v2 encoder for audio feature extraction - Gated MLP adaptor: ln_speech -> frame stack (x15) -> Linear+SiLU -> GLU -> out_proj - Gemma2 3B / 27B decoder The mmproj GGUF is generated via convert_hf_to_gguf.py --mmproj on the full MERaLiON-2 model directory (architecture: MERaLiON2ForConditionalGeneration). The decoder is converted separately as a standard Gemma2 model after stripping the text_decoder. weight prefix. New projector type: PROJECTOR_TYPE_MERALION Supports tasks: speech transcription (EN/ZH/MS/TA), translation, spoken QA. Model: https://huggingface.co/MERaLiON/MERaLiON-2-3B https://huggingface.co/MERaLiON/MERaLiON-2-10B * simplify comments in meralion adaptor * meralion: use format_tensor_name, ascii arrows in comments

* docs: add guide on how to add multimodal support * nits

* mtmd: add Gemma 4 audio conformer encoder support Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer. Architecture: - 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm - Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm - Full self-attention with sinusoidal RPE and sliding window mask (24) - Logit softcapping at 50.0, ClippableLinear clamping - Output: 1024 → 1536 → RMSNorm → multimodal embedder Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a): - HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3 - Standard periodic Hann window (320 samples), zero-padded to FFT size - Semicausal left-padding (frame_length/2 samples) - Frame count matched to PyTorch (unfold formula) - No pre-emphasis, no Whisper-style normalization - Mel cosine similarity vs PyTorch: 0.9998 Key fixes: - Tensor loading dedup: prevent get_tensor() from creating duplicate entries in ctx_data. Fixed with std::set guard. - ClippableLinear clamp_info loading moved after per-layer tensors. - Sliding window mask (24 positions) matching PyTorch context_size. - Skip Whisper normalization for Gemma4 mel output. Tested on E2B and E4B with CPU and Vulkan backends. Transcribes: "Glad to see things are going well and business is starting to pick up" (matching ground truth). Ref: ggml-org#21325

* mtmd: add gemma 4 test (vision + audio) * add to docs

* add qwen3a * wip * vision ok * no more deepstack for audio * convert ASR model ok * qwen3 asr working * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * nits * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix bad merge * fix multi inheritance --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…21807)

…org#20627) Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

…gml-org#20633) * ggml-cpu: add 128-bit impls for i-quants, ternary quants * ggml-cpu: add 128-bit impls for iq2_xs, iq3_s, iq3_xxs, tq2_0 Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: refactor; add rvv checks --------- Co-authored-by: taimur-10x <taimur.ahmad@10xengineers.ai> Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* nix: support unified apple-sdk * Impl roll op for Metal * Revert "nix: support unified apple-sdk" This reverts commit abfa473. * update ops.md * update op docs

* ggml: add graph_reused * use versioning instead of reuse flag * increment version with atomic * use top bits for split numbering * add assert * move counter to ggml.c * set uid in split_graph only * fix windows * address further review comments * get next_uid rather than doing bit manipulation * rename + add comment about uid

* fix NemotronH vocab loading by using trust_remote_code for unsupported config patterns * fix NemotronH tokenizer loading by overriding set_vocab with trust_remote_code

ibelem and others added 30 commits April 10, 2026 09:50

docs : fix broken link to ggml-openvino in OPENVINO.md (ggml-org#21709)

3f8752b

webui: Static build output improvements (ggml-org#21667)

f989a6e

* refactor: Build improvements * chore: Formatting + package lock update

common: mark --split-mode tensor as experimental (ggml-org#21684)

0893f50

common : fix when loading a cached HF models with unavailable API (gg…

fb38d6f

…ml-org#21670) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

ggml-webgpu: support non-square subgroup matrix configs for Intel GPUs (

bfd1f45

ggml-org#21669)

model : make Gemma 4 shared-KV tail attn_k tensors optional on load (g…

e62fa13

…gml-org#21739)

common : add callback interface for download progress (ggml-org#21735)

05b3caa

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

common : better align to the updated official gemma4 template (ggml-o…

3fc6506

…rg#21704)

fix: Fix broken structured output when using $refs in json_schema (gg…

b136b62

…ml-org#21699)

CUDA: also store node->src ne/nb for graph equality (ggml-org#21736)

a29e4c0

py : Bump typer to latest to fix huggingface_hub issue (ggml-org#21701)

660386f

ggml : fix a few instances of missing GGML_TYPE_Q1_0 cases (ggml-org#…

2b2cd57

…21716)

TP: fix Qwen 3 Next data split (ggml-org#21732)

865ff06

opencl: add basic support for q5_k (ggml-org#21593)

af1127d

* opencl: add general q5_k mv * opencl: add flattened Q5_K mv and general Q5_K mm * opencl: fix Q5_K unit tests

CUDA: skip compilation of superfluous FA kernels (ggml-org#21768)

ff5ef82

docs: add guide on how to add multimodal support (ggml-org#21778)

6313acb

* docs: add guide on how to add multimodal support * nits

fix: Proper messages rendering for "Show raw output" (ggml-org#21672)

9e209c5

mtmd: add gemma 4 test (vision + audio) [no ci] (ggml-org#21806)

aa4695c

* mtmd: add gemma 4 test (vision + audio) * add to docs

convert : force f16 or f32 on step3-vl conv weights (ggml-org#21646)

1e9d771

mtmd: fix crash when sending image under 2x2 pixels (ggml-org#21711)

82764d8

sycl: disable Q1_0 in backend and cleanup unused variables (ggml-org#…

873c825

…21807)

Remove extra conditional check on debug mode. (ggml-org#21798)

bafae27

yuannan and others added 8 commits April 16, 2026 11:12

devops : added spirv-headers to nix (ggml-org#21965)

90fb96a

ggml : implemented simd_gemm kernel for riscv vector extension (ggml-…

5637536

…org#20627) Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

metal: Implement ROLL op (ggml-org#21946)

ae2d348

* nix: support unified apple-sdk * Impl roll op for Metal * Revert "nix: support unified apple-sdk" This reverts commit abfa473. * update ops.md * update op docs

Convert: Fix NemotronH Config Parsing (ggml-org#21664)

03b3d07

* fix NemotronH vocab loading by using trust_remote_code for unsupported config patterns * fix NemotronH tokenizer loading by overriding set_vocab with trust_remote_code

codeowners: add team member comments (ggml-org#21714)

b572d1e

Merge branch 'layla-build' into merge

d27acb0

l3utterfly merged commit 060368a into layla-build Apr 16, 2026
17 of 70 checks passed

l3utterfly deleted the merge branch April 16, 2026 11:22

github-actions Bot added documentation Improvements or additions to documentation SYCL Nvidia GPU Vulkan testing build examples devops python server ggml Apple Metal script nix OpenCL model Hexagon WebGPU OpenVINO server/webui labels Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge from upstream#95

merge from upstream#95
l3utterfly merged 77 commits intolayla-buildfrom
merge

l3utterfly commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

l3utterfly commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants