ggml-cpu: add Q1_0 AVX2 fast path by elusznik · Pull Request #3 · elusznik/llama.cpp

elusznik · 2026-04-07T14:18:26Z

Overview

Adds an AVX2 SIMD fast path for ggml_vec_dot_q1_0_q8_0() in ggml/src/ggml-cpu/quants.c.

Q1_0 was missing an x86 kernel and fell back to a scalar loop. This patch implements the fast path using the existing bytes_from_bits_32() and mul_sum_i8_pairs_float() helpers, keeping it minimal and consistent with the q4/q5 kernel style. The scalar fallback remains intact for non-AVX2 builds.

Benchmark (AMD Ryzen 7 5800X, Bonsai-8B Q1_0, 16 threads):

test-quantize-perf --type q1_0 --op vec_dot_q -4:

Size	Baseline (cycles/32)	AVX2 (cycles/32)	Speedup
4 KB	104.4	6.3	~16.5x
64 KB	103.6	5.6	~18.5x
2.5 MB	104.8	5.8	~18.1x
250 MB	105.6	6.0	~17.6x

llama-server --threads 16 --ctx-size 512:

Metric	Baseline	AVX2	Speedup
Prompt eval	1.24/s	18.64	~15x
Generation	1.13/s	18.01	~16x

Additional information

Follow-up to the existing ARM NEON Q1_0 implementation. The x86 AVX2 path uses the same algorithm adapted for x86 intrinsics.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

* server: wrap headers for mcp proxy * Update tools/server/server-cors-proxy.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix build * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* fix incorrect type ignore comments * bump ty to 0.0.26

…l-org#20978) * llama-model-loader: use pinned memory for tensor overrides * change to warning

* fix: Branching logic + small refactor * chore: update webui build output

When RPC is running with a remote backend which doesn't have init_tensor function (like CPU and Metal), the server log gets full with error messages saying that init_tensor is being called with null buffer which is incorrect. This patch fixes this.

…l-org#21181) * CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`, while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we had uninitialized values in `offset_iterator[nrows]` for the case when `nrows % block_size == 0`. Fixes ggml-org#21162 * Reduce nrows in test case to 256, don't need 768

* Reject empty computed member expressions before returning slices[0] from parse_member_expression_arguments(). * Treat empty computed member expressions with Jinja2 undefined semantics Treat empty computed member expressions like `a[]` as undefined instead of raising a parser error, to match Jinja2 behavior. - return a noop expression for empty computed member arguments - return undefined when a computed member key evaluates to undefined - add Jinja tests covering `a[]|default('fallback')` and `a[] is undefined` * Handle undefined computed member properties Move undefined-property handling to the common member access path, and add a test covering `a[undefined] is undefined`. * Use default undefined value in member access Initialize val and then return it when property is undefined. Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * empty statement parses to blank_expression instead of noop_statement --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* CI: Enable CUDA and Vulkan ARM64 runners and fix CI/CD Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com> * Obtain source tag name from git tag Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* opencl: add q4_K gemm and gemv kernels for Adreno * opencl: fix whitespace * opencl: add workarounds for compiler bugs on older devices * opencl: handle fp16 denorm on X Elite * opencl: fix kernel build error * opencl: fix whitespace * opencl: make q4_K cvt kernels signature consistent --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

…l-org#21209)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…i-compat (ggml-org#21090) * server/webui: cleanup dual representation approach, simplify to openai-compat * feat: Fix regression for Agentic Loop UI * chore: update webui build output * refactor: Post-review code improvements * chore: update webui build output * refactor: Cleanup * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

…-org#21193) * fix: include API key in CORS proxy requests for MCP connections When llama-server is started with --api-key-file and --webui-mcp-proxy, the /cors-proxy endpoint requires authentication. The WebUI was not including the Authorization header in proxy requests, causing MCP connections to fail with 401. Inject getAuthHeaders() into requestInit when useProxy is true so the proxy request carries the Bearer token alongside the forwarded target headers. Fixes ggml-org#21167 * fix: simplify headers assignment based on reviewer suggestion Apply buildProxiedHeaders only when useProxy is true, pass headers directly to the transport otherwise.

…gfault on failed model load (ggml-org#21082) * common: add bounds check in common_init_result::sampler to prevent segfault on failed model load * Revert a308e58 * Add regression test * Remove regression test for init-fail sampler check

…1176) The build info is now only for debug, so we avoid the duplicate with `--version`. The UTF-8 setup at the beginning is needed to avoid logging garbage on Windows. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

- emdeddings → embeddings (gemma3.cpp, gemma3n-iswa.cpp, gemma-embedding.cpp) - imlpemented → implemented (llama-adapter.cpp) - interere → interfere (llama-graph.cpp) - overridde → overridden (chat.cpp) - stastistics → statistics (ngram-map.h) - layed → laid (llama-kv-cache.h) - worster → worst (llama-context.cpp) - sequantial → sequential (llama-batch.h)

…21213)

* webui: no more gzip * try changing a small line * Revert "try changing a small line" This reverts commit 0d7a353. * fix lint * fix test * rebuild * split into html/css/js * lint * chore: update webui build output * chore: Update git hooks script * server: update webui build output * chore: Update pre-commit hook * refactor: Cleanup --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* CANN: fix multi-thread set_tensor race conditions When ollama calls ggml_backend_tensor_set from multiple threads (each writing a different chunk of the same tensor), the CANN backend had three concurrency issues: 1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform before uploading to device. Per-chunk transforms produced corrupt data. 2. ND-to-NZ weight conversion requires complete tensor data on device. Per-chunk conversion operated on incomplete data. 3. The global g_nz_workspaces array had unprotected concurrent access. Fix by introducing a TensorSetTracker that accumulates write progress per tensor. For quantized tensors, raw data is staged in a host buffer and the transform + upload is deferred until all chunks arrive. For NZ weights, chunks are uploaded directly but conversion is deferred. The tracker and its staging buffer are released immediately after post-processing completes. Add per-device mutex to g_nz_workspaces to prevent data races. * CANN: fix L2_NORM ignoring eps parameter The L2_NORM implementation was not using the eps parameter from op_params, causing incorrect results when eps is large (e.g. 10.0). The CPU reference computes scale = 1/fmaxf(norm, eps), so add a Clamp step to clamp the norm to at least eps before dividing. * ggml/cann: compare op_params for POOL_2D in ACL graph cache matching When ACL graph mode is enabled, the graph LRU cache checks whether a cached graph matches the current computation graph. Previously, GGML_OP_POOL_2D was not included in the op_params comparison, so two POOL_2D nodes with different pooling parameters (kernel size, stride, padding) but identical tensor shapes and addresses could incorrectly reuse a cached graph, leading to wrong results or aclnn errors. Add GGML_OP_POOL_2D to the list of ops that require op_params matching in ggml_graph_node_properties::has_matching_properties(). * cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison The ACL graph LRU cache was incorrectly reusing cached graphs for operations with different tensor types or op_params, causing test failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD, RMS_NORM_MUL_ADD, and ADD_RMS_NORM. Changes: - Add node_type and src_type[] fields to ggml_graph_node_properties so the cache can distinguish tensors with different types but identical ne/nb (e.g. f16 and bf16 both have 2-byte elements) - Compare op_params unconditionally for all ops instead of only for SCALE/UNARY/GLU/ROPE/POOL_2D

``` $ build/bin/llama-server -hf unsloth/Qwen3.5-0.8B-GGUF common_download_file_single_online: HEAD failed, status: 404 no remote preset found, skipping Downloading mmproj-BF16.gguf ——————————————————————————————————————— 100% Downloading Qwen3.5-0.8B-Q4_K_M.gguf ——————————————————————————————— 100% ... ``` Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* port cpy pipeline to shader lib with JIT compilation * port glu pipeline to shader lib with JIT compilation * port rope pipeline to shader lib with JIT compilation * port soft_max pipeline to shader lib with JIT compilation * removed unused functions from embed_wgsl.py which were used for old AOT template expansion

…gml-org#21046) * Work towards removing bitcast * Move rest of existing types over * Add timeout back to wait and remove synchronous set_tensor/memset_tensor * move to unpackf16 for wider compatibility * cleanup * Remove deadlock condition in free_bufs

…face (ggml-org#20346) * Refactor llama_model_quantize_params to expose a pure C interface * Restore comment and cleanup struct def * Code review refactoring Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Code review refactoring --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

) * flash attention support for head dimension 512 added * FA D=512 - match 576 configs, limit ncols2, revert vec cap * fix HIP tile kernel build for D=512 * fix HIP tile kernel occupancy for D=512 on AMD * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * fix tile FA compilation --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* ggml-cpu: refactor sgemm; fix rvv checks * ggml-cpu: refactor rvv kernels; set zvfbfwma default to off

…#21159) * Write an optimized flash_attn_stream_k_fixup kernel Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst. Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst * Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs * Address review comments * Address review comments * Revert variable names to original

* llama-cli: fix stripping of \n in multiline input * Change & string to string_view * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix EditorConfig linter error --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* ggml: add Q1_0 and Q1_0_g128 1-bit quantization support (CPU) * add generic fallback for x86 * remove Q1_0 (group size 32) * rename Q1_0_g128 => Q1_0 * fix Q1_0 LlamaFileType Enum * Fix trailing spaces; add generic fallback for othre backends * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix /r/n spacing + arch-fallback --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

@reeselevine

* Add mul_mat_id support to WebGPU * Apply suggestion from @reeselevine --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>

…1518)

…gml-org#21527) Extend the existing reorder optimization to Q8_0. The reorder separates scale factors from weight data for coalesced memory access -- was implemented for Q4_0/Q4_K/Q6_K but Q8_0 was missing. On Arc Pro B70 (Xe2), Q8_0 tg goes from 4.88 to 15.24 t/s (3.1x) on Qwen3.5-27B. BW utilization: 21% -> 66%. The key fix beyond the kernels: Q8_0 was missing from the type check in ggml_backend_sycl_buffer_init_tensor() that allocates the extra struct carrying the reorder flag -- so the optimization was silently skipped. AI (Claude) was used to assist with root cause investigation and writing the kernel code. All code was human-reviewed and tested on real hardware. Fixes: ggml-org#21517

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

gemini-code-assist

Code Review

This pull request introduces support for the Gemma 4 model architecture, including its specialized BPE tokenization, per-layer embeddings, and tool-calling syntax. It adds a new Q1_0 quantization type with optimized kernels for ARM and AVX2, and expands hardware acceleration support across multiple backends: ZenDNN now supports MUL_MAT_ID for MoE models, the CUDA backend adds support for 512-head Flash Attention and NVFP4, and the SYCL backend improves performance for Q8_0 and NVFP4. Additionally, the PR includes updates to Dockerfiles for newer Ubuntu and CUDA versions, enhances the PEG-based chat parser, and adds templates for LFM 2.5 and IBM Granite 4.0. I have no feedback to provide.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1c569b9cef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-07T14:33:44Z

+    for (uint32_t i = 0; i < desc->n_layer; i++) {
+        model->hparams.n_head_arr[i]    = desc->n_head;
+        model->hparams.n_head_kv_arr[i] = desc->n_head_kv;
+        model->hparams.n_ff_arr[i]      = desc->n_ff;


Validate layer count before writing quant model arrays

llama_quant_model_from_metadata() writes desc->n_layer entries into n_head_arr, n_head_kv_arr, and n_ff_arr without bounding n_layer. Those arrays are fixed at LLAMA_MAX_LAYERS (512 in src/llama-hparams.h), so any descriptor with n_layer > 512 will write out of bounds and can corrupt heap state or crash the process.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-07T14:33:44Z

+    ggml_type default_type = llama_ftype_get_default_type(ftype);
+
+    // compute types
+    for (size_t i = 0; i < n_tensors; i++) {
+        result_types[i] = llama_tensor_get_type(*qs, &local_params, tensors[i], default_type, metadata[i]);


Reject unsupported ftype in quant type computation

llama_quant_compute_types() uses llama_ftype_get_default_type(ftype) but never validates the result before passing it to llama_tensor_get_type(). Since llama_ftype_get_default_type() now returns GGML_TYPE_COUNT for unknown values, callers can trigger invalid type handling (e.g., downstream type-trait indexing paths) instead of a clean error, unlike llama_model_quantize_impl() which already guards this case.

Useful? React with 👍 / 👎.

Xuan-Son Nguyen and others added 30 commits March 30, 2026 08:59

ci : bump ty to 0.0.26 (ggml-org#21156)

e2eb39e

* fix incorrect type ignore comments * bump ty to 0.0.26

llama-model-loader: print warning when using overrides with mmap (ggm…

278521c

…l-org#20978) * llama-model-loader: use pinned memory for tensor overrides * change to warning

webui: Fix branching logic on edit message (ggml-org#21175)

389c7d4

* fix: Branching logic + small refactor * chore: update webui build output

common : Disable backend sampling if reasoning budget is enabled (ggm…

5ce013c

…l-org#21209)

vendor : update BoringSSL to 0.20260327.0 (ggml-org#21211)

26dac84

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

sycl : enhance fattn perf (ggml-org#21185)

62278ce

ggml : bump version to 0.9.9 (ggml/1449)

0be6c7c

sync : ggml

9281dd1

CI: Enable CPU and Vulkan ARM64 Release (ggml-org#21207)

eec6f85

common : gpt-oss handle builtin and unsolicited tool calls (ggml-org#…

624733d

…21213)

fix: Use lower-case proxy headers naming (ggml-org#21235)

0fcb376

ggml-cpu: fix fallback for RVV kernels without zvfh (ggml-org#21157)

2b86e5c

* ggml-cpu: refactor sgemm; fix rvv checks * ggml-cpu: refactor rvv kernels; set zvfbfwma default to off

gaugarg-nv and others added 7 commits April 6, 2026 20:34

ggml-webgpu: Add the support of MUL_MAT_ID (ggml-org#21147)

d0a6dfe

* Add mul_mat_id support to WebGPU * Apply suggestion from @reeselevine --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>

docs: fix typo in build.md (emdawbwebgpu -> emdawnwebgpu) (ggml-org#2…

0033f53

…1518)

ggml-cpu: add Q1_0 AVX2 path

1c569b9

Copilot AI review requested due to automatic review settings April 7, 2026 14:18

Copilot AI reviewed Apr 7, 2026

View reviewed changes

elusznik closed this Apr 7, 2026

gemini-code-assist Bot reviewed Apr 7, 2026

View reviewed changes

github-actions Bot added documentation Improvements or additions to documentation SYCL Nvidia GPU AMD ZenDNN testing examples devops python script server ggml model nix jinja parser Ascend NPU OpenCL Hexagon WebGPU labels Apr 7, 2026

chatgpt-codex-connector Bot reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: add Q1_0 AVX2 fast path#3

ggml-cpu: add Q1_0 AVX2 fast path#3
elusznik wants to merge 106 commits intomasterfrom
q1_0-x86-avx2

elusznik commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 7, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

elusznik commented Apr 7, 2026

Overview

Additional information

Requirements

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants