Skip to content

ggml-cpu: add Q1_0 AVX2 fast path#2

Closed
elusznik wants to merge 106 commits intomasterfrom
q1_0-x86-avx2
Closed

ggml-cpu: add Q1_0 AVX2 fast path#2
elusznik wants to merge 106 commits intomasterfrom
q1_0-x86-avx2

Conversation

@elusznik
Copy link
Copy Markdown
Owner

@elusznik elusznik commented Apr 7, 2026

ggml-cpu: add Q1_0 AVX2 path

Commit message

ggml-cpu: add Q1_0 AVX2 dot product

Add an AVX2 SIMD fast path for ggml_vec_dot_q1_0_q8_0() in the
CPU backend quants.c. The Q1_0 quantization format packs 32
weight bits per 32 bytes, which requires a bits-to-bytes expansion
step before the int8 dot product can run. This patch reuses the
existing helpers (bytes_from_bits_32, mul_sum_i8_pairs_float) to
keep the implementation minimal and consistent with the q4/q5
kernel style.

The scalar fallback remains intact for non-AVX2 builds.

PR title

ggml-cpu: add Q1_0 AVX2 fast path

PR body

## Summary

Adds an AVX2 SIMD implementation of `ggml_vec_dot_q1_0_q8_0()` to
`ggml/src/ggml-cpu/quants.c`. This is the core dot-product kernel
used during inference with Q1_0 quantized weights (1-bit models
like Bonsai).

## Motivation

Q1_0 currently falls back to a scalar loop on x86 targets because
there was no AVX2 kernel. For comparison, ARM NEON has a native
Q1_0 implementation. The scalar fallback is particularly slow for
Q1_0 because the inner loop decodes 32 packed bits per iteration.

## Changes

- `ggml/src/ggml-cpu/quants.c`: added three static inline AVX2
  helpers (`hsum_float_8`, `bytes_from_bits_32`,
  `mul_sum_i8_pairs_float`) and the AVX2 fast path for
  `ggml_vec_dot_q1_0_q8_0()`. Scalar fallback kept for non-AVX2
  builds.

## Benchmark

Micro-benchmark (`test-quantize-perf`, synthetic data, vec_dot_q):

| Size    | Baseline cycles/32 | AVX2 cycles/32 | Speedup |
|---------|-------------------:|---------------:|--------:|
| 4 KB    | 104.38             | 6.32           | ~16.5x  |
| 64 KB   | 103.62             | 5.56           | ~18.6x  |
| 2.5 MB  | 104.76             | 5.76           | ~18.2x  |
| 250 MB  | 105.56             | 5.98           | ~17.6x  |

End-to-end server inference (`llama-server`, Bonsai-8B Q1_0, 16
threads, ctx-size 512):

| Metric       | Baseline | AVX2    | Speedup |
|--------------|---------:|--------:|--------:|
| Prompt eval  | 1.24 t/s | 18.64   | ~15x    |
| Generation   | 1.13 t/s | 18.01   | ~16x    |

All llama-bench results on the modified build confirm no regressions
on other quantization formats.

## Testing

- `test-quantize-perf --type q1_0 --op vec_dot_q -4` on both builds
- `llama-bench` on Bonsai-8B Q1_0 (16 threads) shows no regressions
  for other quants
- `llama-server` inference on Bonsai-8B Q1_0 produces correct output

Xuan-Son Nguyen and others added 30 commits March 30, 2026 08:59
* server: wrap headers for mcp proxy

* Update tools/server/server-cors-proxy.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix build

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* fix incorrect type ignore comments

* bump ty to 0.0.26
…l-org#20978)

* llama-model-loader: use pinned memory for tensor overrides

* change to warning
* fix: Branching logic + small refactor

* chore: update webui build output
When RPC is running with a remote backend which doesn't have init_tensor
function (like CPU and Metal), the server log gets full with error
messages saying that init_tensor is being called with null buffer which
is incorrect. This patch fixes this.
…l-org#21181)

* CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1

We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`,
while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we
had uninitialized values in `offset_iterator[nrows]` for the case when
`nrows % block_size == 0`.

Fixes ggml-org#21162

* Reduce nrows in test case to 256, don't need 768
* Reject empty computed member expressions before returning slices[0] from parse_member_expression_arguments().

* Treat empty computed member expressions with Jinja2 undefined semantics

Treat empty computed member expressions like `a[]` as undefined instead of
raising a parser error, to match Jinja2 behavior.

- return a noop expression for empty computed member arguments
- return undefined when a computed member key evaluates to undefined
- add Jinja tests covering `a[]|default('fallback')` and `a[] is undefined`

* Handle undefined computed member properties

Move undefined-property handling to the common member access path, and add a test covering `a[undefined] is undefined`.

* Use default undefined value in member access

Initialize val and then return it when property is undefined.

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* empty statement parses to blank_expression instead of noop_statement

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CI: Enable CUDA and Vulkan ARM64 runners and fix CI/CD

Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com>

* Obtain source tag name from git tag

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* opencl: add q4_K gemm and gemv kernels for Adreno

* opencl: fix whitespace

* opencl: add workarounds for compiler bugs on older devices

* opencl: handle fp16 denorm on X Elite

* opencl: fix kernel build error

* opencl: fix whitespace

* opencl: make q4_K cvt kernels signature consistent

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
…i-compat (ggml-org#21090)

* server/webui: cleanup dual representation approach, simplify to openai-compat

* feat: Fix regression for Agentic Loop UI

* chore: update webui build output

* refactor: Post-review code improvements

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
…-org#21193)

* fix: include API key in CORS proxy requests for MCP connections

When llama-server is started with --api-key-file and --webui-mcp-proxy,
the /cors-proxy endpoint requires authentication. The WebUI was not
including the Authorization header in proxy requests, causing MCP
connections to fail with 401.

Inject getAuthHeaders() into requestInit when useProxy is true so the
proxy request carries the Bearer token alongside the forwarded target
headers.

Fixes ggml-org#21167

* fix: simplify headers assignment based on reviewer suggestion

Apply buildProxiedHeaders only when useProxy is true, pass headers
directly to the transport otherwise.
…gfault on failed model load (ggml-org#21082)

* common: add bounds check in common_init_result::sampler to prevent segfault on failed model load

* Revert a308e58

* Add regression test

* Remove regression test for init-fail sampler check
…1176)

The build info is now only for debug, so we avoid the duplicate
with `--version`.

The UTF-8 setup at the beginning is needed to avoid logging
garbage on Windows.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
- emdeddings → embeddings (gemma3.cpp, gemma3n-iswa.cpp,
gemma-embedding.cpp)
- imlpemented → implemented (llama-adapter.cpp)
- interere → interfere (llama-graph.cpp)
- overridde → overridden (chat.cpp)
- stastistics → statistics (ngram-map.h)
- layed → laid (llama-kv-cache.h)
- worster → worst (llama-context.cpp)
- sequantial → sequential (llama-batch.h)
* webui: no more gzip

* try changing a small line

* Revert "try changing a small line"

This reverts commit 0d7a353.

* fix lint

* fix test

* rebuild

* split into html/css/js

* lint

* chore: update webui build output

* chore: Update git hooks script

* server: update webui build output

* chore: Update pre-commit hook

* refactor: Cleanup

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* CANN: fix multi-thread set_tensor race conditions

When ollama calls ggml_backend_tensor_set from multiple threads (each
writing a different chunk of the same tensor), the CANN backend had
three concurrency issues:

1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform
   before uploading to device. Per-chunk transforms produced corrupt data.

2. ND-to-NZ weight conversion requires complete tensor data on device.
   Per-chunk conversion operated on incomplete data.

3. The global g_nz_workspaces array had unprotected concurrent access.

Fix by introducing a TensorSetTracker that accumulates write progress
per tensor. For quantized tensors, raw data is staged in a host buffer
and the transform + upload is deferred until all chunks arrive. For NZ
weights, chunks are uploaded directly but conversion is deferred. The
tracker and its staging buffer are released immediately after
post-processing completes.

Add per-device mutex to g_nz_workspaces to prevent data races.

* CANN: fix L2_NORM ignoring eps parameter

The L2_NORM implementation was not using the eps parameter from
op_params, causing incorrect results when eps is large (e.g. 10.0).
The CPU reference computes scale = 1/fmaxf(norm, eps), so add a
Clamp step to clamp the norm to at least eps before dividing.

* ggml/cann: compare op_params for POOL_2D in ACL graph cache matching

When ACL graph mode is enabled, the graph LRU cache checks whether a
cached graph matches the current computation graph. Previously,
GGML_OP_POOL_2D was not included in the op_params comparison, so two
POOL_2D nodes with different pooling parameters (kernel size, stride,
padding) but identical tensor shapes and addresses could incorrectly
reuse a cached graph, leading to wrong results or aclnn errors.

Add GGML_OP_POOL_2D to the list of ops that require op_params matching
in ggml_graph_node_properties::has_matching_properties().

* cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison

The ACL graph LRU cache was incorrectly reusing cached graphs for
operations with different tensor types or op_params, causing test
failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD,
RMS_NORM_MUL_ADD, and ADD_RMS_NORM.

Changes:
- Add node_type and src_type[] fields to ggml_graph_node_properties
  so the cache can distinguish tensors with different types but
  identical ne/nb (e.g. f16 and bf16 both have 2-byte elements)
- Compare op_params unconditionally for all ops instead of only for
  SCALE/UNARY/GLU/ROPE/POOL_2D
```
$ build/bin/llama-server -hf unsloth/Qwen3.5-0.8B-GGUF
common_download_file_single_online: HEAD failed, status: 404
no remote preset found, skipping
Downloading mmproj-BF16.gguf ——————————————————————————————————————— 100%
Downloading Qwen3.5-0.8B-Q4_K_M.gguf ——————————————————————————————— 100%
...
```

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* port cpy pipeline to shader lib with JIT compilation
 * port glu pipeline to shader lib with JIT compilation
 * port rope pipeline to shader lib with JIT compilation
 * port soft_max pipeline to shader lib with JIT compilation
 * removed unused functions from embed_wgsl.py which were used for
old AOT template expansion
…gml-org#21046)

* Work towards removing bitcast

* Move rest of existing types over

* Add timeout back to wait and remove synchronous set_tensor/memset_tensor

* move to unpackf16 for wider compatibility

* cleanup

* Remove deadlock condition in free_bufs
…face (ggml-org#20346)

* Refactor llama_model_quantize_params to expose a pure C interface

* Restore comment and cleanup struct def

* Code review refactoring

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Code review refactoring

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
)

* flash attention support for head dimension 512 added

* FA D=512 - match 576 configs, limit ncols2, revert vec cap

* fix HIP tile kernel build for D=512

* fix HIP tile kernel occupancy for D=512 on AMD

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* fix tile FA compilation

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* ggml-cpu: refactor sgemm; fix rvv checks

* ggml-cpu: refactor rvv kernels; set zvfbfwma default to off
am17an and others added 8 commits April 6, 2026 22:26
* llama-bench: add `-fitc` and `-fitt` to arguments

* update README.md

* address review comments

* update compare-llama-bench.py
…#21159)

* Write an optimized flash_attn_stream_k_fixup kernel

Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst.
Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst

* Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs

* Address review comments

* Address review comments

* Revert variable names to original
* llama-cli: fix stripping of \n in multiline input

* Change & string to string_view

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix EditorConfig linter error

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* ggml: add Q1_0 and Q1_0_g128 1-bit quantization support (CPU)

* add generic fallback for x86

* remove Q1_0 (group size 32)

* rename Q1_0_g128 => Q1_0

* fix Q1_0 LlamaFileType Enum

* Fix trailing spaces; add generic fallback for othre backends

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix /r/n spacing + arch-fallback

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add mul_mat_id support to WebGPU

* Apply suggestion from @reeselevine

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
…gml-org#21527)

Extend the existing reorder optimization to Q8_0. The reorder
separates scale factors from weight data for coalesced memory
access -- was implemented for Q4_0/Q4_K/Q6_K but Q8_0 was missing.

On Arc Pro B70 (Xe2), Q8_0 tg goes from 4.88 to 15.24 t/s (3.1x)
on Qwen3.5-27B. BW utilization: 21% -> 66%.

The key fix beyond the kernels: Q8_0 was missing from the type
check in ggml_backend_sycl_buffer_init_tensor() that allocates
the extra struct carrying the reorder flag -- so the optimization
was silently skipped.

AI (Claude) was used to assist with root cause investigation and
writing the kernel code. All code was human-reviewed and tested
on real hardware.

Fixes: ggml-org#21517
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This extensive pull request introduces support for the Gemma 4 model architecture, including its per-layer embeddings and specialized chat templates. It implements the new Q1_0 quantization type and adds NVFP4 support for CUDA and SYCL backends. Significant performance enhancements are provided for the WebGPU backend through optimized shaders for MUL_MAT_ID and flash attention, while the CPU backend receives RISC-V vector optimizations. Other improvements include new CUMSUM support for Hexagon, Q4_K support for Adreno GPUs, and revised contribution guidelines for AI agents. Feedback suggests improving maintainability by refactoring repetitive requirement checks into a loop, breaking down complex parser logic, and consolidating duplicated FP8 conversion code. Additionally, the reviewer recommends using a dispatch table for device detection, vectorizing WebGPU output logic, and deriving hardcoded layer counts from existing parameters.

Comment thread ci/run.sh
Comment on lines +635 to 669
if ! command -v git &> /dev/null; then
gg_printf 'git not found, please install'
fi

if ! command -v git-lfs &> /dev/null; then
gg_printf 'git-lfs not found, please install'
fi

if ! command -v wget &> /dev/null; then
gg_printf 'wget not found, please install'
fi

if ! command -v python3 &> /dev/null; then
gg_printf 'python3 not found, please install'
fi

if ! command -v pip3 &> /dev/null; then
gg_printf 'pip3 not found, please install'
fi

if ! python3 -m ensurepip --help &> /dev/null; then
gg_printf 'ensurepip not found, please install python3-venv package'
fi

if ! command -v cmake &> /dev/null; then
gg_printf 'cmake not found, please install'
fi

if ! command -v ccache &> /dev/null; then
gg_printf 'ccache not found, please consider installing for faster builds'
fi

if ! command -v ctest &> /dev/null; then
gg_printf 'ctest not found, please install'
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function is becoming quite large and repetitive. Consider refactoring to use a loop over an array of required binaries to improve maintainability.

Comment on lines +219 to +255
common_peg_parser analyze_tools::build_func_parser(common_chat_peg_builder & p, const std::string & name,
const common_peg_parser & call_id_section, bool have_call_id,
const common_peg_parser & args,
std::optional<common_peg_parser> atomic_peek) const {
auto open = p.tool_open(function.name_prefix + p.tool_name(p.literal(name)) + function.name_suffix);
bool matched_atomic = false;
common_peg_parser func_parser = p.eps();

if (!function.name_suffix.empty()) {
func_parser = open + call_id_section + p.space() + args;
matched_atomic = true;
} else if (have_call_id) {
func_parser = p.atomic(open + call_id_section) + p.space() + args;
matched_atomic = true;
} else if (atomic_peek.has_value()) {
func_parser = p.atomic(open + call_id_section + p.space() + *atomic_peek) + args;
matched_atomic = true;
} else {
func_parser = open + call_id_section + p.space() + args;
}

if (!function.close.empty()) {
func_parser = func_parser + p.space() + p.tool_close(p.literal(function.close));
} else if (!format.per_call_end.empty()) {
// When there's no func_close but there is a per_call_end marker, use peek() to ensure
// we only emit tool_close when we can actually see the closing marker. This prevents
// premature closing during partial parsing when we've seen e.g. "</" which could be
// either "</tool_call>" (end) or "<arg_key>" prefix that failed to match.
func_parser = func_parser + p.tool_close(p.peek(p.literal(format.per_call_end)));
} else {
func_parser = func_parser + p.tool_close(p.space()); // force this to process tool closing callbacks in mapper
}
if (!matched_atomic) {
func_parser = p.atomic(func_parser);
}
return func_parser;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function build_func_parser is quite complex. Consider breaking it down into smaller, more focused helper functions to improve readability and maintainability.

Comment on lines +815 to +826
if (x == 0 || (x == 0x7F && x != 0xFF)) { // Convert NaN to 0.0f
return 0.0f;
}
const int exp = (x >> 3) & 0xF;
const int man = x & 0x7;
float raw;
if (exp == 0) {
raw = ldexpf((float) man, -9);
} else {
raw = ldexpf(1.0f + (float) man / 8.0f, exp - 7);
}
return static_cast<float>(raw / 2);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for converting FP8 to FP32 is duplicated across different architectures. Consider consolidating this into a single, robust implementation if possible to reduce maintenance overhead.

Comment on lines 238 to 266
if (GGML_CUDA_CC_IS_NVIDIA(cc)) {
if (cc == GGML_CUDA_CC_VOLTA || cc >= GGML_CUDA_CC_ADA_LOVELACE) {
return MMVQ_MAX_BATCH_SIZE;
}
if (cc >= GGML_CUDA_CC_TURING) {
return get_mmvq_mmid_max_batch_turing_plus(type);
}
return get_mmvq_mmid_max_batch_pascal_older(type);
}

// AMD
if (GGML_CUDA_CC_IS_RDNA4(cc)) {
return get_mmvq_mmid_max_batch_rdna4(type);
}
if (GGML_CUDA_CC_IS_RDNA3(cc)) {
return get_mmvq_mmid_max_batch_rdna3(type);
}
if (GGML_CUDA_CC_IS_RDNA1(cc) || GGML_CUDA_CC_IS_RDNA2(cc)) {
return get_mmvq_mmid_max_batch_rdna1_rdna2(type);
}
if (GGML_CUDA_CC_IS_CDNA(cc)) {
return get_mmvq_mmid_max_batch_cdna(type);
}
if (GGML_CUDA_CC_IS_GCN(cc)) {
return get_mmvq_mmid_max_batch_gcn(type);
if (GGML_CUDA_CC_IS_AMD(cc)) {
if (GGML_CUDA_CC_IS_RDNA4(cc)) {
return get_mmvq_mmid_max_batch_rdna4(type);
}
if (GGML_CUDA_CC_IS_RDNA3(cc)) {
return get_mmvq_mmid_max_batch_rdna3(type);
}
if (GGML_CUDA_CC_IS_RDNA1(cc) || GGML_CUDA_CC_IS_RDNA2(cc)) {
return get_mmvq_mmid_max_batch_rdna1_rdna2(type);
}
if (GGML_CUDA_CC_IS_CDNA(cc)) {
return get_mmvq_mmid_max_batch_cdna(type);
}
if (GGML_CUDA_CC_IS_GCN(cc)) {
return get_mmvq_mmid_max_batch_gcn(type);
}
}
return MMVQ_MAX_BATCH_SIZE;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The nested if statements for device architecture detection are becoming difficult to follow. Consider using a more structured approach, such as a dispatch table or a cleaner helper function, to improve readability.

Comment on lines +141 to +171
var acc: array<array<f16, TILE_M>, TILE_N>;

for (var k_outer = 0u; k_outer < params.k; k_outer += TILE_K) {

if (is_valid) {
init_shmem_src0(thread_id, src0_batch_offset, offset_wg_m, k_outer);
init_shmem_id_src1(thread_id, params.offset_src1, rest_token_n, k_outer);
}

workgroupBarrier();

if (is_valid) {
let k_end = min(TILE_K, params.k - k_outer);

for (var k_inner = 0u; k_inner < k_end; k_inner++) {
var src0_tile: array<f16, TILE_M>;
for (var tm = 0u; tm < TILE_M; tm++) {
let src0_m = local_m * TILE_M + tm;
let src0_idx = k_inner + src0_m * TILE_K;
src0_tile[tm] = shmem[src0_idx];
}
for (var tn = 0u; tn < TILE_N; tn++) {
let src1_n = local_n * TILE_N + tn;
let src1_idx = src1_n * TILE_K + k_inner;
let src1_val = shmem[TILE_SRC0_SHMEM + src1_idx];
for (var tm = 0u; tm < TILE_M; tm++) {
acc[tn][tm] += src0_tile[tm] * src1_val;
}
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The output storage logic uses multiple if checks for bounds. Consider if this can be simplified or vectorized to improve performance and reduce code duplication.

Comment thread src/llama-model.cpp
Comment on lines +1284 to +1288
switch (hparams.n_layer) {
case 35: type = LLM_TYPE_E2B; break;
case 42: type = LLM_TYPE_E4B; break; // to confirm: E4B or E5B?
default: type = LLM_TYPE_UNKNOWN;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The switch-case for layer counts is hardcoded. Consider if this can be derived from other hparams or if a more flexible mapping is needed for future model variants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.