ggml-cpu: add Q1_0 AVX2 fast path by elusznik · Pull Request #2 · elusznik/llama.cpp

elusznik · 2026-04-07T14:13:34Z

ggml-cpu: add Q1_0 AVX2 path

Commit message

ggml-cpu: add Q1_0 AVX2 dot product

Add an AVX2 SIMD fast path for ggml_vec_dot_q1_0_q8_0() in the
CPU backend quants.c. The Q1_0 quantization format packs 32
weight bits per 32 bytes, which requires a bits-to-bytes expansion
step before the int8 dot product can run. This patch reuses the
existing helpers (bytes_from_bits_32, mul_sum_i8_pairs_float) to
keep the implementation minimal and consistent with the q4/q5
kernel style.

The scalar fallback remains intact for non-AVX2 builds.

PR title

ggml-cpu: add Q1_0 AVX2 fast path

PR body

## Summary

Adds an AVX2 SIMD implementation of `ggml_vec_dot_q1_0_q8_0()` to
`ggml/src/ggml-cpu/quants.c`. This is the core dot-product kernel
used during inference with Q1_0 quantized weights (1-bit models
like Bonsai).

## Motivation

Q1_0 currently falls back to a scalar loop on x86 targets because
there was no AVX2 kernel. For comparison, ARM NEON has a native
Q1_0 implementation. The scalar fallback is particularly slow for
Q1_0 because the inner loop decodes 32 packed bits per iteration.

## Changes

- `ggml/src/ggml-cpu/quants.c`: added three static inline AVX2
  helpers (`hsum_float_8`, `bytes_from_bits_32`,
  `mul_sum_i8_pairs_float`) and the AVX2 fast path for
  `ggml_vec_dot_q1_0_q8_0()`. Scalar fallback kept for non-AVX2
  builds.

## Benchmark

Micro-benchmark (`test-quantize-perf`, synthetic data, vec_dot_q):

| Size    | Baseline cycles/32 | AVX2 cycles/32 | Speedup |
|---------|-------------------:|---------------:|--------:|
| 4 KB    | 104.38             | 6.32           | ~16.5x  |
| 64 KB   | 103.62             | 5.56           | ~18.6x  |
| 2.5 MB  | 104.76             | 5.76           | ~18.2x  |
| 250 MB  | 105.56             | 5.98           | ~17.6x  |

End-to-end server inference (`llama-server`, Bonsai-8B Q1_0, 16
threads, ctx-size 512):

| Metric       | Baseline | AVX2    | Speedup |
|--------------|---------:|--------:|--------:|
| Prompt eval  | 1.24 t/s | 18.64   | ~15x    |
| Generation   | 1.13 t/s | 18.01   | ~16x    |

All llama-bench results on the modified build confirm no regressions
on other quantization formats.

## Testing

- `test-quantize-perf --type q1_0 --op vec_dot_q -4` on both builds
- `llama-bench` on Bonsai-8B Q1_0 (16 threads) shows no regressions
  for other quants
- `llama-server` inference on Bonsai-8B Q1_0 produces correct output

* server: wrap headers for mcp proxy * Update tools/server/server-cors-proxy.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix build * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* fix incorrect type ignore comments * bump ty to 0.0.26

…l-org#20978) * llama-model-loader: use pinned memory for tensor overrides * change to warning

* fix: Branching logic + small refactor * chore: update webui build output

When RPC is running with a remote backend which doesn't have init_tensor function (like CPU and Metal), the server log gets full with error messages saying that init_tensor is being called with null buffer which is incorrect. This patch fixes this.

…l-org#21181) * CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`, while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we had uninitialized values in `offset_iterator[nrows]` for the case when `nrows % block_size == 0`. Fixes ggml-org#21162 * Reduce nrows in test case to 256, don't need 768

* Reject empty computed member expressions before returning slices[0] from parse_member_expression_arguments(). * Treat empty computed member expressions with Jinja2 undefined semantics Treat empty computed member expressions like `a[]` as undefined instead of raising a parser error, to match Jinja2 behavior. - return a noop expression for empty computed member arguments - return undefined when a computed member key evaluates to undefined - add Jinja tests covering `a[]|default('fallback')` and `a[] is undefined` * Handle undefined computed member properties Move undefined-property handling to the common member access path, and add a test covering `a[undefined] is undefined`. * Use default undefined value in member access Initialize val and then return it when property is undefined. Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * empty statement parses to blank_expression instead of noop_statement --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* CI: Enable CUDA and Vulkan ARM64 runners and fix CI/CD Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com> * Obtain source tag name from git tag Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* opencl: add q4_K gemm and gemv kernels for Adreno * opencl: fix whitespace * opencl: add workarounds for compiler bugs on older devices * opencl: handle fp16 denorm on X Elite * opencl: fix kernel build error * opencl: fix whitespace * opencl: make q4_K cvt kernels signature consistent --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

…l-org#21209)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…i-compat (ggml-org#21090) * server/webui: cleanup dual representation approach, simplify to openai-compat * feat: Fix regression for Agentic Loop UI * chore: update webui build output * refactor: Post-review code improvements * chore: update webui build output * refactor: Cleanup * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

…-org#21193) * fix: include API key in CORS proxy requests for MCP connections When llama-server is started with --api-key-file and --webui-mcp-proxy, the /cors-proxy endpoint requires authentication. The WebUI was not including the Authorization header in proxy requests, causing MCP connections to fail with 401. Inject getAuthHeaders() into requestInit when useProxy is true so the proxy request carries the Bearer token alongside the forwarded target headers. Fixes ggml-org#21167 * fix: simplify headers assignment based on reviewer suggestion Apply buildProxiedHeaders only when useProxy is true, pass headers directly to the transport otherwise.

…gfault on failed model load (ggml-org#21082) * common: add bounds check in common_init_result::sampler to prevent segfault on failed model load * Revert a308e58 * Add regression test * Remove regression test for init-fail sampler check

…1176) The build info is now only for debug, so we avoid the duplicate with `--version`. The UTF-8 setup at the beginning is needed to avoid logging garbage on Windows. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

- emdeddings → embeddings (gemma3.cpp, gemma3n-iswa.cpp, gemma-embedding.cpp) - imlpemented → implemented (llama-adapter.cpp) - interere → interfere (llama-graph.cpp) - overridde → overridden (chat.cpp) - stastistics → statistics (ngram-map.h) - layed → laid (llama-kv-cache.h) - worster → worst (llama-context.cpp) - sequantial → sequential (llama-batch.h)

…21213)

* webui: no more gzip * try changing a small line * Revert "try changing a small line" This reverts commit 0d7a353. * fix lint * fix test * rebuild * split into html/css/js * lint * chore: update webui build output * chore: Update git hooks script * server: update webui build output * chore: Update pre-commit hook * refactor: Cleanup --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* CANN: fix multi-thread set_tensor race conditions When ollama calls ggml_backend_tensor_set from multiple threads (each writing a different chunk of the same tensor), the CANN backend had three concurrency issues: 1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform before uploading to device. Per-chunk transforms produced corrupt data. 2. ND-to-NZ weight conversion requires complete tensor data on device. Per-chunk conversion operated on incomplete data. 3. The global g_nz_workspaces array had unprotected concurrent access. Fix by introducing a TensorSetTracker that accumulates write progress per tensor. For quantized tensors, raw data is staged in a host buffer and the transform + upload is deferred until all chunks arrive. For NZ weights, chunks are uploaded directly but conversion is deferred. The tracker and its staging buffer are released immediately after post-processing completes. Add per-device mutex to g_nz_workspaces to prevent data races. * CANN: fix L2_NORM ignoring eps parameter The L2_NORM implementation was not using the eps parameter from op_params, causing incorrect results when eps is large (e.g. 10.0). The CPU reference computes scale = 1/fmaxf(norm, eps), so add a Clamp step to clamp the norm to at least eps before dividing. * ggml/cann: compare op_params for POOL_2D in ACL graph cache matching When ACL graph mode is enabled, the graph LRU cache checks whether a cached graph matches the current computation graph. Previously, GGML_OP_POOL_2D was not included in the op_params comparison, so two POOL_2D nodes with different pooling parameters (kernel size, stride, padding) but identical tensor shapes and addresses could incorrectly reuse a cached graph, leading to wrong results or aclnn errors. Add GGML_OP_POOL_2D to the list of ops that require op_params matching in ggml_graph_node_properties::has_matching_properties(). * cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison The ACL graph LRU cache was incorrectly reusing cached graphs for operations with different tensor types or op_params, causing test failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD, RMS_NORM_MUL_ADD, and ADD_RMS_NORM. Changes: - Add node_type and src_type[] fields to ggml_graph_node_properties so the cache can distinguish tensors with different types but identical ne/nb (e.g. f16 and bf16 both have 2-byte elements) - Compare op_params unconditionally for all ops instead of only for SCALE/UNARY/GLU/ROPE/POOL_2D

``` $ build/bin/llama-server -hf unsloth/Qwen3.5-0.8B-GGUF common_download_file_single_online: HEAD failed, status: 404 no remote preset found, skipping Downloading mmproj-BF16.gguf ——————————————————————————————————————— 100% Downloading Qwen3.5-0.8B-Q4_K_M.gguf ——————————————————————————————— 100% ... ``` Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* port cpy pipeline to shader lib with JIT compilation * port glu pipeline to shader lib with JIT compilation * port rope pipeline to shader lib with JIT compilation * port soft_max pipeline to shader lib with JIT compilation * removed unused functions from embed_wgsl.py which were used for old AOT template expansion

…gml-org#21046) * Work towards removing bitcast * Move rest of existing types over * Add timeout back to wait and remove synchronous set_tensor/memset_tensor * move to unpackf16 for wider compatibility * cleanup * Remove deadlock condition in free_bufs

…face (ggml-org#20346) * Refactor llama_model_quantize_params to expose a pure C interface * Restore comment and cleanup struct def * Code review refactoring Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Code review refactoring --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

) * flash attention support for head dimension 512 added * FA D=512 - match 576 configs, limit ncols2, revert vec cap * fix HIP tile kernel build for D=512 * fix HIP tile kernel occupancy for D=512 on AMD * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * fix tile FA compilation --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* ggml-cpu: refactor sgemm; fix rvv checks * ggml-cpu: refactor rvv kernels; set zvfbfwma default to off

* llama-bench: add `-fitc` and `-fitt` to arguments * update README.md * address review comments * update compare-llama-bench.py

…#21159) * Write an optimized flash_attn_stream_k_fixup kernel Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst. Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst * Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs * Address review comments * Address review comments * Revert variable names to original

* llama-cli: fix stripping of \n in multiline input * Change & string to string_view * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix EditorConfig linter error --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* ggml: add Q1_0 and Q1_0_g128 1-bit quantization support (CPU) * add generic fallback for x86 * remove Q1_0 (group size 32) * rename Q1_0_g128 => Q1_0 * fix Q1_0 LlamaFileType Enum * Fix trailing spaces; add generic fallback for othre backends * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix /r/n spacing + arch-fallback --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

@reeselevine

* Add mul_mat_id support to WebGPU * Apply suggestion from @reeselevine --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>

…1518)

…gml-org#21527) Extend the existing reorder optimization to Q8_0. The reorder separates scale factors from weight data for coalesced memory access -- was implemented for Q4_0/Q4_K/Q6_K but Q8_0 was missing. On Arc Pro B70 (Xe2), Q8_0 tg goes from 4.88 to 15.24 t/s (3.1x) on Qwen3.5-27B. BW utilization: 21% -> 66%. The key fix beyond the kernels: Q8_0 was missing from the type check in ggml_backend_sycl_buffer_init_tensor() that allocates the extra struct carrying the reorder flag -- so the optimization was silently skipped. AI (Claude) was used to assist with root cause investigation and writing the kernel code. All code was human-reviewed and tested on real hardware. Fixes: ggml-org#21517

gemini-code-assist

Code Review

This extensive pull request introduces support for the Gemma 4 model architecture, including its per-layer embeddings and specialized chat templates. It implements the new Q1_0 quantization type and adds NVFP4 support for CUDA and SYCL backends. Significant performance enhancements are provided for the WebGPU backend through optimized shaders for MUL_MAT_ID and flash attention, while the CPU backend receives RISC-V vector optimizations. Other improvements include new CUMSUM support for Hexagon, Q4_K support for Adreno GPUs, and revised contribution guidelines for AI agents. Feedback suggests improving maintainability by refactoring repetitive requirement checks into a loop, breaking down complex parser logic, and consolidating duplicated FP8 conversion code. Additionally, the reviewer recommends using a dispatch table for device detection, vectorizing WebGPU output logic, and deriving hardcoded layer counts from existing parameters.

gemini-code-assist · 2026-04-07T14:30:22Z

+    if ! command -v git &> /dev/null; then
+        gg_printf 'git not found, please install'
+    fi
+
+    if ! command -v git-lfs &> /dev/null; then
+        gg_printf 'git-lfs not found, please install'
+    fi
+
+    if ! command -v wget &> /dev/null; then
+        gg_printf 'wget not found, please install'
+    fi
+
+    if ! command -v python3 &> /dev/null; then
+        gg_printf 'python3 not found, please install'
+    fi
+
+    if ! command -v pip3 &> /dev/null; then
+        gg_printf 'pip3 not found, please install'
+    fi
+
+    if ! python3 -m ensurepip --help &> /dev/null; then
+        gg_printf 'ensurepip not found, please install python3-venv package'
+    fi
+
    if ! command -v cmake &> /dev/null; then
        gg_printf 'cmake not found, please install'
    fi

+    if ! command -v ccache &> /dev/null; then
+        gg_printf 'ccache not found, please consider installing for faster builds'
+    fi
+
    if ! command -v ctest &> /dev/null; then
        gg_printf 'ctest not found, please install'
    fi


This function is becoming quite large and repetitive. Consider refactoring to use a loop over an array of required binaries to improve maintainability.

gemini-code-assist · 2026-04-07T14:30:22Z

+common_peg_parser analyze_tools::build_func_parser(common_chat_peg_builder & p, const std::string & name,
+                                                    const common_peg_parser & call_id_section, bool have_call_id,
+                                                    const common_peg_parser & args,
+                                                    std::optional<common_peg_parser> atomic_peek) const {
+    auto              open           = p.tool_open(function.name_prefix + p.tool_name(p.literal(name)) + function.name_suffix);
+    bool              matched_atomic = false;
+    common_peg_parser func_parser    = p.eps();
+
+    if (!function.name_suffix.empty()) {
+        func_parser    = open + call_id_section + p.space() + args;
+        matched_atomic = true;
+    } else if (have_call_id) {
+        func_parser    = p.atomic(open + call_id_section) + p.space() + args;
+        matched_atomic = true;
+    } else if (atomic_peek.has_value()) {
+        func_parser    = p.atomic(open + call_id_section + p.space() + *atomic_peek) + args;
+        matched_atomic = true;
+    } else {
+        func_parser = open + call_id_section + p.space() + args;
+    }
+
+    if (!function.close.empty()) {
+        func_parser = func_parser + p.space() + p.tool_close(p.literal(function.close));
+    } else if (!format.per_call_end.empty()) {
+        // When there's no func_close but there is a per_call_end marker, use peek() to ensure
+        // we only emit tool_close when we can actually see the closing marker. This prevents
+        // premature closing during partial parsing when we've seen e.g. "</" which could be
+        // either "</tool_call>" (end) or "<arg_key>" prefix that failed to match.
+        func_parser = func_parser + p.tool_close(p.peek(p.literal(format.per_call_end)));
+    } else {
+        func_parser = func_parser + p.tool_close(p.space());  // force this to process tool closing callbacks in mapper
+    }
+    if (!matched_atomic) {
+        func_parser = p.atomic(func_parser);
+    }
+    return func_parser;
+}


The function build_func_parser is quite complex. Consider breaking it down into smaller, more focused helper functions to improve readability and maintainability.

gemini-code-assist · 2026-04-07T14:30:22Z

+    if (x == 0 || (x == 0x7F && x != 0xFF)) { // Convert NaN to 0.0f
+        return 0.0f;
+    }
+    const int exp = (x >> 3) & 0xF;
+    const int man = x & 0x7;
+    float raw;
+    if (exp == 0) {
+        raw = ldexpf((float) man, -9);
+    } else {
+        raw = ldexpf(1.0f + (float) man / 8.0f, exp - 7);
+    }
+    return static_cast<float>(raw / 2);


The logic for converting FP8 to FP32 is duplicated across different architectures. Consider consolidating this into a single, robust implementation if possible to reduce maintenance overhead.

gemini-code-assist · 2026-04-07T14:30:22Z

    if (GGML_CUDA_CC_IS_NVIDIA(cc)) {
+        if (cc == GGML_CUDA_CC_VOLTA || cc >= GGML_CUDA_CC_ADA_LOVELACE) {
+            return MMVQ_MAX_BATCH_SIZE;
+        }
+        if (cc >= GGML_CUDA_CC_TURING) {
+            return get_mmvq_mmid_max_batch_turing_plus(type);
+        }
        return get_mmvq_mmid_max_batch_pascal_older(type);
    }
+
    // AMD
-    if (GGML_CUDA_CC_IS_RDNA4(cc)) {
-        return get_mmvq_mmid_max_batch_rdna4(type);
-    }
-    if (GGML_CUDA_CC_IS_RDNA3(cc)) {
-        return get_mmvq_mmid_max_batch_rdna3(type);
-    }
-    if (GGML_CUDA_CC_IS_RDNA1(cc) || GGML_CUDA_CC_IS_RDNA2(cc)) {
-        return get_mmvq_mmid_max_batch_rdna1_rdna2(type);
-    }
-    if (GGML_CUDA_CC_IS_CDNA(cc)) {
-        return get_mmvq_mmid_max_batch_cdna(type);
-    }
-    if (GGML_CUDA_CC_IS_GCN(cc)) {
-        return get_mmvq_mmid_max_batch_gcn(type);
+    if (GGML_CUDA_CC_IS_AMD(cc)) {
+        if (GGML_CUDA_CC_IS_RDNA4(cc)) {
+            return get_mmvq_mmid_max_batch_rdna4(type);
+        }
+        if (GGML_CUDA_CC_IS_RDNA3(cc)) {
+            return get_mmvq_mmid_max_batch_rdna3(type);
+        }
+        if (GGML_CUDA_CC_IS_RDNA1(cc) || GGML_CUDA_CC_IS_RDNA2(cc)) {
+            return get_mmvq_mmid_max_batch_rdna1_rdna2(type);
+        }
+        if (GGML_CUDA_CC_IS_CDNA(cc)) {
+            return get_mmvq_mmid_max_batch_cdna(type);
+        }
+        if (GGML_CUDA_CC_IS_GCN(cc)) {
+            return get_mmvq_mmid_max_batch_gcn(type);
+        }
    }
    return MMVQ_MAX_BATCH_SIZE;


The nested if statements for device architecture detection are becoming difficult to follow. Consider using a more structured approach, such as a dispatch table or a cleaner helper function, to improve readability.

gemini-code-assist · 2026-04-07T14:30:22Z

+    var acc: array<array<f16, TILE_M>, TILE_N>;
+
+    for (var k_outer = 0u; k_outer < params.k; k_outer += TILE_K) {
+
+        if (is_valid) {
+            init_shmem_src0(thread_id, src0_batch_offset, offset_wg_m, k_outer);
+            init_shmem_id_src1(thread_id, params.offset_src1, rest_token_n, k_outer);
+        }
+
+        workgroupBarrier();
+
+        if (is_valid) {
+            let k_end = min(TILE_K, params.k - k_outer);
+
+            for (var k_inner = 0u; k_inner < k_end; k_inner++) {
+                var src0_tile: array<f16, TILE_M>;
+                for (var tm = 0u; tm < TILE_M; tm++) {
+                    let src0_m = local_m * TILE_M + tm;
+                    let src0_idx = k_inner + src0_m * TILE_K;
+                    src0_tile[tm] = shmem[src0_idx];
+                }
+                for (var tn = 0u; tn < TILE_N; tn++) {
+                    let src1_n = local_n * TILE_N + tn;
+                    let src1_idx = src1_n * TILE_K + k_inner;
+                    let src1_val = shmem[TILE_SRC0_SHMEM + src1_idx];
+                    for (var tm = 0u; tm < TILE_M; tm++) {
+                        acc[tn][tm] += src0_tile[tm] * src1_val;
+                    }
+                }
+            }
+        }


The output storage logic uses multiple if checks for bounds. Consider if this can be simplified or vectorized to improve performance and reduce code duplication.

gemini-code-assist · 2026-04-07T14:30:22Z

+                switch (hparams.n_layer) {
+                    case 35: type = LLM_TYPE_E2B; break;
+                    case 42: type = LLM_TYPE_E4B; break; // to confirm: E4B or E5B?
+                    default: type = LLM_TYPE_UNKNOWN;
+                }


The switch-case for layer counts is hardcoded. Consider if this can be derived from other hparams or if a more flexible mapping is needed for future model variants.

Xuan-Son Nguyen and others added 30 commits March 30, 2026 08:59

ci : bump ty to 0.0.26 (ggml-org#21156)

e2eb39e

* fix incorrect type ignore comments * bump ty to 0.0.26

llama-model-loader: print warning when using overrides with mmap (ggm…

278521c

…l-org#20978) * llama-model-loader: use pinned memory for tensor overrides * change to warning

webui: Fix branching logic on edit message (ggml-org#21175)

389c7d4

* fix: Branching logic + small refactor * chore: update webui build output

common : Disable backend sampling if reasoning budget is enabled (ggm…

5ce013c

…l-org#21209)

vendor : update BoringSSL to 0.20260327.0 (ggml-org#21211)

26dac84

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

sycl : enhance fattn perf (ggml-org#21185)

62278ce

ggml : bump version to 0.9.9 (ggml/1449)

0be6c7c

sync : ggml

9281dd1

CI: Enable CPU and Vulkan ARM64 Release (ggml-org#21207)

eec6f85

common : gpt-oss handle builtin and unsolicited tool calls (ggml-org#…

624733d

…21213)

fix: Use lower-case proxy headers naming (ggml-org#21235)

0fcb376

ggml-cpu: fix fallback for RVV kernels without zvfh (ggml-org#21157)

2b86e5c

* ggml-cpu: refactor sgemm; fix rvv checks * ggml-cpu: refactor rvv kernels; set zvfbfwma default to off

am17an and others added 8 commits April 6, 2026 22:26

llama-bench: add -fitc and -fitt to arguments (ggml-org#21304)

94ca829

* llama-bench: add `-fitc` and `-fitt` to arguments * update README.md * address review comments * update compare-llama-bench.py

ggml-webgpu: Add the support of MUL_MAT_ID (ggml-org#21147)

d0a6dfe

* Add mul_mat_id support to WebGPU * Apply suggestion from @reeselevine --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>

docs: fix typo in build.md (emdawbwebgpu -> emdawnwebgpu) (ggml-org#2…

0033f53

…1518)

ggml-cpu: add Q1_0 AVX2 path

1c569b9

github-actions Bot added documentation Improvements or additions to documentation SYCL Nvidia GPU AMD ZenDNN testing examples devops python script server ggml model nix jinja parser Ascend NPU OpenCL Hexagon WebGPU labels Apr 7, 2026

elusznik closed this Apr 7, 2026

elusznik deleted the q1_0-x86-avx2 branch April 7, 2026 14:15

elusznik restored the q1_0-x86-avx2 branch April 7, 2026 14:15

gemini-code-assist Bot reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: add Q1_0 AVX2 fast path#2

ggml-cpu: add Q1_0 AVX2 fast path#2
elusznik wants to merge 106 commits intomasterfrom
q1_0-x86-avx2

elusznik commented Apr 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

elusznik commented Apr 7, 2026

ggml-cpu: add Q1_0 AVX2 path

Commit message

PR title

PR body

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants