ggml-webgpu: fast matrix-vector multiplication for i-quants#22344
Merged
reeselevine merged 1 commit intoggml-org:masterfrom Apr 27, 2026
Merged
ggml-webgpu: fast matrix-vector multiplication for i-quants#22344reeselevine merged 1 commit intoggml-org:masterfrom
reeselevine merged 1 commit intoggml-org:masterfrom
Conversation
CISC
approved these changes
Apr 25, 2026
Contributor
|
Looks good! In terms of future work on i-quants, for you or anyone else who is interested in collaborating:
|
reeselevine
approved these changes
Apr 27, 2026
IntelNav
pushed a commit
to IntelNav/llama.cpp
that referenced
this pull request
Apr 29, 2026
IntelNav
pushed a commit
to IntelNav/llama.cpp
that referenced
this pull request
Apr 29, 2026
rsenthilkumar6
pushed a commit
to rsenthilkumar6/llama.cpp
that referenced
this pull request
May 1, 2026
Crssz
pushed a commit
to Crssz/buun-llama-cpp
that referenced
this pull request
May 1, 2026
Major upstream additions: - CUDA graph improvements: LRU eviction, node property tracking, uid-based reuse - Flash attention: stream-k fixup kernel, DKQ=320/DV=256 support, Pascal fix - SSM_CONV + ADD + SILU 3-node fusion (ggml-org#22478) - Blackwell native NVFP4 support (ggml-org#22196) - Q1_0 1-bit quantization (CPU, CUDA, Metal, Vulkan, WebGPU) - Backend-agnostic tensor parallelism (ggml-org#19378) - Speculative decoding: checkpointing, param refactoring, low-prob discard - libcommon renamed to libllama-common (ggml-org#21936) - Server: /api endpoints removed, checkpoint support, CVE-2026-21869 fix - Model refactors: build_qkv/create_tensor_qkv helpers, cmake glob for models - Recurrent state serialization fix for partial reads/writes (ggml-org#22362) - Fast mat-vec kernels for i-quants (ggml-org#22344, ggml-org#22504) Conflict resolution (22 files): - Turbo quant type IDs shifted +1 (42-46) to accommodate Q1_0 (41) - SSM_CONV tree kernels preserved alongside new fusion - DFlash spec decode coexists with upstream checkpointing - Server slot fields renamed: drafted→spec_draft, i_batch_dft→spec_i_batch - Qwen3.5/DeltaNet model registration uses new create_tensor_qkv helper - Gemma4 BF16 precision fix preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Adds fast WebGPU mat-vec implementations for all nine i-quant types (IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S, IQ4_NL, IQ4_XS). The kernels are added to
mul_mat_vec.wgsland selected through the existinguse_fastdispatcher inggml_webgpu_mul_mat.Additional information
Numbers below are from
test-backend-ops perf, comparing this branch vs. current master for the variantacross the nine i-quant types.
Intel Arc B580 (Mesa 25.2.8, Dawn 4654ba883e):

NVIDIA RTX 5080 (Dawn 4654ba883e):

AMD Radeon RX 7900 XT (Mesa 25.2.8, Dawn 4654ba883e):

Apple M2 (Dawn 4654ba883e):

Requirements