ggml-webgpu: fast matrix-vector multiplication for i-quants by SharmaRithik · Pull Request #22344 · ggml-org/llama.cpp

SharmaRithik · 2026-04-25T01:57:37Z

Overview

Adds fast WebGPU mat-vec implementations for all nine i-quant types (IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S, IQ4_NL, IQ4_XS). The kernels are added to mul_mat_vec.wgsl and selected through the existing use_fast dispatcher in ggml_webgpu_mul_mat.

Additional information

Numbers below are from test-backend-ops perf, comparing this branch vs. current master for the variant

MUL_MAT(type_a=<TYPE>,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1)

across the nine i-quant types.

Intel Arc B580 (Mesa 25.2.8, Dawn 4654ba883e):

NVIDIA RTX 5080 (Dawn 4654ba883e):

AMD Radeon RX 7900 XT (Mesa 25.2.8, Dawn 4654ba883e):

Apple M2 (Dawn 4654ba883e):

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: no

reeselevine · 2026-04-27T15:25:04Z

Looks good! In terms of future work on i-quants, for you or anyone else who is interested in collaborating:

We should add support to the shared memory loading for the matrix multiplication shaders, at which point we could fully remove the legacy mat-mul path.
One optimization that some of the other backends do is load the i-quant tables into shared memory collaboratively, so that every thread doesn't have to maintain the giant array locally, which almost certainly leads to register pressure and spilling.

Major upstream additions: - CUDA graph improvements: LRU eviction, node property tracking, uid-based reuse - Flash attention: stream-k fixup kernel, DKQ=320/DV=256 support, Pascal fix - SSM_CONV + ADD + SILU 3-node fusion (ggml-org#22478) - Blackwell native NVFP4 support (ggml-org#22196) - Q1_0 1-bit quantization (CPU, CUDA, Metal, Vulkan, WebGPU) - Backend-agnostic tensor parallelism (ggml-org#19378) - Speculative decoding: checkpointing, param refactoring, low-prob discard - libcommon renamed to libllama-common (ggml-org#21936) - Server: /api endpoints removed, checkpoint support, CVE-2026-21869 fix - Model refactors: build_qkv/create_tensor_qkv helpers, cmake glob for models - Recurrent state serialization fix for partial reads/writes (ggml-org#22362) - Fast mat-vec kernels for i-quants (ggml-org#22344, ggml-org#22504) Conflict resolution (22 files): - Turbo quant type IDs shifted +1 (42-46) to accommodate Q1_0 (41) - SSM_CONV tree kernels preserved alongside new fusion - DFlash spec decode coexists with upstream checkpointing - Server slot fields renamed: drafted→spec_draft, i_batch_dft→spec_i_batch - Qwen3.5/DeltaNet model registration uses new create_tensor_qkv helper - Gemma4 BF16 precision fix preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

add fast mat-vec kernels for i-quants

e85876c

SharmaRithik requested a review from a team as a code owner April 25, 2026 01:57

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning WebGPU labels Apr 25, 2026

CISC approved these changes Apr 25, 2026

View reviewed changes

reeselevine approved these changes Apr 27, 2026

View reviewed changes

reeselevine merged commit 665abc6 into ggml-org:master Apr 27, 2026
44 of 46 checks passed

IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026

add fast mat-vec kernels for i-quants (ggml-org#22344)

d50d7d8

IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026

add fast mat-vec kernels for i-quants (ggml-org#22344)

66bf080

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026

add fast mat-vec kernels for i-quants (ggml-org#22344)

87a598b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-webgpu: fast matrix-vector multiplication for i-quants#22344

ggml-webgpu: fast matrix-vector multiplication for i-quants#22344
reeselevine merged 1 commit intoggml-org:masterfrom
SharmaRithik:webgpu-matvec-iq-fast

SharmaRithik commented Apr 25, 2026

Uh oh!

reeselevine commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SharmaRithik commented Apr 25, 2026

Overview

Additional information

Requirements

Uh oh!

reeselevine commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants