Skip to content

ggml-webgpu: fast matrix-vector multiplication for i-quants#22344

Merged
reeselevine merged 1 commit intoggml-org:masterfrom
SharmaRithik:webgpu-matvec-iq-fast
Apr 27, 2026
Merged

ggml-webgpu: fast matrix-vector multiplication for i-quants#22344
reeselevine merged 1 commit intoggml-org:masterfrom
SharmaRithik:webgpu-matvec-iq-fast

Conversation

@SharmaRithik
Copy link
Copy Markdown
Contributor

Overview

Adds fast WebGPU mat-vec implementations for all nine i-quant types (IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S, IQ4_NL, IQ4_XS). The kernels are added to mul_mat_vec.wgsl and selected through the existing use_fast dispatcher in ggml_webgpu_mul_mat.

Additional information

Numbers below are from test-backend-ops perf, comparing this branch vs. current master for the variant

MUL_MAT(type_a=<TYPE>,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1)

across the nine i-quant types.

Intel Arc B580 (Mesa 25.2.8, Dawn 4654ba883e):
chart-intel

NVIDIA RTX 5080 (Dawn 4654ba883e):
chart-nvidia

AMD Radeon RX 7900 XT (Mesa 25.2.8, Dawn 4654ba883e):
chart-amd

Apple M2 (Dawn 4654ba883e):
chart-apple

Requirements

@SharmaRithik SharmaRithik requested a review from a team as a code owner April 25, 2026 01:57
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning WebGPU labels Apr 25, 2026
@reeselevine
Copy link
Copy Markdown
Contributor

Looks good! In terms of future work on i-quants, for you or anyone else who is interested in collaborating:

  • We should add support to the shared memory loading for the matrix multiplication shaders, at which point we could fully remove the legacy mat-mul path.
  • One optimization that some of the other backends do is load the i-quant tables into shared memory collaboratively, so that every thread doesn't have to maintain the giant array locally, which almost certainly leads to register pressure and spilling.

@reeselevine reeselevine merged commit 665abc6 into ggml-org:master Apr 27, 2026
44 of 46 checks passed
IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
Crssz pushed a commit to Crssz/buun-llama-cpp that referenced this pull request May 1, 2026
Major upstream additions:
- CUDA graph improvements: LRU eviction, node property tracking, uid-based reuse
- Flash attention: stream-k fixup kernel, DKQ=320/DV=256 support, Pascal fix
- SSM_CONV + ADD + SILU 3-node fusion (ggml-org#22478)
- Blackwell native NVFP4 support (ggml-org#22196)
- Q1_0 1-bit quantization (CPU, CUDA, Metal, Vulkan, WebGPU)
- Backend-agnostic tensor parallelism (ggml-org#19378)
- Speculative decoding: checkpointing, param refactoring, low-prob discard
- libcommon renamed to libllama-common (ggml-org#21936)
- Server: /api endpoints removed, checkpoint support, CVE-2026-21869 fix
- Model refactors: build_qkv/create_tensor_qkv helpers, cmake glob for models
- Recurrent state serialization fix for partial reads/writes (ggml-org#22362)
- Fast mat-vec kernels for i-quants (ggml-org#22344, ggml-org#22504)

Conflict resolution (22 files):
- Turbo quant type IDs shifted +1 (42-46) to accommodate Q1_0 (41)
- SSM_CONV tree kernels preserved alongside new fusion
- DFlash spec decode coexists with upstream checkpointing
- Server slot fields renamed: drafted→spec_draft, i_batch_dft→spec_i_batch
- Qwen3.5/DeltaNet model registration uses new create_tensor_qkv helper
- Gemma4 BF16 precision fix preserved

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning WebGPU

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants