ggml-webgpu: add fast mat-mat path for i-quants by SharmaRithik · Pull Request #22504 · ggml-org/llama.cpp

SharmaRithik · 2026-04-29T07:13:36Z

Overview

Adds i-quant support to the WebGPU fast mat-mat path. Previously i-quants (IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S, IQ4_NL, IQ4_XS) only had a fast mat-vec kernel; mat-mat (prefill) fell back to the legacy non-tiled mul_mat.wgsl path. This PR adds the missing INIT_SRC0_SHMEM_IQ* blocks to mul_mat_decls.tmpl so the same shared memory dequant feeds both fast paths.

Additional information

Numbers below are kernel-level throughput (GFLOPS) from test-backend-ops perf -o MUL_MAT at m=4096, n=512, k=14336. The register-tile column was measured by disabling the subgroup_matrix capability so the fallback fast path runs directly.

Intel Arc B580

Quant	master (GFLOPS)	register-tile (GFLOPS)	subgroup-matrix (GFLOPS)
IQ1_S	411	2060	7280
IQ1_M	446	1850	6330
IQ2_XXS	490	2010	7180
IQ2_XS	335	1830	6700
IQ2_S	387	1860	6550
IQ3_XXS	497	1920	6720
IQ3_S	474	1800	6330
IQ4_NL	608	2090	8160
IQ4_XS	577	1900	7020

Apple M2

Quant	master (GFLOPS)	register-tile (GFLOPS)	subgroup-matrix (GFLOPS)
IQ1_S	138	362	914
IQ1_M	143	365	897
IQ2_XXS	188	344	812
IQ2_XS	160	340	813
IQ2_S	163	305	738
IQ3_XXS	85	345	864
IQ3_S	106	297	734
IQ4_NL	139	395	1080
IQ4_XS	181	394	1090

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

* 'master' of github.com:tekintian/llama.cpp: (659 commits) ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (ggml-org#22464) Update llama-mmap to use ftello/fseeko (ggml-org#22497) common : check for null getpwuid in hf-cache (ggml-org#22550) vulkan: add get/set tensor 2d functions (ggml-org#22514) spec: fix argument typo (ggml-org#22552) ci : bump ty to 0.0.33 (ggml-org#22535) vendor : update cpp-httplib to 0.43.2 (ggml-org#22548) CUDA: fix tile FA kernel on Pascal (ggml-org#22541) scripts : add wc2wt.sh - create worktree from current HEAD (ggml-org#22513) add fast matmul iquants (ggml-org#22504) spec : fix draft model checkpoints (ggml-org#22521) spec : fix vocab compat checks in spec example (ggml-org#22426) common : do not pass prompt tokens to reasoning budget sampler (ggml-org#22488) hexagon: make vmem and buffer-size configurable (ggml-org#22487) CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478) spec : disacard last drafted token with low prob (ggml-org#22506) sync : ggml ggml : bump version to 0.10.1 (ggml/1469) webui: fix slow mic stop and WAV encode (ggml-org#22480) ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (ggml-org#22293) ... # Conflicts: # .gitignore

Major upstream additions: - CUDA graph improvements: LRU eviction, node property tracking, uid-based reuse - Flash attention: stream-k fixup kernel, DKQ=320/DV=256 support, Pascal fix - SSM_CONV + ADD + SILU 3-node fusion (ggml-org#22478) - Blackwell native NVFP4 support (ggml-org#22196) - Q1_0 1-bit quantization (CPU, CUDA, Metal, Vulkan, WebGPU) - Backend-agnostic tensor parallelism (ggml-org#19378) - Speculative decoding: checkpointing, param refactoring, low-prob discard - libcommon renamed to libllama-common (ggml-org#21936) - Server: /api endpoints removed, checkpoint support, CVE-2026-21869 fix - Model refactors: build_qkv/create_tensor_qkv helpers, cmake glob for models - Recurrent state serialization fix for partial reads/writes (ggml-org#22362) - Fast mat-vec kernels for i-quants (ggml-org#22344, ggml-org#22504) Conflict resolution (22 files): - Turbo quant type IDs shifted +1 (42-46) to accommodate Q1_0 (41) - SSM_CONV tree kernels preserved alongside new fusion - DFlash spec decode coexists with upstream checkpointing - Server slot fields renamed: drafted→spec_draft, i_batch_dft→spec_i_batch - Qwen3.5/DeltaNet model registration uses new create_tensor_qkv helper - Gemma4 BF16 precision fix preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

add fast matmul iquants

d1fea28

SharmaRithik requested a review from a team as a code owner April 29, 2026 07:13

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning WebGPU labels Apr 29, 2026

CISC approved these changes Apr 29, 2026

View reviewed changes

reeselevine mentioned this pull request Apr 29, 2026

ggml-webgpu: Improve the mat-vec and mat-mat of MUL_MAT_ID #22464

Merged

reeselevine approved these changes Apr 30, 2026

View reviewed changes

reeselevine merged commit 4515559 into ggml-org:master Apr 30, 2026
44 of 46 checks passed

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026

add fast matmul iquants (ggml-org#22504)

1ea4481

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-webgpu: add fast mat-mat path for i-quants#22504

ggml-webgpu: add fast mat-mat path for i-quants#22504
reeselevine merged 1 commit intoggml-org:masterfrom
SharmaRithik:webgpu-iq-mul-mat-fast-path

SharmaRithik commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SharmaRithik commented Apr 29, 2026

Overview

Additional information

Intel Arc B580

Apple M2

Requirements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants