Skip to content

ggml-webgpu: add fast mat-mat path for i-quants#22504

Merged
reeselevine merged 1 commit intoggml-org:masterfrom
SharmaRithik:webgpu-iq-mul-mat-fast-path
Apr 30, 2026
Merged

ggml-webgpu: add fast mat-mat path for i-quants#22504
reeselevine merged 1 commit intoggml-org:masterfrom
SharmaRithik:webgpu-iq-mul-mat-fast-path

Conversation

@SharmaRithik
Copy link
Copy Markdown
Contributor

Overview

Adds i-quant support to the WebGPU fast mat-mat path. Previously i-quants (IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S, IQ4_NL, IQ4_XS) only had a fast mat-vec kernel; mat-mat (prefill) fell back to the legacy non-tiled mul_mat.wgsl path. This PR adds the missing INIT_SRC0_SHMEM_IQ* blocks to mul_mat_decls.tmpl so the same shared memory dequant feeds both fast paths.

Additional information

Numbers below are kernel-level throughput (GFLOPS) from test-backend-ops perf -o MUL_MAT at m=4096, n=512, k=14336. The register-tile column was measured by disabling the subgroup_matrix capability so the fallback fast path runs directly.

Intel Arc B580

Quant master (GFLOPS) register-tile (GFLOPS) subgroup-matrix (GFLOPS)
IQ1_S 411 2060 7280
IQ1_M 446 1850 6330
IQ2_XXS 490 2010 7180
IQ2_XS 335 1830 6700
IQ2_S 387 1860 6550
IQ3_XXS 497 1920 6720
IQ3_S 474 1800 6330
IQ4_NL 608 2090 8160
IQ4_XS 577 1900 7020

Apple M2

Quant master (GFLOPS) register-tile (GFLOPS) subgroup-matrix (GFLOPS)
IQ1_S 138 362 914
IQ1_M 143 365 897
IQ2_XXS 188 344 812
IQ2_XS 160 340 813
IQ2_S 163 305 738
IQ3_XXS 85 345 864
IQ3_S 106 297 734
IQ4_NL 139 395 1080
IQ4_XS 181 394 1090

Requirements

@SharmaRithik SharmaRithik requested a review from a team as a code owner April 29, 2026 07:13
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning WebGPU labels Apr 29, 2026
@reeselevine reeselevine merged commit 4515559 into ggml-org:master Apr 30, 2026
44 of 46 checks passed
tekintian added a commit to tekintian/llama.cpp that referenced this pull request May 1, 2026
* 'master' of github.com:tekintian/llama.cpp: (659 commits)
  ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (ggml-org#22464)
  Update llama-mmap to use ftello/fseeko (ggml-org#22497)
  common : check for null getpwuid in hf-cache (ggml-org#22550)
  vulkan: add get/set tensor 2d functions (ggml-org#22514)
  spec: fix argument typo (ggml-org#22552)
  ci : bump ty to 0.0.33 (ggml-org#22535)
  vendor : update cpp-httplib to 0.43.2 (ggml-org#22548)
  CUDA: fix tile FA kernel on Pascal (ggml-org#22541)
  scripts : add wc2wt.sh - create worktree from current HEAD (ggml-org#22513)
  add fast matmul iquants (ggml-org#22504)
  spec : fix draft model checkpoints (ggml-org#22521)
  spec : fix vocab compat checks in spec example (ggml-org#22426)
  common : do not pass prompt tokens to reasoning budget sampler (ggml-org#22488)
  hexagon: make vmem and buffer-size configurable (ggml-org#22487)
  CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478)
  spec : disacard last drafted token with low prob (ggml-org#22506)
  sync : ggml
  ggml : bump version to 0.10.1 (ggml/1469)
  webui: fix slow mic stop and WAV encode (ggml-org#22480)
  ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (ggml-org#22293)
  ...

# Conflicts:
#	.gitignore
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
Crssz pushed a commit to Crssz/buun-llama-cpp that referenced this pull request May 1, 2026
Major upstream additions:
- CUDA graph improvements: LRU eviction, node property tracking, uid-based reuse
- Flash attention: stream-k fixup kernel, DKQ=320/DV=256 support, Pascal fix
- SSM_CONV + ADD + SILU 3-node fusion (ggml-org#22478)
- Blackwell native NVFP4 support (ggml-org#22196)
- Q1_0 1-bit quantization (CPU, CUDA, Metal, Vulkan, WebGPU)
- Backend-agnostic tensor parallelism (ggml-org#19378)
- Speculative decoding: checkpointing, param refactoring, low-prob discard
- libcommon renamed to libllama-common (ggml-org#21936)
- Server: /api endpoints removed, checkpoint support, CVE-2026-21869 fix
- Model refactors: build_qkv/create_tensor_qkv helpers, cmake glob for models
- Recurrent state serialization fix for partial reads/writes (ggml-org#22362)
- Fast mat-vec kernels for i-quants (ggml-org#22344, ggml-org#22504)

Conflict resolution (22 files):
- Turbo quant type IDs shifted +1 (42-46) to accommodate Q1_0 (41)
- SSM_CONV tree kernels preserved alongside new fusion
- DFlash spec decode coexists with upstream checkpointing
- Server slot fields renamed: drafted→spec_draft, i_batch_dft→spec_i_batch
- Qwen3.5/DeltaNet model registration uses new create_tensor_qkv helper
- Gemma4 BF16 precision fix preserved

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning WebGPU

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants