ggml-cpu: add 128-bit RVV implementation for Quantization Vector Dot by taimur-10x · Pull Request #9 · riseproject-dev/llama.cpp

taimur-10x · 2026-02-13T15:47:57Z

Summary

This PR adds RVV 128-bit implementations for quantized vector dot kernels.

Key Changes

Added the following RVV kernels:

Kernel	VLEN
ggml_vec_dot_iq1_s_q8_K	128
ggml_vec_dot_iq1_m_q8_K	128
ggml_vec_dot_iq2_xs_q8_K	128
ggml_vec_dot_iq3_s_q8_K	128
ggml_vec_dot_iq3_xxs_q8_K	128
ggml_vec_dot_iq4_xs_q8_K	128
ggml_vec_dot_tq1_0_q8_K	128
ggml_vec_dot_tq2_0_q8_K	128

Testing

Kernels were functionally tested through test-quantize-fns for 128-bit on QEMU.

Future Work

Subsequent PRs plan to extend existing RVV kernels for quantization types to higher VLENs (512-bit and 1024-bit).

rehan-10xengineer · 2026-03-16T10:58:34Z

opened pr upstreamhere

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (ggml-org#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (ggml-org#17) * meta : formatting, naming, indentation (ggml-org#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

taimur-10x marked this pull request as draft February 13, 2026 15:48

github-actions Bot added the ggml label Feb 13, 2026

taimur-10x force-pushed the 10x/riscv-quant-vec-dot-128b branch from 51b400b to 8bd5cbe Compare February 14, 2026 16:51

rehan-10xengineer force-pushed the 10x/riscv-quant-vec-dot-128b branch from 21b6845 to a22b149 Compare February 24, 2026 12:15

taimur-10x changed the base branch from master to 10x/riscv-quant March 4, 2026 11:26

taimur-10x force-pushed the 10x/riscv-quant-vec-dot-128b branch 2 times, most recently from f8f9384 to 2785c94 Compare March 4, 2026 11:41

taimur-10x marked this pull request as ready for review March 4, 2026 11:47

taimur-10x requested a review from david-baker-808 March 10, 2026 00:26

taimur-10x assigned taimur-10x and rehan-10xengineer Mar 10, 2026

rehan-10xengineer force-pushed the 10x/riscv-quant branch 3 times, most recently from 9ca80fc to 68e3cee Compare March 13, 2026 15:04

rehan-10xengineer force-pushed the 10x/riscv-quant-vec-dot-128b branch from 2785c94 to f83ddf7 Compare March 16, 2026 10:50

github-actions Bot added documentation Improvements or additions to documentation testing Nvidia GPU Apple Metal SYCL Vulkan examples devops python script server model OpenCL labels Mar 16, 2026

rehan-10xengineer force-pushed the 10x/riscv-quant-vec-dot-128b branch from f83ddf7 to c7c6abc Compare March 16, 2026 10:55

rehan-10xengineer changed the base branch from 10x/riscv-quant to master March 16, 2026 11:17

taimur-10x force-pushed the 10x/riscv-quant-vec-dot-128b branch from c7c6abc to d618925 Compare March 16, 2026 12:15

taimur-10x removed documentation Improvements or additions to documentation testing Nvidia GPU Apple Metal SYCL Vulkan examples devops python script server model OpenCL labels Mar 16, 2026

taimur-10x and others added 2 commits March 18, 2026 16:59

ggml-cpu: add 128-bit impls for i-quants, ternary quants

2fe760f

ggml-cpu: add 128-bit impls for iq2_xs, iq3_s, iq3_xxs, tq2_0

4b12d40

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

taimur-10x force-pushed the 10x/riscv-quant-vec-dot-128b branch from d618925 to cf95828 Compare March 18, 2026 12:19

ggml-cpu: refactor; add rvv checks

05a5425

taimur-10x force-pushed the 10x/riscv-quant-vec-dot-128b branch from cf95828 to 05a5425 Compare March 18, 2026 12:47

taimur-10x merged commit 92dc6b1 into master Mar 18, 2026
32 of 51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: add 128-bit RVV implementation for Quantization Vector Dot#9

ggml-cpu: add 128-bit RVV implementation for Quantization Vector Dot#9
taimur-10x merged 3 commits intomasterfrom
10x/riscv-quant-vec-dot-128b

taimur-10x commented Feb 13, 2026 •

edited

Loading

Uh oh!

rehan-10xengineer commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

taimur-10x commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Testing

Future Work

Uh oh!

rehan-10xengineer commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

taimur-10x commented Feb 13, 2026 •

edited

Loading