ggml-cpu: simd_gemm implementation for riscv vector extension by rehan-10xengineer · Pull Request #11 · riseproject-dev/llama.cpp

rehan-10xengineer · 2026-03-09T06:24:48Z

Summary

Implemented a simd_gemm kernel for the RISC-V Vector Extension, which is used in
ggml_compute_forward_flash_attn_ext_tiled.

The implementation was verified using backend-ops.

Performance Evaluation

Flash-attn op performance improvement is measured using backend-ops on Banana Pi BPI-F3.

@TinyLlama-1B architecture parameters:

hsk = 64
hsv = 64
nh = 4
nr23 = [8,1]
mask = 1
prec = f32
type_KV = f16

# of Tokens	Upstream GFLOPS	Vectorized GFLOPS	Speedup
128	2.69	22.75	8.46×
256	2.49	23.70	9.52×
512	2.69	24.07	8.95×
1024	2.60	24.43	9.40×
2048	2.64	23.48	8.90×
4096	2.72	23.16	8.51×
8192	2.73	22.18	8.13×

rehan-10xengineer · 2026-03-16T07:56:45Z

opened upstream here

) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (ggml-org#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (ggml-org#17) * meta : formatting, naming, indentation (ggml-org#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

github-actions Bot added the ggml label Mar 9, 2026

rehan-10xengineer requested a review from david-baker-808 March 9, 2026 06:43

rehan-10xengineer self-assigned this Mar 9, 2026

rehan-10xengineer requested review from taimur-10x and removed request for taimur-10x March 9, 2026 06:43

taimur-10x approved these changes Mar 10, 2026

View reviewed changes

am17an and others added 3 commits March 16, 2026 11:41

CUDA: GDN hide memory latency (ggml-org#20537)

34818ea

common : fix iterator::end() dereference (ggml-org#20445)

d393649

implemented simd_gemm kernel for riscv vector extension

74dcf66

rehan-10xengineer force-pushed the rvv_flash_attn branch from b3ac51e to 74dcf66 Compare March 16, 2026 07:40

github-actions Bot added the Nvidia GPU label Mar 16, 2026

rehan-10xengineer changed the title ~~implemented simd_gemm kernel for riscv vector extension~~ ggml-cpu: simd_gemm implementation for riscv vector extension Mar 16, 2026

rehan-10xengineer removed the Nvidia GPU label Mar 16, 2026

rehan-10xengineer closed this Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: simd_gemm implementation for riscv vector extension#11

ggml-cpu: simd_gemm implementation for riscv vector extension#11
rehan-10xengineer wants to merge 3 commits intomasterfrom
rvv_flash_attn

rehan-10xengineer commented Mar 9, 2026 •

edited

Loading

Uh oh!

rehan-10xengineer commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

rehan-10xengineer commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance Evaluation

Uh oh!

rehan-10xengineer commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rehan-10xengineer commented Mar 9, 2026 •

edited

Loading