[Tensor Parallel] Fix recurrent state serialization for partial reads and writes by gaugarg-nv · Pull Request #22362 · ggml-org/llama.cpp

gaugarg-nv · 2026-04-25T16:50:08Z

The previous code worked only for full tensor reads and writes and was hitting GGML_ASSERT(size == ggml_nbytes(tensor)); assert when tested with llama-server.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: For testing and code review

The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.

JohannesGaessler · 2026-04-26T08:24:09Z

@ggml-org/maintainers can I get a second review, please?

…org#22362) The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.

Major upstream additions: - CUDA graph improvements: LRU eviction, node property tracking, uid-based reuse - Flash attention: stream-k fixup kernel, DKQ=320/DV=256 support, Pascal fix - SSM_CONV + ADD + SILU 3-node fusion (ggml-org#22478) - Blackwell native NVFP4 support (ggml-org#22196) - Q1_0 1-bit quantization (CPU, CUDA, Metal, Vulkan, WebGPU) - Backend-agnostic tensor parallelism (ggml-org#19378) - Speculative decoding: checkpointing, param refactoring, low-prob discard - libcommon renamed to libllama-common (ggml-org#21936) - Server: /api endpoints removed, checkpoint support, CVE-2026-21869 fix - Model refactors: build_qkv/create_tensor_qkv helpers, cmake glob for models - Recurrent state serialization fix for partial reads/writes (ggml-org#22362) - Fast mat-vec kernels for i-quants (ggml-org#22344, ggml-org#22504) Conflict resolution (22 files): - Turbo quant type IDs shifted +1 (42-46) to accommodate Q1_0 (41) - SSM_CONV tree kernels preserved alongside new fusion - DFlash spec decode coexists with upstream checkpointing - Server slot fields renamed: drafted→spec_draft, i_batch_dft→spec_i_batch - Qwen3.5/DeltaNet model registration uses new create_tensor_qkv helper - Gemma4 BF16 precision fix preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix recurrent state serialization for partial reads and writes

92f0041

The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 25, 2026

gaugarg-nv changed the title ~~Fix recurrent state serialization for partial reads and writes~~ [Tensor Parallel] Fix recurrent state serialization for partial reads and writes Apr 25, 2026

robfuscator mentioned this pull request Apr 25, 2026

Eval bug: llama-server crash on generation using "-sm tensor" #22268

Closed

JohannesGaessler approved these changes Apr 26, 2026

View reviewed changes

CISC approved these changes Apr 26, 2026

View reviewed changes

JohannesGaessler merged commit 78433f6 into ggml-org:master Apr 26, 2026
45 of 46 checks passed

gaugarg-nv deleted the fix_cache_serialization branch April 26, 2026 11:41

BlisteringViola mentioned this pull request Apr 30, 2026

server: prevent destructive memory wipe on seq_rm failure for hybrid models #22534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tensor Parallel] Fix recurrent state serialization for partial reads and writes#22362

[Tensor Parallel] Fix recurrent state serialization for partial reads and writes#22362
JohannesGaessler merged 1 commit intoggml-org:masterfrom
gaugarg-nv:fix_cache_serialization

gaugarg-nv commented Apr 25, 2026

Uh oh!

JohannesGaessler commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gaugarg-nv commented Apr 25, 2026

Requirements

Uh oh!

JohannesGaessler commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants