Skip to content

[Tensor Parallel] Fix recurrent state serialization for partial reads and writes#22362

Merged
JohannesGaessler merged 1 commit intoggml-org:masterfrom
gaugarg-nv:fix_cache_serialization
Apr 26, 2026
Merged

[Tensor Parallel] Fix recurrent state serialization for partial reads and writes#22362
JohannesGaessler merged 1 commit intoggml-org:masterfrom
gaugarg-nv:fix_cache_serialization

Conversation

@gaugarg-nv
Copy link
Copy Markdown
Contributor

The previous code worked only for full tensor reads and writes and was hitting GGML_ASSERT(size == ggml_nbytes(tensor)); assert when tested with llama-server.

Requirements

The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.
@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 25, 2026
@gaugarg-nv gaugarg-nv changed the title Fix recurrent state serialization for partial reads and writes [Tensor Parallel] Fix recurrent state serialization for partial reads and writes Apr 25, 2026
@JohannesGaessler
Copy link
Copy Markdown
Contributor

@ggml-org/maintainers can I get a second review, please?

@JohannesGaessler JohannesGaessler merged commit 78433f6 into ggml-org:master Apr 26, 2026
45 of 46 checks passed
@gaugarg-nv gaugarg-nv deleted the fix_cache_serialization branch April 26, 2026 11:41
IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
…org#22362)

The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.
IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
…org#22362)

The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
…org#22362)

The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.
Crssz pushed a commit to Crssz/buun-llama-cpp that referenced this pull request May 1, 2026
Major upstream additions:
- CUDA graph improvements: LRU eviction, node property tracking, uid-based reuse
- Flash attention: stream-k fixup kernel, DKQ=320/DV=256 support, Pascal fix
- SSM_CONV + ADD + SILU 3-node fusion (ggml-org#22478)
- Blackwell native NVFP4 support (ggml-org#22196)
- Q1_0 1-bit quantization (CPU, CUDA, Metal, Vulkan, WebGPU)
- Backend-agnostic tensor parallelism (ggml-org#19378)
- Speculative decoding: checkpointing, param refactoring, low-prob discard
- libcommon renamed to libllama-common (ggml-org#21936)
- Server: /api endpoints removed, checkpoint support, CVE-2026-21869 fix
- Model refactors: build_qkv/create_tensor_qkv helpers, cmake glob for models
- Recurrent state serialization fix for partial reads/writes (ggml-org#22362)
- Fast mat-vec kernels for i-quants (ggml-org#22344, ggml-org#22504)

Conflict resolution (22 files):
- Turbo quant type IDs shifted +1 (42-46) to accommodate Q1_0 (41)
- SSM_CONV tree kernels preserved alongside new fusion
- DFlash spec decode coexists with upstream checkpointing
- Server slot fields renamed: drafted→spec_draft, i_batch_dft→spec_i_batch
- Qwen3.5/DeltaNet model registration uses new create_tensor_qkv helper
- Gemma4 BF16 precision fix preserved

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants