[Tensor Parallel] Fix recurrent state serialization for partial reads and writes#22362
Merged
JohannesGaessler merged 1 commit intoggml-org:masterfrom Apr 26, 2026
Merged
Conversation
The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.
JohannesGaessler
approved these changes
Apr 26, 2026
Contributor
|
@ggml-org/maintainers can I get a second review, please? |
CISC
approved these changes
Apr 26, 2026
IntelNav
pushed a commit
to IntelNav/llama.cpp
that referenced
this pull request
Apr 29, 2026
…org#22362) The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.
IntelNav
pushed a commit
to IntelNav/llama.cpp
that referenced
this pull request
Apr 29, 2026
…org#22362) The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.
rsenthilkumar6
pushed a commit
to rsenthilkumar6/llama.cpp
that referenced
this pull request
May 1, 2026
…org#22362) The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.
Crssz
pushed a commit
to Crssz/buun-llama-cpp
that referenced
this pull request
May 1, 2026
Major upstream additions: - CUDA graph improvements: LRU eviction, node property tracking, uid-based reuse - Flash attention: stream-k fixup kernel, DKQ=320/DV=256 support, Pascal fix - SSM_CONV + ADD + SILU 3-node fusion (ggml-org#22478) - Blackwell native NVFP4 support (ggml-org#22196) - Q1_0 1-bit quantization (CPU, CUDA, Metal, Vulkan, WebGPU) - Backend-agnostic tensor parallelism (ggml-org#19378) - Speculative decoding: checkpointing, param refactoring, low-prob discard - libcommon renamed to libllama-common (ggml-org#21936) - Server: /api endpoints removed, checkpoint support, CVE-2026-21869 fix - Model refactors: build_qkv/create_tensor_qkv helpers, cmake glob for models - Recurrent state serialization fix for partial reads/writes (ggml-org#22362) - Fast mat-vec kernels for i-quants (ggml-org#22344, ggml-org#22504) Conflict resolution (22 files): - Turbo quant type IDs shifted +1 (42-46) to accommodate Q1_0 (41) - SSM_CONV tree kernels preserved alongside new fusion - DFlash spec decode coexists with upstream checkpointing - Server slot fields renamed: drafted→spec_draft, i_batch_dft→spec_i_batch - Qwen3.5/DeltaNet model registration uses new create_tensor_qkv helper - Gemma4 BF16 precision fix preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The previous code worked only for full tensor reads and writes and was hitting
GGML_ASSERT(size == ggml_nbytes(tensor));assert when tested with llama-server.Requirements