Context Overflow Fix by zsogitbe · Pull Request #1373 · SciSharp/LLamaSharp

zsogitbe · 2026-04-19T06:35:08Z

Thank you for digging into this issue. The proposed solution (MemoryClear + Re-prefilling) in #1336 correctly identifies how to prevent the engine from crashing on models that return MemoryCanShift == false (like Qwen 3.5 with 2D RoPE embeddings).

However, after reviewing how the upstream llama.cpp ecosystem handles this, I’d like to propose an architectural modification in this PR.

The Core Issue

The truncation and shifting logic (like --no-context-shift) does not actually exist in the core llama.h C API; it is application-level logic built into llama-cli (main.cpp). Because LLamaSharp's StatelessExecutor and InteractiveExecutor act as our application-layer equivalents, implementing the fix here is structurally correct.

However, hardcoding n_discard = n_left / 2; forces an extremely opinionated behavior. Silently amputating 50% of the KV cache when it fills up introduces two major risks for developers:

Semantic Destruction: Blindly dropping exactly half of the tokens can cut code blocks, JSON payloads, or sentences in half, causing instant hallucinations.
Silent Latency Spikes: The developer will experience a massive, multi-second TTFT (Time To First Token) delay while DecodeAsync re-evaluates the remaining thousands of tokens, with no explicit warning as to why the delay happened.

The Proposed Compromise

Instead of making silent truncation the default behavior, we should implement a ContextOverflowStrategy on IInferenceParams.

Default: ThrowException. If the model's math blocks shifting, we fast-fail. This ensures enterprise developers know they hit a limit and must implement proper RAG/Summarization. (This mimics the llama-cli --no-context-shift flag).
Opt-In: TruncateAndReprefill. We run the PR's MemoryClear logic, but we allow the developer to configure the ContextTruncationPercentage (defaulting to 10% instead of 50% to prevent excessive context loss).

Explanation of Implementation Choices

TokensToKeep Indexing: In the original PR, RemoveRange used tokensToKeep - 1. This was structurally unsafe. If tokensToKeep was 0 (e.g., no system prompt), it would result in a -1 index and instantly crash the application. The new implementation safely calculates the startIndex.
Math.Clamp on Percentage: A user might accidentally pass 1.5f (150%) or 0.0f to the truncation percentage. The implementation guarantees the math will always drop some tokens but never all tokens, preventing infinite loops.
ContextOverflowException: By subclassing a specific exception, developers can wrap their generation loops in a try/catch (ContextOverflowException) block to specifically catch and handle this event with custom semantic summarization.

What's Included in this PR

Added ContextOverflowStrategy enum and ContextOverflowException.
Updated IInferenceParams and InferenceParams with the new configuration properties.
Updated StatefulExecutorBase.cs to respect the new strategy.
Updated LLamaStatelessExecutor.cs to try native memory shifting first, and safely catch the MemoryCanShift exception to trigger the fallback logic.
Added/Updated Unit Tests to verify the exact overflow boundaries for both Stateless and Stateful executors.

Test Output

To ensure this is robust, I tested the exact context boundaries dynamically based on what llama.cpp allocates at runtime. Both executors now successfully intercept the overflow and throw the correct exception before native crashes occur:

[DEBUG] Actual ContextSize: 256, Prompt length: 254
Successfully caught expected exception: The context window is full and the current strategy 
is set to ThrowException. To automatically truncate and manage context, set 
InferenceParams.OverflowStrategy to ContextOverflowStrategy.TruncateAndReprefill.

Successfully caught expected exception in InstructExecutor: The context window is full and 
the current model architecture does not support native memory shifting. To automatically 
truncate and re-prefill the context, set InferenceParams.OverflowStrategy to 
ContextOverflowStrategy.TruncateAndReprefill.

Looking forward to hearing your thoughts on this approach!

update to LLama.Web.Common.InferenceOptions

martindevans · 2026-04-19T22:47:30Z

I really like this idea, it's much better than hardcoding a specific behaviour.

I haven't had a chance to review this in depth yet - but my first thought on reading is: do you think it might be feasible to factor the code handling the overflow out to a separate class? Something like an IOverflowHandler interface, with OverflowThrows and OverflowShifts implementations built into LLamaSharp. That way end users can add completely custom context overflow handling.

(Note: If you don't think that's feasible or you simply don't want to work on it that's fine! I'm happy to go ahead with reviewing this as-is if you prefer).

aropb · 2026-04-20T09:20:37Z

I haven't had a chance to review this in depth yet - but my first thought on reading is: do you think it might be feasible to factor the code handling the overflow out to a separate class? Something like an IOverflowHandler interface, with OverflowThrows and OverflowShifts implementations built into LLamaSharp. That way end users can add completely custom context overflow handling.

It seems redundant to me. Alternatively, you can make the method virtual.

aropb · 2026-04-20T09:57:02Z

I usually calculate the size of the context as the sum of the prompt + reserve for the response (2-4К). TokensKeep = -1, so that the prompt is saved when the context is shifted. In this case, the 50% shift works well. A lower value will result in a more frequent shift.

zsogitbe · 2026-04-20T10:13:36Z

I really like this idea, it's much better than hardcoding a specific behaviour.

I haven't had a chance to review this in depth yet - but my first thought on reading is: do you think it might be feasible to factor the code handling the overflow out to a separate class? Something like an IOverflowHandler interface, with OverflowThrows and OverflowShifts implementations built into LLamaSharp. That way end users can add completely custom context overflow handling.

(Note: If you don't think that's feasible or you simply don't want to work on it that's fine! I'm happy to go ahead with reviewing this as-is if you prefer).

Glad that you like the approach! Concerning your IOverflowHandler idea Martin, I think that I tend to agree with aropb, because the reason for handling the overflow would be, for example, to create a summary, then rerun the prompt, and this handler would not be the right place to do that logic, I think, but correct me if I am wrong. In any case, even if it is decided to do this IOverflowHandler interface, I would like to suggest to do this in a follow-up PR. let us focus now on making this logic perfect first.

I think Context.NativeHandle.MemoryCanShift should be a better way to detect this case.

Yes, we can change that to a Context.NativeHandle.MemoryCanShift test instead of catching the exception.

@martindevans, @aropb, I will wait until you did a full review and test and you give all of your remarks, before adjusting the PR.

martindevans · 2026-04-21T22:37:47Z

Concerning your IOverflowHandler idea

There are three main things I want to address with the idea of overflow strategies:

User extensibility - New overflow handling strategies can be implemented without extending the library or creating a new executor type.
Configuration - At the moment there's no place to store per-strategy configuration, so we end up storing it on the InferenceParams and simply documenting that some properties don't do anything except in certain cases.
Code duplication - At the moment the complex shifting behaviour is implemented twice (BaseExecutor and StatelessExecutor) and is not compatible at all with the BatchedExecutor. If it were moved out to overflow strategy classes that code wouldn't be deuplicated. There's an added bonus here that the executor classes have become a complex mess - factoring any code out of them is a win!

In any case, even if it is decided to do this IOverflowHandler interface, I would like to suggest to do this in a follow-up PR. let us focus now on making this logic perfect first.

Agreed 👍

zsogitbe · 2026-04-23T17:15:47Z

Concerning your IOverflowHandler idea

There are three main things I want to address with the idea of overflow strategies:

User extensibility - New overflow handling strategies can be implemented without extending the library or creating a new executor type.

Configuration - At the moment there's no place to store per-strategy configuration, so we end up storing it on the InferenceParams and simply documenting that some properties don't do anything except in certain cases.

Code duplication - At the moment the complex shifting behaviour is implemented twice (BaseExecutor and StatelessExecutor) and is not compatible at all with the BatchedExecutor. If it were moved out to overflow strategy classes that code wouldn't be deuplicated. There's an added bonus here that the executor classes have become a complex mess - factoring any code out of them is a win!

In any case, even if it is decided to do this IOverflowHandler interface, I would like to suggest to do this in a follow-up PR. let us focus now on making this logic perfect first.

Agreed 👍

@martindevans, this is a fantastic idea Martin, and a textbook use case for the Strategy pattern. It cleanly solves the issue of overloading InferenceParams, removes the duplicated fallback logic between StatelessExecutor and StatefulExecutorBase, and makes the API much more extensible for users who want to build custom overflow handlers.

The main piece of work here will be getting the internal "plumbing" right. Because StatelessExecutor (which uses List<LLamaToken> all_tokens) and StatefulExecutorBase (which tracks _session_tokens and _embed_inps) manage their token histories differently, the IOverflowHandler interface will need to be designed carefully. It has to safely mutate those different history states and update the n_past count without breaking the internal tracking of either executor.

p.s.: I have updated the code above!

Copilot

Pull request overview

This PR introduces configurable handling for context-window overflow in LLamaSharp executors, aiming to prevent native crashes on models that cannot shift KV cache (e.g., MemoryCanShift == false) by either fast-failing or truncating and re-prefilling.

Changes:

Added ContextOverflowStrategy and ContextOverflowException, and extended IInferenceParams/InferenceParams/web InferenceOptions with overflow controls.
Updated stateless and stateful executors to honor the overflow strategy, including a truncate+re-prefill fallback when native shifting isn’t supported.
Added unit tests to validate the default (throw) strategy behavior and retained a (skipped) truncation test.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`LLama/LLamaStatelessExecutor.cs`	Tracks token history and implements overflow behavior (throw vs truncate+shift/reprefill).
`LLama/LLamaInteractExecutor.cs`	Awaits updated async overflow handler with inference params.
`LLama/LLamaInstructExecutor.cs`	Awaits updated async overflow handler with inference params.
`LLama/LLamaExecutorBase.cs`	Converts overflow handler to async and adds truncate+reprefill logic for stateful executors.
`LLama/Exceptions/ContextOverflowException.cs`	Adds a dedicated exception type for context overflow.
`LLama/Common/InferenceParams.cs`	Adds `OverflowStrategy` and `ContextTruncationPercentage` configuration with defaults.
`LLama/Common/ContextOverflowStrategy.cs`	Adds strategy enum (throw vs truncate+reprefill).
`LLama/Abstractions/IInferenceParams.cs`	Extends the public inference-params contract with overflow controls.
`LLama.Web/Common/InferenceOptions.cs`	Exposes overflow controls for the web layer implementation of `IInferenceParams`.
`LLama.Unittest/StatelessExecutorTest.cs`	Adds tests for default overflow behavior (throws) and configures the truncate test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

zsogitbe · 2026-04-23T18:54:08Z

I have corrected the bugs found by Copilot. Unfortunately, I had to disable the MemoryCanShift = false path in the StatefulExecutorBase in HandleRunOutOfContext because as Copilot pointed out we need to track _history_tokens... This will be a future PR.

Explanation: I am holding off on implementing the manual history tracking fix in this PR for StatefulExecutorBase. Rebuilding the KV cache manually for models that don't support native memory shifting is highly complex and risks silent state corruption, especially when dealing with session caching and multimodal inputs. Keeping the ContextOverflowException guard as-is forces the calling application to handle the overflow gracefully (e.g., starting a new chat) rather than risking a desynced model state. We can revisit this if we see high user impact, but stability comes first.

martindevans · 2026-04-25T20:45:52Z

Thanks for fixing that feedback, it all looks ok to me 👍

Could you create an issue for that todo linking this PR and explaining the details of the work required? I'll mark it as do-not-close.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-25T20:54:01Z

                // when run out of context
-                // based on this logic: https://github.com/ggerganov/llama.cpp/blob/master/examples/main/main.cpp#L497
                if (n_past + tokens.Count >= Context.ContextSize)
                {
+                    if (inferenceParams.OverflowStrategy == ContextOverflowStrategy.ThrowException)
+                    {
+                        throw new ContextOverflowException();
+                    }


Context overflow handling occurs after a token has already been sampled (and even yielded). When truncation/re-prefill happens, the sampled token was produced from logits for the pre-truncation KV state, so the emitted token may be inconsistent with the post-truncation context. To keep generation correct/deterministic, overflow handling should happen before sampling the next token (and after any re-prefill, re-sample from the new logits) rather than sampling/yielding first and truncating afterwards.

This is standard continuous batching behavior and closely mirrors how llama.cpp handles it. Moving the overflow check before sampling is unnecessary and would likely complicate your generation loop without providing any tangible benefit.

Copilot · 2026-04-25T20:54:02Z

            _pastTokensCount -= n_discard;
-            // stop saving session if we run out of context
-            _pathSession = string.Empty;
+


HandleRunOutOfContext now shifts the KV cache but no longer disables session saving (previously _pathSession was cleared on overflow). Since _session_tokens is still appended to later, SaveSessionFile can produce a session token list that no longer matches the shifted KV cache. Either keep the old behavior (stop session saving after a shift) or update _session_tokens (and related counters) to reflect discarded tokens so saved sessions remain loadable/correct.

Suggested change

// Keep session tracking aligned with the shifted KV cache so future session saves/reuse

// operate on the same logical token sequence as the current context.

var sessionDiscardStart = Math.Min(tokensToKeep, _session_tokens.Count);

var sessionDiscardEnd = Math.Min(tokensToKeep + n_discard, _session_tokens.Count);

var removedSessionTokens = sessionDiscardEnd - sessionDiscardStart;

if (removedSessionTokens > 0)

{

_session_tokens.RemoveRange(sessionDiscardStart, removedSessionTokens);

if (_n_session_consumed > sessionDiscardStart)

{

_n_session_consumed = _n_session_consumed >= sessionDiscardEnd

? _n_session_consumed - removedSessionTokens

: sessionDiscardStart;

}

}

if (_n_session_consumed > _session_tokens.Count)

{

_n_session_consumed = _session_tokens.Count;

}

OK! We will stop saving the session if we run out of context.

Copilot · 2026-04-25T20:54:02Z

+        /// Defines the strategy the executor should use when the context window is full 
+        /// and the model architecture (e.g., models with 2D RoPE embeddings) does not 
+        /// support native memory shifting.
+        /// </summary>
+        ContextOverflowStrategy OverflowStrategy { get; set; }
+
+        /// <summary>
+        /// The percentage of past tokens to discard when <see cref="OverflowStrategy"/> 
+        /// is set to <see cref="ContextOverflowStrategy.TruncateAndReprefill"/>. 
+        /// For example, 0.1f represents dropping the oldest 10% of the conversational context.


The docs for OverflowStrategy describe it as applying only when the model “does not support native memory shifting”, but executors also use it to force a fast-fail even when MemoryCanShift is true. Consider rewording to “when the context window is full” and documenting that ThrowException disables any automatic shifting/truncation regardless of model support.

Suggested change

/// Defines the strategy the executor should use when the context window is full

/// and the model architecture (e.g., models with 2D RoPE embeddings) does not

/// support native memory shifting.

/// </summary>

ContextOverflowStrategy OverflowStrategy { get; set; }

/// <summary>

/// The percentage of past tokens to discard when <see cref="OverflowStrategy"/>

/// is set to <see cref="ContextOverflowStrategy.TruncateAndReprefill"/>.

/// For example, 0.1f represents dropping the oldest 10% of the conversational context.

/// Defines the strategy the executor should use when the context window is full.

/// </summary>

/// <remarks>

/// This setting applies even for models that support native memory shifting.

/// Setting <see cref="ContextOverflowStrategy.ThrowException"/> disables automatic

/// shifting or truncation and causes the executor to fail immediately on overflow.

/// </remarks>

ContextOverflowStrategy OverflowStrategy { get; set; }

/// <summary>

/// The percentage of past tokens to discard when <see cref="OverflowStrategy"/>

/// is set to <see cref="ContextOverflowStrategy.TruncateAndReprefill"/> to recover

/// from a full context window. For example, 0.1f represents dropping the oldest

/// 10% of the conversational context.

Not doing this now.

Copilot · 2026-04-25T20:54:03Z

                    // number of tokens to keep when resetting context
-                    // Ported from https://github.com/ggerganov/llama.cpp/blob/60325fa56f61c228464c9f065db3aa6a61f2156e/examples/main/main.cpp#L334
-                    if (tokensKeep < 0 || tokensKeep > tokens.Count)
+                    if (tokensKeep < 0 || tokensKeep > all_tokens.Count)
                    {
-                        tokensKeep = tokens.Count;
+                        tokensKeep = all_tokens.Count;
                    }


tokensKeep is being clamped against all_tokens.Count, but all_tokens grows over generation. For values like TokensKeep = -1 (used elsewhere to mean “keep the prompt”), this will effectively become “keep the entire current history”, making n_left become 0 and truncation impossible. Consider capturing the initial prompt token count (before generation) and clamping tokensKeep against that constant prompt length instead of the evolving history list.

OK! I will correct this.

martindevans · 2026-04-25T21:15:27Z

copilot has raised a couple of issues, but I can't assess them since I don't fully understand the internals of the higher level executors. Feel free to dismiss them if you think they're invalid.

zsogitbe · 2026-04-26T06:34:07Z

copilot has raised a couple of issues, but I can't assess them since I don't fully understand the internals of the higher level executors. Feel free to dismiss them if you think they're invalid.

I will look at them and apply the necessary changes if needed! This is a specialized model trained on similar code, so the comments may be relevant.

OK! Added the corrections.

martindevans · 2026-04-26T15:33:20Z

Thanks for investigating those review items.

TheIntelligentEnterpriseGroup added 2 commits April 19, 2026 08:30

Context Overflow Fix

556a401

Context Overflow Fix

7f54d5c

update to LLama.Web.Common.InferenceOptions

martindevans reviewed Apr 19, 2026

View reviewed changes

Comment thread LLama/LLamaStatelessExecutor.cs Outdated

martindevans mentioned this pull request Apr 21, 2026

Update For Qwen3.5/Gemma4 Support #1356

Merged

4 tasks

martindevans requested changes Apr 21, 2026

View reviewed changes

Comment thread LLama/Exceptions/ContextOverflowException.cs Outdated

Comment thread LLama/Exceptions/ContextOverflowException.cs Outdated

Comment thread LLama/LLamaExecutorBase.cs Outdated

Comment thread LLama/LLamaExecutorBase.cs

Comment thread LLama/LLamaExecutorBase.cs Outdated

Context Overflow Fix

6b66a79

martindevans approved these changes Apr 23, 2026

View reviewed changes

martindevans requested a review from Copilot April 23, 2026 17:54

Copilot started reviewing on behalf of martindevans April 23, 2026 17:55 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Comment thread LLama/LLamaStatelessExecutor.cs

Comment thread LLama/LLamaExecutorBase.cs Outdated

Comment thread LLama.Unittest/StatelessExecutorTest.cs

Comment thread LLama.Unittest/StatelessExecutorTest.cs

Context Overflow Fix

5d13abd

martindevans requested a review from Copilot April 25, 2026 20:45

Copilot started reviewing on behalf of martindevans April 25, 2026 20:45 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

zsogitbe mentioned this pull request Apr 26, 2026

TODO: implementing the manual history tracking fix in StatefulExecutorBase #1376

Open

Open Context Overflow Fix

ec97967

martindevans merged commit 190344f into SciSharp:master Apr 26, 2026
8 checks passed

This was referenced Apr 27, 2026

Bump LLamaSharp from 0.26.0 to 0.27.0 VibeNL/GhostfolioSidekick#1060

Closed

Bump LLamaSharp.Backend.Cpu from 0.26.0 to 0.27.0 VibeNL/GhostfolioSidekick#1061

Merged

+            // Keep session tracking aligned with the shifted KV cache so future session saves/reuse
+            // operate on the same logical token sequence as the current context.
+            var sessionDiscardStart = Math.Min(tokensToKeep, _session_tokens.Count);
+            var sessionDiscardEnd = Math.Min(tokensToKeep + n_discard, _session_tokens.Count);
+            var removedSessionTokens = sessionDiscardEnd - sessionDiscardStart;
+            if (removedSessionTokens > 0)
+            {
+                _session_tokens.RemoveRange(sessionDiscardStart, removedSessionTokens);
+                if (_n_session_consumed > sessionDiscardStart)
+                {
+                    _n_session_consumed = _n_session_consumed >= sessionDiscardEnd
+                        ? _n_session_consumed - removedSessionTokens
+                        : sessionDiscardStart;
+                }
+            }
+            if (_n_session_consumed > _session_tokens.Count)
+            {
+                _n_session_consumed = _session_tokens.Count;
+            }

Conversation

zsogitbe commented Apr 19, 2026

The Core Issue

The Proposed Compromise

Explanation of Implementation Choices

What's Included in this PR

Test Output

Uh oh!

martindevans commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

aropb commented Apr 20, 2026

Uh oh!

aropb commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zsogitbe commented Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martindevans commented Apr 21, 2026

Uh oh!

zsogitbe commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zsogitbe commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindevans commented Apr 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

zsogitbe Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

zsogitbe Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

zsogitbe Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

zsogitbe Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

martindevans commented Apr 25, 2026

Uh oh!

zsogitbe commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindevans commented Apr 26, 2026

martindevans commented Apr 19, 2026 •

edited

Loading

aropb commented Apr 20, 2026 •

edited

Loading

zsogitbe commented Apr 23, 2026 •

edited

Loading

zsogitbe Apr 26, 2026 •

edited

Loading

zsogitbe commented Apr 26, 2026 •

edited

Loading