Skip to content

Context Overflow Fix#1373

Merged
martindevans merged 5 commits intoSciSharp:masterfrom
zsogitbe:ContextOverflowFix
Apr 26, 2026
Merged

Context Overflow Fix#1373
martindevans merged 5 commits intoSciSharp:masterfrom
zsogitbe:ContextOverflowFix

Conversation

@zsogitbe
Copy link
Copy Markdown
Contributor

Hi @martindevans and @aropb,

Thank you for digging into this issue. The proposed solution (MemoryClear + Re-prefilling) in #1336 correctly identifies how to prevent the engine from crashing on models that return MemoryCanShift == false (like Qwen 3.5 with 2D RoPE embeddings).

However, after reviewing how the upstream llama.cpp ecosystem handles this, I’d like to propose an architectural modification in this PR.

The Core Issue

The truncation and shifting logic (like --no-context-shift) does not actually exist in the core llama.h C API; it is application-level logic built into llama-cli (main.cpp). Because LLamaSharp's StatelessExecutor and InteractiveExecutor act as our application-layer equivalents, implementing the fix here is structurally correct.

However, hardcoding n_discard = n_left / 2; forces an extremely opinionated behavior. Silently amputating 50% of the KV cache when it fills up introduces two major risks for developers:

  1. Semantic Destruction: Blindly dropping exactly half of the tokens can cut code blocks, JSON payloads, or sentences in half, causing instant hallucinations.
  2. Silent Latency Spikes: The developer will experience a massive, multi-second TTFT (Time To First Token) delay while DecodeAsync re-evaluates the remaining thousands of tokens, with no explicit warning as to why the delay happened.

The Proposed Compromise

Instead of making silent truncation the default behavior, we should implement a ContextOverflowStrategy on IInferenceParams.

  • Default: ThrowException. If the model's math blocks shifting, we fast-fail. This ensures enterprise developers know they hit a limit and must implement proper RAG/Summarization. (This mimics the llama-cli --no-context-shift flag).
  • Opt-In: TruncateAndReprefill. We run the PR's MemoryClear logic, but we allow the developer to configure the ContextTruncationPercentage (defaulting to 10% instead of 50% to prevent excessive context loss).

Explanation of Implementation Choices

  1. TokensToKeep Indexing: In the original PR, RemoveRange used tokensToKeep - 1. This was structurally unsafe. If tokensToKeep was 0 (e.g., no system prompt), it would result in a -1 index and instantly crash the application. The new implementation safely calculates the startIndex.
  2. Math.Clamp on Percentage: A user might accidentally pass 1.5f (150%) or 0.0f to the truncation percentage. The implementation guarantees the math will always drop some tokens but never all tokens, preventing infinite loops.
  3. ContextOverflowException: By subclassing a specific exception, developers can wrap their generation loops in a try/catch (ContextOverflowException) block to specifically catch and handle this event with custom semantic summarization.

What's Included in this PR

  • Added ContextOverflowStrategy enum and ContextOverflowException.
  • Updated IInferenceParams and InferenceParams with the new configuration properties.
  • Updated StatefulExecutorBase.cs to respect the new strategy.
  • Updated LLamaStatelessExecutor.cs to try native memory shifting first, and safely catch the MemoryCanShift exception to trigger the fallback logic.
  • Added/Updated Unit Tests to verify the exact overflow boundaries for both Stateless and Stateful executors.

Test Output

To ensure this is robust, I tested the exact context boundaries dynamically based on what llama.cpp allocates at runtime. Both executors now successfully intercept the overflow and throw the correct exception before native crashes occur:

[DEBUG] Actual ContextSize: 256, Prompt length: 254
Successfully caught expected exception: The context window is full and the current strategy 
is set to ThrowException. To automatically truncate and manage context, set 
InferenceParams.OverflowStrategy to ContextOverflowStrategy.TruncateAndReprefill.

Successfully caught expected exception in InstructExecutor: The context window is full and 
the current model architecture does not support native memory shifting. To automatically 
truncate and re-prefill the context, set InferenceParams.OverflowStrategy to 
ContextOverflowStrategy.TruncateAndReprefill.

Looking forward to hearing your thoughts on this approach!

@martindevans
Copy link
Copy Markdown
Member

martindevans commented Apr 19, 2026

I really like this idea, it's much better than hardcoding a specific behaviour.

I haven't had a chance to review this in depth yet - but my first thought on reading is: do you think it might be feasible to factor the code handling the overflow out to a separate class? Something like an IOverflowHandler interface, with OverflowThrows and OverflowShifts implementations built into LLamaSharp. That way end users can add completely custom context overflow handling.

(Note: If you don't think that's feasible or you simply don't want to work on it that's fine! I'm happy to go ahead with reviewing this as-is if you prefer).

Comment thread LLama/LLamaStatelessExecutor.cs Outdated
@aropb
Copy link
Copy Markdown

aropb commented Apr 20, 2026

I haven't had a chance to review this in depth yet - but my first thought on reading is: do you think it might be feasible to factor the code handling the overflow out to a separate class? Something like an IOverflowHandler interface, with OverflowThrows and OverflowShifts implementations built into LLamaSharp. That way end users can add completely custom context overflow handling.

It seems redundant to me. Alternatively, you can make the method virtual.

@aropb
Copy link
Copy Markdown

aropb commented Apr 20, 2026

I usually calculate the size of the context as the sum of the prompt + reserve for the response (2-4К). TokensKeep = -1, so that the prompt is saved when the context is shifted. In this case, the 50% shift works well. A lower value will result in a more frequent shift.

@zsogitbe
Copy link
Copy Markdown
Contributor Author

I really like this idea, it's much better than hardcoding a specific behaviour.

I haven't had a chance to review this in depth yet - but my first thought on reading is: do you think it might be feasible to factor the code handling the overflow out to a separate class? Something like an IOverflowHandler interface, with OverflowThrows and OverflowShifts implementations built into LLamaSharp. That way end users can add completely custom context overflow handling.

(Note: If you don't think that's feasible or you simply don't want to work on it that's fine! I'm happy to go ahead with reviewing this as-is if you prefer).

Glad that you like the approach! Concerning your IOverflowHandler idea Martin, I think that I tend to agree with aropb, because the reason for handling the overflow would be, for example, to create a summary, then rerun the prompt, and this handler would not be the right place to do that logic, I think, but correct me if I am wrong. In any case, even if it is decided to do this IOverflowHandler interface, I would like to suggest to do this in a follow-up PR. let us focus now on making this logic perfect first.

I think Context.NativeHandle.MemoryCanShift should be a better way to detect this case.

Yes, we can change that to a Context.NativeHandle.MemoryCanShift test instead of catching the exception.

@martindevans, @aropb, I will wait until you did a full review and test and you give all of your remarks, before adjusting the PR.

Comment thread LLama/Exceptions/ContextOverflowException.cs Outdated
Comment thread LLama/Exceptions/ContextOverflowException.cs Outdated
Comment thread LLama/LLamaExecutorBase.cs Outdated
Comment thread LLama/LLamaExecutorBase.cs
Comment thread LLama/LLamaExecutorBase.cs Outdated
@martindevans
Copy link
Copy Markdown
Member

Concerning your IOverflowHandler idea

There are three main things I want to address with the idea of overflow strategies:

  1. User extensibility - New overflow handling strategies can be implemented without extending the library or creating a new executor type.
  2. Configuration - At the moment there's no place to store per-strategy configuration, so we end up storing it on the InferenceParams and simply documenting that some properties don't do anything except in certain cases.
  3. Code duplication - At the moment the complex shifting behaviour is implemented twice (BaseExecutor and StatelessExecutor) and is not compatible at all with the BatchedExecutor. If it were moved out to overflow strategy classes that code wouldn't be deuplicated. There's an added bonus here that the executor classes have become a complex mess - factoring any code out of them is a win!

In any case, even if it is decided to do this IOverflowHandler interface, I would like to suggest to do this in a follow-up PR. let us focus now on making this logic perfect first.

Agreed 👍

@zsogitbe
Copy link
Copy Markdown
Contributor Author

Concerning your IOverflowHandler idea

There are three main things I want to address with the idea of overflow strategies:

  1. User extensibility - New overflow handling strategies can be implemented without extending the library or creating a new executor type.
  2. Configuration - At the moment there's no place to store per-strategy configuration, so we end up storing it on the InferenceParams and simply documenting that some properties don't do anything except in certain cases.
  3. Code duplication - At the moment the complex shifting behaviour is implemented twice (BaseExecutor and StatelessExecutor) and is not compatible at all with the BatchedExecutor. If it were moved out to overflow strategy classes that code wouldn't be deuplicated. There's an added bonus here that the executor classes have become a complex mess - factoring any code out of them is a win!

In any case, even if it is decided to do this IOverflowHandler interface, I would like to suggest to do this in a follow-up PR. let us focus now on making this logic perfect first.

Agreed 👍

@martindevans, this is a fantastic idea Martin, and a textbook use case for the Strategy pattern. It cleanly solves the issue of overloading InferenceParams, removes the duplicated fallback logic between StatelessExecutor and StatefulExecutorBase, and makes the API much more extensible for users who want to build custom overflow handlers.

The main piece of work here will be getting the internal "plumbing" right. Because StatelessExecutor (which uses List<LLamaToken> all_tokens) and StatefulExecutorBase (which tracks _session_tokens and _embed_inps) manage their token histories differently, the IOverflowHandler interface will need to be designed carefully. It has to safely mutate those different history states and update the n_past count without breaking the internal tracking of either executor.

p.s.: I have updated the code above!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces configurable handling for context-window overflow in LLamaSharp executors, aiming to prevent native crashes on models that cannot shift KV cache (e.g., MemoryCanShift == false) by either fast-failing or truncating and re-prefilling.

Changes:

  • Added ContextOverflowStrategy and ContextOverflowException, and extended IInferenceParams/InferenceParams/web InferenceOptions with overflow controls.
  • Updated stateless and stateful executors to honor the overflow strategy, including a truncate+re-prefill fallback when native shifting isn’t supported.
  • Added unit tests to validate the default (throw) strategy behavior and retained a (skipped) truncation test.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
LLama/LLamaStatelessExecutor.cs Tracks token history and implements overflow behavior (throw vs truncate+shift/reprefill).
LLama/LLamaInteractExecutor.cs Awaits updated async overflow handler with inference params.
LLama/LLamaInstructExecutor.cs Awaits updated async overflow handler with inference params.
LLama/LLamaExecutorBase.cs Converts overflow handler to async and adds truncate+reprefill logic for stateful executors.
LLama/Exceptions/ContextOverflowException.cs Adds a dedicated exception type for context overflow.
LLama/Common/InferenceParams.cs Adds OverflowStrategy and ContextTruncationPercentage configuration with defaults.
LLama/Common/ContextOverflowStrategy.cs Adds strategy enum (throw vs truncate+reprefill).
LLama/Abstractions/IInferenceParams.cs Extends the public inference-params contract with overflow controls.
LLama.Web/Common/InferenceOptions.cs Exposes overflow controls for the web layer implementation of IInferenceParams.
LLama.Unittest/StatelessExecutorTest.cs Adds tests for default overflow behavior (throws) and configures the truncate test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread LLama/LLamaStatelessExecutor.cs
Comment thread LLama/LLamaExecutorBase.cs Outdated
Comment thread LLama.Unittest/StatelessExecutorTest.cs
Comment thread LLama.Unittest/StatelessExecutorTest.cs
@zsogitbe
Copy link
Copy Markdown
Contributor Author

zsogitbe commented Apr 23, 2026

I have corrected the bugs found by Copilot. Unfortunately, I had to disable the MemoryCanShift = false path in the StatefulExecutorBase in HandleRunOutOfContext because as Copilot pointed out we need to track _history_tokens... This will be a future PR.

Explanation: I am holding off on implementing the manual history tracking fix in this PR for StatefulExecutorBase. Rebuilding the KV cache manually for models that don't support native memory shifting is highly complex and risks silent state corruption, especially when dealing with session caching and multimodal inputs. Keeping the ContextOverflowException guard as-is forces the calling application to handle the overflow gracefully (e.g., starting a new chat) rather than risking a desynced model state. We can revisit this if we see high user impact, but stability comes first.

@martindevans
Copy link
Copy Markdown
Member

Thanks for fixing that feedback, it all looks ok to me 👍

Could you create an issue for that todo linking this PR and explaining the details of the work required? I'll mark it as do-not-close.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 143 to +149
// when run out of context
// based on this logic: https://github.com/ggerganov/llama.cpp/blob/master/examples/main/main.cpp#L497
if (n_past + tokens.Count >= Context.ContextSize)
{
if (inferenceParams.OverflowStrategy == ContextOverflowStrategy.ThrowException)
{
throw new ContextOverflowException();
}
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context overflow handling occurs after a token has already been sampled (and even yielded). When truncation/re-prefill happens, the sampled token was produced from logits for the pre-truncation KV state, so the emitted token may be inconsistent with the post-truncation context. To keep generation correct/deterministic, overflow handling should happen before sampling the next token (and after any re-prefill, re-sample from the new logits) rather than sampling/yielding first and truncating afterwards.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is standard continuous batching behavior and closely mirrors how llama.cpp handles it. Moving the overflow check before sampling is unnecessary and would likely complicate your generation loop without providing any tangible benefit.

_pastTokensCount -= n_discard;
// stop saving session if we run out of context
_pathSession = string.Empty;

Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HandleRunOutOfContext now shifts the KV cache but no longer disables session saving (previously _pathSession was cleared on overflow). Since _session_tokens is still appended to later, SaveSessionFile can produce a session token list that no longer matches the shifted KV cache. Either keep the old behavior (stop session saving after a shift) or update _session_tokens (and related counters) to reflect discarded tokens so saved sessions remain loadable/correct.

Suggested change
// Keep session tracking aligned with the shifted KV cache so future session saves/reuse
// operate on the same logical token sequence as the current context.
var sessionDiscardStart = Math.Min(tokensToKeep, _session_tokens.Count);
var sessionDiscardEnd = Math.Min(tokensToKeep + n_discard, _session_tokens.Count);
var removedSessionTokens = sessionDiscardEnd - sessionDiscardStart;
if (removedSessionTokens > 0)
{
_session_tokens.RemoveRange(sessionDiscardStart, removedSessionTokens);
if (_n_session_consumed > sessionDiscardStart)
{
_n_session_consumed = _n_session_consumed >= sessionDiscardEnd
? _n_session_consumed - removedSessionTokens
: sessionDiscardStart;
}
}
if (_n_session_consumed > _session_tokens.Count)
{
_n_session_consumed = _session_tokens.Count;
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! We will stop saving the session if we run out of context.

Comment thread LLama/LLamaExecutorBase.cs
Comment on lines +42 to +51
/// Defines the strategy the executor should use when the context window is full
/// and the model architecture (e.g., models with 2D RoPE embeddings) does not
/// support native memory shifting.
/// </summary>
ContextOverflowStrategy OverflowStrategy { get; set; }

/// <summary>
/// The percentage of past tokens to discard when <see cref="OverflowStrategy"/>
/// is set to <see cref="ContextOverflowStrategy.TruncateAndReprefill"/>.
/// For example, 0.1f represents dropping the oldest 10% of the conversational context.
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for OverflowStrategy describe it as applying only when the model “does not support native memory shifting”, but executors also use it to force a fast-fail even when MemoryCanShift is true. Consider rewording to “when the context window is full” and documenting that ThrowException disables any automatic shifting/truncation regardless of model support.

Suggested change
/// Defines the strategy the executor should use when the context window is full
/// and the model architecture (e.g., models with 2D RoPE embeddings) does not
/// support native memory shifting.
/// </summary>
ContextOverflowStrategy OverflowStrategy { get; set; }
/// <summary>
/// The percentage of past tokens to discard when <see cref="OverflowStrategy"/>
/// is set to <see cref="ContextOverflowStrategy.TruncateAndReprefill"/>.
/// For example, 0.1f represents dropping the oldest 10% of the conversational context.
/// Defines the strategy the executor should use when the context window is full.
/// </summary>
/// <remarks>
/// This setting applies even for models that support native memory shifting.
/// Setting <see cref="ContextOverflowStrategy.ThrowException"/> disables automatic
/// shifting or truncation and causes the executor to fail immediately on overflow.
/// </remarks>
ContextOverflowStrategy OverflowStrategy { get; set; }
/// <summary>
/// The percentage of past tokens to discard when <see cref="OverflowStrategy"/>
/// is set to <see cref="ContextOverflowStrategy.TruncateAndReprefill"/> to recover
/// from a full context window. For example, 0.1f represents dropping the oldest
/// 10% of the conversational context.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

@zsogitbe zsogitbe Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not doing this now.

Comment thread LLama.Unittest/StatelessExecutorTest.cs
Comment thread LLama.Unittest/StatelessExecutorTest.cs
Comment on lines 154 to 158
// number of tokens to keep when resetting context
// Ported from https://github.com/ggerganov/llama.cpp/blob/60325fa56f61c228464c9f065db3aa6a61f2156e/examples/main/main.cpp#L334
if (tokensKeep < 0 || tokensKeep > tokens.Count)
if (tokensKeep < 0 || tokensKeep > all_tokens.Count)
{
tokensKeep = tokens.Count;
tokensKeep = all_tokens.Count;
}
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokensKeep is being clamped against all_tokens.Count, but all_tokens grows over generation. For values like TokensKeep = -1 (used elsewhere to mean “keep the prompt”), this will effectively become “keep the entire current history”, making n_left become 0 and truncation impossible. Consider capturing the initial prompt token count (before generation) and clamping tokensKeep against that constant prompt length instead of the evolving history list.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! I will correct this.

@martindevans
Copy link
Copy Markdown
Member

copilot has raised a couple of issues, but I can't assess them since I don't fully understand the internals of the higher level executors. Feel free to dismiss them if you think they're invalid.

@zsogitbe
Copy link
Copy Markdown
Contributor Author

zsogitbe commented Apr 26, 2026

copilot has raised a couple of issues, but I can't assess them since I don't fully understand the internals of the higher level executors. Feel free to dismiss them if you think they're invalid.

I will look at them and apply the necessary changes if needed! This is a specialized model trained on similar code, so the comments may be relevant.

OK! Added the corrections.

@martindevans
Copy link
Copy Markdown
Member

Thanks for investigating those review items.

@martindevans martindevans merged commit 190344f into SciSharp:master Apr 26, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants