Context Overflow Fix#1373
Conversation
update to LLama.Web.Common.InferenceOptions
|
I really like this idea, it's much better than hardcoding a specific behaviour. I haven't had a chance to review this in depth yet - but my first thought on reading is: do you think it might be feasible to factor the code handling the overflow out to a separate class? Something like an (Note: If you don't think that's feasible or you simply don't want to work on it that's fine! I'm happy to go ahead with reviewing this as-is if you prefer). |
It seems redundant to me. Alternatively, you can make the method virtual. |
|
I usually calculate the size of the context as the sum of the prompt + reserve for the response (2-4К). TokensKeep = -1, so that the prompt is saved when the context is shifted. In this case, the 50% shift works well. A lower value will result in a more frequent shift. |
Glad that you like the approach! Concerning your
Yes, we can change that to a Context.NativeHandle.MemoryCanShift test instead of catching the exception. @martindevans, @aropb, I will wait until you did a full review and test and you give all of your remarks, before adjusting the PR. |
There are three main things I want to address with the idea of overflow strategies:
Agreed 👍 |
@martindevans, this is a fantastic idea Martin, and a textbook use case for the Strategy pattern. It cleanly solves the issue of overloading The main piece of work here will be getting the internal "plumbing" right. Because p.s.: I have updated the code above! |
There was a problem hiding this comment.
Pull request overview
This PR introduces configurable handling for context-window overflow in LLamaSharp executors, aiming to prevent native crashes on models that cannot shift KV cache (e.g., MemoryCanShift == false) by either fast-failing or truncating and re-prefilling.
Changes:
- Added
ContextOverflowStrategyandContextOverflowException, and extendedIInferenceParams/InferenceParams/webInferenceOptionswith overflow controls. - Updated stateless and stateful executors to honor the overflow strategy, including a truncate+re-prefill fallback when native shifting isn’t supported.
- Added unit tests to validate the default (throw) strategy behavior and retained a (skipped) truncation test.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
LLama/LLamaStatelessExecutor.cs |
Tracks token history and implements overflow behavior (throw vs truncate+shift/reprefill). |
LLama/LLamaInteractExecutor.cs |
Awaits updated async overflow handler with inference params. |
LLama/LLamaInstructExecutor.cs |
Awaits updated async overflow handler with inference params. |
LLama/LLamaExecutorBase.cs |
Converts overflow handler to async and adds truncate+reprefill logic for stateful executors. |
LLama/Exceptions/ContextOverflowException.cs |
Adds a dedicated exception type for context overflow. |
LLama/Common/InferenceParams.cs |
Adds OverflowStrategy and ContextTruncationPercentage configuration with defaults. |
LLama/Common/ContextOverflowStrategy.cs |
Adds strategy enum (throw vs truncate+reprefill). |
LLama/Abstractions/IInferenceParams.cs |
Extends the public inference-params contract with overflow controls. |
LLama.Web/Common/InferenceOptions.cs |
Exposes overflow controls for the web layer implementation of IInferenceParams. |
LLama.Unittest/StatelessExecutorTest.cs |
Adds tests for default overflow behavior (throws) and configures the truncate test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I have corrected the bugs found by Copilot. Unfortunately, I had to disable the MemoryCanShift = false path in the StatefulExecutorBase in HandleRunOutOfContext because as Copilot pointed out we need to track _history_tokens... This will be a future PR. Explanation: I am holding off on implementing the manual history tracking fix in this PR for StatefulExecutorBase. Rebuilding the KV cache manually for models that don't support native memory shifting is highly complex and risks silent state corruption, especially when dealing with session caching and multimodal inputs. Keeping the ContextOverflowException guard as-is forces the calling application to handle the overflow gracefully (e.g., starting a new chat) rather than risking a desynced model state. We can revisit this if we see high user impact, but stability comes first. |
|
Thanks for fixing that feedback, it all looks ok to me 👍 Could you create an issue for that todo linking this PR and explaining the details of the work required? I'll mark it as do-not-close. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // when run out of context | ||
| // based on this logic: https://github.com/ggerganov/llama.cpp/blob/master/examples/main/main.cpp#L497 | ||
| if (n_past + tokens.Count >= Context.ContextSize) | ||
| { | ||
| if (inferenceParams.OverflowStrategy == ContextOverflowStrategy.ThrowException) | ||
| { | ||
| throw new ContextOverflowException(); | ||
| } |
There was a problem hiding this comment.
Context overflow handling occurs after a token has already been sampled (and even yielded). When truncation/re-prefill happens, the sampled token was produced from logits for the pre-truncation KV state, so the emitted token may be inconsistent with the post-truncation context. To keep generation correct/deterministic, overflow handling should happen before sampling the next token (and after any re-prefill, re-sample from the new logits) rather than sampling/yielding first and truncating afterwards.
There was a problem hiding this comment.
This is standard continuous batching behavior and closely mirrors how llama.cpp handles it. Moving the overflow check before sampling is unnecessary and would likely complicate your generation loop without providing any tangible benefit.
| _pastTokensCount -= n_discard; | ||
| // stop saving session if we run out of context | ||
| _pathSession = string.Empty; | ||
|
|
There was a problem hiding this comment.
HandleRunOutOfContext now shifts the KV cache but no longer disables session saving (previously _pathSession was cleared on overflow). Since _session_tokens is still appended to later, SaveSessionFile can produce a session token list that no longer matches the shifted KV cache. Either keep the old behavior (stop session saving after a shift) or update _session_tokens (and related counters) to reflect discarded tokens so saved sessions remain loadable/correct.
| // Keep session tracking aligned with the shifted KV cache so future session saves/reuse | |
| // operate on the same logical token sequence as the current context. | |
| var sessionDiscardStart = Math.Min(tokensToKeep, _session_tokens.Count); | |
| var sessionDiscardEnd = Math.Min(tokensToKeep + n_discard, _session_tokens.Count); | |
| var removedSessionTokens = sessionDiscardEnd - sessionDiscardStart; | |
| if (removedSessionTokens > 0) | |
| { | |
| _session_tokens.RemoveRange(sessionDiscardStart, removedSessionTokens); | |
| if (_n_session_consumed > sessionDiscardStart) | |
| { | |
| _n_session_consumed = _n_session_consumed >= sessionDiscardEnd | |
| ? _n_session_consumed - removedSessionTokens | |
| : sessionDiscardStart; | |
| } | |
| } | |
| if (_n_session_consumed > _session_tokens.Count) | |
| { | |
| _n_session_consumed = _session_tokens.Count; | |
| } |
There was a problem hiding this comment.
OK! We will stop saving the session if we run out of context.
| /// Defines the strategy the executor should use when the context window is full | ||
| /// and the model architecture (e.g., models with 2D RoPE embeddings) does not | ||
| /// support native memory shifting. | ||
| /// </summary> | ||
| ContextOverflowStrategy OverflowStrategy { get; set; } | ||
|
|
||
| /// <summary> | ||
| /// The percentage of past tokens to discard when <see cref="OverflowStrategy"/> | ||
| /// is set to <see cref="ContextOverflowStrategy.TruncateAndReprefill"/>. | ||
| /// For example, 0.1f represents dropping the oldest 10% of the conversational context. |
There was a problem hiding this comment.
The docs for OverflowStrategy describe it as applying only when the model “does not support native memory shifting”, but executors also use it to force a fast-fail even when MemoryCanShift is true. Consider rewording to “when the context window is full” and documenting that ThrowException disables any automatic shifting/truncation regardless of model support.
| /// Defines the strategy the executor should use when the context window is full | |
| /// and the model architecture (e.g., models with 2D RoPE embeddings) does not | |
| /// support native memory shifting. | |
| /// </summary> | |
| ContextOverflowStrategy OverflowStrategy { get; set; } | |
| /// <summary> | |
| /// The percentage of past tokens to discard when <see cref="OverflowStrategy"/> | |
| /// is set to <see cref="ContextOverflowStrategy.TruncateAndReprefill"/>. | |
| /// For example, 0.1f represents dropping the oldest 10% of the conversational context. | |
| /// Defines the strategy the executor should use when the context window is full. | |
| /// </summary> | |
| /// <remarks> | |
| /// This setting applies even for models that support native memory shifting. | |
| /// Setting <see cref="ContextOverflowStrategy.ThrowException"/> disables automatic | |
| /// shifting or truncation and causes the executor to fail immediately on overflow. | |
| /// </remarks> | |
| ContextOverflowStrategy OverflowStrategy { get; set; } | |
| /// <summary> | |
| /// The percentage of past tokens to discard when <see cref="OverflowStrategy"/> | |
| /// is set to <see cref="ContextOverflowStrategy.TruncateAndReprefill"/> to recover | |
| /// from a full context window. For example, 0.1f represents dropping the oldest | |
| /// 10% of the conversational context. |
There was a problem hiding this comment.
Not doing this now.
| // number of tokens to keep when resetting context | ||
| // Ported from https://github.com/ggerganov/llama.cpp/blob/60325fa56f61c228464c9f065db3aa6a61f2156e/examples/main/main.cpp#L334 | ||
| if (tokensKeep < 0 || tokensKeep > tokens.Count) | ||
| if (tokensKeep < 0 || tokensKeep > all_tokens.Count) | ||
| { | ||
| tokensKeep = tokens.Count; | ||
| tokensKeep = all_tokens.Count; | ||
| } |
There was a problem hiding this comment.
tokensKeep is being clamped against all_tokens.Count, but all_tokens grows over generation. For values like TokensKeep = -1 (used elsewhere to mean “keep the prompt”), this will effectively become “keep the entire current history”, making n_left become 0 and truncation impossible. Consider capturing the initial prompt token count (before generation) and clamping tokensKeep against that constant prompt length instead of the evolving history list.
There was a problem hiding this comment.
OK! I will correct this.
|
copilot has raised a couple of issues, but I can't assess them since I don't fully understand the internals of the higher level executors. Feel free to dismiss them if you think they're invalid. |
I will look at them and apply the necessary changes if needed! This is a specialized model trained on similar code, so the comments may be relevant. OK! Added the corrections. |
|
Thanks for investigating those review items. |
Hi @martindevans and @aropb,
Thank you for digging into this issue. The proposed solution (
MemoryClear+ Re-prefilling) in #1336 correctly identifies how to prevent the engine from crashing on models that returnMemoryCanShift == false(like Qwen 3.5 with 2D RoPE embeddings).However, after reviewing how the upstream
llama.cppecosystem handles this, I’d like to propose an architectural modification in this PR.The Core Issue
The truncation and shifting logic (like
--no-context-shift) does not actually exist in the corellama.hC API; it is application-level logic built intollama-cli(main.cpp). Because LLamaSharp'sStatelessExecutorandInteractiveExecutoract as our application-layer equivalents, implementing the fix here is structurally correct.However, hardcoding
n_discard = n_left / 2;forces an extremely opinionated behavior. Silently amputating 50% of the KV cache when it fills up introduces two major risks for developers:DecodeAsyncre-evaluates the remaining thousands of tokens, with no explicit warning as to why the delay happened.The Proposed Compromise
Instead of making silent truncation the default behavior, we should implement a
ContextOverflowStrategyonIInferenceParams.ThrowException. If the model's math blocks shifting, we fast-fail. This ensures enterprise developers know they hit a limit and must implement proper RAG/Summarization. (This mimics thellama-cli--no-context-shiftflag).TruncateAndReprefill. We run the PR'sMemoryClearlogic, but we allow the developer to configure theContextTruncationPercentage(defaulting to 10% instead of 50% to prevent excessive context loss).Explanation of Implementation Choices
TokensToKeepIndexing: In the original PR,RemoveRangeusedtokensToKeep - 1. This was structurally unsafe. IftokensToKeepwas0(e.g., no system prompt), it would result in a-1index and instantly crash the application. The new implementation safely calculates thestartIndex.Math.Clampon Percentage: A user might accidentally pass1.5f(150%) or0.0fto the truncation percentage. The implementation guarantees the math will always drop some tokens but never all tokens, preventing infinite loops.ContextOverflowException: By subclassing a specific exception, developers can wrap their generation loops in atry/catch (ContextOverflowException)block to specifically catch and handle this event with custom semantic summarization.What's Included in this PR
ContextOverflowStrategyenum andContextOverflowException.IInferenceParamsandInferenceParamswith the new configuration properties.StatefulExecutorBase.csto respect the new strategy.LLamaStatelessExecutor.csto try native memory shifting first, and safely catch theMemoryCanShiftexception to trigger the fallback logic.Test Output
To ensure this is robust, I tested the exact context boundaries dynamically based on what
llama.cppallocates at runtime. Both executors now successfully intercept the overflow and throw the correct exception before native crashes occur:Looking forward to hearing your thoughts on this approach!