Epic: Session Continuity and Coherence in Gemini CLI
1. Overview
As the complexity of tasks handled by the Gemini CLI grows, standard context window limitations and linear chat histories become bottlenecks. Long-running sessions often suffer from context degradation, forgotten constraints, and "foggy" memory.
This epic tracks a multi-part plan to evolve the Gemini CLI from relying solely on linear history and basic compression to adopting robust, agentic, and model-directed context engineering.
2. Current State
- Thread Compression: Automatic and manual context summarization (e.g., via
/compress) to avoid exceeding token limits.
- Memory Tool: A
save_memory tool to persist global facts/preferences to ~/.gemini/GEMINI.md.
- Context Loading: Hierarchical loading of
.gemini/GEMINI.md files (Global, Extension, Project).
- Gap: State management, task tracking, and mid-session contextual anchoring are entirely manual or rely on ad-hoc agent behaviors.
3. Scope of Work
Phase 1: Short-Term (Incremental Improvements)
Focus: Immediate, localized improvements to the existing single-thread context lifecycle, minimizing noise and optimizing the current compression mechanics.
- Reduce Auto-Compression Threshold: Tune the default
model.compressionThreshold (currently 0.5) to be more aggressive. Triggering compression earlier prevents the model from getting "lost in the weeds" of a massive context window.
- Auto-Distillation for Tool Calls: Automatically summarize or strictly filter high-volume tool outputs (like
run_shell_command compiler errors or grep_search results) before they enter the main context window. This could involve using a fast, lightweight model (like Gemini Flash Lite) to extract just the "signal" from the "noise."
- Fix & Enhance Existing Compression Logic: Address known gaps in the current compression token counting. For example, ensuring system instructions and tool definitions are correctly factored into the "beneficial compression" math, and tweaking the summarization prompt to strictly preserve the user's active intent.
- Stale Output Elision (History Pruning): Introduce a mechanism to retroactively collapse tool outputs that are no longer relevant. For instance, if
read_file outputs 500 lines, and the agent subsequently uses replace to rewrite that file, the original read_file output in the history can be safely replaced with [Content elided - File subsequently modified].
- Guided Compression: Enhance the
/compress command to accept user prompts (e.g., /compress Retain the specific SQL query we just built), ensuring critical details aren't lost during manual summarization.
Phase 2: Medium-Term (Active Context Management & Incremental Compression)
Focus: Shifting from reactive compression to proactive, agent-driven state management and continuous, granular context optimization within a single thread.
- Session Scratchpad Tool: Implement a lightweight
scratchpad tool. This allows the agent to explicitly maintain an active, automatically injected markdown checklist/notes section for the duration of the session, separate from the chat stream. It acts as the agent's "working memory" that survives compression.
- State Checkpointing (
checkpoint_state): Introduce a tool for the agent to explicitly declare "State Snapshots" (Goals, Key Knowledge, Recent Actions). The compression engine treats these snapshots as immutable anchors that must survive compression untouched, guaranteeing the agent never loses the thread of a long task.
- Background Context Monitor (The "Observer Agent"): Deploy a lightweight, parallel agent (e.g., Gemini Flash Lite) that asynchronously monitors the active thread's context window. Instead of waiting for a rigid token threshold, the Monitor Agent evaluates the semantic density of the conversation and can trigger an early, targeted compression if the thread is drifting or becoming overwhelmed by noise (e.g., after a series of failed debugging attempts).
- "Ship of Theseus" Incremental Compression: Move away from disruptive, "stop-the-world" summarization of the entire chat history. Implement a rolling compression strategy that continuously swaps out the oldest, most granular conversational turns (e.g., early research tool calls) with dense, localized summaries. The context window gradually transforms into a high-fidelity summary of the past and a granular view of the present, without ever halting the user's workflow for a massive compression pass.
- Semantic Elision of Redundant Tool Calls: A smarter version of stale output pruning. If the agent repeatedly calls
run_shell_command("npm run build") and it fails 4 times with similar stack traces before succeeding, the system automatically collapses the 4 failures into a single semantic block ([4 failed build attempts due to TS2304; resolved in subsequent edit]), preserving the narrative but destroying the redundant token cost.
Phase 3: Long-Term (Agentic & Model-Directed Context Engineering)
Focus: Shifting the paradigm from reactive system-level compression to proactive, model-directed context curation, where the agent is a first-class participant in managing its own cognitive load.
- Active Context Shaping ("The Oxygen Mask Protocol"): Empower the model with direct control over its context stream. Introduce tools like
distill_history (to collapse the last N turns into a dense summary) or elide_noise (to actively drop specific verbose tool outputs). The agent can proactively prune its context during a complex debugging loop, rather than waiting for system-level auto-compression.
- Model-Driven Targeted Compression/Truncation: Enhance the background compression engine to be fully model-directed. Instead of a hard
compressionThreshold and a generic "summarize this" prompt, the system queries the model itself: "You are approaching your context limit. Which previous conversational branches or tool outputs can be safely truncated, summarized, or dropped to preserve your focus on the current task?"
- Semantic History Rewriting: Shift from an append-only linear chat log to an evolving, Document-Driven State. The agent dynamically updates an overarching
Architecture_State.md in the background. The actual "chat history" passed to the model becomes just the recent diffs or actions applied to this state, fundamentally decoupling the length of the session from the size of the context window.
- Structured Context Widget System (Exploratory): Investigate moving away from a flat text history entirely. The context window is composed of structured, model-controlled "widgets" (e.g., a "Current Code Snippets" widget, a "Recent Test Failures" widget, an "Active Constraints" widget). The agent explicitly pushes and pops data from these widgets. Note: This represents a radical architectural shift and may require specific fine-tuning or training data support to be fully effective.
4. Success Criteria
- Users can run sessions spanning multiple days and hundreds of turns without the agent losing the initial goal.
- Token usage is actively managed and reduced via intelligent pruning rather than just brute-force summarization.
- The agent proactively tracks its own progress and constraints without relying on the user to remind it.
Epic: Session Continuity and Coherence in Gemini CLI
1. Overview
As the complexity of tasks handled by the Gemini CLI grows, standard context window limitations and linear chat histories become bottlenecks. Long-running sessions often suffer from context degradation, forgotten constraints, and "foggy" memory.
This epic tracks a multi-part plan to evolve the Gemini CLI from relying solely on linear history and basic compression to adopting robust, agentic, and model-directed context engineering.
2. Current State
/compress) to avoid exceeding token limits.save_memorytool to persist global facts/preferences to~/.gemini/GEMINI.md..gemini/GEMINI.mdfiles (Global, Extension, Project).3. Scope of Work
Phase 1: Short-Term (Incremental Improvements)
Focus: Immediate, localized improvements to the existing single-thread context lifecycle, minimizing noise and optimizing the current compression mechanics.
model.compressionThreshold(currently 0.5) to be more aggressive. Triggering compression earlier prevents the model from getting "lost in the weeds" of a massive context window.run_shell_commandcompiler errors orgrep_searchresults) before they enter the main context window. This could involve using a fast, lightweight model (like Gemini Flash Lite) to extract just the "signal" from the "noise."read_fileoutputs 500 lines, and the agent subsequently usesreplaceto rewrite that file, the originalread_fileoutput in the history can be safely replaced with[Content elided - File subsequently modified]./compresscommand to accept user prompts (e.g.,/compress Retain the specific SQL query we just built), ensuring critical details aren't lost during manual summarization.Phase 2: Medium-Term (Active Context Management & Incremental Compression)
Focus: Shifting from reactive compression to proactive, agent-driven state management and continuous, granular context optimization within a single thread.
scratchpadtool. This allows the agent to explicitly maintain an active, automatically injected markdown checklist/notes section for the duration of the session, separate from the chat stream. It acts as the agent's "working memory" that survives compression.checkpoint_state): Introduce a tool for the agent to explicitly declare "State Snapshots" (Goals, Key Knowledge, Recent Actions). The compression engine treats these snapshots as immutable anchors that must survive compression untouched, guaranteeing the agent never loses the thread of a long task.run_shell_command("npm run build")and it fails 4 times with similar stack traces before succeeding, the system automatically collapses the 4 failures into a single semantic block ([4 failed build attempts due to TS2304; resolved in subsequent edit]), preserving the narrative but destroying the redundant token cost.Phase 3: Long-Term (Agentic & Model-Directed Context Engineering)
Focus: Shifting the paradigm from reactive system-level compression to proactive, model-directed context curation, where the agent is a first-class participant in managing its own cognitive load.
distill_history(to collapse the last N turns into a dense summary) orelide_noise(to actively drop specific verbose tool outputs). The agent can proactively prune its context during a complex debugging loop, rather than waiting for system-level auto-compression.compressionThresholdand a generic "summarize this" prompt, the system queries the model itself: "You are approaching your context limit. Which previous conversational branches or tool outputs can be safely truncated, summarized, or dropped to preserve your focus on the current task?"Architecture_State.mdin the background. The actual "chat history" passed to the model becomes just the recent diffs or actions applied to this state, fundamentally decoupling the length of the session from the size of the context window.4. Success Criteria