A local-first coding sidekick for Antigravity, VS Code and Cursor. Stream live model output, chat with your repo, edit files, run terminal commands, and keep everything on-device without shipping your soul to the cloud.
This extension was built because I was tired of being ghosted by AI every time I hit 30,000 feet. No Wi-Fi? No problem. LLeM was made specifically for those long-haul flights where you just want to vibe and code without needing an internet connection.
Note
Credits & Origin: Huge shoutout to the OG inspiration, connect-ai. We took that foundation, gave it a massive refactor, and cranked everything up to eleven. Weβre talking boosted performance, fresh features, and a serious security audit to keep your local workflow locked down. We didn't just download it; we leveled it up.
Special Thanks: Seriously, LLeM wouldn't even exist if it weren't for the connect-ai sharing their code with the world. Major respect for that open-source energyβyou guys made this happen. π«Ά
Fair Play: If you're planning to build on top of this or create something new based on this code, please keep the good karma flowing and make sure to shout out the original creators connect-ai and their contributions. Respect the hustle! βοΈ
Important
LLeM is 100% local. Your code never leaves your machine. No cloud, no drama, just pure local intelligence.
LLeM now treats common context-mode utility commands as first-class MCP actions instead of sending them through the normal chat-completion path. That makes commands like ctx stats and /ctx_stats faster, cleaner, and less likely to confuse local models.
You can now run context-mode utilities directly from chat:
ctx statsctx doctorctx upgradectx purgectx insight
LLeM recognizes these as MCP utility commands, routes them to the matching context-mode tool, and displays the result back in the chat.
LLeM also understands slash-style aliases for the same tools:
/ctx_stats/ctx-stats/ctx_doctor/ctx-doctor/context-mode:ctx-stats/context-mode:ctx-doctor
Hyphenated aliases are normalized to the MCP tool names internally, so /ctx-stats resolves to ctx_stats, and /context-mode:ctx-doctor resolves to ctx_doctor.
MCP responses shaped like { content: [{ type: "text", text: "..." }] } are now unwrapped and displayed as plain text. Raw JSON is only shown as a fallback when LLeM cannot extract text content.
This fixes the context-mode stats display issue where the result could appear as a JSON code block instead of readable output.
The previous JSON rendering path could expose long separator lines from context-mode stats output. Some local models would echo those separators during follow-up turns, which could trip LLeM's repetition watchdog. With text-first MCP rendering, the stats output stays readable and avoids feeding noisy JSON wrappers back into the conversation.
This update also adds regression tests for the new context-mode command parsing behavior, including slash aliases and natural utility commands.
Since LLeM is currently in early flight, we distribute it via .vsix files. Follow these steps to get airborne:
- Go to the LLeM GitHub Repository.
- Look at the Releases section on the right sidebar.
- Click on the latest release tag (e.g.,
v3.1.3). - Under the Assets section, click on the
.vsixfile (e.g.,llem-3.1.3.vsix) to download it to your machine.
- Open VS Code or Cursor.
- Open the Extensions view by clicking the square icon in the left sidebar or pressing
Cmd+Shift+X(macOS) orCtrl+Shift+X(Windows/Linux). - Click the
...(More Actions) menu icon in the top right corner of the Extensions view title bar. - Select Install from VSIX... from the dropdown menu.
- Locate and select the
.vsixfile you just downloaded. - Once installed, you might need to click Reload or restart your editor.
- π‘οΈ Local-First Workflow: Connects directly to local engines like Rapid-MLX, LM Studio, or Ollama. No cloud, no API costs.
- π Live Streaming: Real-time output rendered inside a custom VS Code chat panel with full Markdown and code block support.
- π οΈ Agentic Actions: Trigger file creations, non-destructive edits, and terminal commands directly from the AI's response.
- ποΈ Persistent History: Conversations are automatically saved to
~/.llem-history, supporting session recovery, renaming, and bulk deletion. - π Workspace Awareness: Real-time monitoring of your project files. Drop files/folders into chat for instant, high-fidelity context injection.
- π§ The Brain (Markdown Vault): Sync your notes with an Obsidian-compatible vault. Supports visual network maps and local Git synchronization.
- β‘ Performance First: Multi-layered caching, request throttling, and token-usage monitoring to keep your dev environment snappy.
- π§ Model-Aware Prompt Budgeting: Automatically trims prompt weight for big local models so 24B+ and 26B-class runs stay responsive instead of drowning in context.
- π Built-In Diagnostics: Inspect prompt size, first-token delay, section-by-section context weight, and streaming throughput directly from the LLeM diagnostics panel.
To get started, you'll need a local model runtime running on your machine.
Typical URL: http://127.0.0.1:8000
Rapid-MLX is supported as an OpenAI-compatible local engine. LLeM talks to its /v1/chat/completions streaming endpoint and reads installed models from /v1/models.
# Install and serve a model on the default Rapid-MLX port
pip install vllm-mlx
rapid-mlx serve qwen3.5-4b --port 8000Then point llem.engineUrl at http://127.0.0.1:8000 or pick Rapid-MLX from Settings -> Swap model engine.
Typical URL: http://127.0.0.1:11434
# Pull a model and serve
ollama pull gemma4:e4b
ollama serveFor larger local runs, a 24B+ Gemma-family model is a better fit for the new performance profile flow:
# Example 26B-class local setup
ollama pull gemma6:26b
ollama serveTypical URL: http://127.0.0.1:1234
- Download and load your favorite model.
- Enable the Local Server.
- Confirm the server is active.
- Point
llem.engineUrlathttp://127.0.0.1:1234or pick LM Studio from Settings -> Swap model engine.
LLeM treats LM Studio as an OpenAI-compatible local engine and automatically normalizes the request path to /v1/chat/completions.
Open your VS Code settings.json to customize the experience.
| Setting | Description | Default |
|---|---|---|
llem.engineUrl |
Local/remote model endpoint URL. Supports Rapid-MLX (:8000), LM Studio (:1234), OpenAI-compatible /v1 servers, and Ollama (:11434). |
http://127.0.0.1:11434 |
llem.defaultModel |
The default model slug used for requests. | gemma4:e4b |
llem.performancePreset |
Prompt and generation budget profile. Use auto, balanced, or large-local-26b. |
auto |
llem.requestTimeout |
Request timeout in seconds. | 300 |
llem.vaultPath |
Path to your markdown vault. | ~/.llem-vault |
llem.bridgeEnabled |
Enable the local HTTP bridge on port 4825. | false |
llem.bridgeToken |
Security token for authenticated bridge callers. | (empty) |
llem.mcpEnabled |
Enable MCP server discovery and tool calls. | true |
llem.mcpServers |
MCP servers registered directly in LLeM. | {} |
llem.mcpConfigSources |
MCP sources to resolve. | ["llem", "workspace", "codex-global", "codex-project"] |
llem.mcpConfigPaths |
Extra MCP JSON/TOML config paths to import. | [] |
llem.mcpToolTimeoutSeconds |
Timeout for MCP startup, listing, and calls. | 60 |
llem.maxHistoryItems |
Maximum number of sessions to keep in history. | 100 |
Tip
If you're using a slower model or long prompts, try bumping up the llem.requestTimeout.
LLeM can register and run MCP servers with Codex-style action tags:
<list_mcp_tools/>
<call_mcp_tool server="context7" tool="resolve-library-id">{"libraryName":"react"}</call_mcp_tool>Direct LLeM config lives in llem.mcpServers:
"llem.mcpServers": {
"context7": {
"command": "npx",
"args": ["-y", "@upstash/context7-mcp"],
"env": {},
"enabled": true
}
}LLeM also syncs Codex MCP settings from $CODEX_HOME/config.toml, ~/.codex/config.toml, and <workspace>/.codex/config.toml. Before applying a sync it shows a diff with Added, Removed, and Changed servers; environment values are masked and only changed keys are shown. Synced snapshots are stored in ~/.llem/llem-mcp-synced.json instead of VS Code settings, and user-owned llem.mcpServers are never deleted or modified by Codex sync.
Use Settings -> MCP servers -> Import MCP from GitHub URL to paste an MCP repository URL. LLeM reads README/package/config examples, previews the inferred server command, and imports it only after approval.
v1 runs stdio MCP servers only. HTTP/SSE/remote entries can be imported and listed, but tool calls report them as unsupported.
LLeM reads markdown files from ~/.llem/prolog before every prompt. Files are loaded in 0-9, A-Z filename order and injected directly after the built-in system prompt as mandatory prolog instructions. Use this for durable local routing rules, house style, or workflow constraints that should apply to every request.
- If terminal commands fail on Windows, confirm
node,npm, andnpxare available in the VS Code process environment. - If a model response suggests an edit that has no effect, use
read_filefirst and retry with the current file content.
For bigger local models such as gemma6:26b or other 24B+ Gemma-family builds:
- prefer Rapid-MLX or LM Studio when you want an OpenAI-compatible local server,
- prefer Ollama when you want the native Ollama
/api/chatpath and local manifest-based capability hints, - switch
llem.performancePresettolarge-local-26bif you want tighter prompt budgets immediately, - keep
llem.performancePresetonautoif you want LLeM to detect 26B-class models by name or metadata, - raise
llem.requestTimeoutto around600seconds on slower or memory-constrained machines, - pair a 26B default with a smaller fallback model if you want fast iteration for simple edits.
Current-machine guidance:
- on Apple Silicon systems around the
34 GBclass,large-local-26bis the recommended preset for 26B local models, - on other machines, start with the same preset and only widen timeout or context if your hardware can comfortably handle it.
LLeM now exposes a model-sensitive prompt and generation budget setting through llem.performancePreset.
auto: Recommended default. LLeM checks the selected model name and, when available, Ollama metadata such asparameter_size. If the model looks like a24B+local run, it automatically switches into the 26B profile.balanced: Keeps the wider default context and generation budget. This is the better fit for smaller local models when raw responsiveness is already good.large-local-26b: Uses a tighter prompt budget and smaller Ollama generation window so big local models spend less time chewing through workspace context before the first token lands.
When large-local-26b is active, LLeM intentionally becomes more selective about context:
- active editor context gets first priority,
- attached text files are budgeted per file and across the whole turn,
- workspace tree and vault index are clipped more aggressively,
- and older low-relevance chat history is pruned before the current request is allowed to grow out of control.
This is designed to improve real-world latency, not benchmark token counts in isolation. The point is to keep the answer useful while reducing the hidden prompt tax that large local models pay.
Use LLeM: Show Diagnostics when tuning a larger model. The diagnostics channel now surfaces the key numbers you need:
- selected model and resolved performance profile,
- estimated prompt size before send,
- final request size after pruning,
- history, attachment, active-editor, workspace, and vault character breakdowns,
- pruned message count and attachment trim amount,
- first-token latency,
- total stream duration,
- and token throughput.
If a 26B-class model still feels sluggish, the fastest knobs to check are:
llem.performancePresetllem.requestTimeout- total attachment size in the current turn
- whether the active file or vault index is unusually large
In practice, this makes it much easier to see whether the bottleneck is model load time, prompt size, or generation speed.
- Node.js (v18+)
- npm
- Compile:
npm run compile - Build VSIX:
npm run package:vsix - Local Test VSIX:
npm run package:vsix:local
- Context Limits: Large file attachments might hit the context window limit of your local model.
- Large-Model Warmup: The first request to a 24B+ local model can still feel slow even after prompt trimming, especially right after loading the model into memory.
- Server Check: Make sure your local engine (Rapid-MLX/LM Studio/Ollama) is actually running before you start chatting.
v3.1.1 builds on the v3.1.0 chat UX refresh and adds the missing piece: editing earlier user messages in a Gemini Web-style flow.
You can now go back to a previous user message, click Edit, revise the prompt, and continue from there.
- the old thread stays intact,
- LLeM creates a new branch from the point before that message,
- the edited prompt is resubmitted into that branch,
- and any reusable attached files from the original message can travel with the edit flow.
This keeps the conversation history safe while making prompt iteration much faster and less destructive.
With Copy, Branch, Edit, π, and π, each finished exchange can now be reused in multiple ways:
- Copy a strong answer,
- Branch an assistant response into a new direction,
- Edit a user message to retry from an earlier point,
- Like a response style you want repeated,
- Dislike a response style you want avoided later.
That makes LLeM feel much closer to modern consumer chat tools while staying inside VS Code and staying local-first.
Preference memory continues to apply across:
- normal follow-up turns,
- new chats,
- chat branches,
- and edited-message branches.
So if you teach LLeM what kind of answers you like, that preference signal survives even when you fork or revise the conversation path.
Clickable file references in chat are now more accurate:
- only editable file types can be opened from chat,
- basename-only references like
extension.tscan resolve to a real workspace file when the match is unambiguous, - and chat attachments preserve enough metadata to reopen the right source more reliably.
The webview renderer now handles leftover inline Markdown markers more gracefully in normal prose.
- inline bold and emphasis markers render more reliably in mixed-language text,
- bullet items such as
- **βλ‘컬 νκ²½μ λΏλ¦¬λ΄λ¦°(Local-first) μ§λ₯ν μμ΄μ νΈβ**now display with the intended emphasis, - and the fallback logic avoids touching fenced code blocks while cleaning up visible chat output.
- added editable earlier-message branching from the webview action bar,
- preserved reusable attachment payloads in display history for edit/retry flows,
- added branch generation from the point before a selected user message,
- improved workspace filename resolution for clickable chat file references,
- added a safe inline-Markdown fallback for webview chat rendering,
- kept reply-style preference memory persistent across branch variants.
The big shift here is that LLeM is no longer just good at answering or branching. It is now better at revising. That means less copy-paste, less losing context, and much smoother iteration when you're tightening prompts or trying alternate implementation directions.
v3.1.0 is the first release that makes each completed reply feel more like a modern chat product, while still keeping the whole workflow local-first.
Once an assistant reply finishes streaming, LLeM now shows a compact action row directly under that message.
- Copy: Copies just that specific assistant response to your clipboard.
- Branch: Creates a brand-new chat branch from that response so you can explore a different direction without losing the original thread.
- π Like: Marks that answer style as something the user wants more of.
- π Dislike: Marks that answer style as something the user wants less of.
This interaction model is intentionally inspired by the post-reply controls you see in Gemini Web, but adapted to LLeM's local VS Code workflow.
Branching is now a first-class concept inside the chat experience.
- You can branch from any completed assistant response.
- The new branch becomes its own saved chat session.
- The original conversation remains untouched in history.
- The branch inherits the visible conversation context up to the selected reply, making it easy to explore alternate plans, implementations, or follow-up prompts.
This is especially useful when you want to:
- compare two implementation strategies,
- keep one thread focused on debugging while another explores a refactor,
- or preserve a "good state" before taking the conversation in a different direction.
Likes and dislikes are not cosmetic. They now update a persistent memory layer that survives:
- new chats,
- branched chats,
- and extension restarts.
When you give feedback on a reply, LLeM stores a compact memory of that preference and uses it to steer future responses. In practice, that means:
- replies you like help reinforce the kind of tone, structure, and answer shape you want,
- replies you dislike tell the assistant to avoid similar response patterns later unless you explicitly ask for them.
This preference memory is injected into the system context for future requests, so LLeM can adapt over time instead of acting like every conversation starts from zero.
This release also tightens the file interaction model inside chat:
- only editable file types are shown as clickable in message content,
- only editable attachments can be opened from chat,
- and dropped file attachments preserve enough metadata to open the correct source more reliably.
That keeps chat interactions cleaner and avoids misleading "clickable" affordances on files that are not actually editable in the intended way.
Under the hood, this release adds several important building blocks:
- a shared editable-file classifier used by both the webview and extension host,
- per-message feedback state in persisted chat history,
- a new response preference manager backed by extension global state,
- message-level UI actions for copy, branching, and feedback,
- and branch session generation from the currently visible conversation timeline.
LLeM has always focused on local execution, real file edits, and practical repo-aware assistance. With v3.1.0, the chat UX becomes much more iterative:
- you can fork thought paths without losing your place,
- quickly reuse or share strong replies,
- and gradually teach the assistant how you want it to respond.
Still local. Still yours. Just much more adaptable.
Sup world! π v3.0.5 is officially out in the wild and it's our first public release. π
- Branding on Point: We ditched the boring stuff for a fresh icon and a UI that actually looks good.
- Gemma Optimization: We tweaked the engine to hunt down Ollama's or LM Studio's default model automatically.
- Chat History 2.0: Full persistence layer implemented. Your conversations now survive VS Code restarts.
- Workspace Sync: Instant UI updates when you rename, delete, or add files to your project.
- Security Audit: Completed a deep-dive security pass on the Bridge Server, adding rate limiting and token-based auth.
- Better Vibes: Smoother logging and descriptive errors so you're never left guessing.
- Public Launch: This is it. The first time we're letting this thing out of the hangar for everyone to use.
Local-first, offline-always. Let's cook. π«π»
- Support context-mode utility command aliases and render MCP text results directly.
- Packaged
release/llem-3.5.12.vsix.
- Show MCP slash command results directly in chat responses.
- Packaged
release/llem-3.5.11.vsix.
- MCP slash commands such as /ctx_stats now execute directly and placeholder MCP server names are resolved from available tools.
- Packaged
release/llem-3.5.10.vsix.
- Added execution modes and Windows HOME prolog loading
- removed profile fallback from prolog discovery
- Packaged
release/llem-3.5.9.vsix.
- Moved the active neon animation from the header tagline to the Running now queue card
- Packaged
release/llem-3.5.8.vsix.
- Moved MCP runtime modules into src/mcp for clearer source organization
- Packaged
release/llem-3.5.7.vsix.
- Prevent partial file paths from expanding into nested MCP folders
- Packaged
release/llem-3.5.6.vsix.
- Added ~/.llem/prolog markdown prolog loading before every prompt
- Applied prolog files in numeric and alphabetical filename order
- Documented prompt prolog behavior in README
- Packaged
release/llem-3.5.5.vsix.
- Recognized codex as an MCP config source alias so synced Codex servers appear in LLeM server lists
- Packaged
release/llem-3.5.4.vsix.
- Moved synced MCP snapshots from VS Code settings into ~/.llem/llem-mcp-synced.json
- Documented the home-profile MCP sync storage path in README
- Added neon highlighting for the active running queue prompt
- Removed the header neon underline below the LLeM tagline
- Packaged
release/llem-3.5.3.vsix.
- Added MCP server discovery and stdio tool-call support
- Added Codex MCP config sync with diff preview and masked environment changes
- Added GitHub MCP server import flow
- Packaged
release/llem-3.5.2.vsix.
v3.5.0 expands LLeM's local engine support beyond the original Ollama/LM Studio flow and makes OpenAI-compatible MLX runtimes a first-class path.
- Bumped the extension version from
3.4.3to3.5.0. - Added Rapid-MLX support as a local OpenAI-compatible backend.
- Added automatic first-run discovery for Rapid-MLX at
http://127.0.0.1:8000. - Added
/v1/modelsmodel discovery for Rapid-MLX and other OpenAI-compatible local engines. - Normalized Rapid-MLX requests to
/v1/chat/completions, matching the same streaming chat shape used by LM Studio. - Kept LM Studio support explicit at
http://127.0.0.1:1234, including/v1/chat/completionsand/v1/models. - Kept Ollama support on the native
http://127.0.0.1:11434path with/api/chat,/api/tags, and Ollama metadata/capability checks. - Updated the Settings menu so Rapid-MLX, LM Studio, and Ollama can all be selected from Swap model engine.
- Updated active runtime labels so prompts and diagnostics report
Rapid-MLX,LM Studio,Ollama, or a generic OpenAI-compatible local engine instead of mislabeling every/v1endpoint as LM Studio. - Improved connection and model-not-found guidance so errors point users toward the selected engine's expected local port and startup flow.
- Fixed the first action icon under chat results so it copies the message to the clipboard without pasting it into the composer or opening the edit-branch banner.
- Refreshed README setup guidance with Rapid-MLX install/serve examples, LM Studio selection notes, and the complete local engine compatibility list.
- Renamed the VS Code tab and view labels from Assistant to LLeM.
- Refreshed the compiled extension and webview bundles for the release.
- Packaged
release/llem-3.4.3.vsix.
- Removed MCP and context-mode integration from runtime, prompts, and docs
- cleaned vault handling guidance and saved context-mode rules into the local vault
- refreshed package contents after the MCP removal
- Packaged
release/llem-3.4.2.vsix.
- Removed LLeM branding from visible UI strings
- removed context-mode integration
- restored local TypeScript tooling so typecheck works again
- Packaged
release/llem-3.4.1.vsix.
This release focuses on making agentic file edits visible, debuggable, and easier to trust when running local models such as Ollama Gemma-family models.
- Codex-style file change summaries in chat: When LLeM creates, edits, or deletes files, the chat now shows a compact change card with one row per file. Each row includes the action, file name, and line-level
+/-counts so you can immediately see what changed without opening the filesystem first. - Whole-turn change totals: Multi-file edits now include a footer such as
2 files changed +75 -20, giving a clear overview of the total edit impact for the current agent action. - Clickable changed files: File rows in the change summary can be clicked to open the affected file directly from the chat UI.
- Review Changes shortcut: The change summary includes a
Review changesbutton that opens VS Code's Source Control view, making it faster to inspect the workspace diff after an agent run. - Stronger edit failure visibility: If the model emits an
<edit_file>action but none of the<find>blocks match the current file, LLeM now reports it as a clear failure:Edit failed ... replacement 0/N. This makes silent no-op edits much harder to miss. - Immediate Action Report streaming: External action results are now posted into the live chat stream as soon as they happen. File edits, failed replacements, safety blocks, and terminal actions no longer wait until later continuation logic to become visible.
- Action Report preserved in the final answer: The final assistant message keeps the action report attached, so the user can scroll back later and still see exactly what LLeM tried, what succeeded, and what failed.
- Cleaner regenerate behavior:
Regenerate replynow removes the previous assistant response from the chat UI before streaming the replacement, so regeneration feels like a true retry instead of an extra appended answer. - Follow-up recovery guidance for local models: When an edit fails because the
<find>text does not match, LLeM now gives the follow-up model turn a stronger system observation telling it to retry with exact current file content instead of explaining the failure away. - Post-mortem logging for file actions: File create/edit/delete paths now write structured diagnostics for validation blocks, missing files, invalid edit bodies, zero-replacement edits, successful writes, and exceptions. These logs include trace IDs, parsed action counts, file paths, replacement metadata, and previews to help reconstruct what happened after a failed run.
- Safer testable logging outside VS Code: The logger now lazily loads the VS Code API and falls back to diagnostics-file logging during Node-based tests, so action logging can be covered without requiring an extension host.
- Regression coverage for edit metadata: Tests now verify that file action results include structured change metadata for created, edited, and deleted files.
- Bumped version and upgraded axios before VSIX build.
- Packaged
release/llem-3.3.35.vsix.
- Fixed image lightbox close behavior so the top-right close button and backdrop dismiss reliably. Reduced action-history bloat by keeping only the most recent file context per turn and trimming file/web observation payloads.
- Packaged
release/llem-3.3.34.vsix.
- Reduced action-history bloat by keeping only the most recent file context per turn and trimming file/web observation payloads. Improved live output masking, offline vision detection, image lightbox preview, and request startup logging.
- Packaged
release/llem-3.3.33.vsix.
- Masked create_file and edit_file code from live output and now show progress-only streaming states. Improved offline vision-model detection using local Ollama manifests and added vision decision logging.
- Packaged
release/llem-3.3.32.vsix.
- Improved offline vision-model detection using local Ollama manifests. Fixed capability checks to use the active engine endpoint and added vision decision logging.
- Packaged
release/llem-3.3.31.vsix.
- Implemented Intelligent Repetition Guard with tiered backoff (3s, 10s, 30s), non-blocking queue scheduling, and automated retry orchestration with UI cooldown feedback.
- Packaged
release/llem-3.3.30.vsix.
- Implemented File System Access Transparency with user-approved out-of-workspace operations and high-fidelity UI feedback.
- Packaged
release/llem-3.3.29.vsix.
- Action transparency and loop prevention improvements
- Packaged
release/llem-3.3.28.vsix.
- Added live stream metadata (duration, chunks, chars) to action progress UI
- Packaged
release/llem-3.3.27.vsix.
- Implemented AI self-correction loop and Codex-style action progress visualization
- Packaged
release/llem-3.3.24.vsix.
- B-1 fix: Repeated/watchdog-aborted responses are no longer pushed to the chat history. Previously, the aborted assistant message would linger in history and seed the next turn with a contaminated context, causing cascading repetition loops. Now the pipeline returns immediately without writing the bad response to history.
- B-2 fix: Consecutive
assistant β assistantoruser β usermessage pushes during agentic action loops are now de-duplicated. If acontinuationuser message arrives when the last history entry is already auserentry, the content is merged rather than creating a second entry. - B-3 fix: Images are no longer forwarded to text-only models (gemma, llama, mistral, etc.). The model name is inspected for known vision indicators (
llava,vision,:vl,bakllava,moondream, etc.) and a clear in-chat notice is shown when an image is skipped. - B-4 fix:
RequestRetryGuardfingerprints now use a normalized, punctuation-stripped 300-character prompt core instead of the raw prompt string. Rephrased retries of the same request are blocked even when the exact wording changes. - FileStateGuard: New
src/fileStateGuard.tscomputes SHA-256 hashes before and after everyedit_fileaction. Ano-effectwarning is surfaced when the file is unchanged (typically a<find>mismatch). After 3 consecutive no-effect edits on the same file,loop-detectedis returned and further edits on that path are blocked viaActionLoopGuard. - Packaged
release/llem-3.3.22.vsix.
- Live stream output now shows raw AI text without any HTML/Markdown parsing during generation β
<edit_file>,<find>,<replace>action tags are visible as-is while streaming. - Final reply (after stream completes) continues to render as full Markdown with code highlighting, file badges, and action summaries.
- Removed
sanitizeAssistantDisplayText()call from the liverenderStreamNow()path so the raw model output is never silently stripped mid-stream. - Packaged
release/llem-3.3.21.vsix.
- Hardened assistant output sanitization to prevent leaked action tags and scratchpad text in streamed replies
- Packaged
release/llem-3.3.20.vsix.
- Fixed RepetitionWatchdog false positives that could truncate edit-file streams during repeated action-tag/code sequences. Added regression coverage for repeated closing-tag action streams.
- Packaged
release/llem-3.3.19.vsix.
- Repackaged the current workspace state through the formal VSIX release flow.
- Packaged
release/llem-3.3.18.vsix.
- Fixed RepetitionWatchdog false positives on markdown structure tokens so tables, fences, headers, list markers, blockquotes, and task markers no longer abort valid replies
- added regression tests for markdown-safe watchdog behavior
- Packaged
release/llem-3.3.17.vsix.
- Added structured repetition abort handling, retry and action loop guards, safer file mutation validation, restored clickable editable files, and added default-browser opening for chat URL links
- Packaged
release/llem-3.3.16.vsix.
- Bumped the extension version from
3.3.15to3.3.16. - Fixed Korean IME Enter handling so composing Hangul no longer sends a duplicated trailing message.
- Added composition-aware Enter submission logic with regression coverage for
isComposingand IME confirm keycode229. - Hardened stream loop handling so repetition detection is promoted into structured pipeline state instead of being treated like a normal completion.
- Stopped follow-up execution after repetition aborts, including watchdog-triggered stops and turn-to-turn repeated continuation loops.
- Added request fingerprinting and retry fencing so the same request cannot immediately restart after a repetition stop.
- Added action loop guarding so repeated
create_fileandedit_filepatterns are blocked before they spin in place. - Added file mutation guarding so the same file cannot be mutated twice at the same time during model-driven actions.
- Rejected incomplete
<find>/<replace>edit bodies before disk write, preventing truncated edit actions from corrupting files. - Rejected obviously truncated
create_fileoutput such as unbalanced fenced code blocks before writing files. - Generalized plan-first enforcement for implementation requests, not just special design-guideline file names.
- Added implementation planning mode so code-generation requests are guided toward a compact file split and smaller Next.js/TypeScript steps first.
- Added a stronger post-processing guard that blocks action-tag execution if the model disobeys the initial plan-only response.
- Restored clickable editable-file behavior in chat by improving local file link validation, workspace-path resolution, and message rerendering after workspace file sync.
- Added default-browser opening for URL links in chat by routing external links through the extension host with
vscode.env.openExternal(...). - Expanded tests for stream outcome handling, retry guards, action loop guards, file mutation guards, design planning mode, editable file resolution, external link routing, and file-safety edge cases.
- Fixed Korean IME Enter handling to prevent duplicate trailing messages
- added regression tests for composition-safe prompt submission
- Packaged
release/llem-3.3.15.vsix.
- Added queued request pause/resume and reordering
- Added direct editing for queued items
- Expanded queue tests and stabilized package test suite
- Packaged
release/llem-3.3.14.vsix.
- Fix stop button UI and edit banner visibility
- Packaged
release/llem-3.3.12.vsix.
- Fix main-view layout causing input to overflow
- Packaged
release/llem-3.3.11.vsix.
- Fix terminal executing logged messages
- Packaged
release/llem-3.3.10.vsix.
- Fix immediate deletion of history items in UI
- Packaged
release/llem-3.3.9.vsix.
- Fix edit banner visibility on initial chat load
- Packaged
release/llem-3.3.8.vsix.
- Fix edit banner visibility on initial chat load
- Packaged
release/llem-3.3.7.vsix.
- Fix terminal rendering, layout stability, and improve hardware summary quality
- Packaged
release/llem-3.3.7.vsix.
- Implemented sequence-aware RepetitionWatchdog and improved action parsing to prevent infinite loops.
- Packaged
release/llem-3.3.6.vsix.
- Fixed model output streaming issues with buffering and enhanced token extraction for reasoning fields.
- Packaged
release/llem-3.2.9.vsix.
- Fixed AI response truncation, improved action tag stripping with smart quote support, and tuned model performance profiles for 26B models.
- Packaged
release/llem-3.2.7.vsix.
- Enabled unlimited response length by setting predict token limits to -1. Added handling for unlimited output in both Ollama and LM Studio engines.
- Packaged
release/llem-3.2.5.vsix.
- Increased token prediction limits to 4096+ to prevent response truncation. Fixed LM Studio max_tokens mapping.
- Packaged
release/llem-3.2.3.vsix.
- Implemented repetition penalty for large models to prevent hallucination loops, fixed model selection persistence in settings.json, and added overwrite protection for user settings.
- Packaged
release/llem-3.2.1.vsix.
- Revert to standard settings.json persistence and fix model selection overwrite issue
- Packaged
release/llem-3.1.9.vsix.
- Added Codex-style message actions for user and assistant replies
- restored copy and edit flows for existing user messages
- added edit-in-new-branch composer state
- Packaged
release/llem-3.1.8.vsix.
- Made the model dropdown persist the real active default model and pass runtime engine/model metadata into each request
- Removed the earlier-message editing banner and edit entrypoint from the chat UI so message composer stays in normal send mode
- Packaged
release/llem-3.1.7.vsix.
- Added file-based diagnostics for stream debugging with per-request raw chunk capture and parsed token traces
- Logged final assistant text cleanup so empty replies can be traced from transport through final rendering
- Packaged
release/llem-3.1.6.vsix.
- Improved stream parsing for object-shaped output chunks
- Fixed empty reply state when the model returned text in newer OpenAI-compatible stream formats
- Packaged
release/llem-3.1.5.vsix.
- Fixed recurring empty replies by broadening stream parsing for additional LM Studio and Ollama response shapes
- Added raw stream preview logging when parsed output ends up empty so future payload mismatches are diagnosable instantly
- Packaged
release/llem-3.1.4.vsix.
- Added model-aware performance presets for 26B-class local Ollama runs
- Added prompt budgeting and richer diagnostics for large local Gemma-family models
- Expanded the README with detailed performance profile guidance, 26B tuning notes, and diagnostics tips
- Packaged
release/llem-3.1.3.vsix.
- Fixed empty-reply turns by hardening stream parsing for Ollama and LM Studio
- Flushed trailing stream buffers so the final token is not lost when a stream ends without a newline
- Saved assistant replies consistently into chat history so follow-up turns keep the right conversation context
- Updated the chat UI to distinguish truly empty replies from successful completed output
- Packaged
release/llem-3.1.2.vsix.