Problem
Mainstream DeepSeek models (e.g., deepseek-v4-pro) do not natively support multimodal image input. Offloading image recognition to a dedicated, fast vision-capable LLM allows the main agent to stay focused on reasoning and coding, without being distracted by visual parsing work.
Proposed solution
1. Clipboard image paste (Ctrl+V)
Pressing Ctrl+V in the composer reads the system clipboard. If it contains an image (RGBA bitmap), the TUI encodes it as PNG, persists it to ~/.deepseek/clipboard-images/clipboard-{timestamp}.png, and inserts a text reference into the input buffer:
[image:/home/user/.deepseek/clipboard-images/clipboard-1715030400123456789.png]
A status hint is displayed, e.g. Attached image: 1024x768 PNG (235KB).
This works independently of vision_model — the image is saved to disk and referenced by path so the main model or any sub-agent can read it via standard file tools.
2. Dedicated vision model ([vision_model])
Users can configure a standalone [vision_model] in config.toml. Once enabled via the vision_model feature flag, two built-in vision tools become available:
vision_analyze — reads an image file from disk, base64-encodes it, sends it to the configured vision model for analysis
vision_ocr — delegates to vision_analyze with an OCR-specific prompt to extract text from images
The vision model runs with fully independent session state, isolated from the main model's context window.
Configure in config.toml
[features]
vision_model = true
[vision_model]
model = ""gemini-3.1-flash-lite-preview" # vision_model
provider = "openai" # optional
api_key = "..."
base_url = "https://generativelanguage.googleapis.com/v1beta/openai/"
End-to-end flow
Ctrl+V (clipboard has image)
→ PNG saved to ~/.deepseek/clipboard-images/
→ [image:path] inserted into composer
→ user types prompt: "what does this screenshot show?"
→ DeepSeek main model decides to call vision_analyze
→ tool reads image file, base64-encodes it
→ independent vision session sends OpenAI-compatible request to Gemini
→ analysis result flows back → main model → user
Supported image formats
- Clipboard paste: any RGBA bitmap on the system clipboard
- File analysis (vision_analyze / vision_ocr): png, jpg/jpeg, gif, webp, bmp, svg
Additional context
Input Image
Input Prompt
Output Result

Problem
Mainstream DeepSeek models (e.g.,
deepseek-v4-pro) do not natively support multimodal image input. Offloading image recognition to a dedicated, fast vision-capable LLM allows the main agent to stay focused on reasoning and coding, without being distracted by visual parsing work.Proposed solution
1. Clipboard image paste (
Ctrl+V)Pressing
Ctrl+Vin the composer reads the system clipboard. If it contains an image (RGBA bitmap), the TUI encodes it as PNG, persists it to~/.deepseek/clipboard-images/clipboard-{timestamp}.png, and inserts a text reference into the input buffer:[image:/home/user/.deepseek/clipboard-images/clipboard-1715030400123456789.png]A status hint is displayed, e.g.
Attached image: 1024x768 PNG (235KB).This works independently of
vision_model— the image is saved to disk and referenced by path so the main model or any sub-agent can read it via standard file tools.2. Dedicated vision model (
[vision_model])Users can configure a standalone [vision_model] in config.toml. Once enabled via the vision_model feature flag, two built-in vision tools become available:
vision_analyze— reads an image file from disk, base64-encodes it, sends it to the configured vision model for analysisvision_ocr— delegates tovision_analyzewith an OCR-specific prompt to extract text from imagesThe vision model runs with fully independent session state, isolated from the main model's context window.
Configure in config.toml
End-to-end flow
Ctrl+V (clipboard has image)
→ PNG saved to ~/.deepseek/clipboard-images/
→ [image:path] inserted into composer
→ user types prompt: "what does this screenshot show?"
→ DeepSeek main model decides to call vision_analyze
→ tool reads image file, base64-encodes it
→ independent vision session sends OpenAI-compatible request to Gemini
→ analysis result flows back → main model → user
Supported image formats
Additional context
Input Image
Input Prompt
Output Result