Skip to content

Add vision_model registration & vision tools for image input support #868

@MMMarcinho

Description

@MMMarcinho

Problem

Mainstream DeepSeek models (e.g., deepseek-v4-pro) do not natively support multimodal image input. Offloading image recognition to a dedicated, fast vision-capable LLM allows the main agent to stay focused on reasoning and coding, without being distracted by visual parsing work.

Proposed solution

1. Clipboard image paste (Ctrl+V)

Pressing Ctrl+V in the composer reads the system clipboard. If it contains an image (RGBA bitmap), the TUI encodes it as PNG, persists it to ~/.deepseek/clipboard-images/clipboard-{timestamp}.png, and inserts a text reference into the input buffer:

[image:/home/user/.deepseek/clipboard-images/clipboard-1715030400123456789.png]

A status hint is displayed, e.g. Attached image: 1024x768 PNG (235KB).

This works independently of vision_model — the image is saved to disk and referenced by path so the main model or any sub-agent can read it via standard file tools.

2. Dedicated vision model ([vision_model])

Users can configure a standalone [vision_model] in config.toml. Once enabled via the vision_model feature flag, two built-in vision tools become available:

  • vision_analyze — reads an image file from disk, base64-encodes it, sends it to the configured vision model for analysis
  • vision_ocr — delegates to vision_analyze with an OCR-specific prompt to extract text from images

The vision model runs with fully independent session state, isolated from the main model's context window.

Configure in config.toml

[features]
  vision_model = true 

[vision_model]
  model = ""gemini-3.1-flash-lite-preview"          # vision_model
  provider = "openai"        # optional
  api_key = "..."
  base_url = "https://generativelanguage.googleapis.com/v1beta/openai/"

End-to-end flow

Ctrl+V (clipboard has image)
→ PNG saved to ~/.deepseek/clipboard-images/
→ [image:path] inserted into composer
→ user types prompt: "what does this screenshot show?"
→ DeepSeek main model decides to call vision_analyze
→ tool reads image file, base64-encodes it
→ independent vision session sends OpenAI-compatible request to Gemini
→ analysis result flows back → main model → user

Supported image formats

  • Clipboard paste: any RGBA bitmap on the system clipboard
  • File analysis (vision_analyze / vision_ocr): png, jpg/jpeg, gif, webp, bmp, svg

Additional context

Input Image

Image

Input Prompt

Image

Output Result

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestquestionFurther information is requested

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions