Skip to content

feat: Vision Language Model (VLM) Support #56

@pescn

Description

@pescn

Summary

Add support for Vision Language Models (VLM) - image input for model analysis.

Scope

Supported

  • Image Input (Vision): Users send images for model analysis (GPT-4V, Claude Vision, etc.)

Not Supported (Future)

  • ❌ Image Generation (DALL-E)
  • ❌ Audio Input/Output
  • ❌ Video
  • ❌ Realtime/Omni

Requirements

Internal Format Extension

// New ImageContentBlock
interface ImageContentBlock {
  type: "image"
  source: {
    type: "base64" | "url"
    mediaType?: string  // "image/jpeg", "image/png", etc.
    data?: string       // base64 data
    url?: string        // image URL
  }
  detail?: "auto" | "low" | "high"  // OpenAI vision detail
}

// Update InternalContentBlock union
type InternalContentBlock =
  | TextContentBlock
  | ThinkingContentBlock
  | ToolUseContentBlock
  | ToolResultContentBlock
  | ImageContentBlock  // New

Adapter Modifications

Request Adapters (Parse image input):

  • openai-chat.ts: Parse image_url content part
  • anthropic.ts: Parse image content block
  • openai-responses.ts: Parse input_image part

Upstream Adapters (Send to Provider):

  • openai.ts: Build OpenAI Vision format
  • anthropic.ts: Build Anthropic Vision format

Database Changes

ALTER TABLE models ADD COLUMN supports_vision BOOLEAN DEFAULT false;
ALTER TABLE models ADD COLUMN max_image_size INTEGER;  -- bytes
ALTER TABLE models ADD COLUMN max_images_per_request INTEGER;

Image Handling Considerations

  • Base64 encoding adds ~33% overhead
  • Large images: Consider server-side compression or rejection
  • JSONB storage: Consider whether to save original image data

概要

添加视觉语言模型(VLM)支持 - 图片输入用于模型分析。

范围

支持

  • 图片输入 (Vision):用户发送图片让模型分析(GPT-4V、Claude Vision 等)

暂不支持(未来)

  • ❌ 图片生成 (DALL-E)
  • ❌ 音频输入/输出
  • ❌ 视频
  • ❌ Realtime/Omni

需求

内部格式扩展

// 新增 ImageContentBlock
interface ImageContentBlock {
  type: "image"
  source: {
    type: "base64" | "url"
    mediaType?: string  // "image/jpeg", "image/png" 等
    data?: string       // base64 数据
    url?: string        // 图片 URL
  }
  detail?: "auto" | "low" | "high"  // OpenAI vision 细节级别
}

// 更新 InternalContentBlock union
type InternalContentBlock =
  | TextContentBlock
  | ThinkingContentBlock
  | ToolUseContentBlock
  | ToolResultContentBlock
  | ImageContentBlock  // 新增

适配器修改

Request Adapters(解析图片输入):

  • openai-chat.ts:解析 image_url content part
  • anthropic.ts:解析 image content block
  • openai-responses.ts:解析 input_image part

Upstream Adapters(发送给 Provider):

  • openai.ts:构建 OpenAI Vision 格式
  • anthropic.ts:构建 Anthropic Vision 格式

数据库变更

ALTER TABLE models ADD COLUMN supports_vision BOOLEAN DEFAULT false;
ALTER TABLE models ADD COLUMN max_image_size INTEGER;  -- 字节
ALTER TABLE models ADD COLUMN max_images_per_request INTEGER;

图片处理注意事项

  • Base64 编码会增加约 33% 体积
  • 大图片:考虑服务端压缩或拒绝
  • JSONB 存储:考虑是否保存原始图片数据

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions