Skip to content

feat: add DeepSeek-OCR support, C++ API updates, and dual-model loadi…#534

Merged
chenghuaWang merged 3 commits intoUbiquitousLearning:v2from
yuerqiqi:feat/deepseek-ocr-support
Nov 23, 2025
Merged

feat: add DeepSeek-OCR support, C++ API updates, and dual-model loadi…#534
chenghuaWang merged 3 commits intoUbiquitousLearning:v2from
yuerqiqi:feat/deepseek-ocr-support

Conversation

@yuerqiqi
Copy link
Copy Markdown
Contributor

@yuerqiqi yuerqiqi commented Nov 19, 2025

Key Changes

1. C++ Core & API

  • Added DeepseekOCR model implementation (modeling_deepseek_ocr.hpp, modeling_deepseek_ocr_service.hpp).
  • Added createDeepseekOCRSession to the C API (Runtime.h, Runtime.cpp) to handle visual inputs specifically.

2. Go Service (mllm-server)

  • Dual Model Support: The server can now load an LLM and an OCR model simultaneously (or individually) using new flags:
    • --model-path: For standard LLM (e.g., Qwen3).
    • --ocr-model-path: For DeepSeek-OCR.
  • OCR Preprocessing: Implemented preprocessRequestForOCR in handlers.go to detect Base64 encoded images in the request, decode them to temporary files, and pass file paths to the C++ backend.

3. Documentation

  • Updated docs/service/mllm_cli.rst with architectural changes, new API descriptions, and usage guides for the OCR feature.

Summary by CodeRabbit

  • New Features

    • Added DeepSeek-OCR model support with automatic image preprocessing
    • Server now supports simultaneous deployment of multiple models
    • Enhanced image input handling with Base64 decoding and file management
  • Documentation

    • Updated build and configuration instructions for OCR deployment
    • Added multi-model setup and usage guidelines
    • Clarified model-name matching requirements and configuration options

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 19, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The changes introduce end-to-end DeepSeek-OCR model support, adding a C API function for session creation, Go bindings, server-side dual-model initialization with OCR flag, request preprocessing for Base64-encoded image decoding, and a full DeepseekOCRSession implementation with model loading and streaming generation.

Changes

Cohort / File(s) Summary
C API Layer
mllm/c_api/Runtime.h, mllm/c_api/Runtime.cpp
Adds createDeepseekOCRSession(const char\* model_path) declaration and implementation; mirrors Qwen3 session creation with null validation, exception handling, and MllmSessionWrapper wrapping.
Go Bindings
mllm-cli/mllm/c.go
Implements NewDeepseekOCRSession(modelPath string) wrapper that converts Go strings to C, calls C API, and configures resource cleanup via finalizer.
Server Initialization
mllm-cli/cmd/mllm-server/main.go
Adds --ocr-model-path flag; conditionally creates Qwen3 and/or DeepSeek-OCR sessions based on provided flags; updates per-session registration and logging.
Request Handlers
mllm-cli/pkg/server/handlers.go
Introduces decodeBase64Image() and preprocessRequestForOCR() to handle image data URLs, write temporary files, and inject image paths into requests; detects OCR models by name; enforces image presence in OCR requests; adds type-asserted streaming flush handling.
Model Service Implementation
mllm/models/deepseek_ocr/modeling_deepseek_ocr_service.hpp
Implements DeepseekOCRSession with fromPreTrain() model loading and streamGenerate() that handles image loading, tokenization, spatial cropping, and streaming token generation.
Model Accessors
mllm/models/deepseek_ocr/modeling_deepseek_ocr.hpp
Adds inline accessor methods kvCache() and eosTokenId() to DeepseekOCRForCausalLM.
Documentation
docs/service/mllm_cli.rst
Updates usage instructions to cover dual-model startup, OCR request handling, client/server configuration alignment, build script references, and OCR feature limitations.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Server
    participant Handler
    participant OCRService
    participant Model

    Client->>Server: POST /chat/completions<br/>(Base64 image + text)
    Server->>Handler: Route to chatCompletionsHandler
    
    rect rgb(200, 220, 255)
    Note over Handler: OCR Detection & Preprocessing
    Handler->>Handler: Detect model name contains "OCR"
    Handler->>Handler: preprocessRequestForOCR()<br/>- Decode Base64 image<br/>- Save to temp file<br/>- Inject file path into request
    end
    
    Handler->>OCRService: streamGenerate(preprocessed payload)
    
    rect rgb(220, 200, 255)
    Note over OCRService,Model: Image Processing & Generation
    OCRService->>Model: Load image from path
    OCRService->>Model: Extract text from request
    OCRService->>Model: Build tokenized input + image tensor
    OCRService->>Model: Generate with streaming callback
    end
    
    Model->>OCRService: Token callback (per token)
    OCRService->>Handler: Output tokens
    Handler->>Server: Stream response chunks
    Server->>Client: SSE response
    
    Note over Handler: Cleanup temp image file
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Image preprocessing logic (handlers.go): decodeBase64Image and preprocessRequestForOCR introduce non-trivial data transformation with multiple payload shape variants; requires validation of error handling and file I/O patterns.
  • Service implementation (modeling_deepseek_ocr_service.hpp): fromPreTrain model initialization and streamGenerate generation loop contain significant image handling, tokenization coordination, and tensor manipulation; needs careful review of image transformation setup and EOS token handling.
  • Cross-layer integration (Runtime.cpp, c.go, main.go): New C API, Go bindings, and server initialization require tracing session lifecycle and error propagation across language boundaries.
  • Conditional logic (main.go, handlers.go): Dual-model startup and OCR detection branches add complexity; verify flag validation, session isolation, and request routing correctness.

Suggested reviewers

  • yirongjie
  • chenghuaWang

Poem

🐰 A rabbit hops through images vast,
Decoding Base64 lines so fast!
From vision streams to tokens bright,
DeepSeek-OCR reads the light—
Preprocessed paths through files we bound,
Where text and sight together sound! 🎨✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: adding DeepSeek-OCR support, updating the C++ API, and implementing dual-model loading capabilities across the codebase.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuerqiqi
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 19, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
mllm-cli/cmd/mllm-server/main.go (1)

21-70: Current argument check blocks OCR‑only mode; relax --model-path requirement

Right now the server exits if --model-path is empty, even when --ocr-model-path is provided. That prevents running with only DeepSeek‑OCR, which the PR description claims should be supported.

Consider changing the guard to require at least one of the two flags:

-    if *modelPath == "" {
-        log.Fatal("FATAL: --model-path argument is required.")
-    }
+    if *modelPath == "" && *ocrModelPath == "" {
+        log.Fatal("FATAL: at least one of --model-path or --ocr-model-path is required.")
+    }

The rest of the per‑model initialization blocks (if *modelPath != "" { ... }, if *ocrModelPath != "" { ... }) can stay as is and will then support LLM‑only, OCR‑only, or dual‑model setups.

🧹 Nitpick comments (6)
mllm/models/deepseek_ocr/modeling_deepseek_ocr.hpp (1)

658-659: Accessors are fine; consider making eosTokenId const

The new accessors cleanly expose the cache and EOS id for the service layer and look correct. As a small polish, eosTokenId can safely be marked const:

-  inline int64_t eosTokenId() { return eos_token_id_; }
+  inline int64_t eosTokenId() const { return eos_token_id_; }
mllm/models/deepseek_ocr/modeling_deepseek_ocr_service.hpp (2)

121-137: Clarify integer math for num_queries_base / num_queries

num_queries_base and num_queries are derived via std::ceil on results of integer division:

auto num_queries_base = std::ceil((base_size / PATCH_SIZE) / DOWN_SAMPLE_RATIO);
...
auto num_queries = std::ceil((image_size / PATCH_SIZE) / DOWN_SAMPLE_RATIO);

Because base_size, PATCH_SIZE, and DOWN_SAMPLE_RATIO are int, the divisions are already truncated before ceil, and the result is stored in a double but used in integer contexts (loop bounds, reserve).

This works for the current constants but is harder to reason about if the values change. You can make the intent explicit and keep everything integral:

-        auto num_queries_base = std::ceil((base_size / PATCH_SIZE) / DOWN_SAMPLE_RATIO);
+        const int num_queries_base =
+            static_cast<int>(std::ceil(static_cast<float>(base_size) / PATCH_SIZE / DOWN_SAMPLE_RATIO));
...
-            auto num_queries = std::ceil((image_size / PATCH_SIZE) / DOWN_SAMPLE_RATIO);
+            const int num_queries =
+                static_cast<int>(std::ceil(static_cast<float>(image_size) / PATCH_SIZE / DOWN_SAMPLE_RATIO));

53-61: Hard‑coded image constants duplicated from model‑side logic

The service uses hard‑coded values for base_size, image_size, PATCH_SIZE, DOWN_SAMPLE_RATIO, IMAGE_TOKEN, and IMAGE_TOKEN_ID, mirroring the logic in DeepseekOCRForCausalLM::infer. That keeps behavior aligned today but creates a maintenance hazard if the model’s config or tokenizer changes.

Consider centralizing these parameters (e.g., deriving them from config_ or a shared helper) so the service doesn’t need to be updated separately when model constants change.

Also applies to: 121-139

mllm-cli/mllm/c.go (1)

76-91: Go binding for DeepSeek‑OCR session is consistent with existing NewSession

The NewDeepseekOCRSession wrapper correctly mirrors NewSession: it manages the C string, checks isOk, returns a Session on success, and attaches a finalizer that calls freeSession. This looks good.

If you want to reduce duplication later, you could factor the common “create session + finalizer wiring” into a small helper used by both NewSession and NewDeepseekOCRSession, but that’s optional.

mllm-cli/pkg/server/handlers.go (2)

18-35: Enhance validation and error handling for image format detection.

The function has a few areas for improvement:

  1. Line 24: Defaults to .png for unknown MIME types, which could be misleading.
  2. Line 19-22: Doesn't validate that the URI starts with data:image/.
  3. Lines 25-29: Limited to three formats (JPEG, WebP, PNG).

Consider these improvements:

 func decodeBase64Image(uri string) ([]byte, string, error) {
+	if !strings.HasPrefix(uri, "data:image/") {
+		return nil, "", fmt.Errorf("invalid data URI: must start with 'data:image/'")
+	}
 	parts := strings.SplitN(uri, ",", 2)
 	if len(parts) != 2 {
 		return nil, "", fmt.Errorf("invalid base64 image data")
 	}
 	meta := parts[0]
-	ext := ".png"
+	ext := ""
 	if strings.Contains(meta, "image/jpeg") {
 		ext = ".jpg"
 	} else if strings.Contains(meta, "image/webp") {
 		ext = ".webp"
+	} else if strings.Contains(meta, "image/png") {
+		ext = ".png"
+	} else {
+		return nil, "", fmt.Errorf("unsupported image format in data URI")
 	}
 	data, err := base64.StdEncoding.DecodeString(parts[1])
 	if err != nil {
 		return nil, "", fmt.Errorf("failed to decode base64: %v", err)
 	}
 	return data, ext, nil
 }

164-184: Consider more precise OCR model detection.

Line 164 uses strings.Contains(strings.ToUpper(modelName), "OCR") which could match unintended model names (e.g., "MACRO", "MICROCOSM").

For more precise detection, consider:

-		if strings.Contains(strings.ToUpper(modelName), "OCR") {
+		if strings.Contains(strings.ToLower(modelName), "deepseek-ocr") || 
+		   strings.HasSuffix(strings.ToLower(modelName), "-ocr") {
 			log.Printf("[Handler] OCR model detected ('%s'). Checking for image data...", modelName)

Alternatively, you could maintain a registry of OCR-capable models in the service layer and query that.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f8fe667 and 3271f28.

📒 Files selected for processing (8)
  • docs/service/mllm_cli.rst (9 hunks)
  • mllm-cli/cmd/mllm-server/main.go (2 hunks)
  • mllm-cli/mllm/c.go (1 hunks)
  • mllm-cli/pkg/server/handlers.go (5 hunks)
  • mllm/c_api/Runtime.cpp (2 hunks)
  • mllm/c_api/Runtime.h (1 hunks)
  • mllm/models/deepseek_ocr/modeling_deepseek_ocr.hpp (1 hunks)
  • mllm/models/deepseek_ocr/modeling_deepseek_ocr_service.hpp (1 hunks)
🧰 Additional context used
🪛 Clang (14.0.6)
mllm/c_api/Runtime.cpp

[error] 108-108: initializing non-owner 'MllmSessionWrapper *' with a newly created 'gsl::owner<>'

(cppcoreguidelines-owning-memory,-warnings-as-errors)

mllm/models/deepseek_ocr/modeling_deepseek_ocr_service.hpp

[error] 3-3: 'filesystem' file not found

(clang-diagnostic-error)


[error] 20-20: constructor does not initialize these fields: model_, tokenizer_, config_

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 24-24: method 'fromPreTrain' can be made static

(readability-convert-member-functions-to-static,-warnings-as-errors)


[error] 26-26: variable 'root' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 27-27: variable 'config_file' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 28-28: variable 'model_file' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 29-29: variable 'tokenizer_file' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 46-46: method 'streamGenerate' can be made static

(readability-convert-member-functions-to-static,-warnings-as-errors)


[error] 46-46: 2 adjacent parameters of 'streamGenerate' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 57-57: invalid case style for variable 'PATCH_SIZE'

(readability-identifier-naming,-warnings-as-errors)


[error] 58-58: invalid case style for variable 'DOWN_SAMPLE_RATIO'

(readability-identifier-naming,-warnings-as-errors)


[error] 59-59: variable 'IMAGE_TOKEN' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 59-59: invalid case style for variable 'IMAGE_TOKEN'

(readability-identifier-naming,-warnings-as-errors)


[error] 60-60: variable 'IMAGE_TOKEN_ID' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 60-60: invalid case style for variable 'IMAGE_TOKEN_ID'

(readability-identifier-naming,-warnings-as-errors)


[error] 71-71: variable 'user_text' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 73-73: variable 'i' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 102-102: variable name 'p' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 141-141: variable 'final_prompt_text' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 155-155: variable '_ptr' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 155-155: invalid case style for variable '_ptr'

(readability-identifier-naming,-warnings-as-errors)


[error] 162-162: variable 'images_crop_tensor' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 163-163: if with identical then and else branches

(bugprone-branch-clone,-warnings-as-errors)


[error] 169-169: variable 'input' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 176-176: variable 'args' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 207-207: namespace 'mllm::models::deepseek_ocr' not terminated with a closing comment

(google-readability-namespace-comments,-warnings-as-errors)

🔇 Additional comments (10)
mllm/c_api/Runtime.h (1)

43-46: Implementation verified and properly exported; ready to merge

Verification confirms the implementation is complete:

  • createDeepseekOCRSession is implemented at Runtime.cpp:99 with proper error handling (line 101) and exception handling (line 113)
  • ✅ Symbol is successfully exported and visible—cgo code at mllm-cli/mllm/c.go:80 calls it without linkage issues
  • ✅ Error conventions match the existing pattern (invalid arguments check, exception catch with message)

The code change is solid.

mllm/c_api/Runtime.cpp (2)

7-7: LGTM!

The include is necessary for the new createDeepseekOCRSession function and follows the existing pattern.


99-116: LGTM! Implementation mirrors the existing pattern.

The function correctly follows the established pattern from createQwen3Session, with appropriate null checks, exception handling, and resource wrapping.

Note on static analysis warning: The cppcoreguidelines-owning-memory warning at line 108 also applies to the existing createQwen3Session at line 89. This is a pre-existing pattern in the codebase, not introduced by this PR.

docs/service/mllm_cli.rst (4)

9-13: LGTM! Clear documentation of supported models.

The section clearly identifies the supported LLM and OCR models, helping users understand the system's capabilities.


44-44: LGTM! Accurate API documentation.

The documentation correctly describes the new C API function and its purpose.


55-69: LGTM! Comprehensive dual-model documentation.

The section clearly explains the dual-model initialization, routing logic, and OCR preprocessing workflow, aligning well with the implementation.


319-326: LGTM! Clear server startup instructions.

The documentation provides clear examples for both single-model (Qwen3 only) and dual-model (Qwen3 + OCR) server startup scenarios.

mllm-cli/pkg/server/handlers.go (3)

1-2: LGTM! Standard copyright header.


122-146: LGTM! Robust temporary file handling.

The implementation properly:

  • Creates temp files with appropriate extensions
  • Handles write errors with immediate cleanup
  • Returns absolute paths for consistency
  • Provides a cleanup function for resource management

The cleanup function is correctly deferred in the handler at line 173, ensuring temporary files are removed after request processing.


233-235: LGTM! Safer flushing implementation.

The type-asserted optional flushing is more robust than assuming the response writer always implements http.Flusher. This prevents potential panics in environments where flushing isn't supported.

Also applies to: 251-253

Comment thread docs/service/mllm_cli.rst Outdated
Comment on lines 202 to 204
On the build server, execute the build task. This task uses `tasks/build_android.yaml` to configure and run CMake.

Before executing this step, you also need to ensure that the hardcoded directories in build_android_debug.yaml have been modified to match your requirements. The modification method is the same as for the Go compilation file mentioned earlier.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix inconsistent build script references.

Line 202 references build_android.yaml, but line 204 mentions build_android_debug.yaml. These should refer to the same file.

Apply this fix:

-Before executing this step, you also need to ensure that the hardcoded directories in build_android_debug.yaml have been modified to match your requirements. The modification method is the same as for the Go compilation file mentioned earlier.
+Before executing this step, you also need to ensure that the hardcoded directories in build_android.yaml have been modified to match your requirements. The modification method is the same as for the Go compilation file mentioned earlier.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
On the build server, execute the build task. This task uses `tasks/build_android.yaml` to configure and run CMake.
Before executing this step, you also need to ensure that the hardcoded directories in build_android_debug.yaml have been modified to match your requirements. The modification method is the same as for the Go compilation file mentioned earlier.
On the build server, execute the build task. This task uses `tasks/build_android.yaml` to configure and run CMake.
Before executing this step, you also need to ensure that the hardcoded directories in build_android.yaml have been modified to match your requirements. The modification method is the same as for the Go compilation file mentioned earlier.
🤖 Prompt for AI Agents
In docs/service/mllm_cli.rst around lines 202 to 204, the build script names are
inconsistent: line 202 mentions `build_android.yaml` while line 204 refers to
`build_android_debug.yaml`; update the text so both lines reference the same
file (preferably `build_android_debug.yaml` to match the hardcoded-directory
note) and ensure the phrasing consistently instructs the user to modify that
single file before running the build task.

Comment on lines +60 to +71
} else if images, ok := msg["images"].([]interface{}); ok && len(images) > 0 {
log.Println("[Handler] Found custom 'images' field.")
base64URI, ok := images[0].(string)
if !ok || !strings.HasPrefix(base64URI, "data:image") {
return false, func() {}, fmt.Errorf("image data in 'images' field is not a valid base64 URI")
}
contentArray = []interface{}{
map[string]interface{}{"type": "text", "text": msg["content"].(string)},
map[string]interface{}{"type": "image_url", "image_url": map[string]interface{}{"url": base64URI}},
}
userMessage = msg
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add type safety when accessing message content.

Line 67 assumes msg["content"] is a string without validating this. If the content field is missing or has a different type when using the custom images field format, this could panic.

Apply this defensive fix:

 			} else if images, ok := msg["images"].([]interface{}); ok && len(images) > 0 {
 				log.Println("[Handler] Found custom 'images' field.")
 				base64URI, ok := images[0].(string)
 				if !ok || !strings.HasPrefix(base64URI, "data:image") {
 					return false, func() {}, fmt.Errorf("image data in 'images' field is not a valid base64 URI")
 				}
+				textContent, _ := msg["content"].(string)
+				if textContent == "" {
+					textContent = "What's in this image?"
+				}
 				contentArray = []interface{}{
-					map[string]interface{}{"type": "text", "text": msg["content"].(string)},
+					map[string]interface{}{"type": "text", "text": textContent},
 					map[string]interface{}{"type": "image_url", "image_url": map[string]interface{}{"url": base64URI}},
 				}
 				userMessage = msg
 				break
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} else if images, ok := msg["images"].([]interface{}); ok && len(images) > 0 {
log.Println("[Handler] Found custom 'images' field.")
base64URI, ok := images[0].(string)
if !ok || !strings.HasPrefix(base64URI, "data:image") {
return false, func() {}, fmt.Errorf("image data in 'images' field is not a valid base64 URI")
}
contentArray = []interface{}{
map[string]interface{}{"type": "text", "text": msg["content"].(string)},
map[string]interface{}{"type": "image_url", "image_url": map[string]interface{}{"url": base64URI}},
}
userMessage = msg
break
} else if images, ok := msg["images"].([]interface{}); ok && len(images) > 0 {
log.Println("[Handler] Found custom 'images' field.")
base64URI, ok := images[0].(string)
if !ok || !strings.HasPrefix(base64URI, "data:image") {
return false, func() {}, fmt.Errorf("image data in 'images' field is not a valid base64 URI")
}
textContent, _ := msg["content"].(string)
if textContent == "" {
textContent = "What's in this image?"
}
contentArray = []interface{}{
map[string]interface{}{"type": "text", "text": textContent},
map[string]interface{}{"type": "image_url", "image_url": map[string]interface{}{"url": base64URI}},
}
userMessage = msg
break
🤖 Prompt for AI Agents
mllm-cli/pkg/server/handlers.go around lines 60 to 71: the code assumes
msg["content"] is a string when building contentArray which can panic if the key
is missing or of a different type; validate msg["content"] with a type assertion
(e.g., s, ok := msg["content"].(string)), and if not ok return an error (or
handle gracefully) before constructing contentArray; replace the direct
msg["content"].(string) usage with the validated string variable so contentArray
is only built when content is present and a proper string.

Comment on lines +71 to +81
std::string user_text;
const auto& messages = request["messages"];
for (int i = messages.size() - 1; i >= 0; --i) {
const auto& msg = messages[i];
if (msg.value("role", "") == "user" || msg.value("role", "") == "<|User|>") {
if (msg.contains("content") && msg["content"].is_string()) {
user_text = msg["content"].get<std::string>();
break;
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against empty or malformed messages when extracting user_text

The reverse scan for the last user message assumes request["messages"] is a non‑empty array. If messages is missing, not an array, or empty, messages.size() - 1 can underflow when converted to int, and user_text will stay default‑constructed without any explicit handling.

To make this more robust:

-        std::string user_text;
-        const auto& messages = request["messages"];
-        for (int i = messages.size() - 1; i >= 0; --i) {
+        std::string user_text;
+        const auto& messages = request["messages"];
+        if (!messages.is_array() || messages.empty()) {
+            throw std::runtime_error("DeepSeek-OCR requires a non-empty 'messages' array.");
+        }
+        for (int i = static_cast<int>(messages.size()) - 1; i >= 0; --i) {
             const auto& msg = messages[i];
             if (msg.value("role", "") == "user" || msg.value("role", "") == "<|User|>") {
                 if (msg.contains("content") && msg["content"].is_string()) {
                     user_text = msg["content"].get<std::string>();
                     break; 
                 }
             }
         }
+        // Optionally: handle the case where no user_text is found (log or throw).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::string user_text;
const auto& messages = request["messages"];
for (int i = messages.size() - 1; i >= 0; --i) {
const auto& msg = messages[i];
if (msg.value("role", "") == "user" || msg.value("role", "") == "<|User|>") {
if (msg.contains("content") && msg["content"].is_string()) {
user_text = msg["content"].get<std::string>();
break;
}
}
}
std::string user_text;
const auto& messages = request["messages"];
if (!messages.is_array() || messages.empty()) {
throw std::runtime_error("DeepSeek-OCR requires a non-empty 'messages' array.");
}
for (int i = static_cast<int>(messages.size()) - 1; i >= 0; --i) {
const auto& msg = messages[i];
if (msg.value("role", "") == "user" || msg.value("role", "") == "<|User|>") {
if (msg.contains("content") && msg["content"].is_string()) {
user_text = msg["content"].get<std::string>();
break;
}
}
}
// Optionally: handle the case where no user_text is found (log or throw).
🧰 Tools
🪛 Clang (14.0.6)

[error] 71-71: variable 'user_text' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 73-73: variable 'i' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

@chenghuaWang chenghuaWang merged commit 9d081db into UbiquitousLearning:v2 Nov 23, 2025
4 checks passed
@yuerqiqi yuerqiqi deleted the feat/deepseek-ocr-support branch January 14, 2026 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants