feat: add DeepSeek-OCR support, C++ API updates, and dual-model loadi…#534
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe changes introduce end-to-end DeepSeek-OCR model support, adding a C API function for session creation, Go bindings, server-side dual-model initialization with OCR flag, request preprocessing for Base64-encoded image decoding, and a full DeepseekOCRSession implementation with model loading and streaming generation. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Server
participant Handler
participant OCRService
participant Model
Client->>Server: POST /chat/completions<br/>(Base64 image + text)
Server->>Handler: Route to chatCompletionsHandler
rect rgb(200, 220, 255)
Note over Handler: OCR Detection & Preprocessing
Handler->>Handler: Detect model name contains "OCR"
Handler->>Handler: preprocessRequestForOCR()<br/>- Decode Base64 image<br/>- Save to temp file<br/>- Inject file path into request
end
Handler->>OCRService: streamGenerate(preprocessed payload)
rect rgb(220, 200, 255)
Note over OCRService,Model: Image Processing & Generation
OCRService->>Model: Load image from path
OCRService->>Model: Extract text from request
OCRService->>Model: Build tokenized input + image tensor
OCRService->>Model: Generate with streaming callback
end
Model->>OCRService: Token callback (per token)
OCRService->>Handler: Output tokens
Handler->>Server: Stream response chunks
Server->>Client: SSE response
Note over Handler: Cleanup temp image file
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
mllm-cli/cmd/mllm-server/main.go (1)
21-70: Current argument check blocks OCR‑only mode; relax--model-pathrequirementRight now the server exits if
--model-pathis empty, even when--ocr-model-pathis provided. That prevents running with only DeepSeek‑OCR, which the PR description claims should be supported.Consider changing the guard to require at least one of the two flags:
- if *modelPath == "" { - log.Fatal("FATAL: --model-path argument is required.") - } + if *modelPath == "" && *ocrModelPath == "" { + log.Fatal("FATAL: at least one of --model-path or --ocr-model-path is required.") + }The rest of the per‑model initialization blocks (
if *modelPath != "" { ... },if *ocrModelPath != "" { ... }) can stay as is and will then support LLM‑only, OCR‑only, or dual‑model setups.
🧹 Nitpick comments (6)
mllm/models/deepseek_ocr/modeling_deepseek_ocr.hpp (1)
658-659: Accessors are fine; consider makingeosTokenIdconstThe new accessors cleanly expose the cache and EOS id for the service layer and look correct. As a small polish,
eosTokenIdcan safely be markedconst:- inline int64_t eosTokenId() { return eos_token_id_; } + inline int64_t eosTokenId() const { return eos_token_id_; }mllm/models/deepseek_ocr/modeling_deepseek_ocr_service.hpp (2)
121-137: Clarify integer math fornum_queries_base/num_queries
num_queries_baseandnum_queriesare derived viastd::ceilon results of integer division:auto num_queries_base = std::ceil((base_size / PATCH_SIZE) / DOWN_SAMPLE_RATIO); ... auto num_queries = std::ceil((image_size / PATCH_SIZE) / DOWN_SAMPLE_RATIO);Because
base_size,PATCH_SIZE, andDOWN_SAMPLE_RATIOareint, the divisions are already truncated beforeceil, and the result is stored in adoublebut used in integer contexts (loop bounds,reserve).This works for the current constants but is harder to reason about if the values change. You can make the intent explicit and keep everything integral:
- auto num_queries_base = std::ceil((base_size / PATCH_SIZE) / DOWN_SAMPLE_RATIO); + const int num_queries_base = + static_cast<int>(std::ceil(static_cast<float>(base_size) / PATCH_SIZE / DOWN_SAMPLE_RATIO)); ... - auto num_queries = std::ceil((image_size / PATCH_SIZE) / DOWN_SAMPLE_RATIO); + const int num_queries = + static_cast<int>(std::ceil(static_cast<float>(image_size) / PATCH_SIZE / DOWN_SAMPLE_RATIO));
53-61: Hard‑coded image constants duplicated from model‑side logicThe service uses hard‑coded values for
base_size,image_size,PATCH_SIZE,DOWN_SAMPLE_RATIO,IMAGE_TOKEN, andIMAGE_TOKEN_ID, mirroring the logic inDeepseekOCRForCausalLM::infer. That keeps behavior aligned today but creates a maintenance hazard if the model’s config or tokenizer changes.Consider centralizing these parameters (e.g., deriving them from
config_or a shared helper) so the service doesn’t need to be updated separately when model constants change.Also applies to: 121-139
mllm-cli/mllm/c.go (1)
76-91: Go binding for DeepSeek‑OCR session is consistent with existingNewSessionThe
NewDeepseekOCRSessionwrapper correctly mirrorsNewSession: it manages the C string, checksisOk, returns aSessionon success, and attaches a finalizer that callsfreeSession. This looks good.If you want to reduce duplication later, you could factor the common “create session + finalizer wiring” into a small helper used by both
NewSessionandNewDeepseekOCRSession, but that’s optional.mllm-cli/pkg/server/handlers.go (2)
18-35: Enhance validation and error handling for image format detection.The function has a few areas for improvement:
- Line 24: Defaults to
.pngfor unknown MIME types, which could be misleading.- Line 19-22: Doesn't validate that the URI starts with
data:image/.- Lines 25-29: Limited to three formats (JPEG, WebP, PNG).
Consider these improvements:
func decodeBase64Image(uri string) ([]byte, string, error) { + if !strings.HasPrefix(uri, "data:image/") { + return nil, "", fmt.Errorf("invalid data URI: must start with 'data:image/'") + } parts := strings.SplitN(uri, ",", 2) if len(parts) != 2 { return nil, "", fmt.Errorf("invalid base64 image data") } meta := parts[0] - ext := ".png" + ext := "" if strings.Contains(meta, "image/jpeg") { ext = ".jpg" } else if strings.Contains(meta, "image/webp") { ext = ".webp" + } else if strings.Contains(meta, "image/png") { + ext = ".png" + } else { + return nil, "", fmt.Errorf("unsupported image format in data URI") } data, err := base64.StdEncoding.DecodeString(parts[1]) if err != nil { return nil, "", fmt.Errorf("failed to decode base64: %v", err) } return data, ext, nil }
164-184: Consider more precise OCR model detection.Line 164 uses
strings.Contains(strings.ToUpper(modelName), "OCR")which could match unintended model names (e.g., "MACRO", "MICROCOSM").For more precise detection, consider:
- if strings.Contains(strings.ToUpper(modelName), "OCR") { + if strings.Contains(strings.ToLower(modelName), "deepseek-ocr") || + strings.HasSuffix(strings.ToLower(modelName), "-ocr") { log.Printf("[Handler] OCR model detected ('%s'). Checking for image data...", modelName)Alternatively, you could maintain a registry of OCR-capable models in the service layer and query that.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
docs/service/mllm_cli.rst(9 hunks)mllm-cli/cmd/mllm-server/main.go(2 hunks)mllm-cli/mllm/c.go(1 hunks)mllm-cli/pkg/server/handlers.go(5 hunks)mllm/c_api/Runtime.cpp(2 hunks)mllm/c_api/Runtime.h(1 hunks)mllm/models/deepseek_ocr/modeling_deepseek_ocr.hpp(1 hunks)mllm/models/deepseek_ocr/modeling_deepseek_ocr_service.hpp(1 hunks)
🧰 Additional context used
🪛 Clang (14.0.6)
mllm/c_api/Runtime.cpp
[error] 108-108: initializing non-owner 'MllmSessionWrapper *' with a newly created 'gsl::owner<>'
(cppcoreguidelines-owning-memory,-warnings-as-errors)
mllm/models/deepseek_ocr/modeling_deepseek_ocr_service.hpp
[error] 3-3: 'filesystem' file not found
(clang-diagnostic-error)
[error] 20-20: constructor does not initialize these fields: model_, tokenizer_, config_
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 24-24: method 'fromPreTrain' can be made static
(readability-convert-member-functions-to-static,-warnings-as-errors)
[error] 26-26: variable 'root' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 27-27: variable 'config_file' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 28-28: variable 'model_file' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 29-29: variable 'tokenizer_file' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 46-46: method 'streamGenerate' can be made static
(readability-convert-member-functions-to-static,-warnings-as-errors)
[error] 46-46: 2 adjacent parameters of 'streamGenerate' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 57-57: invalid case style for variable 'PATCH_SIZE'
(readability-identifier-naming,-warnings-as-errors)
[error] 58-58: invalid case style for variable 'DOWN_SAMPLE_RATIO'
(readability-identifier-naming,-warnings-as-errors)
[error] 59-59: variable 'IMAGE_TOKEN' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 59-59: invalid case style for variable 'IMAGE_TOKEN'
(readability-identifier-naming,-warnings-as-errors)
[error] 60-60: variable 'IMAGE_TOKEN_ID' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 60-60: invalid case style for variable 'IMAGE_TOKEN_ID'
(readability-identifier-naming,-warnings-as-errors)
[error] 71-71: variable 'user_text' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 73-73: variable 'i' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 102-102: variable name 'p' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 141-141: variable 'final_prompt_text' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 155-155: variable '_ptr' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 155-155: invalid case style for variable '_ptr'
(readability-identifier-naming,-warnings-as-errors)
[error] 162-162: variable 'images_crop_tensor' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 163-163: if with identical then and else branches
(bugprone-branch-clone,-warnings-as-errors)
[error] 169-169: variable 'input' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 176-176: variable 'args' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 207-207: namespace 'mllm::models::deepseek_ocr' not terminated with a closing comment
(google-readability-namespace-comments,-warnings-as-errors)
🔇 Additional comments (10)
mllm/c_api/Runtime.h (1)
43-46: Implementation verified and properly exported; ready to mergeVerification confirms the implementation is complete:
- ✅
createDeepseekOCRSessionis implemented atRuntime.cpp:99with proper error handling (line 101) and exception handling (line 113)- ✅ Symbol is successfully exported and visible—cgo code at
mllm-cli/mllm/c.go:80calls it without linkage issues- ✅ Error conventions match the existing pattern (invalid arguments check, exception catch with message)
The code change is solid.
mllm/c_api/Runtime.cpp (2)
7-7: LGTM!The include is necessary for the new
createDeepseekOCRSessionfunction and follows the existing pattern.
99-116: LGTM! Implementation mirrors the existing pattern.The function correctly follows the established pattern from
createQwen3Session, with appropriate null checks, exception handling, and resource wrapping.Note on static analysis warning: The cppcoreguidelines-owning-memory warning at line 108 also applies to the existing
createQwen3Sessionat line 89. This is a pre-existing pattern in the codebase, not introduced by this PR.docs/service/mllm_cli.rst (4)
9-13: LGTM! Clear documentation of supported models.The section clearly identifies the supported LLM and OCR models, helping users understand the system's capabilities.
44-44: LGTM! Accurate API documentation.The documentation correctly describes the new C API function and its purpose.
55-69: LGTM! Comprehensive dual-model documentation.The section clearly explains the dual-model initialization, routing logic, and OCR preprocessing workflow, aligning well with the implementation.
319-326: LGTM! Clear server startup instructions.The documentation provides clear examples for both single-model (Qwen3 only) and dual-model (Qwen3 + OCR) server startup scenarios.
mllm-cli/pkg/server/handlers.go (3)
1-2: LGTM! Standard copyright header.
122-146: LGTM! Robust temporary file handling.The implementation properly:
- Creates temp files with appropriate extensions
- Handles write errors with immediate cleanup
- Returns absolute paths for consistency
- Provides a cleanup function for resource management
The cleanup function is correctly deferred in the handler at line 173, ensuring temporary files are removed after request processing.
233-235: LGTM! Safer flushing implementation.The type-asserted optional flushing is more robust than assuming the response writer always implements
http.Flusher. This prevents potential panics in environments where flushing isn't supported.Also applies to: 251-253
| On the build server, execute the build task. This task uses `tasks/build_android.yaml` to configure and run CMake. | ||
|
|
||
| Before executing this step, you also need to ensure that the hardcoded directories in build_android_debug.yaml have been modified to match your requirements. The modification method is the same as for the Go compilation file mentioned earlier. |
There was a problem hiding this comment.
Fix inconsistent build script references.
Line 202 references build_android.yaml, but line 204 mentions build_android_debug.yaml. These should refer to the same file.
Apply this fix:
-Before executing this step, you also need to ensure that the hardcoded directories in build_android_debug.yaml have been modified to match your requirements. The modification method is the same as for the Go compilation file mentioned earlier.
+Before executing this step, you also need to ensure that the hardcoded directories in build_android.yaml have been modified to match your requirements. The modification method is the same as for the Go compilation file mentioned earlier.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| On the build server, execute the build task. This task uses `tasks/build_android.yaml` to configure and run CMake. | |
| Before executing this step, you also need to ensure that the hardcoded directories in build_android_debug.yaml have been modified to match your requirements. The modification method is the same as for the Go compilation file mentioned earlier. | |
| On the build server, execute the build task. This task uses `tasks/build_android.yaml` to configure and run CMake. | |
| Before executing this step, you also need to ensure that the hardcoded directories in build_android.yaml have been modified to match your requirements. The modification method is the same as for the Go compilation file mentioned earlier. |
🤖 Prompt for AI Agents
In docs/service/mllm_cli.rst around lines 202 to 204, the build script names are
inconsistent: line 202 mentions `build_android.yaml` while line 204 refers to
`build_android_debug.yaml`; update the text so both lines reference the same
file (preferably `build_android_debug.yaml` to match the hardcoded-directory
note) and ensure the phrasing consistently instructs the user to modify that
single file before running the build task.
| } else if images, ok := msg["images"].([]interface{}); ok && len(images) > 0 { | ||
| log.Println("[Handler] Found custom 'images' field.") | ||
| base64URI, ok := images[0].(string) | ||
| if !ok || !strings.HasPrefix(base64URI, "data:image") { | ||
| return false, func() {}, fmt.Errorf("image data in 'images' field is not a valid base64 URI") | ||
| } | ||
| contentArray = []interface{}{ | ||
| map[string]interface{}{"type": "text", "text": msg["content"].(string)}, | ||
| map[string]interface{}{"type": "image_url", "image_url": map[string]interface{}{"url": base64URI}}, | ||
| } | ||
| userMessage = msg | ||
| break |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add type safety when accessing message content.
Line 67 assumes msg["content"] is a string without validating this. If the content field is missing or has a different type when using the custom images field format, this could panic.
Apply this defensive fix:
} else if images, ok := msg["images"].([]interface{}); ok && len(images) > 0 {
log.Println("[Handler] Found custom 'images' field.")
base64URI, ok := images[0].(string)
if !ok || !strings.HasPrefix(base64URI, "data:image") {
return false, func() {}, fmt.Errorf("image data in 'images' field is not a valid base64 URI")
}
+ textContent, _ := msg["content"].(string)
+ if textContent == "" {
+ textContent = "What's in this image?"
+ }
contentArray = []interface{}{
- map[string]interface{}{"type": "text", "text": msg["content"].(string)},
+ map[string]interface{}{"type": "text", "text": textContent},
map[string]interface{}{"type": "image_url", "image_url": map[string]interface{}{"url": base64URI}},
}
userMessage = msg
break📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } else if images, ok := msg["images"].([]interface{}); ok && len(images) > 0 { | |
| log.Println("[Handler] Found custom 'images' field.") | |
| base64URI, ok := images[0].(string) | |
| if !ok || !strings.HasPrefix(base64URI, "data:image") { | |
| return false, func() {}, fmt.Errorf("image data in 'images' field is not a valid base64 URI") | |
| } | |
| contentArray = []interface{}{ | |
| map[string]interface{}{"type": "text", "text": msg["content"].(string)}, | |
| map[string]interface{}{"type": "image_url", "image_url": map[string]interface{}{"url": base64URI}}, | |
| } | |
| userMessage = msg | |
| break | |
| } else if images, ok := msg["images"].([]interface{}); ok && len(images) > 0 { | |
| log.Println("[Handler] Found custom 'images' field.") | |
| base64URI, ok := images[0].(string) | |
| if !ok || !strings.HasPrefix(base64URI, "data:image") { | |
| return false, func() {}, fmt.Errorf("image data in 'images' field is not a valid base64 URI") | |
| } | |
| textContent, _ := msg["content"].(string) | |
| if textContent == "" { | |
| textContent = "What's in this image?" | |
| } | |
| contentArray = []interface{}{ | |
| map[string]interface{}{"type": "text", "text": textContent}, | |
| map[string]interface{}{"type": "image_url", "image_url": map[string]interface{}{"url": base64URI}}, | |
| } | |
| userMessage = msg | |
| break |
🤖 Prompt for AI Agents
mllm-cli/pkg/server/handlers.go around lines 60 to 71: the code assumes
msg["content"] is a string when building contentArray which can panic if the key
is missing or of a different type; validate msg["content"] with a type assertion
(e.g., s, ok := msg["content"].(string)), and if not ok return an error (or
handle gracefully) before constructing contentArray; replace the direct
msg["content"].(string) usage with the validated string variable so contentArray
is only built when content is present and a proper string.
| std::string user_text; | ||
| const auto& messages = request["messages"]; | ||
| for (int i = messages.size() - 1; i >= 0; --i) { | ||
| const auto& msg = messages[i]; | ||
| if (msg.value("role", "") == "user" || msg.value("role", "") == "<|User|>") { | ||
| if (msg.contains("content") && msg["content"].is_string()) { | ||
| user_text = msg["content"].get<std::string>(); | ||
| break; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Guard against empty or malformed messages when extracting user_text
The reverse scan for the last user message assumes request["messages"] is a non‑empty array. If messages is missing, not an array, or empty, messages.size() - 1 can underflow when converted to int, and user_text will stay default‑constructed without any explicit handling.
To make this more robust:
- std::string user_text;
- const auto& messages = request["messages"];
- for (int i = messages.size() - 1; i >= 0; --i) {
+ std::string user_text;
+ const auto& messages = request["messages"];
+ if (!messages.is_array() || messages.empty()) {
+ throw std::runtime_error("DeepSeek-OCR requires a non-empty 'messages' array.");
+ }
+ for (int i = static_cast<int>(messages.size()) - 1; i >= 0; --i) {
const auto& msg = messages[i];
if (msg.value("role", "") == "user" || msg.value("role", "") == "<|User|>") {
if (msg.contains("content") && msg["content"].is_string()) {
user_text = msg["content"].get<std::string>();
break;
}
}
}
+ // Optionally: handle the case where no user_text is found (log or throw).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| std::string user_text; | |
| const auto& messages = request["messages"]; | |
| for (int i = messages.size() - 1; i >= 0; --i) { | |
| const auto& msg = messages[i]; | |
| if (msg.value("role", "") == "user" || msg.value("role", "") == "<|User|>") { | |
| if (msg.contains("content") && msg["content"].is_string()) { | |
| user_text = msg["content"].get<std::string>(); | |
| break; | |
| } | |
| } | |
| } | |
| std::string user_text; | |
| const auto& messages = request["messages"]; | |
| if (!messages.is_array() || messages.empty()) { | |
| throw std::runtime_error("DeepSeek-OCR requires a non-empty 'messages' array."); | |
| } | |
| for (int i = static_cast<int>(messages.size()) - 1; i >= 0; --i) { | |
| const auto& msg = messages[i]; | |
| if (msg.value("role", "") == "user" || msg.value("role", "") == "<|User|>") { | |
| if (msg.contains("content") && msg["content"].is_string()) { | |
| user_text = msg["content"].get<std::string>(); | |
| break; | |
| } | |
| } | |
| } | |
| // Optionally: handle the case where no user_text is found (log or throw). |
🧰 Tools
🪛 Clang (14.0.6)
[error] 71-71: variable 'user_text' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 73-73: variable 'i' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
Key Changes
1. C++ Core & API
DeepseekOCRmodel implementation (modeling_deepseek_ocr.hpp,modeling_deepseek_ocr_service.hpp).createDeepseekOCRSessionto the C API (Runtime.h,Runtime.cpp) to handle visual inputs specifically.2. Go Service (mllm-server)
--model-path: For standard LLM (e.g., Qwen3).--ocr-model-path: For DeepSeek-OCR.preprocessRequestForOCRinhandlers.goto detect Base64 encoded images in the request, decode them to temporary files, and pass file paths to the C++ backend.3. Documentation
docs/service/mllm_cli.rstwith architectural changes, new API descriptions, and usage guides for the OCR feature.Summary by CodeRabbit
New Features
Documentation