feat : Add Support of Qwen2.5Omni model (without talker) by KKkai0315 · Pull Request #610 · UbiquitousLearning/mllm

KKkai0315 · 2026-01-23T06:58:24Z

Add Support of Qwen2.5Omni model (without talker)
now it receive text, audio, image input and output text modality
talker to output audio modality is under development

Summary by CodeRabbit

New Features
- Full Qwen2.5-Omni multimodal support (text, vision, audio) with model config, tokenizer, and audio preprocessing.
- Interactive example CLI apps for text, image, and audio inference.
- Python/API: new quantized embedding, concat observer/fixed QDQ, and expanded dtype/device constructors.
Chores
- Installer/export and build updates for multiple backends and examples; added Android build workflows.
Performance/Infrastructure
- New CPU fill utilities and dispatch paths to improve low-level data ops and efficiency.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-23T06:58:44Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

Adds full Qwen2.5-Omni multimodal support: build entries and CLI examples, tokenizer with vision/audio message handling, audio preprocessing, comprehensive multimodal model headers and configuration, many backend/kernel fill utilities, QNN backend hardening, PyMllm FFI/type/device exports, and several PyTorch quantization/observer utilities.

Changes

Cohort / File(s)	Summary
Examples / Build `examples/CMakeLists.txt`, `examples/qwen2_5omni/CMakeLists.txt`	Registers qwen2_5omni and adds three example executables (text/image/audio runners).
Example CLIs `examples/qwen2_5omni/text_infer.cpp`, `.../image_infer.cpp`, `.../audio_infer.cpp`	New interactive CLI apps for text/image/audio inference (argument parsing, model loading, streaming output).
Model Config `src/mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp`, `examples/qwen2_5omni/config_qwen2_5omni_7B.json`	New Qwen2_5OmniConfig struct and full JSON config describing multimodal model/thinker/talker settings.
Tokenizer / Messages `src/mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp`	New Qwen2_5OmniTokenizer, message structs (text/vision/audio), BPE and regex tokenization, vision/audio sequence composition.
Audio Preprocessing `src/mllm/models/qwen2_5omni/audio_preprocessor_qwen2_5omni.hpp`	Mel-spectrogram utilities, Hann/window/filterbank, MelSpectrogramFeatures module, Qwen2_5OmniAudioPreprocessor and helpers.
Modeling / Multimodal `src/mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp`	Large multimodal model declarations: RoPE helpers, visual/audio rotary embeddings, vision/audio encoders, decoder/thinker, and top-level ForCausalLM interfaces.
Build / Packaging `CMakeLists.txt`, `mllm/backends/qnn/CMakeLists.txt`	flatbuffers target set POSITION_INDEPENDENT_CODE and installed/exported; QNN backend target installation added.
CMake / Warnings `mllm/CMakeLists.txt`	Added compiler warning suppression (-Wno-comma-subscript) handling for Clang/GCC.
CPU Kernels: fill utilities `mllm/backends/cpu/kernels/common/fill-inl.hpp`, `.../kernel_dispatch.cpp`, `.../kernel_dispatch.hpp`, `mllm/backends/cpu/ops/FillOp.cpp`	New vectorized fill zeros/ones/value/arange/random implementations and dynamic-dispatch wrappers; x86 fallback handling for fp16; kernel dispatch/headers updated.
QNN Backend hardening `mllm/backends/qnn/QNNBackend.cpp`, `.../QNNBackend.hpp`, `.../Register.cpp`, `.../QnnWrappersAPI.cpp`	More defensive symbol/context loading, logging, error checks; default QNN log level changed to VERBOSE; context reuse guard.
QNN AOT / PTQ changes `mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp`, `.../PTQPass.cpp`, `.../visitor/RMSNorm.cpp`	Quantization recipe/embedding weight handling updated, AsymPerTensor/SymPerTensor checks, added recursive validation helpers and bias quant spec changes.
Embedding / Ops `mllm/core/aops/EmbeddingOp.cpp`	EmbeddingOp::reshape now derives output dtype from weight dtype (handles uint16).
FFI / Python exports `mllm/ffi/Extension.cc`, `pymllm/ffi/__init__.py`, `pymllm/__init__.py`	New DType factory bindings (int/uint/bool), global DType instances, device globals (cpu/cuda/qnn), and expanded pymllm exports.
PyTorch quant & qualcomm modules `pymllm/backends/qualcomm/transformers/core/embedding.py`, `.../observer.py`, `.../qdq.py`, `.../rms_norm.py`, `.../qlinear.py`	Added QEmbedding quantized embedding class, ConcatObserver, FixedActivationQDQ, observer eps refinements, small formatting tweaks.
Qwen3 quant wiring `pymllm/backends/qualcomm/transformers/qwen3/*`	Integrates FixedActivationQDQ/ConcatObserver/QEmbedding into qwen3 modeling and runner, updates rotate_half/roPE integration and quantizer toggles.
Model file writer `pymllm/convertor/model_file_v2.py`	Added helper to extract torch.Tensor bytes (_torch_tensor_bytes) and switched write paths to it.
Misc / Tasks / Gitignore `.gitignore`, `tasks/*.yaml`, small cpp cleanup `PromptProcessor.cpp`	Added mllm-install-* ignore, new Android QNN AOT build task, build-type tweak for x86 QNN AOT, and trivial file cleanup.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as Example CLI
    participant AudioPrep as AudioPreprocessor
    participant Tokenizer
    participant Model as Qwen2_5OmniForCausalLM
    participant Output

    User->>CLI: supply audio file + prompt
    CLI->>AudioPrep: load WAV, extract Mel features
    AudioPrep-->>CLI: audio features + lengths
    CLI->>Tokenizer: convert prompt + audio features -> token seq + modality metadata
    Tokenizer-->>CLI: token ids + audio/image tensors
    CLI->>Model: forward(tokens, audio_features, modality metadata)
    Model->>Model: encode audio/vision, decode text (kv cache)
    Model-->>CLI: streamed token logits
    CLI->>Tokenizer: detokenize logits -> text
    CLI-->>Output: stream text to stdout

sequenceDiagram
    participant WAV as Audio File
    participant Reader as WAV Reader
    participant STFT as STFT/Hann
    participant Mel as Mel Filterbank
    participant Norm as Log/Normalize
    participant Features

    WAV->>Reader: read samples (resample if needed)
    Reader->>STFT: frame & window (Hann), compute FFT
    STFT->>Mel: apply power, project via Mel filterbank
    Mel->>Norm: log-compress, clip, normalize
    Norm-->>Features: final Mel spectrogram tensor

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

feat(qualcomm): PTQPass add constant ptq impl. #593 — PTQPass and quantization recipe changes overlap with this PR's PTQ/quant handling edits.
fix(Qualcomm): Replace linear op with conv2d in Qualcomm backend #600 — Related edits to Qwen3 attention/rotary integration and modeling changes touch overlapping areas.
fix(qualcomm): Enhance quantization modules. #607 — Large-scale changes to CPU fill kernels, CMake/install, and QNN integration appear related.

Suggested reviewers

yirongjie
chenghuaWang
oreomaker

Poem

🐰 I hopped through headers, tokens, and sound,
I stitched Mel spectrums where signals abound.
With tokens and images all snug in a stream,
Qwen2.5-Omni now dances — a multimodal dream! 🥕🎶🖼️

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete and does not follow the provided template. It lacks structured sections like motivation, changes made, testing, and references to contribution guidelines.	Expand the description to follow the template structure, including detailed motivation, comprehensive list of changes, testing approach, and any breaking changes or dependencies.
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.75% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title clearly and concisely summarizes the main change: adding support for the Qwen2.5Omni model without the talker component.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 14

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🤖 Fix all issues with AI agents

In `@examples/qwen2_5omni/audio_infer.cpp`:
- Around line 51-59: The CLI currently ignores stdin by hardcoding audio_path
and prompt_text; revert to reading user input using std::getline for audio_path
and prompt_text (restore the commented lines) instead of the absolute path and
hardcoded prompt, and re-add checks to exit on "exit"/"quit" for audio_path and
to set a default prompt (e.g., "Please describe the audio.") when prompt_text is
empty; update the variables audio_path and prompt_text in
examples/qwen2_5omni/audio_infer.cpp accordingly and remove the local absolute
assignment.
- Around line 24-29: The code currently treats any model_version other than "v2"
as kV1 leading to silent misloads; update the validation around the
model_version.get() check (where file_version and mllm::ModelFileVersion are
set) to explicitly accept only "v1" or "v2" and otherwise print a clear error
(including the invalid value) and terminate (e.g., std::cerr + std::exit(1) or
throw) so unsupported values do not fall back silently to kV1.

In `@examples/qwen2_5omni/image_infer.cpp`:
- Around line 49-59: The code contains debug artifacts: hard-coded assignments
to image_path and prompt_text and commented-out interactive input; restore
interactive CLI by removing the hard-coded assignments and re-enabling the
std::getline calls for image_path and prompt_text, keep the existing prompt
prints and the exit/quit check, and ensure you trim or validate the read lines
if necessary (look for variables image_path, prompt_text and the std::getline
usage in this block).
- Around line 31-35: Move the help handling so -h/--help bypasses
required-argument validation: call Argparse::printHelp() and
mllm::shutdownContext() when help.isSet() before running the code path that
enforces required args (i.e., before or immediately after Argparse::parse() but
before any required-argument validation), or adjust Argparse to treat help as a
special case; locate the current help check (help.isSet()), the parser
invocation (Argparse::parse()), and the help output (Argparse::printHelp()) to
implement this change.

In `@examples/qwen2_5omni/text_infer.cpp`:
- Around line 24-29: The current logic silently maps any model_version other
than "v2" to kV1; update the branch that sets file_version (variable
file_version of type mllm::ModelFileVersion) to explicitly validate
model_version.get(): accept "v1" -> mllm::ModelFileVersion::kV1, "v2" ->
mllm::ModelFileVersion::kV2, and for any other value print a clear error
(including the invalid value) and exit/throw instead of falling back, so callers
of text_infer (where model_version is used) cannot accidentally load wrong
weights.

In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp`:
- Around line 197-206: Add a validation inside the loop where spatial merging is
computed (the block that reads dims and computes t,h,w, spatial_merge_size,
num_h_blocks, num_w_blocks, total_blocks, block_area) to ensure h %
spatial_merge_size == 0 and w % spatial_merge_size == 0; if not, raise/abort
with a clear error (or use an assert) describing invalid spatial_merge_size vs
H/W so rotary position alignment cannot silently miscompute. Ensure the check
references the same symbols (h, w, spatial_merge_size,
num_h_blocks/num_w_blocks) and runs before computing
num_h_blocks/num_w_blocks/total_blocks.
- Around line 568-575: The code assumes seq_len is divisible by
spatial_merge_unit_ before calling hidden_states.view({seq_len /
spatial_merge_unit_, spatial_merge_unit_, -1}), so add a guard that checks
seq_len % spatial_merge_unit_ == 0 and handle the non-divisible case (e.g.,
trim/pad hidden_states or raise a clear error) before the view call; update the
block around seq_len, spatial_merge_unit_, and hidden_states.view to perform the
check and decide whether to reshape, pad, or throw, ensuring any error message
references seq_len and spatial_merge_unit_ for easier debugging.
- Around line 95-105: The function makeWindowIndex must validate window_size,
spatial_merge_size, and patch_size before computing vit_merger_window_size and
spatial_merge_unit: check spatial_merge_size and patch_size are non-zero and
that window_size is divisible by (spatial_merge_size * patch_size) (or that
window_size / spatial_merge_size / patch_size produces the intended integer) and
fail-fast (MLLM_RT_ASSERT or throw) with a clear message referencing grid_thw
and the offending parameter(s); additionally validate spatial_merge_size > 0
before computing spatial_merge_unit to avoid division/multiplication errors and
ensure window_index construction only proceeds when these guard conditions pass.

In `@mllm/models/qwen2_5omni/python_src_code/audio_process.py`:
- Around line 71-73: Replace the assert check that calls
_check_if_video_has_audio(path) with an explicit runtime exception: if
use_audio_in_video is True and _check_if_video_has_audio(path) returns False,
raise a ValueError with the same message ("Video must has audio track when
use_audio_in_video=True"); this mirrors the existing validation style used
around lines 65 and 81 and ensures the check isn't skipped under Python -O.
Reference the existing helper _check_if_video_has_audio and the local variable
path (and the use_audio_in_video flag) when making the change.
- Around line 11-16: The _check_if_video_has_audio function currently opens a
PyAV container with av.open(video_path) and never closes it, leaking file
descriptors; update it to use a context manager (with av.open(video_path) as
container:) or ensure container.close() in a finally block so the container is
always closed even on exceptions, keeping the existing logic that inspects
container.streams for audio streams.

In `@mllm/models/qwen2_5omni/python_src_code/LICENSE`:
- Around line 1-201: The LICENSE file name violates the project's allowed
extensions; rename the file named "LICENSE" to include a permitted extension
(e.g., "LICENSE.txt" or "LICENSE.md") so it matches coding guidelines; update
any references to "LICENSE" in packaging or docs to point to the new filename to
avoid broken links.

In `@mllm/models/qwen2_5omni/python_src_code/name.py`:
- Around line 27-32: The loop currently calls f.get_tensor(key) and computes
unused variables i, shape, and dtype, which loads full tensors into memory;
replace this by using safetensors metadata access (e.g., f.get_metadata or
f.metadata[key]) to read shape and dtype without materializing the tensor, or if
you only need the key, remove the get_tensor call and the unused variables
entirely; update the loop that iterates over keys to either log the metadata via
the safetensors metadata API or simply print the key without calling
f.get_tensor(key).

In `@mllm/models/qwen2_5omni/python_src_code/README.md`:
- Around line 740-749: The README shows audio output code for model.generate and
uses processor.batch_decode and sf.write, but the PR is text-only; update the
examples to explicitly disable audio (call model.generate(...,
return_audio=False or remove use_audio_in_video) and only process text via
processor.batch_decode) and remove or guard any sf.write/audio handling; also
add a clear "audio/talker not supported yet" note near the generation examples
(including the other audio example blocks) so users aren’t misled.

In `@mllm/models/qwen2_5omni/python_src_code/vision_process.py`:
- Around line 106-111: Add a timeout to the HTTP fetch to avoid hangs: update
the requests.get call in the image URL branch (the block that checks
image.startswith("http://") or "https://") to pass a timeout (e.g.,
timeout=REQUEST_TIMEOUT or timeout=10) — preferably use a module-level constant
like REQUEST_TIMEOUT so it’s configurable — and keep using the response context
manager, response.raise_for_status(), and the BytesIO/Image.open flow unchanged;
also ensure any callers/handlers will surface
requests.exceptions.ReadTimeout/ConnectTimeout if needed.

🟡 Minor comments (13)

mllm/models/qwen2_5omni/python_src_code/name.py-1-5 (1)

1-5: Filename does not match docstring.

The file is named name.py but the docstring references print_safetensors_keys.py. Consider renaming the file to print_safetensors_keys.py for consistency and clarity.

mllm/models/qwen2_5omni/python_src_code/README.md-760-763 (1)

760-763: Use descriptive link text instead of “here.”
markdownlint MD059 prefers descriptive anchor text for accessibility.
mllm/models/qwen2_5omni/python_src_code/README.md-672-681 (1)
672-681: Specify languages on fenced code blocks.
markdownlint MD040 flags missing languages on the install/error and system-prompt snippets.
✏️ Proposed update
-```
+```bash
 pip uninstall transformers
 pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview
 pip install accelerate
-```
+```

-```
+```text
 KeyError: 'qwen2_5_omni'
-```
+```

-```
+```json
 {
     "role": "system",
     "content": [
         {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
     ],
 }
-```
+```
Also applies to: 875-882
mllm/models/qwen2_5omni/python_src_code/audio_process.py-35-37 (1)
35-37: Guard against empty input and invalid time ranges.
conversations[0] will raise on empty input, and negative durations can be produced when *_end < *_start. Consider explicit validation. As per coding guidelines, validate inputs for public APIs and critical internal functions.
🛠️ Proposed fix
-    audios = []
-    if isinstance(conversations[0], dict):
+    if not conversations:
+        return None
+    audios = []
+    if isinstance(conversations[0], dict):
         conversations = [conversations]
@@
-                        audio_start = ele.get("audio_start", 0.0)
-                        audio_end = ele.get("audio_end", None)
+                        audio_start = ele.get("audio_start", 0.0)
+                        audio_end = ele.get("audio_end", None)
+                        if audio_end is not None and audio_end < audio_start:
+                            raise ValueError("audio_end must be >= audio_start")
@@
-                        audio_start = ele.get("video_start", 0.0)
-                        audio_end = ele.get("video_end", None)
+                        audio_start = ele.get("video_start", 0.0)
+                        audio_end = ele.get("video_end", None)
+                        if audio_end is not None and audio_end < audio_start:
+                            raise ValueError("video_end must be >= video_start")
Also applies to: 46-47, 69-70, 88-90
mllm/models/qwen2_5omni/python_src_code/README.md-656-665 (1)

656-665: Replace the hard tab in the Text → Text table.
markdownlint MD010 flags a hard tab here; replace it with spaces to avoid rendering quirks.

mllm/models/qwen2_5omni/python_src_code/README.md-23-25 (1)

23-25: Add alt text to images for accessibility.
The HTML image tags here don’t include alt text (MD045). Please add descriptive alt attributes.

Also applies to: 41-43, 49-50

mllm/models/qwen2_5omni/python_src_code/README.md-492-529 (1)

492-529: Add blank lines around Markdown tables.
markdownlint MD058 requires blank lines before/after tables in these sections.

Also applies to: 532-541, 653-668, 766-775

mllm/models/qwen2_5omni/audio_preprocessor_qwen2_5omni.hpp-119-132 (1)

119-132: Default constructor leaves critical members uninitialized.

The default constructor MelSpectrogramFeatures() = default leaves stft_, window_, and melscale_fbanks_ in an indeterminate state. If forward() is called on a default-constructed instance, it will cause undefined behavior. Consider either removing the default constructor or initializing members to safe defaults.
mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp-942-952 (1)
942-952: Handle unsupported attention dtypes explicitly.
If key_states isn’t float32/float16, attn stays unset. Add an error branch to avoid undefined behavior.
⚠️ Add explicit dtype handling
     if (key_states.dtype() == kFloat32) {
       attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_));
       attn = mask_(attn);
       attn = softmax_(attn);
     } else if (key_states.dtype() == kFloat16) {
       attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_));
       attn = mask_(attn);
       attn = softmax_(attn);
       attn = attn.to(kFloat16);
+    } else {
+      MLLM_ERROR_EXIT(ExitCode::kCoreError, "Unsupported attention dtype for Qwen2.5-Omni.");
     }
mllm/models/qwen2_5omni/python_src_code/processing_qwen2_5_omni.py-290-292 (1)
290-292: Docstring parameter name mismatch.

The docstring references t_ntoken_per_chunk but the actual parameter is tokens_per_chunk.
📝 Fix docstring
         Parameters:
             token_indices (`np.ndarray`): A monotonically increasing list of token index values.
-            t_ntoken_per_chunk (`int`): Number of tokens per chunk (used as the chunk size threshold).
+            tokens_per_chunk (`int`): Number of tokens per chunk (used as the chunk size threshold).
mllm/models/qwen2_5omni/python_src_code/processing_qwen2_5_omni.py-369-385 (1)
369-385: Typos in docstring and error message.

Line 370: Inion should be Union

Line 384-385: Error message missing closing backtick after "audio"
📝 Fix typos
         Returns:
-            `list[Inion[str, np.ndarray]]`: The decoded text or generated audio.
+            `list[Union[str, np.ndarray]]`: The decoded text or generated audio.
         """
         if generation_mode is None or generation_mode == "text":
             return self.post_process_image_text_to_text(
                 generated_outputs, skip_special_tokens=skip_special_tokens, **kwargs
             )

         elif generation_mode == "audio":
             # model supports only bs=1, so we will never get several audio outputs
             audio = generated_outputs[1].reshape(-1).detach().cpu().numpy()
             return [audio]

         else:
             raise ValueError(
-                f"{self.__class__.__name__} got an unexpected generation_mode={generation_mode}. Supported options are only `text` and `audio"
+                f"{self.__class__.__name__} got an unexpected generation_mode={generation_mode}. Supported options are only `text` and `audio`."
             )
mllm/models/qwen2_5omni/python_src_code/processing_qwen2_5_omni.py-387-402 (1)
387-402: Missing null checks for optional processors.

The property accesses model_input_names on feature_extractor, image_processor, and video_processor, all of which can be None per the constructor signature. This will raise AttributeError if any processor wasn't provided.
🐛 Suggested defensive approach
     `@property`
     def model_input_names(self):
         tokenizer_input_names = self.tokenizer.model_input_names
-        feature_extractor_input_names = self.feature_extractor.model_input_names
-        image_processor_input_names = self.image_processor.model_input_names
-        video_processor_input_names = self.video_processor.model_input_names
+        feature_extractor_input_names = self.feature_extractor.model_input_names if self.feature_extractor else []
+        image_processor_input_names = self.image_processor.model_input_names if self.image_processor else []
+        video_processor_input_names = self.video_processor.model_input_names if self.video_processor else []
         return list(
mllm/models/qwen2_5omni/python_src_code/processing_qwen2_5_omni.py-92-102 (1)
92-102: Potential AttributeError if tokenizer is None.

The constructor accesses self.tokenizer.image_token and other attributes unconditionally, but tokenizer=None is the default parameter. If instantiated without a tokenizer, this will raise AttributeError.
🐛 Suggested defensive check
     def __init__(
         self, image_processor=None, video_processor=None, feature_extractor=None, tokenizer=None, chat_template=None
     ):
         super().__init__(image_processor, video_processor, feature_extractor, tokenizer, chat_template=chat_template)
+        if tokenizer is None:
+            raise ValueError("tokenizer is required for Qwen2_5OmniProcessor")
         self.image_token = self.tokenizer.image_token

🧹 Nitpick comments (17)

mllm/models/qwen2_5omni/python_src_code/name.py (3)
10-15: Missing PyTorch dependency in error message.

The script uses framework="pt" which requires PyTorch. If PyTorch is missing, users will get a confusing error later. Consider mentioning both dependencies.
💡 Suggested improvement
 try:
     from safetensors import safe_open
 except ImportError:
-    print("请先安装 safetensors 库:")
-    print("  pip install safetensors")
+    print("请先安装 safetensors 和 torch 库:")
+    print("  pip install safetensors torch")
     sys.exit(1)
61-65: Redundant file opening; refactor to avoid opening each file twice.

Each file is opened in print_keys_from_file() and then again to count keys. This doubles I/O operations. Have print_keys_from_file() return the key count instead.
♻️ Proposed refactor

Update print_keys_from_file to return the key count:
-def print_keys_from_file(filepath: Path):
+def print_keys_from_file(filepath: Path) -> int:
     """打印单个 safetensors 文件的所有键名"""
     ...
-            print(f"{key}")
+            print(f"{key}")
+    return len(keys)
Then simplify the loop in main:
         total_keys = 0
         for filepath in files:
-            print_keys_from_file(filepath)
-            with safe_open(filepath, framework="pt") as f:
-                total_keys += len(f.keys())
+            total_keys += print_keys_from_file(filepath)
53-53: Ambiguous fullwidth comma in comment.

The comment contains a fullwidth comma ， (U+FF0C) instead of ASCII comma ,. Per static analysis (RUF003).
💡 Suggested fix
-        # 目录，查找所有 safetensors 文件
+        # 目录, 查找所有 safetensors 文件
mllm/models/qwen2_5omni/python_src_code/configuration.json (1)
1-1: Consider improving readability and documentation.

The single-line JSON format is valid but could be improved for maintainability:

Format with proper indentation for better readability

Add a separate schema file or documentation explaining each configuration option

Document what "all-to-all" task type means in this context
📝 Suggested formatting improvement
-{"framework": "pytorch", "task": "all-to-all", "allow_remote": true}
+{
+  "framework": "pytorch",
+  "task": "all-to-all",
+  "allow_remote": true
+}
Additionally, consider adding a README.md or inline comments (if the format supports it) documenting the purpose of each field.
mllm/models/qwen2_5omni/python_src_code/preprocessor_config.json (1)

2-29: Add validation for derived audio fields to avoid config drift.

n_samples and nb_max_frames appear derivable from chunk_length, sampling_rate, and hop_length. Consider validating these relationships in the loader (or deriving them) to prevent silent inconsistencies if one value changes.
mllm/models/qwen2_5omni/audio_preprocessor_qwen2_5omni.hpp (1)
43-55: Potential portability issue with M_PI constant.

M_PI is not part of the C++ standard and may not be defined on all platforms (particularly MSVC without _USE_MATH_DEFINES). Consider using a locally defined constant for cross-platform compatibility.
Suggested fix for portability
 inline Tensor create_hann_window(int32_t window_length, bool periodic = true) {
+  constexpr float kPi = 3.14159265358979323846f;
   int32_t length = periodic ? window_length + 1 : window_length;
   auto window = Tensor::empty({1, window_length}, kFloat32, kCPU).alloc();
   float* window_ptr = window.ptr<float>();

   for (int32_t i = 0; i < window_length; ++i) {
     float n = static_cast<float>(i);
     float denominator = periodic ? static_cast<float>(length) : static_cast<float>(length - 1);
-    window_ptr[i] = 0.5f - 0.5f * std::cos(2.0f * M_PI * n / denominator);
+    window_ptr[i] = 0.5f - 0.5f * std::cos(2.0f * kPi * n / denominator);
   }

   return window;
 }
mllm/models/qwen2_5omni/python_src_code/vision_process.py (1)
401-405: Consider more specific exception handling for video reader fallback.

Catching bare Exception can mask unexpected errors beyond video reading issues. While the fallback behavior is reasonable, consider catching more specific exceptions (e.g., IOError, RuntimeError) to avoid hiding unrelated bugs.
Suggested improvement
         try:
             video, sample_fps = VIDEO_READER_BACKENDS[video_reader_backend](ele)
-        except Exception as e:
+        except (IOError, RuntimeError, ImportError, KeyError) as e:
             logger.warning(f"video_reader_backend {video_reader_backend} error, use torchvision as default, msg: {e}")
             video, sample_fps = VIDEO_READER_BACKENDS["torchvision"](ele)
mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp (2)

20-117: Regex tokenization pattern is comprehensive but complex.

The qwen2_5OmniTokenizerMatchPattern function implements a multi-stage pattern matcher handling contractions, letters, digits, punctuation, and whitespace. The logic follows the Qwen2/Qwen2-VL tokenizer patterns. Consider adding inline comments explaining each matching stage for maintainability.

189-193: Consider removing or formalizing the debug comment.

The comment on line 190 (//interestingly, the answer went bad when setting max_pixels higher...) appears to be a development note. If this is a known limitation, consider documenting it properly in the class documentation or as a TODO with a tracking issue.

mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp (3)

24-33: Add brief API docs for new public helpers/classes.
This header exposes many new public helpers/modules without doc comments. Please add concise Doxygen-style docs for purpose/params/returns/errors so the public surface is self-explanatory.

As per coding guidelines, please add brief public API documentation.

584-595: Consider avoiding full seq_len² mask for window attention.
This allocates O(n²) memory/time; for large images it can be heavy. A block-sparse mask or per-window attention would reduce memory pressure.

As per coding guidelines, consider reducing O(n²) memory/time where feasible.

1233-1247: Avoid quadratic scans for next vision/audio tokens.
Each loop rescans the tail to find next markers, making this O(S²) on long sequences. Precompute marker indices or scan once with moving pointers.

As per coding guidelines, consider a linear-time scan to keep long sequences efficient.
examples/qwen2_5omni/image_infer.cpp (1)
24-29: Missing validation for invalid model version.

If model_version is neither "v1" nor "v2", the code silently defaults to kV1. Consider warning or erroring on unrecognized versions.
♻️ Suggested improvement
   mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
   if (model_version.get() == "v1") {
     file_version = mllm::ModelFileVersion::kV1;
   } else if (model_version.get() == "v2") {
     file_version = mllm::ModelFileVersion::kV2;
+  } else {
+    fmt::print(stderr, "Warning: Unknown model version '{}', defaulting to v1\n", model_version.get());
   }
mllm/models/qwen2_5omni/python_src_code/processing_qwen2_5_omni.py (4)
48-70: Consider annotating _defaults with ClassVar.

Ruff flags mutable class attributes without ClassVar annotation. While this pattern is common in HuggingFace processing utilities, adding the annotation improves type safety.
♻️ Optional fix
+from typing import ClassVar

 class Qwen2_5OmniProcessorKwargs(ProcessingKwargs, total=False):
     videos_kwargs: Qwen2_5_OmniVideosKwargs

-    _defaults = {
+    _defaults: ClassVar[dict] = {
         "text_kwargs": {
149-162: Audio length calculation could benefit from a comment.

Lines 158-159 contain a non-obvious formula: (input_lengths - 2) // 2 + 1 and (audio_inputs["feature_attention_mask"].sum(-1) - 1) // 2 + 1. Consider adding a brief comment explaining the transformation (e.g., relating to convolution stride or feature extraction parameters).

224-229: Redundant sorting of positions.

positions is sorted twice: first via sorted() in line 228, then explicitly in line 229. The key function in line 229 sorts by x[0] (start position), which is the same as the default sort behavior of tuples. Remove one of the sorts.
♻️ Proposed fix
         for sample in text:
             positions = []
             special_tokens = [re.escape(tok) for tok in [self.audio_token, self.image_token, self.video_token]]
             pattern = "|".join(special_tokens)
-            positions = sorted([(match.start(), match.group()) for match in re.finditer(pattern, sample)])
-            positions.sort(key=lambda x: x[0])
+            positions = [(match.start(), match.group()) for match in re.finditer(pattern, sample)]
+            positions.sort(key=lambda x: x[0])
312-331: Misleading variable name is_batched.

The logic is inverted: is_batched = True is set when the input is a single conversation (dict), not when it's batched. The variable name is confusing even though the code works correctly.
♻️ Suggested clarification
     def apply_chat_template(self, conversations, chat_template=None, **kwargs):
-        is_batched = False
+        was_single = False
         if isinstance(conversations[0], dict):
             conversations = [conversations]
-            is_batched = True
+            was_single = True

         for conversation in conversations:
             # ... validation ...
-        if is_batched:
+        if was_single:
             conversations = conversations[0]

         return super().apply_chat_template(conversations, chat_template, **kwargs)

coderabbitai · 2026-01-23T07:10:44Z

+  mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
+  if (model_version.get() == "v1") {
+    file_version = mllm::ModelFileVersion::kV1;
+  } else if (model_version.get() == "v2") {
+    file_version = mllm::ModelFileVersion::kV2;
+  }


⚠️ Potential issue | 🟠 Major

Reject unsupported model_version values.
Any value other than "v2" silently falls back to kV1, which can load incorrect weights. Validate and exit with a clear error.

🛡️ Suggested validation

mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; if (model_version.get() == "v1") { file_version = mllm::ModelFileVersion::kV1; } else if (model_version.get() == "v2") { file_version = mllm::ModelFileVersion::kV2; + } else { + fmt::print("Unsupported model_version: {}\n", model_version.get()); + mllm::shutdownContext(); + return 1; }

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/audio_infer.cpp` around lines 24 - 29, The code currently treats any model_version other than "v2" as kV1 leading to silent misloads; update the validation around the model_version.get() check (where file_version and mllm::ModelFileVersion are set) to explicitly accept only "v1" or "v2" and otherwise print a clear error (including the invalid value) and terminate (e.g., std::cerr + std::exit(1) or throw) so unsupported values do not fall back silently to kV1.

coderabbitai · 2026-01-23T07:10:44Z

+    fmt::print("Audio path (or 'exit/quit'): ");
+    //std::getline(std::cin, audio_path);
+    //if (audio_path == "exit" || audio_path == "quit") { return 0; }
+    audio_path = "/Users/kkkai/Desktop/mllm2-former/mllm/rsc/recognize.wav";
+
+    fmt::print("Prompt text: ");
+    //std::getline(std::cin, prompt_text);
+    //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }
+    prompt_text = "复述这段音频";


⚠️ Potential issue | 🟠 Major

Remove hardcoded audio path/prompt.
The CLI ignores user input and uses a local absolute path, which breaks usability for everyone else. Restore stdin input and keep a default prompt if empty.

🎤 Restore interactive input

fmt::print("Audio path (or 'exit/quit'): "); - //std::getline(std::cin, audio_path); - //if (audio_path == "exit" || audio_path == "quit") { return 0; } - audio_path = "/Users/kkkai/Desktop/mllm2-former/mllm/rsc/recognize.wav"; + std::getline(std::cin, audio_path); + if (audio_path == "exit" || audio_path == "quit") { return 0; } fmt::print("Prompt text: "); - //std::getline(std::cin, prompt_text); - //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; } - prompt_text = "复述这段音频"; + std::getline(std::cin, prompt_text); + if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fmt::print("Audio path (or 'exit/quit'): ");

//std::getline(std::cin, audio_path);

//if (audio_path == "exit" || audio_path == "quit") { return 0; }

audio_path = "/Users/kkkai/Desktop/mllm2-former/mllm/rsc/recognize.wav";

fmt::print("Prompt text: ");

//std::getline(std::cin, prompt_text);

//if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }

prompt_text = "复述这段音频";

fmt::print("Audio path (or 'exit/quit'): ");

std::getline(std::cin, audio_path);

if (audio_path == "exit" || audio_path == "quit") { return 0; }

fmt::print("Prompt text: ");

std::getline(std::cin, prompt_text);

if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/audio_infer.cpp` around lines 51 - 59, The CLI currently ignores stdin by hardcoding audio_path and prompt_text; revert to reading user input using std::getline for audio_path and prompt_text (restore the commented lines) instead of the absolute path and hardcoded prompt, and re-add checks to exit on "exit"/"quit" for audio_path and to set a default prompt (e.g., "Please describe the audio.") when prompt_text is empty; update the variables audio_path and prompt_text in examples/qwen2_5omni/audio_infer.cpp accordingly and remove the local absolute assignment.

coderabbitai · 2026-01-23T07:10:44Z

+  if (help.isSet()) {
+    Argparse::printHelp();
+    mllm::shutdownContext();
+    return 0;
+  }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n examples/qwen2_5omni/image_infer.cpp | head -40

Repository: UbiquitousLearning/mllm

Length of output: 1898

🏁 Script executed:

rg -n "required|help.*isSet|isSet.*help" examples/qwen2_5omni/image_infer.cpp -B2 -A2

Repository: UbiquitousLearning/mllm

Length of output: 737

🏁 Script executed:

# Check how other examples handle help flags with required arguments rg -n "help.*isSet|isSet.*help" examples/ --type cpp -B3 -A3

Repository: UbiquitousLearning/mllm

Length of output: 5873

🏁 Script executed:

# Find the Argparse implementation fd -t f "argparse" --type f -e hpp -e cpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 90

🏁 Script executed:

# Search for Argparse class definition rg -n "class Argparse" --type cpp --type hpp

Repository: UbiquitousLearning/mllm

Length of output: 94

🏁 Script executed:

# Read the Argparse implementation cat mllm/utils/Argparse.hpp | head -200

Repository: UbiquitousLearning/mllm

Length of output: 5404

🏁 Script executed:

# Search for parse method in Argparse rg -n "void parse|bool parse" mllm/utils/Argparse.hpp -A 10

Repository: UbiquitousLearning/mllm

Length of output: 1508

🏁 Script executed:

# Get the rest of the parse method to see the requirements checking cat mllm/utils/Argparse.hpp | sed -n '149,220p'

Repository: UbiquitousLearning/mllm

Length of output: 2351

Help flag check occurs after required argument validation, preventing -h from working.

The -h|--help flag is checked at line 31, but Argparse::parse() validates all required arguments (model_path, model_version, tokenizer_path, config_path) at the end of parsing. Running with just -h will fail during parse() before help.isSet() is evaluated, so help will never be displayed. Move the help check before or immediately after Argparse::parse(), or modify the Argparse class to handle help as a special case that bypasses requirement validation.

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/image_infer.cpp` around lines 31 - 35, Move the help handling so -h/--help bypasses required-argument validation: call Argparse::printHelp() and mllm::shutdownContext() when help.isSet() before running the code path that enforces required args (i.e., before or immediately after Argparse::parse() but before any required-argument validation), or adjust Argparse to treat help as a special case; locate the current help check (help.isSet()), the parser invocation (Argparse::parse()), and the help output (Argparse::printHelp()) to implement this change.

coderabbitai · 2026-01-23T07:10:44Z

+    std::string image_path;
+    std::string prompt_text;
+
+    fmt::print("Image path (or 'exit/quit'): ");
+    image_path = "../../../mllm2-former/mllm/rsc/pics.jpg";
+    //std::getline(std::cin, image_path);
+    if (image_path == "exit" || image_path == "quit") { return 0; }
+
+    fmt::print("Prompt text: ");
+    prompt_text = "描述图片中物体";
+    //std::getline(std::cin, prompt_text);


⚠️ Potential issue | 🟠 Major

Remove debug artifacts: hard-coded paths and commented-out interactive code.

The image path and prompt text are hard-coded with std::getline calls commented out. This breaks the interactive CLI functionality described in the UI prompts.

🐛 Proposed fix to restore interactive input

std::string image_path; std::string prompt_text; fmt::print("Image path (or 'exit/quit'): "); - image_path = "../../../mllm2-former/mllm/rsc/pics.jpg"; - //std::getline(std::cin, image_path); + std::getline(std::cin, image_path); if (image_path == "exit" || image_path == "quit") { return 0; } fmt::print("Prompt text: "); - prompt_text = "描述图片中物体"; - //std::getline(std::cin, prompt_text); + std::getline(std::cin, prompt_text);

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/image_infer.cpp` around lines 49 - 59, The code contains debug artifacts: hard-coded assignments to image_path and prompt_text and commented-out interactive input; restore interactive CLI by removing the hard-coded assignments and re-enabling the std::getline calls for image_path and prompt_text, keep the existing prompt prints and the exit/quit check, and ensure you trim or validate the read lines if necessary (look for variables image_path, prompt_text and the std::getline usage in this block).

coderabbitai · 2026-01-23T07:10:44Z

+  mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
+  if (model_version.get() == "v1") {
+    file_version = mllm::ModelFileVersion::kV1;
+  } else if (model_version.get() == "v2") {
+    file_version = mllm::ModelFileVersion::kV2;
+  }


⚠️ Potential issue | 🟠 Major

Reject unsupported model_version values.
Any value other than "v2" silently falls back to kV1, which can load incorrect weights. Validate and exit with a clear error.

🛡️ Suggested validation

mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; if (model_version.get() == "v1") { file_version = mllm::ModelFileVersion::kV1; } else if (model_version.get() == "v2") { file_version = mllm::ModelFileVersion::kV2; + } else { + fmt::print("Unsupported model_version: {}\n", model_version.get()); + mllm::shutdownContext(); + return 1; }

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/text_infer.cpp` around lines 24 - 29, The current logic silently maps any model_version other than "v2" to kV1; update the branch that sets file_version (variable file_version of type mllm::ModelFileVersion) to explicitly validate model_version.get(): accept "v1" -> mllm::ModelFileVersion::kV1, "v2" -> mllm::ModelFileVersion::kV2, and for any other value print a clear error (including the invalid value) and exit/throw instead of falling back, so callers of text_infer (where model_version is used) cannot accidentally load wrong weights.

coderabbitai · 2026-01-23T07:10:44Z

+                        assert _check_if_video_has_audio(
+                            path
+                        ), "Video must has audio track when use_audio_in_video=True"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "audio_process.py" | head -20

Repository: UbiquitousLearning/mllm

Length of output: 125

🏁 Script executed:

cat -n mllm/models/qwen2_5omni/python_src_code/audio_process.py | sed -n '60,85p'

Repository: UbiquitousLearning/mllm

Length of output: 1595

Replace assert with a runtime exception.
Assertions are stripped when Python runs with the -O optimization flag, causing this validation to silently fail in production. Since surrounding code (lines 65, 81) uses ValueError for similar input validation failures, this should be consistent.

🛠️ Proposed fix

- assert _check_if_video_has_audio( - path - ), "Video must has audio track when use_audio_in_video=True" + if not _check_if_video_has_audio(path): + raise ValueError("Video must have an audio track when use_audio_in_video=True")

🤖 Prompt for AI Agents

In `@mllm/models/qwen2_5omni/python_src_code/audio_process.py` around lines 71 - 73, Replace the assert check that calls _check_if_video_has_audio(path) with an explicit runtime exception: if use_audio_in_video is True and _check_if_video_has_audio(path) returns False, raise a ValueError with the same message ("Video must has audio track when use_audio_in_video=True"); this mirrors the existing validation style used around lines 65 and 81 and ensures the check isn't skipped under Python -O. Reference the existing helper _check_if_video_has_audio and the local variable path (and the use_audio_in_video flag) when making the change.

coderabbitai · 2026-01-23T07:10:44Z

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright 2025 Alibaba Cloud
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.


⚠️ Potential issue | 🔴 Critical

File extension required by coding guidelines.

The filename LICENSE lacks an extension, which violates the coding guidelines stating: "Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml."

Rename the file to LICENSE.txt or LICENSE.md to comply with the guidelines.

Based on coding guidelines.

🤖 Prompt for AI Agents

In `@mllm/models/qwen2_5omni/python_src_code/LICENSE` around lines 1 - 201, The LICENSE file name violates the project's allowed extensions; rename the file named "LICENSE" to include a permitted extension (e.g., "LICENSE.txt" or "LICENSE.md") so it matches coding guidelines; update any references to "LICENSE" in packaging or docs to point to the new filename to avoid broken links.

coderabbitai · 2026-01-23T07:10:45Z

+    elif image.startswith("http://") or image.startswith("https://"):
+        # fix memory leak issue while using BytesIO
+        with requests.get(image, stream=True) as response:
+            response.raise_for_status()
+            with BytesIO(response.content) as bio:
+                image_obj = copy.deepcopy(Image.open(bio))


⚠️ Potential issue | 🟠 Major

Missing timeout in HTTP request can cause indefinite hangs.

The requests.get() call lacks a timeout parameter, which can cause the application to hang indefinitely if the server is unresponsive. This is particularly problematic in production environments.

Proposed fix

- with requests.get(image, stream=True) as response: + with requests.get(image, stream=True, timeout=30) as response:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

elif image.startswith("http://") or image.startswith("https://"):

# fix memory leak issue while using BytesIO

with requests.get(image, stream=True) as response:

response.raise_for_status()

with BytesIO(response.content) as bio:

image_obj = copy.deepcopy(Image.open(bio))

elif image.startswith("http://") or image.startswith("https://"):

# fix memory leak issue while using BytesIO

with requests.get(image, stream=True, timeout=30) as response:

response.raise_for_status()

with BytesIO(response.content) as bio:

image_obj = copy.deepcopy(Image.open(bio))

🧰 Tools

🪛 Ruff (0.14.13)

108-108: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents

In `@mllm/models/qwen2_5omni/python_src_code/vision_process.py` around lines 106 - 111, Add a timeout to the HTTP fetch to avoid hangs: update the requests.get call in the image URL branch (the block that checks image.startswith("http://") or "https://") to pass a timeout (e.g., timeout=REQUEST_TIMEOUT or timeout=10) — preferably use a module-level constant like REQUEST_TIMEOUT so it’s configurable — and keep using the response context manager, response.raise_for_status(), and the BytesIO/Image.open flow unchanged; also ensure any callers/handlers will surface requests.exceptions.ReadTimeout/ConnectTimeout if needed.

…fixed quantization parameters, updated ActivationQDQ to use MovingAverageMinMaxObserver, and adjusted eps values for better precision. Modified Qwen3 model to utilize FixedActivationQDQ for sigmoid output and ensured dtype consistency in attention calculations.

… debug print statements from Qwen3DecoderLayer

…ackend in CMake, enhance PTQPass with unsolved tensor value checks, and update quantization specifications in RMSNorm and model file conversion.

…improved quantization, enhance rotate_half function to utilize observers, and ensure consistent scale and zero_point across concatenated inputs.

… zeros, ones, specific values, arange, and random fills. Introduce a new fill-inl.hpp file for optimized implementations and update kernel dispatch to include these operations. Enhance CPUFillOp to utilize the new fill functions for better performance and maintainability.

…d error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.

…es; ensure position-independent code for flatbuffers. Enhance context creation with existing context checks and improve weight quantization specifications.

… input layer normalization handling in Qwen3DecoderLayer. Update weight conversion logic in training script to address model compatibility issues.

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@examples/qwen2_5omni/audio_infer.cpp`:
- Around line 16-35: The help flag is never reachable because Argparse::parse()
enforces required options first; before calling Argparse::parse(argc, argv)
detect help early (e.g., scan argv for "-h" or "--help") and if present call
Argparse::printHelp() and mllm::shutdownContext() then return 0, or
alternatively use an Argparse::parse variant/option that skips required-arg
validation for help; update the code around Argparse::parse, help.isSet(),
Argparse::printHelp, and mllm::shutdownContext to ensure help is handled before
required-arg validation.

In `@examples/qwen2_5omni/config_qwen2_5omni_7B.json`:
- Around line 5-7: The config enables unsupported talker components despite the
PR stating "without talker": set "enable_audio_output" and "enable_talker" to
false (or add an explicit comment stating they are placeholders) in the JSON for
the qwen2_5_omni model to prevent runtime attempts to initialize talker/audio
subsystems; update any documentation or inline note near the "model_type":
"qwen2_5_omni" entry to clarify that talker is not yet supported if you choose
the placeholder route.

In `@examples/qwen2_5omni/image_infer.cpp`:
- Around line 24-29: The code silently defaults file_version to
mllm::ModelFileVersion::kV1 which can load wrong weights; change the logic in
the block that reads model_version.get() so it validates allowed values
explicitly (e.g., "v1" -> kV1, "v2" -> kV2) and on any other value fail fast by
returning an error/throwing an exception or logging and exiting with a clear
message that the model_version is unsupported; update references to file_version
and the model-version parsing branch (where mllm::ModelFileVersion file_version
is set) to enforce this validation rather than falling back to kV1.

♻️ Duplicate comments (4)

examples/qwen2_5omni/image_infer.cpp (2)
16-34: -h/--help is still blocked by required-arg validation.

Argparse validates required args inside parse(), so the help check after parsing won’t run when -h is the only flag. Handle help before validation or make Argparse treat help as a bypass.
💡 One way to bypass required-arg validation
+#include <string_view>
 ...
   auto& config_path = Argparse::add<std::string>("-c|--config_path").help("Config path").required(true);

+  for (int i = 1; i < argc; ++i) {
+    const std::string_view arg{argv[i]};
+    if (arg == "-h" || arg == "--help") {
+      Argparse::printHelp();
+      mllm::shutdownContext();
+      return 0;
+    }
+  }
   Argparse::parse(argc, argv);
52-59: Restore interactive input for image_path and prompt_text.

Currently the CLI ignores stdin and uses empty strings, which makes inference unusable. Re-enable std::getline and keep the exit/quit check.
🐛 Proposed fix
     fmt::print("Image path (or 'exit/quit'): ");
-    image_path = "";
-    //std::getline(std::cin, image_path);
+    std::getline(std::cin, image_path);
     if (image_path == "exit" || image_path == "quit") { return 0; }

     fmt::print("Prompt text: ");
-    prompt_text = "";
-    //std::getline(std::cin, prompt_text);
+    std::getline(std::cin, prompt_text);
examples/qwen2_5omni/audio_infer.cpp (2)
24-29: Reject unsupported model_version values.

Any value other than "v1"/"v2" silently falls back to kV1, which can load incorrect weights. Validate and exit with a clear error.
🛡️ Suggested validation
   mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
   if (model_version.get() == "v1") {
     file_version = mllm::ModelFileVersion::kV1;
   } else if (model_version.get() == "v2") {
     file_version = mllm::ModelFileVersion::kV2;
+  } else {
+    fmt::print("Unsupported model_version: {}\n", model_version.get());
+    mllm::shutdownContext();
+    return 1;
   }
51-59: Restore interactive stdin input for audio path and prompt.

The current code ignores user input and uses empty strings, which breaks the CLI. Re-enable std::getline, keep the exit/quit check, and restore the default prompt when empty.
🎤 Proposed fix
     fmt::print("Audio path (or 'exit/quit'): ");
-    //std::getline(std::cin, audio_path);
-    //if (audio_path == "exit" || audio_path == "quit") { return 0; }
-    audio_path = "";
+    std::getline(std::cin, audio_path);
+    if (audio_path == "exit" || audio_path == "quit") { return 0; }

     fmt::print("Prompt text: ");
-    //std::getline(std::cin, prompt_text);
-    //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }
-    prompt_text = "";
+    std::getline(std::cin, prompt_text);
+    if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }

coderabbitai · 2026-01-23T07:43:54Z

+  auto& help = Argparse::add<bool>("-h|--help").help("Show help message");
+  auto& model_path = Argparse::add<std::string>("-m|--model_path").help("Model path").required(true);
+  auto& model_version = Argparse::add<std::string>("-mv|--model_version").help("Model version").required(true);
+  auto& tokenizer_path = Argparse::add<std::string>("-t|--tokenizer_path").help("Tokenizer directory").required(true);
+  auto& config_path = Argparse::add<std::string>("-c|--config_path").help("Config path").required(true);
+
+  Argparse::parse(argc, argv);
+
+  mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
+  if (model_version.get() == "v1") {
+    file_version = mllm::ModelFileVersion::kV1;
+  } else if (model_version.get() == "v2") {
+    file_version = mllm::ModelFileVersion::kV2;
+  }
+
+  if (help.isSet()) {
+    Argparse::printHelp();
+    mllm::shutdownContext();
+    return 0;
+  }


⚠️ Potential issue | 🟠 Major

-h/--help will never show if required args are missing.

Argparse::parse() enforces required options before help.isSet() is checked. Handle help before validation (or modify Argparse to bypass required checks for help).

💡 One way to bypass required-arg validation

+#include <string_view> ... auto& config_path = Argparse::add<std::string>("-c|--config_path").help("Config path").required(true); + for (int i = 1; i < argc; ++i) { + const std::string_view arg{argv[i]}; + if (arg == "-h" || arg == "--help") { + Argparse::printHelp(); + mllm::shutdownContext(); + return 0; + } + } Argparse::parse(argc, argv);

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/audio_infer.cpp` around lines 16 - 35, The help flag is never reachable because Argparse::parse() enforces required options first; before calling Argparse::parse(argc, argv) detect help early (e.g., scan argv for "-h" or "--help") and if present call Argparse::printHelp() and mllm::shutdownContext() then return 0, or alternatively use an Argparse::parse variant/option that skips required-arg validation for help; update the code around Argparse::parse, help.isSet(), Argparse::printHelp, and mllm::shutdownContext to ensure help is handled before required-arg validation.

coderabbitai · 2026-01-23T07:43:54Z

+  "enable_audio_output": true,
+  "enable_talker": true,
+  "model_type": "qwen2_5_omni",


⚠️ Potential issue | 🟠 Major

Disable talker/audio-output flags until talker is actually supported.

The PR objective says “without talker,” but enable_audio_output / enable_talker are true. If runtime respects these flags, it may try to initialize unsupported components or confuse users. Suggest disabling them (or explicitly documenting that they’re placeholders) until talker is ready.

💡 Proposed change

- "enable_audio_output": true, - "enable_talker": true, + "enable_audio_output": false, + "enable_talker": false,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"enable_audio_output": true,

"enable_talker": true,

"model_type": "qwen2_5_omni",

"enable_audio_output": false,

"enable_talker": false,

"model_type": "qwen2_5_omni",

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/config_qwen2_5omni_7B.json` around lines 5 - 7, The config enables unsupported talker components despite the PR stating "without talker": set "enable_audio_output" and "enable_talker" to false (or add an explicit comment stating they are placeholders) in the JSON for the qwen2_5_omni model to prevent runtime attempts to initialize talker/audio subsystems; update any documentation or inline note near the "model_type": "qwen2_5_omni" entry to clarify that talker is not yet supported if you choose the placeholder route.

coderabbitai · 2026-01-23T07:43:54Z

+  mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
+  if (model_version.get() == "v1") {
+    file_version = mllm::ModelFileVersion::kV1;
+  } else if (model_version.get() == "v2") {
+    file_version = mllm::ModelFileVersion::kV2;
+  }


⚠️ Potential issue | 🟠 Major

Reject unsupported model_version values instead of defaulting to v1.

Silent fallback to kV1 can load incorrect weights. Validate and fail fast on unknown values.

🛡️ Suggested validation

mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; if (model_version.get() == "v1") { file_version = mllm::ModelFileVersion::kV1; } else if (model_version.get() == "v2") { file_version = mllm::ModelFileVersion::kV2; + } else { + fmt::print("Unsupported model_version: {}\n", model_version.get()); + mllm::shutdownContext(); + return 1; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;

if (model_version.get() == "v1") {

file_version = mllm::ModelFileVersion::kV1;

} else if (model_version.get() == "v2") {

file_version = mllm::ModelFileVersion::kV2;

}

mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;

if (model_version.get() == "v1") {

file_version = mllm::ModelFileVersion::kV1;

} else if (model_version.get() == "v2") {

file_version = mllm::ModelFileVersion::kV2;

} else {

fmt::print("Unsupported model_version: {}\n", model_version.get());

mllm::shutdownContext();

return 1;

}

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/image_infer.cpp` around lines 24 - 29, The code silently defaults file_version to mllm::ModelFileVersion::kV1 which can load wrong weights; change the logic in the block that reads model_version.get() so it validates allowed values explicitly (e.g., "v1" -> kV1, "v2" -> kV2) and on any other value fail fast by returning an error/throwing an exception or logging and exiting with a clear message that the model_version is unsupported; update references to file_version and the model-version parsing branch (where mllm::ModelFileVersion file_version is set) to enforce this validation rather than falling back to kV1.

KKkai0315 added 3 commits January 22, 2026 13:54

[Codex] After prompt #0

47833b1

feat:add Qwen2.5omni text modal processing

bcf23b9

add qwen2.5omni vision, audio modal

c573c0a

coderabbitai Bot reviewed Jan 23, 2026

View reviewed changes

chenghuaWang and others added 9 commits January 23, 2026 15:43

fix: Suppress deprecated comma-subscript warnings in CMake and remove…

3f5733f

… debug print statements from Qwen3DecoderLayer

feat(qualcomm): Add installation targets for flatbuffers and MllmQNNB…

86e30f5

…ackend in CMake, enhance PTQPass with unsolved tensor value checks, and update quantization specifications in RMSNorm and model file conversion.

feat(qualcomm): Refactor Qwen3 model to integrate ConcatObserver for …

67c71df

…improved quantization, enhance rotate_half function to utilize observers, and ensure consistent scale and zero_point across concatenated inputs.

feat(qnn): Enhance QNNBackend initialization with improved logging an…

3cd68aa

…d error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.

feat(qnn): Update quantization handling and embedding output data typ…

5cd3f8d

…es; ensure position-independent code for flatbuffers. Enhance context creation with existing context checks and improve weight quantization specifications.

feat(qwen3): Integrate QEmbedding for quantized embeddings and refine…

7bad302

… input layer normalization handling in Qwen3DecoderLayer. Update weight conversion logic in training script to address model compatibility issues.

remove src code

3ce2654

coderabbitai Bot reviewed Jan 23, 2026

View reviewed changes

KKkai0315 added 2 commits January 23, 2026 15:44

fix

9b1c12a

fix

1ad47e6

KKkai0315 force-pushed the main branch from 494c4f3 to 1ad47e6 Compare January 23, 2026 07:45

KKkai0315 requested review from chenghuaWang, liang1232018, oreomaker and yirongjie as code owners January 23, 2026 07:45

KKkai0315 closed this Jan 23, 2026

Conversation

KKkai0315 commented Jan 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KKkai0315 commented Jan 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 23, 2026 •

edited

Loading