Skip to content

feat : Add Support of Qwen2.5Omni model (without talker)#610

Closed
KKkai0315 wants to merge 14 commits intoUbiquitousLearning:mainfrom
KKkai0315:main
Closed

feat : Add Support of Qwen2.5Omni model (without talker)#610
KKkai0315 wants to merge 14 commits intoUbiquitousLearning:mainfrom
KKkai0315:main

Conversation

@KKkai0315
Copy link
Copy Markdown
Contributor

@KKkai0315 KKkai0315 commented Jan 23, 2026

  1. Add Support of Qwen2.5Omni model (without talker)
  2. now it receive text, audio, image input and output text modality
  3. talker to output audio modality is under development

Summary by CodeRabbit

  • New Features

    • Full Qwen2.5-Omni multimodal support (text, vision, audio) with model config, tokenizer, and audio preprocessing.
    • Interactive example CLI apps for text, image, and audio inference.
    • Python/API: new quantized embedding, concat observer/fixed QDQ, and expanded dtype/device constructors.
  • Chores

    • Installer/export and build updates for multiple backends and examples; added Android build workflows.
  • Performance/Infrastructure

    • New CPU fill utilities and dispatch paths to improve low-level data ops and efficiency.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 23, 2026

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

Adds full Qwen2.5-Omni multimodal support: build entries and CLI examples, tokenizer with vision/audio message handling, audio preprocessing, comprehensive multimodal model headers and configuration, many backend/kernel fill utilities, QNN backend hardening, PyMllm FFI/type/device exports, and several PyTorch quantization/observer utilities.

Changes

Cohort / File(s) Summary
Examples / Build
examples/CMakeLists.txt, examples/qwen2_5omni/CMakeLists.txt
Registers qwen2_5omni and adds three example executables (text/image/audio runners).
Example CLIs
examples/qwen2_5omni/text_infer.cpp, .../image_infer.cpp, .../audio_infer.cpp
New interactive CLI apps for text/image/audio inference (argument parsing, model loading, streaming output).
Model Config
src/mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp, examples/qwen2_5omni/config_qwen2_5omni_7B.json
New Qwen2_5OmniConfig struct and full JSON config describing multimodal model/thinker/talker settings.
Tokenizer / Messages
src/mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp
New Qwen2_5OmniTokenizer, message structs (text/vision/audio), BPE and regex tokenization, vision/audio sequence composition.
Audio Preprocessing
src/mllm/models/qwen2_5omni/audio_preprocessor_qwen2_5omni.hpp
Mel-spectrogram utilities, Hann/window/filterbank, MelSpectrogramFeatures module, Qwen2_5OmniAudioPreprocessor and helpers.
Modeling / Multimodal
src/mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp
Large multimodal model declarations: RoPE helpers, visual/audio rotary embeddings, vision/audio encoders, decoder/thinker, and top-level ForCausalLM interfaces.
Build / Packaging
CMakeLists.txt, mllm/backends/qnn/CMakeLists.txt
flatbuffers target set POSITION_INDEPENDENT_CODE and installed/exported; QNN backend target installation added.
CMake / Warnings
mllm/CMakeLists.txt
Added compiler warning suppression (-Wno-comma-subscript) handling for Clang/GCC.
CPU Kernels: fill utilities
mllm/backends/cpu/kernels/common/fill-inl.hpp, .../kernel_dispatch.cpp, .../kernel_dispatch.hpp, mllm/backends/cpu/ops/FillOp.cpp
New vectorized fill zeros/ones/value/arange/random implementations and dynamic-dispatch wrappers; x86 fallback handling for fp16; kernel dispatch/headers updated.
QNN Backend hardening
mllm/backends/qnn/QNNBackend.cpp, .../QNNBackend.hpp, .../Register.cpp, .../QnnWrappersAPI.cpp
More defensive symbol/context loading, logging, error checks; default QNN log level changed to VERBOSE; context reuse guard.
QNN AOT / PTQ changes
mllm/backends/qnn/aot/passes/LLMQuantRecipePass.cpp, .../PTQPass.cpp, .../visitor/RMSNorm.cpp
Quantization recipe/embedding weight handling updated, AsymPerTensor/SymPerTensor checks, added recursive validation helpers and bias quant spec changes.
Embedding / Ops
mllm/core/aops/EmbeddingOp.cpp
EmbeddingOp::reshape now derives output dtype from weight dtype (handles uint16).
FFI / Python exports
mllm/ffi/Extension.cc, pymllm/ffi/__init__.py, pymllm/__init__.py
New DType factory bindings (int*/uint*/bool), global DType instances, device globals (cpu/cuda/qnn), and expanded pymllm exports.
PyTorch quant & qualcomm modules
pymllm/backends/qualcomm/transformers/core/embedding.py, .../observer.py, .../qdq.py, .../rms_norm.py, .../qlinear.py
Added QEmbedding quantized embedding class, ConcatObserver, FixedActivationQDQ, observer eps refinements, small formatting tweaks.
Qwen3 quant wiring
pymllm/backends/qualcomm/transformers/qwen3/*
Integrates FixedActivationQDQ/ConcatObserver/QEmbedding into qwen3 modeling and runner, updates rotate_half/roPE integration and quantizer toggles.
Model file writer
pymllm/convertor/model_file_v2.py
Added helper to extract torch.Tensor bytes (_torch_tensor_bytes) and switched write paths to it.
Misc / Tasks / Gitignore
.gitignore, tasks/*.yaml, small cpp cleanup PromptProcessor.cpp
Added mllm-install-* ignore, new Android QNN AOT build task, build-type tweak for x86 QNN AOT, and trivial file cleanup.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as Example CLI
    participant AudioPrep as AudioPreprocessor
    participant Tokenizer
    participant Model as Qwen2_5OmniForCausalLM
    participant Output

    User->>CLI: supply audio file + prompt
    CLI->>AudioPrep: load WAV, extract Mel features
    AudioPrep-->>CLI: audio features + lengths
    CLI->>Tokenizer: convert prompt + audio features -> token seq + modality metadata
    Tokenizer-->>CLI: token ids + audio/image tensors
    CLI->>Model: forward(tokens, audio_features, modality metadata)
    Model->>Model: encode audio/vision, decode text (kv cache)
    Model-->>CLI: streamed token logits
    CLI->>Tokenizer: detokenize logits -> text
    CLI-->>Output: stream text to stdout
Loading
sequenceDiagram
    participant WAV as Audio File
    participant Reader as WAV Reader
    participant STFT as STFT/Hann
    participant Mel as Mel Filterbank
    participant Norm as Log/Normalize
    participant Features

    WAV->>Reader: read samples (resample if needed)
    Reader->>STFT: frame & window (Hann), compute FFT
    STFT->>Mel: apply power, project via Mel filterbank
    Mel->>Norm: log-compress, clip, normalize
    Norm-->>Features: final Mel spectrogram tensor
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested reviewers

  • yirongjie
  • chenghuaWang
  • oreomaker

Poem

🐰 I hopped through headers, tokens, and sound,
I stitched Mel spectrums where signals abound.
With tokens and images all snug in a stream,
Qwen2.5-Omni now dances — a multimodal dream! 🥕🎶🖼️

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete and does not follow the provided template. It lacks structured sections like motivation, changes made, testing, and references to contribution guidelines. Expand the description to follow the template structure, including detailed motivation, comprehensive list of changes, testing approach, and any breaking changes or dependencies.
Docstring Coverage ⚠️ Warning Docstring coverage is 14.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and concisely summarizes the main change: adding support for the Qwen2.5Omni model without the talker component.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🤖 Fix all issues with AI agents
In `@examples/qwen2_5omni/audio_infer.cpp`:
- Around line 51-59: The CLI currently ignores stdin by hardcoding audio_path
and prompt_text; revert to reading user input using std::getline for audio_path
and prompt_text (restore the commented lines) instead of the absolute path and
hardcoded prompt, and re-add checks to exit on "exit"/"quit" for audio_path and
to set a default prompt (e.g., "Please describe the audio.") when prompt_text is
empty; update the variables audio_path and prompt_text in
examples/qwen2_5omni/audio_infer.cpp accordingly and remove the local absolute
assignment.
- Around line 24-29: The code currently treats any model_version other than "v2"
as kV1 leading to silent misloads; update the validation around the
model_version.get() check (where file_version and mllm::ModelFileVersion are
set) to explicitly accept only "v1" or "v2" and otherwise print a clear error
(including the invalid value) and terminate (e.g., std::cerr + std::exit(1) or
throw) so unsupported values do not fall back silently to kV1.

In `@examples/qwen2_5omni/image_infer.cpp`:
- Around line 49-59: The code contains debug artifacts: hard-coded assignments
to image_path and prompt_text and commented-out interactive input; restore
interactive CLI by removing the hard-coded assignments and re-enabling the
std::getline calls for image_path and prompt_text, keep the existing prompt
prints and the exit/quit check, and ensure you trim or validate the read lines
if necessary (look for variables image_path, prompt_text and the std::getline
usage in this block).
- Around line 31-35: Move the help handling so -h/--help bypasses
required-argument validation: call Argparse::printHelp() and
mllm::shutdownContext() when help.isSet() before running the code path that
enforces required args (i.e., before or immediately after Argparse::parse() but
before any required-argument validation), or adjust Argparse to treat help as a
special case; locate the current help check (help.isSet()), the parser
invocation (Argparse::parse()), and the help output (Argparse::printHelp()) to
implement this change.

In `@examples/qwen2_5omni/text_infer.cpp`:
- Around line 24-29: The current logic silently maps any model_version other
than "v2" to kV1; update the branch that sets file_version (variable
file_version of type mllm::ModelFileVersion) to explicitly validate
model_version.get(): accept "v1" -> mllm::ModelFileVersion::kV1, "v2" ->
mllm::ModelFileVersion::kV2, and for any other value print a clear error
(including the invalid value) and exit/throw instead of falling back, so callers
of text_infer (where model_version is used) cannot accidentally load wrong
weights.

In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp`:
- Around line 197-206: Add a validation inside the loop where spatial merging is
computed (the block that reads dims and computes t,h,w, spatial_merge_size,
num_h_blocks, num_w_blocks, total_blocks, block_area) to ensure h %
spatial_merge_size == 0 and w % spatial_merge_size == 0; if not, raise/abort
with a clear error (or use an assert) describing invalid spatial_merge_size vs
H/W so rotary position alignment cannot silently miscompute. Ensure the check
references the same symbols (h, w, spatial_merge_size,
num_h_blocks/num_w_blocks) and runs before computing
num_h_blocks/num_w_blocks/total_blocks.
- Around line 568-575: The code assumes seq_len is divisible by
spatial_merge_unit_ before calling hidden_states.view({seq_len /
spatial_merge_unit_, spatial_merge_unit_, -1}), so add a guard that checks
seq_len % spatial_merge_unit_ == 0 and handle the non-divisible case (e.g.,
trim/pad hidden_states or raise a clear error) before the view call; update the
block around seq_len, spatial_merge_unit_, and hidden_states.view to perform the
check and decide whether to reshape, pad, or throw, ensuring any error message
references seq_len and spatial_merge_unit_ for easier debugging.
- Around line 95-105: The function makeWindowIndex must validate window_size,
spatial_merge_size, and patch_size before computing vit_merger_window_size and
spatial_merge_unit: check spatial_merge_size and patch_size are non-zero and
that window_size is divisible by (spatial_merge_size * patch_size) (or that
window_size / spatial_merge_size / patch_size produces the intended integer) and
fail-fast (MLLM_RT_ASSERT or throw) with a clear message referencing grid_thw
and the offending parameter(s); additionally validate spatial_merge_size > 0
before computing spatial_merge_unit to avoid division/multiplication errors and
ensure window_index construction only proceeds when these guard conditions pass.

In `@mllm/models/qwen2_5omni/python_src_code/audio_process.py`:
- Around line 71-73: Replace the assert check that calls
_check_if_video_has_audio(path) with an explicit runtime exception: if
use_audio_in_video is True and _check_if_video_has_audio(path) returns False,
raise a ValueError with the same message ("Video must has audio track when
use_audio_in_video=True"); this mirrors the existing validation style used
around lines 65 and 81 and ensures the check isn't skipped under Python -O.
Reference the existing helper _check_if_video_has_audio and the local variable
path (and the use_audio_in_video flag) when making the change.
- Around line 11-16: The _check_if_video_has_audio function currently opens a
PyAV container with av.open(video_path) and never closes it, leaking file
descriptors; update it to use a context manager (with av.open(video_path) as
container:) or ensure container.close() in a finally block so the container is
always closed even on exceptions, keeping the existing logic that inspects
container.streams for audio streams.

In `@mllm/models/qwen2_5omni/python_src_code/LICENSE`:
- Around line 1-201: The LICENSE file name violates the project's allowed
extensions; rename the file named "LICENSE" to include a permitted extension
(e.g., "LICENSE.txt" or "LICENSE.md") so it matches coding guidelines; update
any references to "LICENSE" in packaging or docs to point to the new filename to
avoid broken links.

In `@mllm/models/qwen2_5omni/python_src_code/name.py`:
- Around line 27-32: The loop currently calls f.get_tensor(key) and computes
unused variables i, shape, and dtype, which loads full tensors into memory;
replace this by using safetensors metadata access (e.g., f.get_metadata or
f.metadata[key]) to read shape and dtype without materializing the tensor, or if
you only need the key, remove the get_tensor call and the unused variables
entirely; update the loop that iterates over keys to either log the metadata via
the safetensors metadata API or simply print the key without calling
f.get_tensor(key).

In `@mllm/models/qwen2_5omni/python_src_code/README.md`:
- Around line 740-749: The README shows audio output code for model.generate and
uses processor.batch_decode and sf.write, but the PR is text-only; update the
examples to explicitly disable audio (call model.generate(...,
return_audio=False or remove use_audio_in_video) and only process text via
processor.batch_decode) and remove or guard any sf.write/audio handling; also
add a clear "audio/talker not supported yet" note near the generation examples
(including the other audio example blocks) so users aren’t misled.

In `@mllm/models/qwen2_5omni/python_src_code/vision_process.py`:
- Around line 106-111: Add a timeout to the HTTP fetch to avoid hangs: update
the requests.get call in the image URL branch (the block that checks
image.startswith("http://") or "https://") to pass a timeout (e.g.,
timeout=REQUEST_TIMEOUT or timeout=10) — preferably use a module-level constant
like REQUEST_TIMEOUT so it’s configurable — and keep using the response context
manager, response.raise_for_status(), and the BytesIO/Image.open flow unchanged;
also ensure any callers/handlers will surface
requests.exceptions.ReadTimeout/ConnectTimeout if needed.
🟡 Minor comments (13)
mllm/models/qwen2_5omni/python_src_code/name.py-1-5 (1)

1-5: Filename does not match docstring.

The file is named name.py but the docstring references print_safetensors_keys.py. Consider renaming the file to print_safetensors_keys.py for consistency and clarity.

mllm/models/qwen2_5omni/python_src_code/README.md-760-763 (1)

760-763: Use descriptive link text instead of “here.”
markdownlint MD059 prefers descriptive anchor text for accessibility.

mllm/models/qwen2_5omni/python_src_code/README.md-672-681 (1)

672-681: Specify languages on fenced code blocks.
markdownlint MD040 flags missing languages on the install/error and system-prompt snippets.

✏️ Proposed update
-```
+```bash
 pip uninstall transformers
 pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview
 pip install accelerate
-```
+```

-```
+```text
 KeyError: 'qwen2_5_omni'
-```
+```

-```
+```json
 {
     "role": "system",
     "content": [
         {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
     ],
 }
-```
+```

Also applies to: 875-882

mllm/models/qwen2_5omni/python_src_code/audio_process.py-35-37 (1)

35-37: Guard against empty input and invalid time ranges.
conversations[0] will raise on empty input, and negative durations can be produced when *_end < *_start. Consider explicit validation. As per coding guidelines, validate inputs for public APIs and critical internal functions.

🛠️ Proposed fix
-    audios = []
-    if isinstance(conversations[0], dict):
+    if not conversations:
+        return None
+    audios = []
+    if isinstance(conversations[0], dict):
         conversations = [conversations]
@@
-                        audio_start = ele.get("audio_start", 0.0)
-                        audio_end = ele.get("audio_end", None)
+                        audio_start = ele.get("audio_start", 0.0)
+                        audio_end = ele.get("audio_end", None)
+                        if audio_end is not None and audio_end < audio_start:
+                            raise ValueError("audio_end must be >= audio_start")
@@
-                        audio_start = ele.get("video_start", 0.0)
-                        audio_end = ele.get("video_end", None)
+                        audio_start = ele.get("video_start", 0.0)
+                        audio_end = ele.get("video_end", None)
+                        if audio_end is not None and audio_end < audio_start:
+                            raise ValueError("video_end must be >= video_start")

Also applies to: 46-47, 69-70, 88-90

mllm/models/qwen2_5omni/python_src_code/README.md-656-665 (1)

656-665: Replace the hard tab in the Text → Text table.
markdownlint MD010 flags a hard tab here; replace it with spaces to avoid rendering quirks.

mllm/models/qwen2_5omni/python_src_code/README.md-23-25 (1)

23-25: Add alt text to images for accessibility.
The HTML image tags here don’t include alt text (MD045). Please add descriptive alt attributes.

Also applies to: 41-43, 49-50

mllm/models/qwen2_5omni/python_src_code/README.md-492-529 (1)

492-529: Add blank lines around Markdown tables.
markdownlint MD058 requires blank lines before/after tables in these sections.

Also applies to: 532-541, 653-668, 766-775

mllm/models/qwen2_5omni/audio_preprocessor_qwen2_5omni.hpp-119-132 (1)

119-132: Default constructor leaves critical members uninitialized.

The default constructor MelSpectrogramFeatures() = default leaves stft_, window_, and melscale_fbanks_ in an indeterminate state. If forward() is called on a default-constructed instance, it will cause undefined behavior. Consider either removing the default constructor or initializing members to safe defaults.

mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp-942-952 (1)

942-952: Handle unsupported attention dtypes explicitly.
If key_states isn’t float32/float16, attn stays unset. Add an error branch to avoid undefined behavior.

⚠️ Add explicit dtype handling
     if (key_states.dtype() == kFloat32) {
       attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_));
       attn = mask_(attn);
       attn = softmax_(attn);
     } else if (key_states.dtype() == kFloat16) {
       attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_));
       attn = mask_(attn);
       attn = softmax_(attn);
       attn = attn.to(kFloat16);
+    } else {
+      MLLM_ERROR_EXIT(ExitCode::kCoreError, "Unsupported attention dtype for Qwen2.5-Omni.");
     }
mllm/models/qwen2_5omni/python_src_code/processing_qwen2_5_omni.py-290-292 (1)

290-292: Docstring parameter name mismatch.

The docstring references t_ntoken_per_chunk but the actual parameter is tokens_per_chunk.

📝 Fix docstring
         Parameters:
             token_indices (`np.ndarray`): A monotonically increasing list of token index values.
-            t_ntoken_per_chunk (`int`): Number of tokens per chunk (used as the chunk size threshold).
+            tokens_per_chunk (`int`): Number of tokens per chunk (used as the chunk size threshold).
mllm/models/qwen2_5omni/python_src_code/processing_qwen2_5_omni.py-369-385 (1)

369-385: Typos in docstring and error message.

  • Line 370: Inion should be Union
  • Line 384-385: Error message missing closing backtick after "audio"
📝 Fix typos
         Returns:
-            `list[Inion[str, np.ndarray]]`: The decoded text or generated audio.
+            `list[Union[str, np.ndarray]]`: The decoded text or generated audio.
         """
         if generation_mode is None or generation_mode == "text":
             return self.post_process_image_text_to_text(
                 generated_outputs, skip_special_tokens=skip_special_tokens, **kwargs
             )

         elif generation_mode == "audio":
             # model supports only bs=1, so we will never get several audio outputs
             audio = generated_outputs[1].reshape(-1).detach().cpu().numpy()
             return [audio]

         else:
             raise ValueError(
-                f"{self.__class__.__name__} got an unexpected generation_mode={generation_mode}. Supported options are only `text` and `audio"
+                f"{self.__class__.__name__} got an unexpected generation_mode={generation_mode}. Supported options are only `text` and `audio`."
             )
mllm/models/qwen2_5omni/python_src_code/processing_qwen2_5_omni.py-387-402 (1)

387-402: Missing null checks for optional processors.

The property accesses model_input_names on feature_extractor, image_processor, and video_processor, all of which can be None per the constructor signature. This will raise AttributeError if any processor wasn't provided.

🐛 Suggested defensive approach
     `@property`
     def model_input_names(self):
         tokenizer_input_names = self.tokenizer.model_input_names
-        feature_extractor_input_names = self.feature_extractor.model_input_names
-        image_processor_input_names = self.image_processor.model_input_names
-        video_processor_input_names = self.video_processor.model_input_names
+        feature_extractor_input_names = self.feature_extractor.model_input_names if self.feature_extractor else []
+        image_processor_input_names = self.image_processor.model_input_names if self.image_processor else []
+        video_processor_input_names = self.video_processor.model_input_names if self.video_processor else []
         return list(
mllm/models/qwen2_5omni/python_src_code/processing_qwen2_5_omni.py-92-102 (1)

92-102: Potential AttributeError if tokenizer is None.

The constructor accesses self.tokenizer.image_token and other attributes unconditionally, but tokenizer=None is the default parameter. If instantiated without a tokenizer, this will raise AttributeError.

🐛 Suggested defensive check
     def __init__(
         self, image_processor=None, video_processor=None, feature_extractor=None, tokenizer=None, chat_template=None
     ):
         super().__init__(image_processor, video_processor, feature_extractor, tokenizer, chat_template=chat_template)
+        if tokenizer is None:
+            raise ValueError("tokenizer is required for Qwen2_5OmniProcessor")
         self.image_token = self.tokenizer.image_token
🧹 Nitpick comments (17)
mllm/models/qwen2_5omni/python_src_code/name.py (3)

10-15: Missing PyTorch dependency in error message.

The script uses framework="pt" which requires PyTorch. If PyTorch is missing, users will get a confusing error later. Consider mentioning both dependencies.

💡 Suggested improvement
 try:
     from safetensors import safe_open
 except ImportError:
-    print("请先安装 safetensors 库:")
-    print("  pip install safetensors")
+    print("请先安装 safetensors 和 torch 库:")
+    print("  pip install safetensors torch")
     sys.exit(1)

61-65: Redundant file opening; refactor to avoid opening each file twice.

Each file is opened in print_keys_from_file() and then again to count keys. This doubles I/O operations. Have print_keys_from_file() return the key count instead.

♻️ Proposed refactor

Update print_keys_from_file to return the key count:

-def print_keys_from_file(filepath: Path):
+def print_keys_from_file(filepath: Path) -> int:
     """打印单个 safetensors 文件的所有键名"""
     ...
-            print(f"{key}")
+            print(f"{key}")
+    return len(keys)

Then simplify the loop in main:

         total_keys = 0
         for filepath in files:
-            print_keys_from_file(filepath)
-            with safe_open(filepath, framework="pt") as f:
-                total_keys += len(f.keys())
+            total_keys += print_keys_from_file(filepath)

53-53: Ambiguous fullwidth comma in comment.

The comment contains a fullwidth comma (U+FF0C) instead of ASCII comma ,. Per static analysis (RUF003).

💡 Suggested fix
-        # 目录,查找所有 safetensors 文件
+        # 目录, 查找所有 safetensors 文件
mllm/models/qwen2_5omni/python_src_code/configuration.json (1)

1-1: Consider improving readability and documentation.

The single-line JSON format is valid but could be improved for maintainability:

  • Format with proper indentation for better readability
  • Add a separate schema file or documentation explaining each configuration option
  • Document what "all-to-all" task type means in this context
📝 Suggested formatting improvement
-{"framework": "pytorch", "task": "all-to-all", "allow_remote": true}
+{
+  "framework": "pytorch",
+  "task": "all-to-all",
+  "allow_remote": true
+}

Additionally, consider adding a README.md or inline comments (if the format supports it) documenting the purpose of each field.

mllm/models/qwen2_5omni/python_src_code/preprocessor_config.json (1)

2-29: Add validation for derived audio fields to avoid config drift.

n_samples and nb_max_frames appear derivable from chunk_length, sampling_rate, and hop_length. Consider validating these relationships in the loader (or deriving them) to prevent silent inconsistencies if one value changes.

mllm/models/qwen2_5omni/audio_preprocessor_qwen2_5omni.hpp (1)

43-55: Potential portability issue with M_PI constant.

M_PI is not part of the C++ standard and may not be defined on all platforms (particularly MSVC without _USE_MATH_DEFINES). Consider using a locally defined constant for cross-platform compatibility.

Suggested fix for portability
 inline Tensor create_hann_window(int32_t window_length, bool periodic = true) {
+  constexpr float kPi = 3.14159265358979323846f;
   int32_t length = periodic ? window_length + 1 : window_length;
   auto window = Tensor::empty({1, window_length}, kFloat32, kCPU).alloc();
   float* window_ptr = window.ptr<float>();

   for (int32_t i = 0; i < window_length; ++i) {
     float n = static_cast<float>(i);
     float denominator = periodic ? static_cast<float>(length) : static_cast<float>(length - 1);
-    window_ptr[i] = 0.5f - 0.5f * std::cos(2.0f * M_PI * n / denominator);
+    window_ptr[i] = 0.5f - 0.5f * std::cos(2.0f * kPi * n / denominator);
   }

   return window;
 }
mllm/models/qwen2_5omni/python_src_code/vision_process.py (1)

401-405: Consider more specific exception handling for video reader fallback.

Catching bare Exception can mask unexpected errors beyond video reading issues. While the fallback behavior is reasonable, consider catching more specific exceptions (e.g., IOError, RuntimeError) to avoid hiding unrelated bugs.

Suggested improvement
         try:
             video, sample_fps = VIDEO_READER_BACKENDS[video_reader_backend](ele)
-        except Exception as e:
+        except (IOError, RuntimeError, ImportError, KeyError) as e:
             logger.warning(f"video_reader_backend {video_reader_backend} error, use torchvision as default, msg: {e}")
             video, sample_fps = VIDEO_READER_BACKENDS["torchvision"](ele)
mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp (2)

20-117: Regex tokenization pattern is comprehensive but complex.

The qwen2_5OmniTokenizerMatchPattern function implements a multi-stage pattern matcher handling contractions, letters, digits, punctuation, and whitespace. The logic follows the Qwen2/Qwen2-VL tokenizer patterns. Consider adding inline comments explaining each matching stage for maintainability.


189-193: Consider removing or formalizing the debug comment.

The comment on line 190 (//interestingly, the answer went bad when setting max_pixels higher...) appears to be a development note. If this is a known limitation, consider documenting it properly in the class documentation or as a TODO with a tracking issue.

mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp (3)

24-33: Add brief API docs for new public helpers/classes.
This header exposes many new public helpers/modules without doc comments. Please add concise Doxygen-style docs for purpose/params/returns/errors so the public surface is self-explanatory.

As per coding guidelines, please add brief public API documentation.


584-595: Consider avoiding full seq_len² mask for window attention.
This allocates O(n²) memory/time; for large images it can be heavy. A block-sparse mask or per-window attention would reduce memory pressure.

As per coding guidelines, consider reducing O(n²) memory/time where feasible.


1233-1247: Avoid quadratic scans for next vision/audio tokens.
Each loop rescans the tail to find next markers, making this O(S²) on long sequences. Precompute marker indices or scan once with moving pointers.

As per coding guidelines, consider a linear-time scan to keep long sequences efficient.

examples/qwen2_5omni/image_infer.cpp (1)

24-29: Missing validation for invalid model version.

If model_version is neither "v1" nor "v2", the code silently defaults to kV1. Consider warning or erroring on unrecognized versions.

♻️ Suggested improvement
   mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
   if (model_version.get() == "v1") {
     file_version = mllm::ModelFileVersion::kV1;
   } else if (model_version.get() == "v2") {
     file_version = mllm::ModelFileVersion::kV2;
+  } else {
+    fmt::print(stderr, "Warning: Unknown model version '{}', defaulting to v1\n", model_version.get());
   }
mllm/models/qwen2_5omni/python_src_code/processing_qwen2_5_omni.py (4)

48-70: Consider annotating _defaults with ClassVar.

Ruff flags mutable class attributes without ClassVar annotation. While this pattern is common in HuggingFace processing utilities, adding the annotation improves type safety.

♻️ Optional fix
+from typing import ClassVar

 class Qwen2_5OmniProcessorKwargs(ProcessingKwargs, total=False):
     videos_kwargs: Qwen2_5_OmniVideosKwargs

-    _defaults = {
+    _defaults: ClassVar[dict] = {
         "text_kwargs": {

149-162: Audio length calculation could benefit from a comment.

Lines 158-159 contain a non-obvious formula: (input_lengths - 2) // 2 + 1 and (audio_inputs["feature_attention_mask"].sum(-1) - 1) // 2 + 1. Consider adding a brief comment explaining the transformation (e.g., relating to convolution stride or feature extraction parameters).


224-229: Redundant sorting of positions.

positions is sorted twice: first via sorted() in line 228, then explicitly in line 229. The key function in line 229 sorts by x[0] (start position), which is the same as the default sort behavior of tuples. Remove one of the sorts.

♻️ Proposed fix
         for sample in text:
             positions = []
             special_tokens = [re.escape(tok) for tok in [self.audio_token, self.image_token, self.video_token]]
             pattern = "|".join(special_tokens)
-            positions = sorted([(match.start(), match.group()) for match in re.finditer(pattern, sample)])
-            positions.sort(key=lambda x: x[0])
+            positions = [(match.start(), match.group()) for match in re.finditer(pattern, sample)]
+            positions.sort(key=lambda x: x[0])

312-331: Misleading variable name is_batched.

The logic is inverted: is_batched = True is set when the input is a single conversation (dict), not when it's batched. The variable name is confusing even though the code works correctly.

♻️ Suggested clarification
     def apply_chat_template(self, conversations, chat_template=None, **kwargs):
-        is_batched = False
+        was_single = False
         if isinstance(conversations[0], dict):
             conversations = [conversations]
-            is_batched = True
+            was_single = True

         for conversation in conversations:
             # ... validation ...
-        if is_batched:
+        if was_single:
             conversations = conversations[0]

         return super().apply_chat_template(conversations, chat_template, **kwargs)

Comment on lines +24 to +29
mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
if (model_version.get() == "v1") {
file_version = mllm::ModelFileVersion::kV1;
} else if (model_version.get() == "v2") {
file_version = mllm::ModelFileVersion::kV2;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject unsupported model_version values.
Any value other than "v2" silently falls back to kV1, which can load incorrect weights. Validate and exit with a clear error.

🛡️ Suggested validation
   mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
   if (model_version.get() == "v1") {
     file_version = mllm::ModelFileVersion::kV1;
   } else if (model_version.get() == "v2") {
     file_version = mllm::ModelFileVersion::kV2;
+  } else {
+    fmt::print("Unsupported model_version: {}\n", model_version.get());
+    mllm::shutdownContext();
+    return 1;
   }
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/audio_infer.cpp` around lines 24 - 29, The code
currently treats any model_version other than "v2" as kV1 leading to silent
misloads; update the validation around the model_version.get() check (where
file_version and mllm::ModelFileVersion are set) to explicitly accept only "v1"
or "v2" and otherwise print a clear error (including the invalid value) and
terminate (e.g., std::cerr + std::exit(1) or throw) so unsupported values do not
fall back silently to kV1.

Comment thread examples/qwen2_5omni/audio_infer.cpp Outdated
Comment on lines +51 to +59
fmt::print("Audio path (or 'exit/quit'): ");
//std::getline(std::cin, audio_path);
//if (audio_path == "exit" || audio_path == "quit") { return 0; }
audio_path = "/Users/kkkai/Desktop/mllm2-former/mllm/rsc/recognize.wav";

fmt::print("Prompt text: ");
//std::getline(std::cin, prompt_text);
//if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }
prompt_text = "复述这段音频";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove hardcoded audio path/prompt.
The CLI ignores user input and uses a local absolute path, which breaks usability for everyone else. Restore stdin input and keep a default prompt if empty.

🎤 Restore interactive input
     fmt::print("Audio path (or 'exit/quit'): ");
-    //std::getline(std::cin, audio_path);
-    //if (audio_path == "exit" || audio_path == "quit") { return 0; }
-    audio_path = "/Users/kkkai/Desktop/mllm2-former/mllm/rsc/recognize.wav";
+    std::getline(std::cin, audio_path);
+    if (audio_path == "exit" || audio_path == "quit") { return 0; }
 
     fmt::print("Prompt text: ");
-    //std::getline(std::cin, prompt_text);
-    //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }
-    prompt_text = "复述这段音频";
+    std::getline(std::cin, prompt_text);
+    if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fmt::print("Audio path (or 'exit/quit'): ");
//std::getline(std::cin, audio_path);
//if (audio_path == "exit" || audio_path == "quit") { return 0; }
audio_path = "/Users/kkkai/Desktop/mllm2-former/mllm/rsc/recognize.wav";
fmt::print("Prompt text: ");
//std::getline(std::cin, prompt_text);
//if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }
prompt_text = "复述这段音频";
fmt::print("Audio path (or 'exit/quit'): ");
std::getline(std::cin, audio_path);
if (audio_path == "exit" || audio_path == "quit") { return 0; }
fmt::print("Prompt text: ");
std::getline(std::cin, prompt_text);
if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/audio_infer.cpp` around lines 51 - 59, The CLI currently
ignores stdin by hardcoding audio_path and prompt_text; revert to reading user
input using std::getline for audio_path and prompt_text (restore the commented
lines) instead of the absolute path and hardcoded prompt, and re-add checks to
exit on "exit"/"quit" for audio_path and to set a default prompt (e.g., "Please
describe the audio.") when prompt_text is empty; update the variables audio_path
and prompt_text in examples/qwen2_5omni/audio_infer.cpp accordingly and remove
the local absolute assignment.

Comment on lines +31 to +35
if (help.isSet()) {
Argparse::printHelp();
mllm::shutdownContext();
return 0;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n examples/qwen2_5omni/image_infer.cpp | head -40

Repository: UbiquitousLearning/mllm

Length of output: 1898


🏁 Script executed:

rg -n "required|help.*isSet|isSet.*help" examples/qwen2_5omni/image_infer.cpp -B2 -A2

Repository: UbiquitousLearning/mllm

Length of output: 737


🏁 Script executed:

# Check how other examples handle help flags with required arguments
rg -n "help.*isSet|isSet.*help" examples/ --type cpp -B3 -A3

Repository: UbiquitousLearning/mllm

Length of output: 5873


🏁 Script executed:

# Find the Argparse implementation
fd -t f "argparse" --type f -e hpp -e cpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 90


🏁 Script executed:

# Search for Argparse class definition
rg -n "class Argparse" --type cpp --type hpp

Repository: UbiquitousLearning/mllm

Length of output: 94


🏁 Script executed:

# Read the Argparse implementation
cat mllm/utils/Argparse.hpp | head -200

Repository: UbiquitousLearning/mllm

Length of output: 5404


🏁 Script executed:

# Search for parse method in Argparse
rg -n "void parse|bool parse" mllm/utils/Argparse.hpp -A 10

Repository: UbiquitousLearning/mllm

Length of output: 1508


🏁 Script executed:

# Get the rest of the parse method to see the requirements checking
cat mllm/utils/Argparse.hpp | sed -n '149,220p'

Repository: UbiquitousLearning/mllm

Length of output: 2351


Help flag check occurs after required argument validation, preventing -h from working.

The -h|--help flag is checked at line 31, but Argparse::parse() validates all required arguments (model_path, model_version, tokenizer_path, config_path) at the end of parsing. Running with just -h will fail during parse() before help.isSet() is evaluated, so help will never be displayed. Move the help check before or immediately after Argparse::parse(), or modify the Argparse class to handle help as a special case that bypasses requirement validation.

🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/image_infer.cpp` around lines 31 - 35, Move the help
handling so -h/--help bypasses required-argument validation: call
Argparse::printHelp() and mllm::shutdownContext() when help.isSet() before
running the code path that enforces required args (i.e., before or immediately
after Argparse::parse() but before any required-argument validation), or adjust
Argparse to treat help as a special case; locate the current help check
(help.isSet()), the parser invocation (Argparse::parse()), and the help output
(Argparse::printHelp()) to implement this change.

Comment on lines +49 to +59
std::string image_path;
std::string prompt_text;

fmt::print("Image path (or 'exit/quit'): ");
image_path = "../../../mllm2-former/mllm/rsc/pics.jpg";
//std::getline(std::cin, image_path);
if (image_path == "exit" || image_path == "quit") { return 0; }

fmt::print("Prompt text: ");
prompt_text = "描述图片中物体";
//std::getline(std::cin, prompt_text);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove debug artifacts: hard-coded paths and commented-out interactive code.

The image path and prompt text are hard-coded with std::getline calls commented out. This breaks the interactive CLI functionality described in the UI prompts.

🐛 Proposed fix to restore interactive input
     std::string image_path;
     std::string prompt_text;

     fmt::print("Image path (or 'exit/quit'): ");
-    image_path = "../../../mllm2-former/mllm/rsc/pics.jpg";
-    //std::getline(std::cin, image_path);
+    std::getline(std::cin, image_path);
     if (image_path == "exit" || image_path == "quit") { return 0; }

     fmt::print("Prompt text: ");
-    prompt_text = "描述图片中物体";
-    //std::getline(std::cin, prompt_text);
+    std::getline(std::cin, prompt_text);
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/image_infer.cpp` around lines 49 - 59, The code contains
debug artifacts: hard-coded assignments to image_path and prompt_text and
commented-out interactive input; restore interactive CLI by removing the
hard-coded assignments and re-enabling the std::getline calls for image_path and
prompt_text, keep the existing prompt prints and the exit/quit check, and ensure
you trim or validate the read lines if necessary (look for variables image_path,
prompt_text and the std::getline usage in this block).

Comment on lines +24 to +29
mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
if (model_version.get() == "v1") {
file_version = mllm::ModelFileVersion::kV1;
} else if (model_version.get() == "v2") {
file_version = mllm::ModelFileVersion::kV2;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject unsupported model_version values.
Any value other than "v2" silently falls back to kV1, which can load incorrect weights. Validate and exit with a clear error.

🛡️ Suggested validation
   mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
   if (model_version.get() == "v1") {
     file_version = mllm::ModelFileVersion::kV1;
   } else if (model_version.get() == "v2") {
     file_version = mllm::ModelFileVersion::kV2;
+  } else {
+    fmt::print("Unsupported model_version: {}\n", model_version.get());
+    mllm::shutdownContext();
+    return 1;
   }
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/text_infer.cpp` around lines 24 - 29, The current logic
silently maps any model_version other than "v2" to kV1; update the branch that
sets file_version (variable file_version of type mllm::ModelFileVersion) to
explicitly validate model_version.get(): accept "v1" ->
mllm::ModelFileVersion::kV1, "v2" -> mllm::ModelFileVersion::kV2, and for any
other value print a clear error (including the invalid value) and exit/throw
instead of falling back, so callers of text_infer (where model_version is used)
cannot accidentally load wrong weights.

Comment on lines +71 to +73
assert _check_if_video_has_audio(
path
), "Video must has audio track when use_audio_in_video=True"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "audio_process.py" | head -20

Repository: UbiquitousLearning/mllm

Length of output: 125


🏁 Script executed:

cat -n mllm/models/qwen2_5omni/python_src_code/audio_process.py | sed -n '60,85p'

Repository: UbiquitousLearning/mllm

Length of output: 1595


Replace assert with a runtime exception.
Assertions are stripped when Python runs with the -O optimization flag, causing this validation to silently fail in production. Since surrounding code (lines 65, 81) uses ValueError for similar input validation failures, this should be consistent.

🛠️ Proposed fix
-                        assert _check_if_video_has_audio(
-                            path
-                        ), "Video must has audio track when use_audio_in_video=True"
+                        if not _check_if_video_has_audio(path):
+                            raise ValueError("Video must have an audio track when use_audio_in_video=True")
🤖 Prompt for AI Agents
In `@mllm/models/qwen2_5omni/python_src_code/audio_process.py` around lines 71 -
73, Replace the assert check that calls _check_if_video_has_audio(path) with an
explicit runtime exception: if use_audio_in_video is True and
_check_if_video_has_audio(path) returns False, raise a ValueError with the same
message ("Video must has audio track when use_audio_in_video=True"); this
mirrors the existing validation style used around lines 65 and 81 and ensures
the check isn't skipped under Python -O. Reference the existing helper
_check_if_video_has_audio and the local variable path (and the
use_audio_in_video flag) when making the change.

Comment on lines +1 to +201
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/

TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

1. Definitions.

"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.

"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.

"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.

"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.

"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.

"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.

"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).

"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.

"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."

"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.

2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.

3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.

4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:

(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and

(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and

(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and

(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.

You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.

5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.

6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.

7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.

8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.

9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.

END OF TERMS AND CONDITIONS

APPENDIX: How to apply the Apache License to your work.

To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright 2025 Alibaba Cloud

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

File extension required by coding guidelines.

The filename LICENSE lacks an extension, which violates the coding guidelines stating: "Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml."

Rename the file to LICENSE.txt or LICENSE.md to comply with the guidelines.

Based on coding guidelines.

🤖 Prompt for AI Agents
In `@mllm/models/qwen2_5omni/python_src_code/LICENSE` around lines 1 - 201, The
LICENSE file name violates the project's allowed extensions; rename the file
named "LICENSE" to include a permitted extension (e.g., "LICENSE.txt" or
"LICENSE.md") so it matches coding guidelines; update any references to
"LICENSE" in packaging or docs to point to the new filename to avoid broken
links.

Comment thread mllm/models/qwen2_5omni/python_src_code/name.py Outdated
Comment thread mllm/models/qwen2_5omni/python_src_code/README.md Outdated
Comment on lines +106 to +111
elif image.startswith("http://") or image.startswith("https://"):
# fix memory leak issue while using BytesIO
with requests.get(image, stream=True) as response:
response.raise_for_status()
with BytesIO(response.content) as bio:
image_obj = copy.deepcopy(Image.open(bio))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing timeout in HTTP request can cause indefinite hangs.

The requests.get() call lacks a timeout parameter, which can cause the application to hang indefinitely if the server is unresponsive. This is particularly problematic in production environments.

Proposed fix
-    with requests.get(image, stream=True) as response:
+    with requests.get(image, stream=True, timeout=30) as response:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
elif image.startswith("http://") or image.startswith("https://"):
# fix memory leak issue while using BytesIO
with requests.get(image, stream=True) as response:
response.raise_for_status()
with BytesIO(response.content) as bio:
image_obj = copy.deepcopy(Image.open(bio))
elif image.startswith("http://") or image.startswith("https://"):
# fix memory leak issue while using BytesIO
with requests.get(image, stream=True, timeout=30) as response:
response.raise_for_status()
with BytesIO(response.content) as bio:
image_obj = copy.deepcopy(Image.open(bio))
🧰 Tools
🪛 Ruff (0.14.13)

108-108: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents
In `@mllm/models/qwen2_5omni/python_src_code/vision_process.py` around lines 106 -
111, Add a timeout to the HTTP fetch to avoid hangs: update the requests.get
call in the image URL branch (the block that checks image.startswith("http://")
or "https://") to pass a timeout (e.g., timeout=REQUEST_TIMEOUT or timeout=10) —
preferably use a module-level constant like REQUEST_TIMEOUT so it’s configurable
— and keep using the response context manager, response.raise_for_status(), and
the BytesIO/Image.open flow unchanged; also ensure any callers/handlers will
surface requests.exceptions.ReadTimeout/ConnectTimeout if needed.

chenghuaWang and others added 9 commits January 23, 2026 15:43
…fixed quantization parameters, updated ActivationQDQ to use MovingAverageMinMaxObserver, and adjusted eps values for better precision. Modified Qwen3 model to utilize FixedActivationQDQ for sigmoid output and ensured dtype consistency in attention calculations.
… debug print statements from Qwen3DecoderLayer
…ackend in CMake, enhance PTQPass with unsolved tensor value checks, and update quantization specifications in RMSNorm and model file conversion.
…improved quantization, enhance rotate_half function to utilize observers, and ensure consistent scale and zero_point across concatenated inputs.
… zeros, ones, specific values, arange, and random fills. Introduce a new fill-inl.hpp file for optimized implementations and update kernel dispatch to include these operations. Enhance CPUFillOp to utilize the new fill functions for better performance and maintainability.
…d error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.
…es; ensure position-independent code for flatbuffers. Enhance context creation with existing context checks and improve weight quantization specifications.
… input layer normalization handling in Qwen3DecoderLayer. Update weight conversion logic in training script to address model compatibility issues.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@examples/qwen2_5omni/audio_infer.cpp`:
- Around line 16-35: The help flag is never reachable because Argparse::parse()
enforces required options first; before calling Argparse::parse(argc, argv)
detect help early (e.g., scan argv for "-h" or "--help") and if present call
Argparse::printHelp() and mllm::shutdownContext() then return 0, or
alternatively use an Argparse::parse variant/option that skips required-arg
validation for help; update the code around Argparse::parse, help.isSet(),
Argparse::printHelp, and mllm::shutdownContext to ensure help is handled before
required-arg validation.

In `@examples/qwen2_5omni/config_qwen2_5omni_7B.json`:
- Around line 5-7: The config enables unsupported talker components despite the
PR stating "without talker": set "enable_audio_output" and "enable_talker" to
false (or add an explicit comment stating they are placeholders) in the JSON for
the qwen2_5_omni model to prevent runtime attempts to initialize talker/audio
subsystems; update any documentation or inline note near the "model_type":
"qwen2_5_omni" entry to clarify that talker is not yet supported if you choose
the placeholder route.

In `@examples/qwen2_5omni/image_infer.cpp`:
- Around line 24-29: The code silently defaults file_version to
mllm::ModelFileVersion::kV1 which can load wrong weights; change the logic in
the block that reads model_version.get() so it validates allowed values
explicitly (e.g., "v1" -> kV1, "v2" -> kV2) and on any other value fail fast by
returning an error/throwing an exception or logging and exiting with a clear
message that the model_version is unsupported; update references to file_version
and the model-version parsing branch (where mllm::ModelFileVersion file_version
is set) to enforce this validation rather than falling back to kV1.
♻️ Duplicate comments (4)
examples/qwen2_5omni/image_infer.cpp (2)

16-34: -h/--help is still blocked by required-arg validation.

Argparse validates required args inside parse(), so the help check after parsing won’t run when -h is the only flag. Handle help before validation or make Argparse treat help as a bypass.

💡 One way to bypass required-arg validation
+#include <string_view>
 ...
   auto& config_path = Argparse::add<std::string>("-c|--config_path").help("Config path").required(true);

+  for (int i = 1; i < argc; ++i) {
+    const std::string_view arg{argv[i]};
+    if (arg == "-h" || arg == "--help") {
+      Argparse::printHelp();
+      mllm::shutdownContext();
+      return 0;
+    }
+  }
   Argparse::parse(argc, argv);

52-59: Restore interactive input for image_path and prompt_text.

Currently the CLI ignores stdin and uses empty strings, which makes inference unusable. Re-enable std::getline and keep the exit/quit check.

🐛 Proposed fix
     fmt::print("Image path (or 'exit/quit'): ");
-    image_path = "";
-    //std::getline(std::cin, image_path);
+    std::getline(std::cin, image_path);
     if (image_path == "exit" || image_path == "quit") { return 0; }

     fmt::print("Prompt text: ");
-    prompt_text = "";
-    //std::getline(std::cin, prompt_text);
+    std::getline(std::cin, prompt_text);
examples/qwen2_5omni/audio_infer.cpp (2)

24-29: Reject unsupported model_version values.

Any value other than "v1"/"v2" silently falls back to kV1, which can load incorrect weights. Validate and exit with a clear error.

🛡️ Suggested validation
   mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
   if (model_version.get() == "v1") {
     file_version = mllm::ModelFileVersion::kV1;
   } else if (model_version.get() == "v2") {
     file_version = mllm::ModelFileVersion::kV2;
+  } else {
+    fmt::print("Unsupported model_version: {}\n", model_version.get());
+    mllm::shutdownContext();
+    return 1;
   }

51-59: Restore interactive stdin input for audio path and prompt.

The current code ignores user input and uses empty strings, which breaks the CLI. Re-enable std::getline, keep the exit/quit check, and restore the default prompt when empty.

🎤 Proposed fix
     fmt::print("Audio path (or 'exit/quit'): ");
-    //std::getline(std::cin, audio_path);
-    //if (audio_path == "exit" || audio_path == "quit") { return 0; }
-    audio_path = "";
+    std::getline(std::cin, audio_path);
+    if (audio_path == "exit" || audio_path == "quit") { return 0; }

     fmt::print("Prompt text: ");
-    //std::getline(std::cin, prompt_text);
-    //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }
-    prompt_text = "";
+    std::getline(std::cin, prompt_text);
+    if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }

Comment on lines +16 to +35
auto& help = Argparse::add<bool>("-h|--help").help("Show help message");
auto& model_path = Argparse::add<std::string>("-m|--model_path").help("Model path").required(true);
auto& model_version = Argparse::add<std::string>("-mv|--model_version").help("Model version").required(true);
auto& tokenizer_path = Argparse::add<std::string>("-t|--tokenizer_path").help("Tokenizer directory").required(true);
auto& config_path = Argparse::add<std::string>("-c|--config_path").help("Config path").required(true);

Argparse::parse(argc, argv);

mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
if (model_version.get() == "v1") {
file_version = mllm::ModelFileVersion::kV1;
} else if (model_version.get() == "v2") {
file_version = mllm::ModelFileVersion::kV2;
}

if (help.isSet()) {
Argparse::printHelp();
mllm::shutdownContext();
return 0;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

-h/--help will never show if required args are missing.

Argparse::parse() enforces required options before help.isSet() is checked. Handle help before validation (or modify Argparse to bypass required checks for help).

💡 One way to bypass required-arg validation
+#include <string_view>
 ...
   auto& config_path = Argparse::add<std::string>("-c|--config_path").help("Config path").required(true);

+  for (int i = 1; i < argc; ++i) {
+    const std::string_view arg{argv[i]};
+    if (arg == "-h" || arg == "--help") {
+      Argparse::printHelp();
+      mllm::shutdownContext();
+      return 0;
+    }
+  }
   Argparse::parse(argc, argv);
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/audio_infer.cpp` around lines 16 - 35, The help flag is
never reachable because Argparse::parse() enforces required options first;
before calling Argparse::parse(argc, argv) detect help early (e.g., scan argv
for "-h" or "--help") and if present call Argparse::printHelp() and
mllm::shutdownContext() then return 0, or alternatively use an Argparse::parse
variant/option that skips required-arg validation for help; update the code
around Argparse::parse, help.isSet(), Argparse::printHelp, and
mllm::shutdownContext to ensure help is handled before required-arg validation.

Comment on lines +5 to +7
"enable_audio_output": true,
"enable_talker": true,
"model_type": "qwen2_5_omni",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Disable talker/audio-output flags until talker is actually supported.

The PR objective says “without talker,” but enable_audio_output / enable_talker are true. If runtime respects these flags, it may try to initialize unsupported components or confuse users. Suggest disabling them (or explicitly documenting that they’re placeholders) until talker is ready.

💡 Proposed change
-  "enable_audio_output": true,
-  "enable_talker": true,
+  "enable_audio_output": false,
+  "enable_talker": false,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"enable_audio_output": true,
"enable_talker": true,
"model_type": "qwen2_5_omni",
"enable_audio_output": false,
"enable_talker": false,
"model_type": "qwen2_5_omni",
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/config_qwen2_5omni_7B.json` around lines 5 - 7, The
config enables unsupported talker components despite the PR stating "without
talker": set "enable_audio_output" and "enable_talker" to false (or add an
explicit comment stating they are placeholders) in the JSON for the qwen2_5_omni
model to prevent runtime attempts to initialize talker/audio subsystems; update
any documentation or inline note near the "model_type": "qwen2_5_omni" entry to
clarify that talker is not yet supported if you choose the placeholder route.

Comment on lines +24 to +29
mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
if (model_version.get() == "v1") {
file_version = mllm::ModelFileVersion::kV1;
} else if (model_version.get() == "v2") {
file_version = mllm::ModelFileVersion::kV2;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject unsupported model_version values instead of defaulting to v1.

Silent fallback to kV1 can load incorrect weights. Validate and fail fast on unknown values.

🛡️ Suggested validation
   mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
   if (model_version.get() == "v1") {
     file_version = mllm::ModelFileVersion::kV1;
   } else if (model_version.get() == "v2") {
     file_version = mllm::ModelFileVersion::kV2;
+  } else {
+    fmt::print("Unsupported model_version: {}\n", model_version.get());
+    mllm::shutdownContext();
+    return 1;
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
if (model_version.get() == "v1") {
file_version = mllm::ModelFileVersion::kV1;
} else if (model_version.get() == "v2") {
file_version = mllm::ModelFileVersion::kV2;
}
mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
if (model_version.get() == "v1") {
file_version = mllm::ModelFileVersion::kV1;
} else if (model_version.get() == "v2") {
file_version = mllm::ModelFileVersion::kV2;
} else {
fmt::print("Unsupported model_version: {}\n", model_version.get());
mllm::shutdownContext();
return 1;
}
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/image_infer.cpp` around lines 24 - 29, The code silently
defaults file_version to mllm::ModelFileVersion::kV1 which can load wrong
weights; change the logic in the block that reads model_version.get() so it
validates allowed values explicitly (e.g., "v1" -> kV1, "v2" -> kV2) and on any
other value fail fast by returning an error/throwing an exception or logging and
exiting with a clear message that the model_version is unsupported; update
references to file_version and the model-version parsing branch (where
mllm::ModelFileVersion file_version is set) to enforce this validation rather
than falling back to kV1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants