feat : Add Support of Qwen2.5Omni model (without talker)#610
feat : Add Support of Qwen2.5Omni model (without talker)#610KKkai0315 wants to merge 14 commits intoUbiquitousLearning:mainfrom
Conversation
|
Caution Review failedThe pull request is closed. 📝 WalkthroughWalkthroughAdds full Qwen2.5-Omni multimodal support: build entries and CLI examples, tokenizer with vision/audio message handling, audio preprocessing, comprehensive multimodal model headers and configuration, many backend/kernel fill utilities, QNN backend hardening, PyMllm FFI/type/device exports, and several PyTorch quantization/observer utilities. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as Example CLI
participant AudioPrep as AudioPreprocessor
participant Tokenizer
participant Model as Qwen2_5OmniForCausalLM
participant Output
User->>CLI: supply audio file + prompt
CLI->>AudioPrep: load WAV, extract Mel features
AudioPrep-->>CLI: audio features + lengths
CLI->>Tokenizer: convert prompt + audio features -> token seq + modality metadata
Tokenizer-->>CLI: token ids + audio/image tensors
CLI->>Model: forward(tokens, audio_features, modality metadata)
Model->>Model: encode audio/vision, decode text (kv cache)
Model-->>CLI: streamed token logits
CLI->>Tokenizer: detokenize logits -> text
CLI-->>Output: stream text to stdout
sequenceDiagram
participant WAV as Audio File
participant Reader as WAV Reader
participant STFT as STFT/Hann
participant Mel as Mel Filterbank
participant Norm as Log/Normalize
participant Features
WAV->>Reader: read samples (resample if needed)
Reader->>STFT: frame & window (Hann), compute FFT
STFT->>Mel: apply power, project via Mel filterbank
Mel->>Norm: log-compress, clip, normalize
Norm-->>Features: final Mel spectrogram tensor
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 14
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🤖 Fix all issues with AI agents
In `@examples/qwen2_5omni/audio_infer.cpp`:
- Around line 51-59: The CLI currently ignores stdin by hardcoding audio_path
and prompt_text; revert to reading user input using std::getline for audio_path
and prompt_text (restore the commented lines) instead of the absolute path and
hardcoded prompt, and re-add checks to exit on "exit"/"quit" for audio_path and
to set a default prompt (e.g., "Please describe the audio.") when prompt_text is
empty; update the variables audio_path and prompt_text in
examples/qwen2_5omni/audio_infer.cpp accordingly and remove the local absolute
assignment.
- Around line 24-29: The code currently treats any model_version other than "v2"
as kV1 leading to silent misloads; update the validation around the
model_version.get() check (where file_version and mllm::ModelFileVersion are
set) to explicitly accept only "v1" or "v2" and otherwise print a clear error
(including the invalid value) and terminate (e.g., std::cerr + std::exit(1) or
throw) so unsupported values do not fall back silently to kV1.
In `@examples/qwen2_5omni/image_infer.cpp`:
- Around line 49-59: The code contains debug artifacts: hard-coded assignments
to image_path and prompt_text and commented-out interactive input; restore
interactive CLI by removing the hard-coded assignments and re-enabling the
std::getline calls for image_path and prompt_text, keep the existing prompt
prints and the exit/quit check, and ensure you trim or validate the read lines
if necessary (look for variables image_path, prompt_text and the std::getline
usage in this block).
- Around line 31-35: Move the help handling so -h/--help bypasses
required-argument validation: call Argparse::printHelp() and
mllm::shutdownContext() when help.isSet() before running the code path that
enforces required args (i.e., before or immediately after Argparse::parse() but
before any required-argument validation), or adjust Argparse to treat help as a
special case; locate the current help check (help.isSet()), the parser
invocation (Argparse::parse()), and the help output (Argparse::printHelp()) to
implement this change.
In `@examples/qwen2_5omni/text_infer.cpp`:
- Around line 24-29: The current logic silently maps any model_version other
than "v2" to kV1; update the branch that sets file_version (variable
file_version of type mllm::ModelFileVersion) to explicitly validate
model_version.get(): accept "v1" -> mllm::ModelFileVersion::kV1, "v2" ->
mllm::ModelFileVersion::kV2, and for any other value print a clear error
(including the invalid value) and exit/throw instead of falling back, so callers
of text_infer (where model_version is used) cannot accidentally load wrong
weights.
In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp`:
- Around line 197-206: Add a validation inside the loop where spatial merging is
computed (the block that reads dims and computes t,h,w, spatial_merge_size,
num_h_blocks, num_w_blocks, total_blocks, block_area) to ensure h %
spatial_merge_size == 0 and w % spatial_merge_size == 0; if not, raise/abort
with a clear error (or use an assert) describing invalid spatial_merge_size vs
H/W so rotary position alignment cannot silently miscompute. Ensure the check
references the same symbols (h, w, spatial_merge_size,
num_h_blocks/num_w_blocks) and runs before computing
num_h_blocks/num_w_blocks/total_blocks.
- Around line 568-575: The code assumes seq_len is divisible by
spatial_merge_unit_ before calling hidden_states.view({seq_len /
spatial_merge_unit_, spatial_merge_unit_, -1}), so add a guard that checks
seq_len % spatial_merge_unit_ == 0 and handle the non-divisible case (e.g.,
trim/pad hidden_states or raise a clear error) before the view call; update the
block around seq_len, spatial_merge_unit_, and hidden_states.view to perform the
check and decide whether to reshape, pad, or throw, ensuring any error message
references seq_len and spatial_merge_unit_ for easier debugging.
- Around line 95-105: The function makeWindowIndex must validate window_size,
spatial_merge_size, and patch_size before computing vit_merger_window_size and
spatial_merge_unit: check spatial_merge_size and patch_size are non-zero and
that window_size is divisible by (spatial_merge_size * patch_size) (or that
window_size / spatial_merge_size / patch_size produces the intended integer) and
fail-fast (MLLM_RT_ASSERT or throw) with a clear message referencing grid_thw
and the offending parameter(s); additionally validate spatial_merge_size > 0
before computing spatial_merge_unit to avoid division/multiplication errors and
ensure window_index construction only proceeds when these guard conditions pass.
In `@mllm/models/qwen2_5omni/python_src_code/audio_process.py`:
- Around line 71-73: Replace the assert check that calls
_check_if_video_has_audio(path) with an explicit runtime exception: if
use_audio_in_video is True and _check_if_video_has_audio(path) returns False,
raise a ValueError with the same message ("Video must has audio track when
use_audio_in_video=True"); this mirrors the existing validation style used
around lines 65 and 81 and ensures the check isn't skipped under Python -O.
Reference the existing helper _check_if_video_has_audio and the local variable
path (and the use_audio_in_video flag) when making the change.
- Around line 11-16: The _check_if_video_has_audio function currently opens a
PyAV container with av.open(video_path) and never closes it, leaking file
descriptors; update it to use a context manager (with av.open(video_path) as
container:) or ensure container.close() in a finally block so the container is
always closed even on exceptions, keeping the existing logic that inspects
container.streams for audio streams.
In `@mllm/models/qwen2_5omni/python_src_code/LICENSE`:
- Around line 1-201: The LICENSE file name violates the project's allowed
extensions; rename the file named "LICENSE" to include a permitted extension
(e.g., "LICENSE.txt" or "LICENSE.md") so it matches coding guidelines; update
any references to "LICENSE" in packaging or docs to point to the new filename to
avoid broken links.
In `@mllm/models/qwen2_5omni/python_src_code/name.py`:
- Around line 27-32: The loop currently calls f.get_tensor(key) and computes
unused variables i, shape, and dtype, which loads full tensors into memory;
replace this by using safetensors metadata access (e.g., f.get_metadata or
f.metadata[key]) to read shape and dtype without materializing the tensor, or if
you only need the key, remove the get_tensor call and the unused variables
entirely; update the loop that iterates over keys to either log the metadata via
the safetensors metadata API or simply print the key without calling
f.get_tensor(key).
In `@mllm/models/qwen2_5omni/python_src_code/README.md`:
- Around line 740-749: The README shows audio output code for model.generate and
uses processor.batch_decode and sf.write, but the PR is text-only; update the
examples to explicitly disable audio (call model.generate(...,
return_audio=False or remove use_audio_in_video) and only process text via
processor.batch_decode) and remove or guard any sf.write/audio handling; also
add a clear "audio/talker not supported yet" note near the generation examples
(including the other audio example blocks) so users aren’t misled.
In `@mllm/models/qwen2_5omni/python_src_code/vision_process.py`:
- Around line 106-111: Add a timeout to the HTTP fetch to avoid hangs: update
the requests.get call in the image URL branch (the block that checks
image.startswith("http://") or "https://") to pass a timeout (e.g.,
timeout=REQUEST_TIMEOUT or timeout=10) — preferably use a module-level constant
like REQUEST_TIMEOUT so it’s configurable — and keep using the response context
manager, response.raise_for_status(), and the BytesIO/Image.open flow unchanged;
also ensure any callers/handlers will surface
requests.exceptions.ReadTimeout/ConnectTimeout if needed.
🟡 Minor comments (13)
mllm/models/qwen2_5omni/python_src_code/name.py-1-5 (1)
1-5: Filename does not match docstring.The file is named
name.pybut the docstring referencesprint_safetensors_keys.py. Consider renaming the file toprint_safetensors_keys.pyfor consistency and clarity.mllm/models/qwen2_5omni/python_src_code/README.md-760-763 (1)
760-763: Use descriptive link text instead of “here.”
markdownlint MD059 prefers descriptive anchor text for accessibility.mllm/models/qwen2_5omni/python_src_code/README.md-672-681 (1)
672-681: Specify languages on fenced code blocks.
markdownlint MD040 flags missing languages on the install/error and system-prompt snippets.✏️ Proposed update
-``` +```bash pip uninstall transformers pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview pip install accelerate -``` +``` -``` +```text KeyError: 'qwen2_5_omni' -``` +``` -``` +```json { "role": "system", "content": [ {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."} ], } -``` +```Also applies to: 875-882
mllm/models/qwen2_5omni/python_src_code/audio_process.py-35-37 (1)
35-37: Guard against empty input and invalid time ranges.
conversations[0]will raise on empty input, and negative durations can be produced when*_end < *_start. Consider explicit validation. As per coding guidelines, validate inputs for public APIs and critical internal functions.🛠️ Proposed fix
- audios = [] - if isinstance(conversations[0], dict): + if not conversations: + return None + audios = [] + if isinstance(conversations[0], dict): conversations = [conversations] @@ - audio_start = ele.get("audio_start", 0.0) - audio_end = ele.get("audio_end", None) + audio_start = ele.get("audio_start", 0.0) + audio_end = ele.get("audio_end", None) + if audio_end is not None and audio_end < audio_start: + raise ValueError("audio_end must be >= audio_start") @@ - audio_start = ele.get("video_start", 0.0) - audio_end = ele.get("video_end", None) + audio_start = ele.get("video_start", 0.0) + audio_end = ele.get("video_end", None) + if audio_end is not None and audio_end < audio_start: + raise ValueError("video_end must be >= video_start")Also applies to: 46-47, 69-70, 88-90
mllm/models/qwen2_5omni/python_src_code/README.md-656-665 (1)
656-665: Replace the hard tab in the Text → Text table.
markdownlint MD010 flags a hard tab here; replace it with spaces to avoid rendering quirks.mllm/models/qwen2_5omni/python_src_code/README.md-23-25 (1)
23-25: Add alt text to images for accessibility.
The HTML image tags here don’t includealttext (MD045). Please add descriptive alt attributes.Also applies to: 41-43, 49-50
mllm/models/qwen2_5omni/python_src_code/README.md-492-529 (1)
492-529: Add blank lines around Markdown tables.
markdownlint MD058 requires blank lines before/after tables in these sections.Also applies to: 532-541, 653-668, 766-775
mllm/models/qwen2_5omni/audio_preprocessor_qwen2_5omni.hpp-119-132 (1)
119-132: Default constructor leaves critical members uninitialized.The default constructor
MelSpectrogramFeatures() = defaultleavesstft_,window_, andmelscale_fbanks_in an indeterminate state. Ifforward()is called on a default-constructed instance, it will cause undefined behavior. Consider either removing the default constructor or initializing members to safe defaults.mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp-942-952 (1)
942-952: Handle unsupported attention dtypes explicitly.
Ifkey_statesisn’t float32/float16,attnstays unset. Add an error branch to avoid undefined behavior.
⚠️ Add explicit dtype handlingif (key_states.dtype() == kFloat32) { attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_)); attn = mask_(attn); attn = softmax_(attn); } else if (key_states.dtype() == kFloat16) { attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_)); attn = mask_(attn); attn = softmax_(attn); attn = attn.to(kFloat16); + } else { + MLLM_ERROR_EXIT(ExitCode::kCoreError, "Unsupported attention dtype for Qwen2.5-Omni."); }mllm/models/qwen2_5omni/python_src_code/processing_qwen2_5_omni.py-290-292 (1)
290-292: Docstring parameter name mismatch.The docstring references
t_ntoken_per_chunkbut the actual parameter istokens_per_chunk.📝 Fix docstring
Parameters: token_indices (`np.ndarray`): A monotonically increasing list of token index values. - t_ntoken_per_chunk (`int`): Number of tokens per chunk (used as the chunk size threshold). + tokens_per_chunk (`int`): Number of tokens per chunk (used as the chunk size threshold).mllm/models/qwen2_5omni/python_src_code/processing_qwen2_5_omni.py-369-385 (1)
369-385: Typos in docstring and error message.
- Line 370:
Inionshould beUnion- Line 384-385: Error message missing closing backtick after "audio"
📝 Fix typos
Returns: - `list[Inion[str, np.ndarray]]`: The decoded text or generated audio. + `list[Union[str, np.ndarray]]`: The decoded text or generated audio. """ if generation_mode is None or generation_mode == "text": return self.post_process_image_text_to_text( generated_outputs, skip_special_tokens=skip_special_tokens, **kwargs ) elif generation_mode == "audio": # model supports only bs=1, so we will never get several audio outputs audio = generated_outputs[1].reshape(-1).detach().cpu().numpy() return [audio] else: raise ValueError( - f"{self.__class__.__name__} got an unexpected generation_mode={generation_mode}. Supported options are only `text` and `audio" + f"{self.__class__.__name__} got an unexpected generation_mode={generation_mode}. Supported options are only `text` and `audio`." )mllm/models/qwen2_5omni/python_src_code/processing_qwen2_5_omni.py-387-402 (1)
387-402: Missing null checks for optional processors.The property accesses
model_input_namesonfeature_extractor,image_processor, andvideo_processor, all of which can beNoneper the constructor signature. This will raiseAttributeErrorif any processor wasn't provided.🐛 Suggested defensive approach
`@property` def model_input_names(self): tokenizer_input_names = self.tokenizer.model_input_names - feature_extractor_input_names = self.feature_extractor.model_input_names - image_processor_input_names = self.image_processor.model_input_names - video_processor_input_names = self.video_processor.model_input_names + feature_extractor_input_names = self.feature_extractor.model_input_names if self.feature_extractor else [] + image_processor_input_names = self.image_processor.model_input_names if self.image_processor else [] + video_processor_input_names = self.video_processor.model_input_names if self.video_processor else [] return list(mllm/models/qwen2_5omni/python_src_code/processing_qwen2_5_omni.py-92-102 (1)
92-102: PotentialAttributeErroriftokenizerisNone.The constructor accesses
self.tokenizer.image_tokenand other attributes unconditionally, buttokenizer=Noneis the default parameter. If instantiated without a tokenizer, this will raiseAttributeError.🐛 Suggested defensive check
def __init__( self, image_processor=None, video_processor=None, feature_extractor=None, tokenizer=None, chat_template=None ): super().__init__(image_processor, video_processor, feature_extractor, tokenizer, chat_template=chat_template) + if tokenizer is None: + raise ValueError("tokenizer is required for Qwen2_5OmniProcessor") self.image_token = self.tokenizer.image_token
🧹 Nitpick comments (17)
mllm/models/qwen2_5omni/python_src_code/name.py (3)
10-15: Missing PyTorch dependency in error message.The script uses
framework="pt"which requires PyTorch. If PyTorch is missing, users will get a confusing error later. Consider mentioning both dependencies.💡 Suggested improvement
try: from safetensors import safe_open except ImportError: - print("请先安装 safetensors 库:") - print(" pip install safetensors") + print("请先安装 safetensors 和 torch 库:") + print(" pip install safetensors torch") sys.exit(1)
61-65: Redundant file opening; refactor to avoid opening each file twice.Each file is opened in
print_keys_from_file()and then again to count keys. This doubles I/O operations. Haveprint_keys_from_file()return the key count instead.♻️ Proposed refactor
Update
print_keys_from_fileto return the key count:-def print_keys_from_file(filepath: Path): +def print_keys_from_file(filepath: Path) -> int: """打印单个 safetensors 文件的所有键名""" ... - print(f"{key}") + print(f"{key}") + return len(keys)Then simplify the loop in
main:total_keys = 0 for filepath in files: - print_keys_from_file(filepath) - with safe_open(filepath, framework="pt") as f: - total_keys += len(f.keys()) + total_keys += print_keys_from_file(filepath)
53-53: Ambiguous fullwidth comma in comment.The comment contains a fullwidth comma
,(U+FF0C) instead of ASCII comma,. Per static analysis (RUF003).💡 Suggested fix
- # 目录,查找所有 safetensors 文件 + # 目录, 查找所有 safetensors 文件mllm/models/qwen2_5omni/python_src_code/configuration.json (1)
1-1: Consider improving readability and documentation.The single-line JSON format is valid but could be improved for maintainability:
- Format with proper indentation for better readability
- Add a separate schema file or documentation explaining each configuration option
- Document what "all-to-all" task type means in this context
📝 Suggested formatting improvement
-{"framework": "pytorch", "task": "all-to-all", "allow_remote": true} +{ + "framework": "pytorch", + "task": "all-to-all", + "allow_remote": true +}Additionally, consider adding a
README.mdor inline comments (if the format supports it) documenting the purpose of each field.mllm/models/qwen2_5omni/python_src_code/preprocessor_config.json (1)
2-29: Add validation for derived audio fields to avoid config drift.
n_samplesandnb_max_framesappear derivable fromchunk_length,sampling_rate, andhop_length. Consider validating these relationships in the loader (or deriving them) to prevent silent inconsistencies if one value changes.mllm/models/qwen2_5omni/audio_preprocessor_qwen2_5omni.hpp (1)
43-55: Potential portability issue withM_PIconstant.
M_PIis not part of the C++ standard and may not be defined on all platforms (particularly MSVC without_USE_MATH_DEFINES). Consider using a locally defined constant for cross-platform compatibility.Suggested fix for portability
inline Tensor create_hann_window(int32_t window_length, bool periodic = true) { + constexpr float kPi = 3.14159265358979323846f; int32_t length = periodic ? window_length + 1 : window_length; auto window = Tensor::empty({1, window_length}, kFloat32, kCPU).alloc(); float* window_ptr = window.ptr<float>(); for (int32_t i = 0; i < window_length; ++i) { float n = static_cast<float>(i); float denominator = periodic ? static_cast<float>(length) : static_cast<float>(length - 1); - window_ptr[i] = 0.5f - 0.5f * std::cos(2.0f * M_PI * n / denominator); + window_ptr[i] = 0.5f - 0.5f * std::cos(2.0f * kPi * n / denominator); } return window; }mllm/models/qwen2_5omni/python_src_code/vision_process.py (1)
401-405: Consider more specific exception handling for video reader fallback.Catching bare
Exceptioncan mask unexpected errors beyond video reading issues. While the fallback behavior is reasonable, consider catching more specific exceptions (e.g.,IOError,RuntimeError) to avoid hiding unrelated bugs.Suggested improvement
try: video, sample_fps = VIDEO_READER_BACKENDS[video_reader_backend](ele) - except Exception as e: + except (IOError, RuntimeError, ImportError, KeyError) as e: logger.warning(f"video_reader_backend {video_reader_backend} error, use torchvision as default, msg: {e}") video, sample_fps = VIDEO_READER_BACKENDS["torchvision"](ele)mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp (2)
20-117: Regex tokenization pattern is comprehensive but complex.The
qwen2_5OmniTokenizerMatchPatternfunction implements a multi-stage pattern matcher handling contractions, letters, digits, punctuation, and whitespace. The logic follows the Qwen2/Qwen2-VL tokenizer patterns. Consider adding inline comments explaining each matching stage for maintainability.
189-193: Consider removing or formalizing the debug comment.The comment on line 190 (
//interestingly, the answer went bad when setting max_pixels higher...) appears to be a development note. If this is a known limitation, consider documenting it properly in the class documentation or as a TODO with a tracking issue.mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp (3)
24-33: Add brief API docs for new public helpers/classes.
This header exposes many new public helpers/modules without doc comments. Please add concise Doxygen-style docs for purpose/params/returns/errors so the public surface is self-explanatory.As per coding guidelines, please add brief public API documentation.
584-595: Consider avoiding full seq_len² mask for window attention.
This allocates O(n²) memory/time; for large images it can be heavy. A block-sparse mask or per-window attention would reduce memory pressure.As per coding guidelines, consider reducing O(n²) memory/time where feasible.
1233-1247: Avoid quadratic scans for next vision/audio tokens.
Each loop rescans the tail to find next markers, making this O(S²) on long sequences. Precompute marker indices or scan once with moving pointers.As per coding guidelines, consider a linear-time scan to keep long sequences efficient.
examples/qwen2_5omni/image_infer.cpp (1)
24-29: Missing validation for invalid model version.If
model_versionis neither "v1" nor "v2", the code silently defaults tokV1. Consider warning or erroring on unrecognized versions.♻️ Suggested improvement
mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; if (model_version.get() == "v1") { file_version = mllm::ModelFileVersion::kV1; } else if (model_version.get() == "v2") { file_version = mllm::ModelFileVersion::kV2; + } else { + fmt::print(stderr, "Warning: Unknown model version '{}', defaulting to v1\n", model_version.get()); }mllm/models/qwen2_5omni/python_src_code/processing_qwen2_5_omni.py (4)
48-70: Consider annotating_defaultswithClassVar.Ruff flags mutable class attributes without
ClassVarannotation. While this pattern is common in HuggingFace processing utilities, adding the annotation improves type safety.♻️ Optional fix
+from typing import ClassVar class Qwen2_5OmniProcessorKwargs(ProcessingKwargs, total=False): videos_kwargs: Qwen2_5_OmniVideosKwargs - _defaults = { + _defaults: ClassVar[dict] = { "text_kwargs": {
149-162: Audio length calculation could benefit from a comment.Lines 158-159 contain a non-obvious formula:
(input_lengths - 2) // 2 + 1and(audio_inputs["feature_attention_mask"].sum(-1) - 1) // 2 + 1. Consider adding a brief comment explaining the transformation (e.g., relating to convolution stride or feature extraction parameters).
224-229: Redundant sorting of positions.
positionsis sorted twice: first viasorted()in line 228, then explicitly in line 229. Thekeyfunction in line 229 sorts byx[0](start position), which is the same as the default sort behavior of tuples. Remove one of the sorts.♻️ Proposed fix
for sample in text: positions = [] special_tokens = [re.escape(tok) for tok in [self.audio_token, self.image_token, self.video_token]] pattern = "|".join(special_tokens) - positions = sorted([(match.start(), match.group()) for match in re.finditer(pattern, sample)]) - positions.sort(key=lambda x: x[0]) + positions = [(match.start(), match.group()) for match in re.finditer(pattern, sample)] + positions.sort(key=lambda x: x[0])
312-331: Misleading variable nameis_batched.The logic is inverted:
is_batched = Trueis set when the input is a single conversation (dict), not when it's batched. The variable name is confusing even though the code works correctly.♻️ Suggested clarification
def apply_chat_template(self, conversations, chat_template=None, **kwargs): - is_batched = False + was_single = False if isinstance(conversations[0], dict): conversations = [conversations] - is_batched = True + was_single = True for conversation in conversations: # ... validation ... - if is_batched: + if was_single: conversations = conversations[0] return super().apply_chat_template(conversations, chat_template, **kwargs)
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | ||
| if (model_version.get() == "v1") { | ||
| file_version = mllm::ModelFileVersion::kV1; | ||
| } else if (model_version.get() == "v2") { | ||
| file_version = mllm::ModelFileVersion::kV2; | ||
| } |
There was a problem hiding this comment.
Reject unsupported model_version values.
Any value other than "v2" silently falls back to kV1, which can load incorrect weights. Validate and exit with a clear error.
🛡️ Suggested validation
mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
if (model_version.get() == "v1") {
file_version = mllm::ModelFileVersion::kV1;
} else if (model_version.get() == "v2") {
file_version = mllm::ModelFileVersion::kV2;
+ } else {
+ fmt::print("Unsupported model_version: {}\n", model_version.get());
+ mllm::shutdownContext();
+ return 1;
}🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/audio_infer.cpp` around lines 24 - 29, The code
currently treats any model_version other than "v2" as kV1 leading to silent
misloads; update the validation around the model_version.get() check (where
file_version and mllm::ModelFileVersion are set) to explicitly accept only "v1"
or "v2" and otherwise print a clear error (including the invalid value) and
terminate (e.g., std::cerr + std::exit(1) or throw) so unsupported values do not
fall back silently to kV1.
| fmt::print("Audio path (or 'exit/quit'): "); | ||
| //std::getline(std::cin, audio_path); | ||
| //if (audio_path == "exit" || audio_path == "quit") { return 0; } | ||
| audio_path = "/Users/kkkai/Desktop/mllm2-former/mllm/rsc/recognize.wav"; | ||
|
|
||
| fmt::print("Prompt text: "); | ||
| //std::getline(std::cin, prompt_text); | ||
| //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; } | ||
| prompt_text = "复述这段音频"; |
There was a problem hiding this comment.
Remove hardcoded audio path/prompt.
The CLI ignores user input and uses a local absolute path, which breaks usability for everyone else. Restore stdin input and keep a default prompt if empty.
🎤 Restore interactive input
fmt::print("Audio path (or 'exit/quit'): ");
- //std::getline(std::cin, audio_path);
- //if (audio_path == "exit" || audio_path == "quit") { return 0; }
- audio_path = "/Users/kkkai/Desktop/mllm2-former/mllm/rsc/recognize.wav";
+ std::getline(std::cin, audio_path);
+ if (audio_path == "exit" || audio_path == "quit") { return 0; }
fmt::print("Prompt text: ");
- //std::getline(std::cin, prompt_text);
- //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }
- prompt_text = "复述这段音频";
+ std::getline(std::cin, prompt_text);
+ if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fmt::print("Audio path (or 'exit/quit'): "); | |
| //std::getline(std::cin, audio_path); | |
| //if (audio_path == "exit" || audio_path == "quit") { return 0; } | |
| audio_path = "/Users/kkkai/Desktop/mllm2-former/mllm/rsc/recognize.wav"; | |
| fmt::print("Prompt text: "); | |
| //std::getline(std::cin, prompt_text); | |
| //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; } | |
| prompt_text = "复述这段音频"; | |
| fmt::print("Audio path (or 'exit/quit'): "); | |
| std::getline(std::cin, audio_path); | |
| if (audio_path == "exit" || audio_path == "quit") { return 0; } | |
| fmt::print("Prompt text: "); | |
| std::getline(std::cin, prompt_text); | |
| if (prompt_text.empty()) { prompt_text = "Please describe the audio."; } |
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/audio_infer.cpp` around lines 51 - 59, The CLI currently
ignores stdin by hardcoding audio_path and prompt_text; revert to reading user
input using std::getline for audio_path and prompt_text (restore the commented
lines) instead of the absolute path and hardcoded prompt, and re-add checks to
exit on "exit"/"quit" for audio_path and to set a default prompt (e.g., "Please
describe the audio.") when prompt_text is empty; update the variables audio_path
and prompt_text in examples/qwen2_5omni/audio_infer.cpp accordingly and remove
the local absolute assignment.
| if (help.isSet()) { | ||
| Argparse::printHelp(); | ||
| mllm::shutdownContext(); | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n examples/qwen2_5omni/image_infer.cpp | head -40Repository: UbiquitousLearning/mllm
Length of output: 1898
🏁 Script executed:
rg -n "required|help.*isSet|isSet.*help" examples/qwen2_5omni/image_infer.cpp -B2 -A2Repository: UbiquitousLearning/mllm
Length of output: 737
🏁 Script executed:
# Check how other examples handle help flags with required arguments
rg -n "help.*isSet|isSet.*help" examples/ --type cpp -B3 -A3Repository: UbiquitousLearning/mllm
Length of output: 5873
🏁 Script executed:
# Find the Argparse implementation
fd -t f "argparse" --type f -e hpp -e cpp | head -20Repository: UbiquitousLearning/mllm
Length of output: 90
🏁 Script executed:
# Search for Argparse class definition
rg -n "class Argparse" --type cpp --type hppRepository: UbiquitousLearning/mllm
Length of output: 94
🏁 Script executed:
# Read the Argparse implementation
cat mllm/utils/Argparse.hpp | head -200Repository: UbiquitousLearning/mllm
Length of output: 5404
🏁 Script executed:
# Search for parse method in Argparse
rg -n "void parse|bool parse" mllm/utils/Argparse.hpp -A 10Repository: UbiquitousLearning/mllm
Length of output: 1508
🏁 Script executed:
# Get the rest of the parse method to see the requirements checking
cat mllm/utils/Argparse.hpp | sed -n '149,220p'Repository: UbiquitousLearning/mllm
Length of output: 2351
Help flag check occurs after required argument validation, preventing -h from working.
The -h|--help flag is checked at line 31, but Argparse::parse() validates all required arguments (model_path, model_version, tokenizer_path, config_path) at the end of parsing. Running with just -h will fail during parse() before help.isSet() is evaluated, so help will never be displayed. Move the help check before or immediately after Argparse::parse(), or modify the Argparse class to handle help as a special case that bypasses requirement validation.
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/image_infer.cpp` around lines 31 - 35, Move the help
handling so -h/--help bypasses required-argument validation: call
Argparse::printHelp() and mllm::shutdownContext() when help.isSet() before
running the code path that enforces required args (i.e., before or immediately
after Argparse::parse() but before any required-argument validation), or adjust
Argparse to treat help as a special case; locate the current help check
(help.isSet()), the parser invocation (Argparse::parse()), and the help output
(Argparse::printHelp()) to implement this change.
| std::string image_path; | ||
| std::string prompt_text; | ||
|
|
||
| fmt::print("Image path (or 'exit/quit'): "); | ||
| image_path = "../../../mllm2-former/mllm/rsc/pics.jpg"; | ||
| //std::getline(std::cin, image_path); | ||
| if (image_path == "exit" || image_path == "quit") { return 0; } | ||
|
|
||
| fmt::print("Prompt text: "); | ||
| prompt_text = "描述图片中物体"; | ||
| //std::getline(std::cin, prompt_text); |
There was a problem hiding this comment.
Remove debug artifacts: hard-coded paths and commented-out interactive code.
The image path and prompt text are hard-coded with std::getline calls commented out. This breaks the interactive CLI functionality described in the UI prompts.
🐛 Proposed fix to restore interactive input
std::string image_path;
std::string prompt_text;
fmt::print("Image path (or 'exit/quit'): ");
- image_path = "../../../mllm2-former/mllm/rsc/pics.jpg";
- //std::getline(std::cin, image_path);
+ std::getline(std::cin, image_path);
if (image_path == "exit" || image_path == "quit") { return 0; }
fmt::print("Prompt text: ");
- prompt_text = "描述图片中物体";
- //std::getline(std::cin, prompt_text);
+ std::getline(std::cin, prompt_text);🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/image_infer.cpp` around lines 49 - 59, The code contains
debug artifacts: hard-coded assignments to image_path and prompt_text and
commented-out interactive input; restore interactive CLI by removing the
hard-coded assignments and re-enabling the std::getline calls for image_path and
prompt_text, keep the existing prompt prints and the exit/quit check, and ensure
you trim or validate the read lines if necessary (look for variables image_path,
prompt_text and the std::getline usage in this block).
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | ||
| if (model_version.get() == "v1") { | ||
| file_version = mllm::ModelFileVersion::kV1; | ||
| } else if (model_version.get() == "v2") { | ||
| file_version = mllm::ModelFileVersion::kV2; | ||
| } |
There was a problem hiding this comment.
Reject unsupported model_version values.
Any value other than "v2" silently falls back to kV1, which can load incorrect weights. Validate and exit with a clear error.
🛡️ Suggested validation
mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
if (model_version.get() == "v1") {
file_version = mllm::ModelFileVersion::kV1;
} else if (model_version.get() == "v2") {
file_version = mllm::ModelFileVersion::kV2;
+ } else {
+ fmt::print("Unsupported model_version: {}\n", model_version.get());
+ mllm::shutdownContext();
+ return 1;
}🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/text_infer.cpp` around lines 24 - 29, The current logic
silently maps any model_version other than "v2" to kV1; update the branch that
sets file_version (variable file_version of type mllm::ModelFileVersion) to
explicitly validate model_version.get(): accept "v1" ->
mllm::ModelFileVersion::kV1, "v2" -> mllm::ModelFileVersion::kV2, and for any
other value print a clear error (including the invalid value) and exit/throw
instead of falling back, so callers of text_infer (where model_version is used)
cannot accidentally load wrong weights.
| assert _check_if_video_has_audio( | ||
| path | ||
| ), "Video must has audio track when use_audio_in_video=True" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "audio_process.py" | head -20Repository: UbiquitousLearning/mllm
Length of output: 125
🏁 Script executed:
cat -n mllm/models/qwen2_5omni/python_src_code/audio_process.py | sed -n '60,85p'Repository: UbiquitousLearning/mllm
Length of output: 1595
Replace assert with a runtime exception.
Assertions are stripped when Python runs with the -O optimization flag, causing this validation to silently fail in production. Since surrounding code (lines 65, 81) uses ValueError for similar input validation failures, this should be consistent.
🛠️ Proposed fix
- assert _check_if_video_has_audio(
- path
- ), "Video must has audio track when use_audio_in_video=True"
+ if not _check_if_video_has_audio(path):
+ raise ValueError("Video must have an audio track when use_audio_in_video=True")🤖 Prompt for AI Agents
In `@mllm/models/qwen2_5omni/python_src_code/audio_process.py` around lines 71 -
73, Replace the assert check that calls _check_if_video_has_audio(path) with an
explicit runtime exception: if use_audio_in_video is True and
_check_if_video_has_audio(path) returns False, raise a ValueError with the same
message ("Video must has audio track when use_audio_in_video=True"); this
mirrors the existing validation style used around lines 65 and 81 and ensures
the check isn't skipped under Python -O. Reference the existing helper
_check_if_video_has_audio and the local variable path (and the
use_audio_in_video flag) when making the change.
| Apache License | ||
| Version 2.0, January 2004 | ||
| http://www.apache.org/licenses/ | ||
|
|
||
| TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION | ||
|
|
||
| 1. Definitions. | ||
|
|
||
| "License" shall mean the terms and conditions for use, reproduction, | ||
| and distribution as defined by Sections 1 through 9 of this document. | ||
|
|
||
| "Licensor" shall mean the copyright owner or entity authorized by | ||
| the copyright owner that is granting the License. | ||
|
|
||
| "Legal Entity" shall mean the union of the acting entity and all | ||
| other entities that control, are controlled by, or are under common | ||
| control with that entity. For the purposes of this definition, | ||
| "control" means (i) the power, direct or indirect, to cause the | ||
| direction or management of such entity, whether by contract or | ||
| otherwise, or (ii) ownership of fifty percent (50%) or more of the | ||
| outstanding shares, or (iii) beneficial ownership of such entity. | ||
|
|
||
| "You" (or "Your") shall mean an individual or Legal Entity | ||
| exercising permissions granted by this License. | ||
|
|
||
| "Source" form shall mean the preferred form for making modifications, | ||
| including but not limited to software source code, documentation | ||
| source, and configuration files. | ||
|
|
||
| "Object" form shall mean any form resulting from mechanical | ||
| transformation or translation of a Source form, including but | ||
| not limited to compiled object code, generated documentation, | ||
| and conversions to other media types. | ||
|
|
||
| "Work" shall mean the work of authorship, whether in Source or | ||
| Object form, made available under the License, as indicated by a | ||
| copyright notice that is included in or attached to the work | ||
| (an example is provided in the Appendix below). | ||
|
|
||
| "Derivative Works" shall mean any work, whether in Source or Object | ||
| form, that is based on (or derived from) the Work and for which the | ||
| editorial revisions, annotations, elaborations, or other modifications | ||
| represent, as a whole, an original work of authorship. For the purposes | ||
| of this License, Derivative Works shall not include works that remain | ||
| separable from, or merely link (or bind by name) to the interfaces of, | ||
| the Work and Derivative Works thereof. | ||
|
|
||
| "Contribution" shall mean any work of authorship, including | ||
| the original version of the Work and any modifications or additions | ||
| to that Work or Derivative Works thereof, that is intentionally | ||
| submitted to Licensor for inclusion in the Work by the copyright owner | ||
| or by an individual or Legal Entity authorized to submit on behalf of | ||
| the copyright owner. For the purposes of this definition, "submitted" | ||
| means any form of electronic, verbal, or written communication sent | ||
| to the Licensor or its representatives, including but not limited to | ||
| communication on electronic mailing lists, source code control systems, | ||
| and issue tracking systems that are managed by, or on behalf of, the | ||
| Licensor for the purpose of discussing and improving the Work, but | ||
| excluding communication that is conspicuously marked or otherwise | ||
| designated in writing by the copyright owner as "Not a Contribution." | ||
|
|
||
| "Contributor" shall mean Licensor and any individual or Legal Entity | ||
| on behalf of whom a Contribution has been received by Licensor and | ||
| subsequently incorporated within the Work. | ||
|
|
||
| 2. Grant of Copyright License. Subject to the terms and conditions of | ||
| this License, each Contributor hereby grants to You a perpetual, | ||
| worldwide, non-exclusive, no-charge, royalty-free, irrevocable | ||
| copyright license to reproduce, prepare Derivative Works of, | ||
| publicly display, publicly perform, sublicense, and distribute the | ||
| Work and such Derivative Works in Source or Object form. | ||
|
|
||
| 3. Grant of Patent License. Subject to the terms and conditions of | ||
| this License, each Contributor hereby grants to You a perpetual, | ||
| worldwide, non-exclusive, no-charge, royalty-free, irrevocable | ||
| (except as stated in this section) patent license to make, have made, | ||
| use, offer to sell, sell, import, and otherwise transfer the Work, | ||
| where such license applies only to those patent claims licensable | ||
| by such Contributor that are necessarily infringed by their | ||
| Contribution(s) alone or by combination of their Contribution(s) | ||
| with the Work to which such Contribution(s) was submitted. If You | ||
| institute patent litigation against any entity (including a | ||
| cross-claim or counterclaim in a lawsuit) alleging that the Work | ||
| or a Contribution incorporated within the Work constitutes direct | ||
| or contributory patent infringement, then any patent licenses | ||
| granted to You under this License for that Work shall terminate | ||
| as of the date such litigation is filed. | ||
|
|
||
| 4. Redistribution. You may reproduce and distribute copies of the | ||
| Work or Derivative Works thereof in any medium, with or without | ||
| modifications, and in Source or Object form, provided that You | ||
| meet the following conditions: | ||
|
|
||
| (a) You must give any other recipients of the Work or | ||
| Derivative Works a copy of this License; and | ||
|
|
||
| (b) You must cause any modified files to carry prominent notices | ||
| stating that You changed the files; and | ||
|
|
||
| (c) You must retain, in the Source form of any Derivative Works | ||
| that You distribute, all copyright, patent, trademark, and | ||
| attribution notices from the Source form of the Work, | ||
| excluding those notices that do not pertain to any part of | ||
| the Derivative Works; and | ||
|
|
||
| (d) If the Work includes a "NOTICE" text file as part of its | ||
| distribution, then any Derivative Works that You distribute must | ||
| include a readable copy of the attribution notices contained | ||
| within such NOTICE file, excluding those notices that do not | ||
| pertain to any part of the Derivative Works, in at least one | ||
| of the following places: within a NOTICE text file distributed | ||
| as part of the Derivative Works; within the Source form or | ||
| documentation, if provided along with the Derivative Works; or, | ||
| within a display generated by the Derivative Works, if and | ||
| wherever such third-party notices normally appear. The contents | ||
| of the NOTICE file are for informational purposes only and | ||
| do not modify the License. You may add Your own attribution | ||
| notices within Derivative Works that You distribute, alongside | ||
| or as an addendum to the NOTICE text from the Work, provided | ||
| that such additional attribution notices cannot be construed | ||
| as modifying the License. | ||
|
|
||
| You may add Your own copyright statement to Your modifications and | ||
| may provide additional or different license terms and conditions | ||
| for use, reproduction, or distribution of Your modifications, or | ||
| for any such Derivative Works as a whole, provided Your use, | ||
| reproduction, and distribution of the Work otherwise complies with | ||
| the conditions stated in this License. | ||
|
|
||
| 5. Submission of Contributions. Unless You explicitly state otherwise, | ||
| any Contribution intentionally submitted for inclusion in the Work | ||
| by You to the Licensor shall be under the terms and conditions of | ||
| this License, without any additional terms or conditions. | ||
| Notwithstanding the above, nothing herein shall supersede or modify | ||
| the terms of any separate license agreement you may have executed | ||
| with Licensor regarding such Contributions. | ||
|
|
||
| 6. Trademarks. This License does not grant permission to use the trade | ||
| names, trademarks, service marks, or product names of the Licensor, | ||
| except as required for reasonable and customary use in describing the | ||
| origin of the Work and reproducing the content of the NOTICE file. | ||
|
|
||
| 7. Disclaimer of Warranty. Unless required by applicable law or | ||
| agreed to in writing, Licensor provides the Work (and each | ||
| Contributor provides its Contributions) on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or | ||
| implied, including, without limitation, any warranties or conditions | ||
| of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A | ||
| PARTICULAR PURPOSE. You are solely responsible for determining the | ||
| appropriateness of using or redistributing the Work and assume any | ||
| risks associated with Your exercise of permissions under this License. | ||
|
|
||
| 8. Limitation of Liability. In no event and under no legal theory, | ||
| whether in tort (including negligence), contract, or otherwise, | ||
| unless required by applicable law (such as deliberate and grossly | ||
| negligent acts) or agreed to in writing, shall any Contributor be | ||
| liable to You for damages, including any direct, indirect, special, | ||
| incidental, or consequential damages of any character arising as a | ||
| result of this License or out of the use or inability to use the | ||
| Work (including but not limited to damages for loss of goodwill, | ||
| work stoppage, computer failure or malfunction, or any and all | ||
| other commercial damages or losses), even if such Contributor | ||
| has been advised of the possibility of such damages. | ||
|
|
||
| 9. Accepting Warranty or Additional Liability. While redistributing | ||
| the Work or Derivative Works thereof, You may choose to offer, | ||
| and charge a fee for, acceptance of support, warranty, indemnity, | ||
| or other liability obligations and/or rights consistent with this | ||
| License. However, in accepting such obligations, You may act only | ||
| on Your own behalf and on Your sole responsibility, not on behalf | ||
| of any other Contributor, and only if You agree to indemnify, | ||
| defend, and hold each Contributor harmless for any liability | ||
| incurred by, or claims asserted against, such Contributor by reason | ||
| of your accepting any such warranty or additional liability. | ||
|
|
||
| END OF TERMS AND CONDITIONS | ||
|
|
||
| APPENDIX: How to apply the Apache License to your work. | ||
|
|
||
| To apply the Apache License to your work, attach the following | ||
| boilerplate notice, with the fields enclosed by brackets "[]" | ||
| replaced with your own identifying information. (Don't include | ||
| the brackets!) The text should be enclosed in the appropriate | ||
| comment syntax for the file format. We also recommend that a | ||
| file or class name and description of purpose be included on the | ||
| same "printed page" as the copyright notice for easier | ||
| identification within third-party archives. | ||
|
|
||
| Copyright 2025 Alibaba Cloud | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| you may not use this file except in compliance with the License. | ||
| You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. |
There was a problem hiding this comment.
File extension required by coding guidelines.
The filename LICENSE lacks an extension, which violates the coding guidelines stating: "Only use acceptable file extensions: .c, .cc, .cpp, .cxx, .h, .hh, .hpp, .py, .pyi, .sh, .txt, .md, .yml, .yaml, .json, .toml."
Rename the file to LICENSE.txt or LICENSE.md to comply with the guidelines.
Based on coding guidelines.
🤖 Prompt for AI Agents
In `@mllm/models/qwen2_5omni/python_src_code/LICENSE` around lines 1 - 201, The
LICENSE file name violates the project's allowed extensions; rename the file
named "LICENSE" to include a permitted extension (e.g., "LICENSE.txt" or
"LICENSE.md") so it matches coding guidelines; update any references to
"LICENSE" in packaging or docs to point to the new filename to avoid broken
links.
| elif image.startswith("http://") or image.startswith("https://"): | ||
| # fix memory leak issue while using BytesIO | ||
| with requests.get(image, stream=True) as response: | ||
| response.raise_for_status() | ||
| with BytesIO(response.content) as bio: | ||
| image_obj = copy.deepcopy(Image.open(bio)) |
There was a problem hiding this comment.
Missing timeout in HTTP request can cause indefinite hangs.
The requests.get() call lacks a timeout parameter, which can cause the application to hang indefinitely if the server is unresponsive. This is particularly problematic in production environments.
Proposed fix
- with requests.get(image, stream=True) as response:
+ with requests.get(image, stream=True, timeout=30) as response:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| elif image.startswith("http://") or image.startswith("https://"): | |
| # fix memory leak issue while using BytesIO | |
| with requests.get(image, stream=True) as response: | |
| response.raise_for_status() | |
| with BytesIO(response.content) as bio: | |
| image_obj = copy.deepcopy(Image.open(bio)) | |
| elif image.startswith("http://") or image.startswith("https://"): | |
| # fix memory leak issue while using BytesIO | |
| with requests.get(image, stream=True, timeout=30) as response: | |
| response.raise_for_status() | |
| with BytesIO(response.content) as bio: | |
| image_obj = copy.deepcopy(Image.open(bio)) |
🧰 Tools
🪛 Ruff (0.14.13)
108-108: Probable use of requests call without timeout
(S113)
🤖 Prompt for AI Agents
In `@mllm/models/qwen2_5omni/python_src_code/vision_process.py` around lines 106 -
111, Add a timeout to the HTTP fetch to avoid hangs: update the requests.get
call in the image URL branch (the block that checks image.startswith("http://")
or "https://") to pass a timeout (e.g., timeout=REQUEST_TIMEOUT or timeout=10) —
preferably use a module-level constant like REQUEST_TIMEOUT so it’s configurable
— and keep using the response context manager, response.raise_for_status(), and
the BytesIO/Image.open flow unchanged; also ensure any callers/handlers will
surface requests.exceptions.ReadTimeout/ConnectTimeout if needed.
…fixed quantization parameters, updated ActivationQDQ to use MovingAverageMinMaxObserver, and adjusted eps values for better precision. Modified Qwen3 model to utilize FixedActivationQDQ for sigmoid output and ensured dtype consistency in attention calculations.
… debug print statements from Qwen3DecoderLayer
…ackend in CMake, enhance PTQPass with unsolved tensor value checks, and update quantization specifications in RMSNorm and model file conversion.
…improved quantization, enhance rotate_half function to utilize observers, and ensure consistent scale and zero_point across concatenated inputs.
… zeros, ones, specific values, arange, and random fills. Introduce a new fill-inl.hpp file for optimized implementations and update kernel dispatch to include these operations. Enhance CPUFillOp to utilize the new fill functions for better performance and maintainability.
…d error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.
…es; ensure position-independent code for flatbuffers. Enhance context creation with existing context checks and improve weight quantization specifications.
… input layer normalization handling in Qwen3DecoderLayer. Update weight conversion logic in training script to address model compatibility issues.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@examples/qwen2_5omni/audio_infer.cpp`:
- Around line 16-35: The help flag is never reachable because Argparse::parse()
enforces required options first; before calling Argparse::parse(argc, argv)
detect help early (e.g., scan argv for "-h" or "--help") and if present call
Argparse::printHelp() and mllm::shutdownContext() then return 0, or
alternatively use an Argparse::parse variant/option that skips required-arg
validation for help; update the code around Argparse::parse, help.isSet(),
Argparse::printHelp, and mllm::shutdownContext to ensure help is handled before
required-arg validation.
In `@examples/qwen2_5omni/config_qwen2_5omni_7B.json`:
- Around line 5-7: The config enables unsupported talker components despite the
PR stating "without talker": set "enable_audio_output" and "enable_talker" to
false (or add an explicit comment stating they are placeholders) in the JSON for
the qwen2_5_omni model to prevent runtime attempts to initialize talker/audio
subsystems; update any documentation or inline note near the "model_type":
"qwen2_5_omni" entry to clarify that talker is not yet supported if you choose
the placeholder route.
In `@examples/qwen2_5omni/image_infer.cpp`:
- Around line 24-29: The code silently defaults file_version to
mllm::ModelFileVersion::kV1 which can load wrong weights; change the logic in
the block that reads model_version.get() so it validates allowed values
explicitly (e.g., "v1" -> kV1, "v2" -> kV2) and on any other value fail fast by
returning an error/throwing an exception or logging and exiting with a clear
message that the model_version is unsupported; update references to file_version
and the model-version parsing branch (where mllm::ModelFileVersion file_version
is set) to enforce this validation rather than falling back to kV1.
♻️ Duplicate comments (4)
examples/qwen2_5omni/image_infer.cpp (2)
16-34:-h/--helpis still blocked by required-arg validation.Argparse validates required args inside
parse(), so the help check after parsing won’t run when-his the only flag. Handle help before validation or make Argparse treat help as a bypass.💡 One way to bypass required-arg validation
+#include <string_view> ... auto& config_path = Argparse::add<std::string>("-c|--config_path").help("Config path").required(true); + for (int i = 1; i < argc; ++i) { + const std::string_view arg{argv[i]}; + if (arg == "-h" || arg == "--help") { + Argparse::printHelp(); + mllm::shutdownContext(); + return 0; + } + } Argparse::parse(argc, argv);
52-59: Restore interactive input forimage_pathandprompt_text.Currently the CLI ignores stdin and uses empty strings, which makes inference unusable. Re-enable
std::getlineand keep the exit/quit check.🐛 Proposed fix
fmt::print("Image path (or 'exit/quit'): "); - image_path = ""; - //std::getline(std::cin, image_path); + std::getline(std::cin, image_path); if (image_path == "exit" || image_path == "quit") { return 0; } fmt::print("Prompt text: "); - prompt_text = ""; - //std::getline(std::cin, prompt_text); + std::getline(std::cin, prompt_text);examples/qwen2_5omni/audio_infer.cpp (2)
24-29: Reject unsupportedmodel_versionvalues.Any value other than
"v1"/"v2"silently falls back tokV1, which can load incorrect weights. Validate and exit with a clear error.🛡️ Suggested validation
mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; if (model_version.get() == "v1") { file_version = mllm::ModelFileVersion::kV1; } else if (model_version.get() == "v2") { file_version = mllm::ModelFileVersion::kV2; + } else { + fmt::print("Unsupported model_version: {}\n", model_version.get()); + mllm::shutdownContext(); + return 1; }
51-59: Restore interactive stdin input for audio path and prompt.The current code ignores user input and uses empty strings, which breaks the CLI. Re-enable
std::getline, keep the exit/quit check, and restore the default prompt when empty.🎤 Proposed fix
fmt::print("Audio path (or 'exit/quit'): "); - //std::getline(std::cin, audio_path); - //if (audio_path == "exit" || audio_path == "quit") { return 0; } - audio_path = ""; + std::getline(std::cin, audio_path); + if (audio_path == "exit" || audio_path == "quit") { return 0; } fmt::print("Prompt text: "); - //std::getline(std::cin, prompt_text); - //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; } - prompt_text = ""; + std::getline(std::cin, prompt_text); + if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }
| auto& help = Argparse::add<bool>("-h|--help").help("Show help message"); | ||
| auto& model_path = Argparse::add<std::string>("-m|--model_path").help("Model path").required(true); | ||
| auto& model_version = Argparse::add<std::string>("-mv|--model_version").help("Model version").required(true); | ||
| auto& tokenizer_path = Argparse::add<std::string>("-t|--tokenizer_path").help("Tokenizer directory").required(true); | ||
| auto& config_path = Argparse::add<std::string>("-c|--config_path").help("Config path").required(true); | ||
|
|
||
| Argparse::parse(argc, argv); | ||
|
|
||
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | ||
| if (model_version.get() == "v1") { | ||
| file_version = mllm::ModelFileVersion::kV1; | ||
| } else if (model_version.get() == "v2") { | ||
| file_version = mllm::ModelFileVersion::kV2; | ||
| } | ||
|
|
||
| if (help.isSet()) { | ||
| Argparse::printHelp(); | ||
| mllm::shutdownContext(); | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
-h/--help will never show if required args are missing.
Argparse::parse() enforces required options before help.isSet() is checked. Handle help before validation (or modify Argparse to bypass required checks for help).
💡 One way to bypass required-arg validation
+#include <string_view>
...
auto& config_path = Argparse::add<std::string>("-c|--config_path").help("Config path").required(true);
+ for (int i = 1; i < argc; ++i) {
+ const std::string_view arg{argv[i]};
+ if (arg == "-h" || arg == "--help") {
+ Argparse::printHelp();
+ mllm::shutdownContext();
+ return 0;
+ }
+ }
Argparse::parse(argc, argv);🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/audio_infer.cpp` around lines 16 - 35, The help flag is
never reachable because Argparse::parse() enforces required options first;
before calling Argparse::parse(argc, argv) detect help early (e.g., scan argv
for "-h" or "--help") and if present call Argparse::printHelp() and
mllm::shutdownContext() then return 0, or alternatively use an Argparse::parse
variant/option that skips required-arg validation for help; update the code
around Argparse::parse, help.isSet(), Argparse::printHelp, and
mllm::shutdownContext to ensure help is handled before required-arg validation.
| "enable_audio_output": true, | ||
| "enable_talker": true, | ||
| "model_type": "qwen2_5_omni", |
There was a problem hiding this comment.
Disable talker/audio-output flags until talker is actually supported.
The PR objective says “without talker,” but enable_audio_output / enable_talker are true. If runtime respects these flags, it may try to initialize unsupported components or confuse users. Suggest disabling them (or explicitly documenting that they’re placeholders) until talker is ready.
💡 Proposed change
- "enable_audio_output": true,
- "enable_talker": true,
+ "enable_audio_output": false,
+ "enable_talker": false,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "enable_audio_output": true, | |
| "enable_talker": true, | |
| "model_type": "qwen2_5_omni", | |
| "enable_audio_output": false, | |
| "enable_talker": false, | |
| "model_type": "qwen2_5_omni", |
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/config_qwen2_5omni_7B.json` around lines 5 - 7, The
config enables unsupported talker components despite the PR stating "without
talker": set "enable_audio_output" and "enable_talker" to false (or add an
explicit comment stating they are placeholders) in the JSON for the qwen2_5_omni
model to prevent runtime attempts to initialize talker/audio subsystems; update
any documentation or inline note near the "model_type": "qwen2_5_omni" entry to
clarify that talker is not yet supported if you choose the placeholder route.
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | ||
| if (model_version.get() == "v1") { | ||
| file_version = mllm::ModelFileVersion::kV1; | ||
| } else if (model_version.get() == "v2") { | ||
| file_version = mllm::ModelFileVersion::kV2; | ||
| } |
There was a problem hiding this comment.
Reject unsupported model_version values instead of defaulting to v1.
Silent fallback to kV1 can load incorrect weights. Validate and fail fast on unknown values.
🛡️ Suggested validation
mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
if (model_version.get() == "v1") {
file_version = mllm::ModelFileVersion::kV1;
} else if (model_version.get() == "v2") {
file_version = mllm::ModelFileVersion::kV2;
+ } else {
+ fmt::print("Unsupported model_version: {}\n", model_version.get());
+ mllm::shutdownContext();
+ return 1;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | |
| if (model_version.get() == "v1") { | |
| file_version = mllm::ModelFileVersion::kV1; | |
| } else if (model_version.get() == "v2") { | |
| file_version = mllm::ModelFileVersion::kV2; | |
| } | |
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | |
| if (model_version.get() == "v1") { | |
| file_version = mllm::ModelFileVersion::kV1; | |
| } else if (model_version.get() == "v2") { | |
| file_version = mllm::ModelFileVersion::kV2; | |
| } else { | |
| fmt::print("Unsupported model_version: {}\n", model_version.get()); | |
| mllm::shutdownContext(); | |
| return 1; | |
| } |
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/image_infer.cpp` around lines 24 - 29, The code silently
defaults file_version to mllm::ModelFileVersion::kV1 which can load wrong
weights; change the logic in the block that reads model_version.get() so it
validates allowed values explicitly (e.g., "v1" -> kV1, "v2" -> kV2) and on any
other value fail fast by returning an error/throwing an exception or logging and
exiting with a clear message that the model_version is unsupported; update
references to file_version and the model-version parsing branch (where
mllm::ModelFileVersion file_version is set) to enforce this validation rather
than falling back to kV1.
Summary by CodeRabbit
New Features
Chores
Performance/Infrastructure
✏️ Tip: You can customize this high-level summary in your review settings.