[Bug] qwen2_5_omni: cap generation length to be less than the max_position_embedding in DiT by sniper35 · Pull Request #43068 · huggingface/transformers

sniper35 · 2025-12-30T04:57:19Z

What does this PR do?

This PR resolved the following issue and added tests and provided a test script to validate the behavior before/after the fix to further validate the correctness.
Issue: mel frames of noise_initialization is hardcoded to 30000 in current implmentation, it will cause issue of 'shape mismatch' if the hidden_states time len doesn't match that of condition_vector/code_embed tensors. i.e, We will get the below error when we run the script attached.

  File "/root/transformers/test_before.py", line 20, in <module>
    m.sample(cond, ref_mel, code, num_steps=2)
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 3718, in sample
    solution_trajectory = ode_solver.integrate(time_embedding)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 3550, in integrate
    delta_value, _ = self._compute_step(self.function, time_start, time_step, time_end, current_value)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 3524, in _compute_step
    function_value_start = function(time_start, value_start)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 3698, in ode_function
    model_output = self(
                   ^^^^^
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 3636, in forward
    hidden_states = self.input_embed(
                    ^^^^^^^^^^^^^^^^^
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2975, in forward
    hidden_states = self.proj(torch.cat((hidden_states, condition_vector, code_embed, speaker_embedding), dim=-1))
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 30000 but got size 32000 for tensor number 2 in the list.

The bug is found when I used vllm-omni to run the exact qwen 2.5 omni mode and a similar PR has been merged to resolve the issue.

To validate the issue exists before the fix ang get resolved after the fix. Run the following script:

test_validate_max_position_embeddings_fix.py

All the re-production and validation is running on a B300 GPU.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…choosing a target_duration that's capped and aligned

Signed-off-by: Dong Wang <dongw2019@gmail.com>

sniper35 · 2026-01-01T20:42:59Z

CC: @zucchini-nlp @eustlb @ebezzam @vasqu

zucchini-nlp · 2026-01-05T08:00:36Z

+        max_target_duration = min(max_mel_frames, self.config.max_position_embeddings)
+        target_duration = min(maximum_duration, max_target_duration)
+        align_to = math.lcm(self.repeats, self.block_size)
+        target_duration = target_duration // align_to * align_to
+        if target_duration == 0:
+            target_duration = min(maximum_duration, max_target_duration) // self.repeats * self.repeats
+        if target_duration == 0:
+            raise ValueError(
+                f"Aligned mel length is 0 (got `max_mel_frames`={max_mel_frames}, "
+                f"`dit_config.max_position_embeddings`={self.config.max_position_embeddings})."
+            )
+
+        if target_duration != maximum_duration:


I think the main idea is to make sure that the codes are not too long, i.e. > max_position_embeddings. Not clear why we want to add an arg for max_mel_frames and make it configurable, if the provided max_mel_frames is used only to cap codes when it is too long.

We could raise error indicating that input length is higher than max_position_embeddings and let user decide what they want to do with it, no?

Makes sense to me. I updated the noise_initialization to be caped by the max_position_embeddings instead of setting it to 30000 or providing a parameter to set it. Updated the tests as well. Let me know if it looks good. Thanks!

Signed-off-by: Dong Wang <dongw2019@gmail.com>

github-actions · 2026-01-06T05:40:48Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen2_5_omni

sniper35 · 2026-01-08T02:11:51Z

@zucchini-nlp I updated the error handling logic and updated the test script. Could you review it again? Thanks!

zucchini-nlp

Thanks for iterating

zucchini-nlp · 2026-01-08T11:08:33Z

+@require_torch
+class Qwen2_5OmniToken2WavMaxPositionEmbeddingsTest(unittest.TestCase):
+    """
+    Tests to verify that ValueError is raised when input length exceeds max_position_embeddings.
+    """
+


Oke, ig for DiT we cannot really run all tests from ModelTesterMixin because it's a bit special.

HuggingFaceDocBuilderDev · 2026-01-08T11:17:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ition_embedding in DiT (huggingface#43068) * qwen2_5_omni: make max_mel_frames an inference-time knob * not fail with raising ValueError, instead make it continue to run by choosing a target_duration that's capped and aligned * added unit tests for Token2WavShape shape mismatch Signed-off-by: Dong Wang <dongw2019@gmail.com> * make fixup * remove unit test which takes too much GPU memory Signed-off-by: Dong Wang <dongw2019@gmail.com> * reduce gpu memory usage from the unit test * addressed comments Signed-off-by: Dong Wang <dongw2019@gmail.com> --------- Signed-off-by: Dong Wang <dongw2019@gmail.com>

* add Youtu-LLM model * add testing indicators in model test * [Bug] qwen2_5_omni: cap generation length to be less than the max_position_embedding in DiT (#43068) * qwen2_5_omni: make max_mel_frames an inference-time knob * not fail with raising ValueError, instead make it continue to run by choosing a target_duration that's capped and aligned * added unit tests for Token2WavShape shape mismatch Signed-off-by: Dong Wang <dongw2019@gmail.com> * make fixup * remove unit test which takes too much GPU memory Signed-off-by: Dong Wang <dongw2019@gmail.com> * reduce gpu memory usage from the unit test * addressed comments Signed-off-by: Dong Wang <dongw2019@gmail.com> --------- Signed-off-by: Dong Wang <dongw2019@gmail.com> * upgrade code quality according to latest main branch * correct unnecessary tokenizer annotation * resolve conflicts * modify redundant codes in modules, decompose test functions * fix typo * adapt to latest official codes * update dates * modfiy prefix * update dates * modify model_type and test path * update codes, as suggested by vasqu * fix modeling inconsistency * fix codes * update codes with inherits of config * fix docstring * modular * refactor tests * skip incompatible tests * rerun fix-repo * some last fixes --------- Signed-off-by: Dong Wang <dongw2019@gmail.com> Co-authored-by: Dong W <89223086+sniper35@users.noreply.github.com> Co-authored-by: vasqu <antonprogamer@gmail.com> Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

qwen2_5_omni: make max_mel_frames an inference-time knob

bc7bfbf

sniper35 mentioned this pull request Dec 30, 2025

[Bugfix]Fix Qwen 2.5 omni hardcoded max_mel_frames to resolve shape mismatch vllm-project/vllm-omni#543

Merged

5 tasks

sniper35 added 5 commits December 31, 2025 22:55

not fail with raising ValueError, instead make it continue to run by …

daadf75

…choosing a target_duration that's capped and aligned

added unit tests for Token2WavShape shape mismatch

1ab35d6

Signed-off-by: Dong Wang <dongw2019@gmail.com>

make fixup

af6b6ff

remove unit test which takes too much GPU memory

b2ebf62

Signed-off-by: Dong Wang <dongw2019@gmail.com>

reduce gpu memory usage from the unit test

2830ba4

sniper35 changed the title ~~qwen2_5_omni: make max_mel_frames an inference-time knob~~ [model] qwen2_5_omni: cap and align hardcoded initial max_mel_frames Jan 1, 2026

sniper35 changed the title ~~[model] qwen2_5_omni: cap and align hardcoded initial max_mel_frames~~ [Bug] qwen2_5_omni: cap and align hardcoded initial max_mel_frames Jan 1, 2026

sniper35 mentioned this pull request Jan 1, 2026

[Bug] qwen2_5_omni: Hardcoded noise_initialization length of '30000' causes shape mismatch #43079

Closed

4 tasks

zucchini-nlp reviewed Jan 5, 2026

View reviewed changes

sniper35 force-pushed the fix/qwen2_5_omni-max-mel-frames-arg branch from ecb4c9e to 2830ba4 Compare January 5, 2026 23:53

addressed comments

1c5c8cc

Signed-off-by: Dong Wang <dongw2019@gmail.com>

sniper35 changed the title ~~[Bug] qwen2_5_omni: cap and align hardcoded initial max_mel_frames~~ [Bug] qwen2_5_omni: cap generation length to be less than the max_position_embedding in DiT Jan 6, 2026

sniper35 requested a review from zucchini-nlp January 6, 2026 07:41

zucchini-nlp approved these changes Jan 8, 2026

View reviewed changes

zucchini-nlp enabled auto-merge (squash) January 8, 2026 11:09

zucchini-nlp merged commit 0d8f187 into huggingface:main Jan 8, 2026
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] qwen2_5_omni: cap generation length to be less than the max_position_embedding in DiT#43068

[Bug] qwen2_5_omni: cap generation length to be less than the max_position_embedding in DiT#43068
zucchini-nlp merged 7 commits intohuggingface:mainfrom
sniper35:fix/qwen2_5_omni-max-mel-frames-arg

sniper35 commented Dec 30, 2025 •

edited

Loading

Uh oh!

sniper35 commented Jan 1, 2026

Uh oh!

zucchini-nlp Jan 5, 2026

Uh oh!

sniper35 Jan 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jan 6, 2026

Uh oh!

sniper35 commented Jan 8, 2026

Uh oh!

zucchini-nlp left a comment

Uh oh!

zucchini-nlp Jan 8, 2026

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sniper35 commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

To validate the issue exists before the fix ang get resolved after the fix. Run the following script:

Before submitting

Who can review?

Uh oh!

sniper35 commented Jan 1, 2026

Uh oh!

zucchini-nlp Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

sniper35 Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jan 6, 2026

Uh oh!

sniper35 commented Jan 8, 2026

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sniper35 commented Dec 30, 2025 •

edited

Loading

sniper35 Jan 6, 2026 •

edited

Loading