Skip to content

[Bug] qwen2_5_omni: cap generation length to be less than the max_position_embedding in DiT#43068

Merged
zucchini-nlp merged 7 commits intohuggingface:mainfrom
sniper35:fix/qwen2_5_omni-max-mel-frames-arg
Jan 8, 2026
Merged

[Bug] qwen2_5_omni: cap generation length to be less than the max_position_embedding in DiT#43068
zucchini-nlp merged 7 commits intohuggingface:mainfrom
sniper35:fix/qwen2_5_omni-max-mel-frames-arg

Conversation

@sniper35
Copy link
Copy Markdown
Contributor

@sniper35 sniper35 commented Dec 30, 2025

What does this PR do?

This PR resolved the following issue and added tests and provided a test script to validate the behavior before/after the fix to further validate the correctness.
Issue: mel frames of noise_initialization is hardcoded to 30000 in current implmentation, it will cause issue of 'shape mismatch' if the hidden_states time len doesn't match that of condition_vector/code_embed tensors. i.e, We will get the below error when we run the script attached.

  File "/root/transformers/test_before.py", line 20, in <module>
    m.sample(cond, ref_mel, code, num_steps=2)
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 3718, in sample
    solution_trajectory = ode_solver.integrate(time_embedding)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 3550, in integrate
    delta_value, _ = self._compute_step(self.function, time_start, time_step, time_end, current_value)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 3524, in _compute_step
    function_value_start = function(time_start, value_start)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 3698, in ode_function
    model_output = self(
                   ^^^^^
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 3636, in forward
    hidden_states = self.input_embed(
                    ^^^^^^^^^^^^^^^^^
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/transformers/src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2975, in forward
    hidden_states = self.proj(torch.cat((hidden_states, condition_vector, code_embed, speaker_embedding), dim=-1))
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 30000 but got size 32000 for tensor number 2 in the list.

The bug is found when I used vllm-omni to run the exact qwen 2.5 omni mode and a similar PR has been merged to resolve the issue.

To validate the issue exists before the fix ang get resolved after the fix. Run the following script:

test_validate_max_position_embeddings_fix.py

All the re-production and validation is running on a B300 GPU.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sniper35 sniper35 changed the title qwen2_5_omni: make max_mel_frames an inference-time knob [model] qwen2_5_omni: cap and align hardcoded initial max_mel_frames Jan 1, 2026
@sniper35 sniper35 changed the title [model] qwen2_5_omni: cap and align hardcoded initial max_mel_frames [Bug] qwen2_5_omni: cap and align hardcoded initial max_mel_frames Jan 1, 2026
@sniper35
Copy link
Copy Markdown
Contributor Author

sniper35 commented Jan 1, 2026

CC: @zucchini-nlp @eustlb @ebezzam @vasqu

Comment on lines +3682 to +3694
max_target_duration = min(max_mel_frames, self.config.max_position_embeddings)
target_duration = min(maximum_duration, max_target_duration)
align_to = math.lcm(self.repeats, self.block_size)
target_duration = target_duration // align_to * align_to
if target_duration == 0:
target_duration = min(maximum_duration, max_target_duration) // self.repeats * self.repeats
if target_duration == 0:
raise ValueError(
f"Aligned mel length is 0 (got `max_mel_frames`={max_mel_frames}, "
f"`dit_config.max_position_embeddings`={self.config.max_position_embeddings})."
)

if target_duration != maximum_duration:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main idea is to make sure that the codes are not too long, i.e. > max_position_embeddings. Not clear why we want to add an arg for max_mel_frames and make it configurable, if the provided max_mel_frames is used only to cap codes when it is too long.

We could raise error indicating that input length is higher than max_position_embeddings and let user decide what they want to do with it, no?

Copy link
Copy Markdown
Contributor Author

@sniper35 sniper35 Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. I updated the noise_initialization to be caped by the max_position_embeddings instead of setting it to 30000 or providing a parameter to set it. Updated the tests as well. Let me know if it looks good. Thanks!

@sniper35 sniper35 force-pushed the fix/qwen2_5_omni-max-mel-frames-arg branch from ecb4c9e to 2830ba4 Compare January 5, 2026 23:53
Signed-off-by: Dong Wang <dongw2019@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jan 6, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen2_5_omni

@sniper35 sniper35 changed the title [Bug] qwen2_5_omni: cap and align hardcoded initial max_mel_frames [Bug] qwen2_5_omni: cap generation length to be less than the max_position_embedding in DiT Jan 6, 2026
@sniper35 sniper35 requested a review from zucchini-nlp January 6, 2026 07:41
@sniper35
Copy link
Copy Markdown
Contributor Author

sniper35 commented Jan 8, 2026

@zucchini-nlp I updated the error handling logic and updated the test script. Could you review it again? Thanks!

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating

Comment on lines +865 to +870
@require_torch
class Qwen2_5OmniToken2WavMaxPositionEmbeddingsTest(unittest.TestCase):
"""
Tests to verify that ValueError is raised when input length exceeds max_position_embeddings.
"""

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oke, ig for DiT we cannot really run all tests from ModelTesterMixin because it's a bit special.

@zucchini-nlp zucchini-nlp enabled auto-merge (squash) January 8, 2026 11:09
@zucchini-nlp zucchini-nlp merged commit 0d8f187 into huggingface:main Jan 8, 2026
19 checks passed
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026
…ition_embedding in DiT (huggingface#43068)

* qwen2_5_omni: make max_mel_frames an inference-time knob

* not fail with raising ValueError, instead make it continue to run by choosing a target_duration that's capped and aligned

* added unit tests for Token2WavShape shape mismatch

Signed-off-by: Dong Wang <dongw2019@gmail.com>

* make fixup

* remove unit test which takes too much GPU memory

Signed-off-by: Dong Wang <dongw2019@gmail.com>

* reduce gpu memory usage from the unit test

* addressed comments

Signed-off-by: Dong Wang <dongw2019@gmail.com>

---------

Signed-off-by: Dong Wang <dongw2019@gmail.com>
vasqu added a commit that referenced this pull request Jan 28, 2026
* add Youtu-LLM model

* add testing indicators in model test

* [Bug] qwen2_5_omni: cap generation length to be less than the max_position_embedding in DiT (#43068)

* qwen2_5_omni: make max_mel_frames an inference-time knob

* not fail with raising ValueError, instead make it continue to run by choosing a target_duration that's capped and aligned

* added unit tests for Token2WavShape shape mismatch

Signed-off-by: Dong Wang <dongw2019@gmail.com>

* make fixup

* remove unit test which takes too much GPU memory

Signed-off-by: Dong Wang <dongw2019@gmail.com>

* reduce gpu memory usage from the unit test

* addressed comments

Signed-off-by: Dong Wang <dongw2019@gmail.com>

---------

Signed-off-by: Dong Wang <dongw2019@gmail.com>

* upgrade code quality according to latest main branch

* correct unnecessary tokenizer annotation

* resolve conflicts

* modify redundant codes in modules, decompose test functions

* fix typo

* adapt to latest official codes

* update dates

* modfiy prefix

* update dates

* modify model_type and test path

* update codes, as suggested by vasqu

* fix modeling inconsistency

* fix codes

* update codes with inherits of config

* fix docstring

* modular

* refactor tests

* skip incompatible tests

* rerun fix-repo

* some last fixes

---------

Signed-off-by: Dong Wang <dongw2019@gmail.com>
Co-authored-by: Dong W <89223086+sniper35@users.noreply.github.com>
Co-authored-by: vasqu <antonprogamer@gmail.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants