[Bug] qwen2_5_omni: cap generation length to be less than the max_position_embedding in DiT#43068
Conversation
…choosing a target_duration that's capped and aligned
Signed-off-by: Dong Wang <dongw2019@gmail.com>
Signed-off-by: Dong Wang <dongw2019@gmail.com>
| max_target_duration = min(max_mel_frames, self.config.max_position_embeddings) | ||
| target_duration = min(maximum_duration, max_target_duration) | ||
| align_to = math.lcm(self.repeats, self.block_size) | ||
| target_duration = target_duration // align_to * align_to | ||
| if target_duration == 0: | ||
| target_duration = min(maximum_duration, max_target_duration) // self.repeats * self.repeats | ||
| if target_duration == 0: | ||
| raise ValueError( | ||
| f"Aligned mel length is 0 (got `max_mel_frames`={max_mel_frames}, " | ||
| f"`dit_config.max_position_embeddings`={self.config.max_position_embeddings})." | ||
| ) | ||
|
|
||
| if target_duration != maximum_duration: |
There was a problem hiding this comment.
I think the main idea is to make sure that the codes are not too long, i.e. > max_position_embeddings. Not clear why we want to add an arg for max_mel_frames and make it configurable, if the provided max_mel_frames is used only to cap codes when it is too long.
We could raise error indicating that input length is higher than max_position_embeddings and let user decide what they want to do with it, no?
There was a problem hiding this comment.
Makes sense to me. I updated the noise_initialization to be caped by the max_position_embeddings instead of setting it to 30000 or providing a parameter to set it. Updated the tests as well. Let me know if it looks good. Thanks!
ecb4c9e to
2830ba4
Compare
Signed-off-by: Dong Wang <dongw2019@gmail.com>
|
[For maintainers] Suggested jobs to run (before merge) run-slow: qwen2_5_omni |
|
@zucchini-nlp I updated the error handling logic and updated the test script. Could you review it again? Thanks! |
| @require_torch | ||
| class Qwen2_5OmniToken2WavMaxPositionEmbeddingsTest(unittest.TestCase): | ||
| """ | ||
| Tests to verify that ValueError is raised when input length exceeds max_position_embeddings. | ||
| """ | ||
|
|
There was a problem hiding this comment.
Oke, ig for DiT we cannot really run all tests from ModelTesterMixin because it's a bit special.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…ition_embedding in DiT (huggingface#43068) * qwen2_5_omni: make max_mel_frames an inference-time knob * not fail with raising ValueError, instead make it continue to run by choosing a target_duration that's capped and aligned * added unit tests for Token2WavShape shape mismatch Signed-off-by: Dong Wang <dongw2019@gmail.com> * make fixup * remove unit test which takes too much GPU memory Signed-off-by: Dong Wang <dongw2019@gmail.com> * reduce gpu memory usage from the unit test * addressed comments Signed-off-by: Dong Wang <dongw2019@gmail.com> --------- Signed-off-by: Dong Wang <dongw2019@gmail.com>
* add Youtu-LLM model * add testing indicators in model test * [Bug] qwen2_5_omni: cap generation length to be less than the max_position_embedding in DiT (#43068) * qwen2_5_omni: make max_mel_frames an inference-time knob * not fail with raising ValueError, instead make it continue to run by choosing a target_duration that's capped and aligned * added unit tests for Token2WavShape shape mismatch Signed-off-by: Dong Wang <dongw2019@gmail.com> * make fixup * remove unit test which takes too much GPU memory Signed-off-by: Dong Wang <dongw2019@gmail.com> * reduce gpu memory usage from the unit test * addressed comments Signed-off-by: Dong Wang <dongw2019@gmail.com> --------- Signed-off-by: Dong Wang <dongw2019@gmail.com> * upgrade code quality according to latest main branch * correct unnecessary tokenizer annotation * resolve conflicts * modify redundant codes in modules, decompose test functions * fix typo * adapt to latest official codes * update dates * modfiy prefix * update dates * modify model_type and test path * update codes, as suggested by vasqu * fix modeling inconsistency * fix codes * update codes with inherits of config * fix docstring * modular * refactor tests * skip incompatible tests * rerun fix-repo * some last fixes --------- Signed-off-by: Dong Wang <dongw2019@gmail.com> Co-authored-by: Dong W <89223086+sniper35@users.noreply.github.com> Co-authored-by: vasqu <antonprogamer@gmail.com> Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
What does this PR do?
This PR resolved the following issue and added tests and provided a test script to validate the behavior before/after the fix to further validate the correctness.
Issue: mel frames of noise_initialization is hardcoded to
30000in current implmentation, it will cause issue of 'shape mismatch' if the hidden_states time len doesn't match that of condition_vector/code_embed tensors. i.e, We will get the below error when we run the script attached.The bug is found when I used vllm-omni to run the exact qwen 2.5 omni mode and a similar PR has been merged to resolve the issue.
To validate the issue exists before the fix ang get resolved after the fix. Run the following script:
test_validate_max_position_embeddings_fix.py
All the re-production and validation is running on a B300 GPU.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.