Merged
Conversation
e610b0e to
a4dbcec
Compare
yuki-97
reviewed
Jul 22, 2025
Contributor
|
Thanks @RayenTian for verifying and removing these things! Can you also check if we can also remove |
Contributor
|
One more thing, can you also search |
a4dbcec to
b1cd9b6
Compare
Contributor
|
can you please attach the token_mult_prob_error plots to the description? (we can't always see the errors when we just look at convergence plots) |
b1cd9b6 to
fea99a2
Compare
RayenTian
commented
Jul 24, 2025
Contributor
Author
Thank you for the suggestion. I have added the plots as requested. |
c6386b5 to
d741557
Compare
c9f41b3 to
292cc83
Compare
terrykong
previously approved these changes
Aug 8, 2025
SahilJain314
previously approved these changes
Aug 8, 2025
Contributor
SahilJain314
left a comment
There was a problem hiding this comment.
thanks for rebasing, lgtm now.
Signed-off-by: ruit <ruit@nvidia.com>
…h are not used anymore Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
a5d70d8
292cc83 to
a5d70d8
Compare
terrykong
approved these changes
Aug 8, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
This pull request removes the tie weight check in the DTensor Worker, addressing a previous limitation with training models that utilize tie_word_embeddings when tensor parallel size (tp_size) > 1.
Issues
closes #684
Test Result
For remove
NRL_SKIP_TIED_WEIGHT_CHECKenv variableExperiments were run with meta-llama/Llama-3.2-1B, Qwen/Qwen2.5-1.5B-Instruct and google/gemma-2-2b-it. With the truncate rate set within a reasonable range, both tp_size=1 and tp_size=2 produced comparable reward results, indicating that training proceeds as expected. This demonstrates consistency in model behavior across different tensor parallel configurations, confirming correct handling of tied word embeddings during training.
Main setup:
Llama-3.2-1B
Qwen2.5-1.5B-Instruct
google/gemma-2-2b-it
Test with:
For remove
model.config.tie_word_embeddingsinnemo_rl/models/dtensor/parallelize.pyExperiments were conducted using both the Qwen/Qwen2.5-1.5B-Instruct and Qwen/Qwen2.5-7B-Instruct models, representing the original tied and untied weight configurations, respectively. I disabled Qwen's optimized parallel plan in PARALLIZE_FUNCTIONS, forcing both models to utilize the HP plan for testing.
Qwen/Qwen2.5-1.5B-Instruct
Qwen/Qwen2.5-7B-Instruct
More Custom Test
Some additional tests were also conducted here.
For models like Qwen/Qwen2.5-1.5B-Instruct, which by default use tied word embeddings, we forcibly disabled the tie_word_embedding setting. The results showed significant anomalies in both the token_mult_prob_error and the reward. We suspect this may be because vLLM still keeps the embeddings tied on its side, so when you call update_weights, there might be some issues (in practice, vLLM may be using either the embed or the lm_head weight, rather than truly untying them as in training). This causes the token_mult_prob_error to behave abnormally.
For models like Qwen/Qwen2.5-7B-Instruct, which by default do not use tied embeddings, we forcibly enabled tie_word_embedding. The results showed large differences in reward, but the token_mult_prob_error remained stable. This might be because, originally, the embed and lm_head weights were different due to being untied; forcing a tie essentially removes the original lm_head weights. When we call update_weights to vLLM, its lm_head is also overwritten with the embed weights, so the token_mult_prob_error remains normal. However, since we forcibly replaced the original weights, the reward becomes abnormal.
Additional Notes on Gemma Model Testing