Skip to content

[Bug]Fix init issue for layer_norm in sequence_parallel for non-CUDA device.#450

Closed
ys950902 wants to merge 1 commit intodeepspeedai:mainfrom
ys950902:layernorm_init
Closed

[Bug]Fix init issue for layer_norm in sequence_parallel for non-CUDA device.#450
ys950902 wants to merge 1 commit intodeepspeedai:mainfrom
ys950902:layernorm_init

Conversation

@ys950902
Copy link
Copy Markdown

As sequence_parallel is added in Megatron-DeepSpeed for layernorm, for current implementation, non-CUDA device is using from torch.nn import LayerNorm for layernorm, there is no attr named sequence_parallel, will cause init error for non-CUDA device.

This pr is to fix this issue.

@ys950902 ys950902 changed the title Fix init issue for layer_norm in sequence_parallel for non-CUDA device. [Bug]Fix init issue for layer_norm in sequence_parallel for non-CUDA device. Sep 29, 2024
@ys950902
Copy link
Copy Markdown
Author

ys950902 commented Sep 29, 2024

#429 for layernorm is added in this pr. Hi @polisettyvarma could you please also take a look on this pr, is it okay for you on layernorm, many thank!

@polisettyvarma
Copy link
Copy Markdown

@ys950902 will this feature work correctly ? have you done any accuracy check for this ?

else:
from .rmsnorm import RMSNorm
from torch.nn import LayerNorm
from .layernorm import LayerNorm
Copy link
Copy Markdown

@tjruwase tjruwase Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please share the failure that is being fixed? I have two concerns about this change:

  1. It is quite subtle since it does not show the connection to sequence-parallelism
  2. It is unclear to me that new LayerNorm is equivalent to torch.nn.LayerNorm for non sequence-parallel case. Maintaining parity with torch.nn.LayerNorm imposes extra development burden.

So, I would like to further understand the problem and explore alternative solutions. Thanks!

@tjruwase
Copy link
Copy Markdown

Closing for lack of response. Please re-open if needed.

@tjruwase tjruwase closed this Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants