-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[TOPI] Add layer norm operator #12864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ad090f4 to
bdeace4
Compare
bdeace4 to
b2d0735
Compare
b0ef965 to
d2219a7
Compare
MasterJH5574
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @vinx13 for implementing layer-norm! I just have two tiny nits.
| from .bnn import * | ||
| from .qnn import * | ||
| from .upsampling import * | ||
| from .layer_norm import layer_norm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about importing * 👀? Since I see all other imports import *.
| from .layer_norm import layer_norm | |
| from .layer_norm import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wildcard importing is actually not a good idea though lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, so I avoid using wildcard here. perhaps we should clean up this file in the future
* [TOPI] Add one-pass layer norm using tuple reduction * Add reducer pattern for LowerCrossThreadReduction * lint * update docs
This PR added a tuple-sum based implementation of layer norm. It performs one-pass reduction to compute mean and variance at the same time.
Reducer pattern is also added to allow
LowerCrossThreadReductionto handle this case.On CUDA, it will generate two kernels: one for reduction and one for elemwise operations. Because of some limitation of
compute_atcurrently we are not able to fuse them into one kernel.cc @MasterJH5574 @junrushao @AndrewZhaoLuo