-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[PERFORMANCE] [master] Layer normalization code from Marian for CPU #19602
Conversation
Experiment with OMP_NUM_THREADS=4, times in s, c5.12xlarge
|batchxchanne| New code | MKL |
| 1x 32 | 0.0000288| 0.0000278|
| 128x 32 | 0.0000308| 0.0000311|
| 2560x 32 | 0.0000712| 0.0000672|
| 4096x 32 | 0.0000946| 0.0000910|
| 8192x 32 | 0.0001597| 0.0001523|
|16384x 32 | 0.0002905| 0.0002619|
| 1x 64 | 0.0000264| 0.0000256|
| 128x 64 | 0.0000339| 0.0000330|
| 2560x 64 | 0.0000829| 0.0000972|
| 4096x 64 | 0.0001137| 0.0001356|
| 8192x 64 | 0.0002027| 0.0002435|
|16384x 64 | 0.0003715| 0.0004639|
| 1x 128 | 0.0000262| 0.0000263|
| 128x 128 | 0.0000325| 0.0000389|
| 2560x 128 | 0.0001074| 0.0001580|
| 4096x 128 | 0.0001505| 0.0002336|
| 8192x 128 | 0.0002861| 0.0004481|
|16384x 128 | 0.0005648| 0.0008613|
| 1x 256 | 0.0000273| 0.0000276|
| 128x 256 | 0.0000390| 0.0000431|
| 2560x 256 | 0.0001533| 0.0002811|
| 4096x 256 | 0.0002258| 0.0004300|
| 8192x 256 | 0.0004300| 0.0008464|
|16384x 256 | 0.0010436| 0.0017613|
| 1x 512 | 0.0000256| 0.0000302|
| 128x 512 | 0.0000408| 0.0000551|
| 2560x 512 | 0.0002444| 0.0005225|
| 4096x 512 | 0.0003828| 0.0008147|
| 8192x 512 | 0.0008832| 0.0017192|
|16384x 512 | 0.0058463| 0.0074497|
| 1x 768 | 0.0000252| 0.0000308|
| 128x 768 | 0.0000450| 0.0000676|
| 2560x 768 | 0.0003440| 0.0007719|
| 4096x 768 | 0.0005890| 0.0013346|
| 8192x 768 | 0.0014946| 0.0026145|
|16384x 768 | 0.0089495| 0.0113557|
| 1x 1024 | 0.0000285| 0.0000308|
| 128x 1024 | 0.0000487| 0.0000786|
| 2560x 1024 | 0.0004614| 0.0010190|
| 4096x 1024 | 0.0008083| 0.0017376|
| 8192x 1024 | 0.0059020| 0.0075588|
|16384x 1024 | 0.0116553| 0.0146855|
Benchmark program
```python
import mxnet as mx
import time
def time_procedure(shape, count):
data = mx.nd.random_uniform(shape=shape, low=-1.0, high = 1.0)
factors = mx.nd.random_uniform(shape=(shape[-1],))
mx.nd.waitall()
begin = time.time()
for i in range(0, count):
out = mx.nd.LayerNorm(data, factors, factors)
mx.nd.waitall()
return (time.time() - begin) / count
count = 200
for channel in [32, 64, 128, 256, 512, 768, 1024]:
for batch in [1, 128, 2560, 4096, 8192, 16384]:
s = (batch, channel)
timing = time_procedure(s, count)
print("{:5d}x{:5d} | {:.7f}".format(s[0], s[1], timing))
```
|
Hey @kpuatamazon , Thanks for submitting the PR
CI supported jobs: [centos-cpu, clang, windows-gpu, sanity, centos-gpu, miscellaneous, unix-cpu, windows-cpu, website, unix-gpu, edge] Note: |
|
@mxnet-bot run ci [all] Sigh everything is broken on some python HTTP thing. |
|
Jenkins CI successfully triggered : [edge, sanity, windows-gpu, unix-gpu, clang, centos-gpu, unix-cpu, miscellaneous, website, centos-cpu, windows-cpu] |
|
@mxnet-bot run ci [unix-cpu, website, windows-cpu, windows-gpu] Playing more CI docker daemon lottery. |
|
Jenkins CI successfully triggered : [website, windows-gpu, unix-cpu, windows-cpu] |
|
@mxnet-bot run ci [unix-cpu] Memory gambling is annoying. |
|
Jenkins CI successfully triggered : [unix-cpu] |
|
@mxnet-bot run ci [unix-cpu] Still just running out of RAM compiling numpy kernels. https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-19602/19/pipeline/ |
|
Jenkins CI successfully triggered : [unix-cpu] |
| std::conditional<std::is_same<mshadow::half::half_t, Data>::value, | ||
| float, | ||
| Data>::type> | ||
| void LayerNormCPUKernel(size_t width, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend to change the name to LayerNormContiguousCPUKernel or LayerNormLastAxisCPUKernel
|
One naming issue. Looks good to me. |
sxjscience
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor issue (can be addressed later actually).
|
What are the next steps for this PR? Is this ready to be merged? |
|
@fhieber I've just merged. Feel free to try it out. |
Description
This is the master version of #19601. There isn't much different in LayerNormalization implementation between v1.x and master.
Checklist
Essentials
Changes
Comments
See #19601 for benchmarks.