Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Conversation

@xziya
Copy link
Contributor

@xziya xziya commented Apr 9, 2020

Description

In this PR, we add support of quantization flow of the rnn operator. Currently, only the LSTM mode supports INT8 inference.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Add _contrib_quantized_rnn op.
  • Add asymmetric quantization - _contrib_quantized_asym op, to quantize FP32 data to U8 data using scale and shift.
  • Add MXNET_USE_WEIGHT_CACHE to control rnn init behavior.
  • Support data layout in NDArrayIter. Specifically, NDArrayIter supports only NCHW layout by default, and there is no way to support other layouts, like sequential TNC layout. This PR makes some changes to NDArrayIter to leverage the feature (assuming that N represents the batch).
  • Move MKLDNNRnnMemMgr to individual layer.

@ciyongch @TaoLv @pengzhao-intel

@mxnet-bot
Copy link

Hey @zixuanweeei , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-cpu, website, sanity, miscellaneous, unix-gpu, centos-gpu, clang, unix-cpu, edge, centos-cpu, windows-gpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the performance?

@xziya
Copy link
Contributor Author

xziya commented Apr 13, 2020

what's the performance?

We have verified the accuracy and performance using a pre-trained language model provided by gluon-nlp (a link).

Accuracy (PPL, lower is better)

FP32 INT8
Validataion dataset 68.80 69.24
Test dataset 65.72 66.14

The accuracy results of INT8 is very close to that of FP32.

Performance

Profiler Dumps of FP32 End-to-End

Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
log_softmax 350 10968.93 31.09 31.54 31.34
RNN 1050 5664.45 3.13 7.37 5.39
_sg_mkldnn_fully_connected 350 2630.26 7.40 7.78 7.52
_rnn_param_concat 1050 2392.41 0.94 3.73 2.28
Reshape 4200 775.83 0.01 0.64 0.18
DeleteVariable 3856 185.39 0.00 0.53 0.05
CopyCPU2CPU 2450 48.89 0.01 0.05 0.02
Embedding 350 21.29 0.06 0.08 0.06
WaitForVar 2800 12.85 0.00 0.02 0.00
mean 350 9.26 0.02 0.05 0.03
Dropout 1400 8.38 0.00 0.01 0.01
sum 350 6.85 0.02 0.04 0.02
pick 350 6.55 0.02 0.03 0.02
_mul_scalar 350 3.56 0.01 0.02 0.01
_zeros 6 0.16 0.01 0.07 0.03
Total 22735.04

Profiler Dumps of INT8 End-to-End

Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
log_softmax 350 10805.84 30.72 35.89 30.87
_contrib_quantized_rnn 1050 2857.42 1.52 3.81 2.72
_rnn_param_concat 1050 2375.36 0.83 5.93 2.26
_contrib_quantize_asym 1050 1580.61 0.55 4.87 1.51
_sg_mkldnn_fully_connected 350 1559.83 4.42 4.65 4.46
Reshape 4200 762.71 0.01 0.66 0.18
DeleteVariable 3856 131.79 0.00 0.44 0.03
CopyCPU2CPU 2450 48.68 0.01 0.06 0.02
Embedding 350 21.03 0.06 0.08 0.06
WaitForVar 2796 12.34 0.00 0.02 0.00
_contrib_quantize_v2 350 11.29 0.03 0.06 0.03
mean 350 9.17 0.02 0.15 0.03
Dropout 1400 8.31 0.00 0.01 0.01
sum 350 6.63 0.02 0.04 0.02
pick 350 6.22 0.02 0.03 0.02
_mul_scalar 350 3.67 0.01 0.03 0.01
_zeros 6 0.11 0.01 0.07 0.02
Total 20201.01

End-to-End latency got ~1.1x speedup (22735.04 vs 20201.01) which is not that good. However, _contrib_quantized_rnn got ~2.0x speedup compared with RNN. Since RNN only occupies ~25% of total time while it's ~48% with log_softmax, the speedup of _contrib_quantized_rnn might be weakened. And _contrib_quantize_asym has a poor performance which needs further optimization (WIP).

Besides, the quantization flow of LSTM only takes some gemm operations into INT8 calculation. Others, such as gates' additions, bias additions, element-wise activations, are remain as FP32. So the speedup of _contrib_quantized_rnn isn't able to reach the expected 3~4x speedup.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for sharing!

@eric-haibin-lin
Copy link
Member

Is there a plan to improve log_softmax on cpu?

@ciyongch
Copy link
Contributor

@eric-haibin-lin we'll enable DNNL primitive for log_softmax to improve its performance on CPU, but not in this PR:)

@pengzhao-intel
Copy link
Contributor

@zixuanweeei could you rebase and resolve the conflict?

@xziya
Copy link
Contributor Author

xziya commented Apr 16, 2020

@zixuanweeei could you rebase and resolve the conflict?

Currently, we are focusing on adding the feature on v1.6.x branch, as well as the quantized LSTMP operator. I will port the changes there to this PR soon. Thanks for the reminder.

@xziya xziya force-pushed the rnn/quantization branch from 90dfcd6 to 76b85f4 Compare April 22, 2020 01:44
@xziya
Copy link
Contributor Author

xziya commented Apr 22, 2020

@mxnet-bot run ci [all]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu, windows-gpu, centos-cpu, sanity, miscellaneous, website, clang, windows-cpu, centos-gpu, unix-gpu, edge]

@xziya
Copy link
Contributor Author

xziya commented Apr 23, 2020

@mxnet-bot run ci [windows-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-gpu]

@xziya
Copy link
Contributor Author

xziya commented Apr 23, 2020

@mxnet-bot run ci [windows-gpu]

1 similar comment
@xziya
Copy link
Contributor Author

xziya commented Apr 25, 2020

@mxnet-bot run ci [windows-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-gpu]

xziya added 2 commits April 25, 2020 12:07
* Add _contrib_quantized_rnn op

* Add asymmetric quantization - _contrib_quantized_asym op

* Add MXNET_USE_WEIGHT_CACHE to control rnn init behavior

* Support data layout in NDArrayIter

* Move MKLDNNRnnMgr to individual layer
@xziya xziya force-pushed the rnn/quantization branch from 76b85f4 to 9d61032 Compare April 25, 2020 04:07
@pengzhao-intel
Copy link
Contributor

Closing since we need to refactor quantization flow in master

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants