[MKLDNN] Support quantized rnn #18001

xziya · 2020-04-09T05:57:52Z

Description

In this PR, we add support of quantization flow of the rnn operator. Currently, only the LSTM mode supports INT8 inference.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Add _contrib_quantized_rnn op.
Add asymmetric quantization - _contrib_quantized_asym op, to quantize FP32 data to U8 data using scale and shift.
Add MXNET_USE_WEIGHT_CACHE to control rnn init behavior.
Support data layout in NDArrayIter. Specifically, NDArrayIter supports only NCHW layout by default, and there is no way to support other layouts, like sequential TNC layout. This PR makes some changes to NDArrayIter to leverage the feature (assuming that N represents the batch).
Move MKLDNNRnnMemMgr to individual layer.

@ciyongch @TaoLv @pengzhao-intel

mxnet-bot · 2020-04-09T05:57:56Z

Hey @zixuanweeei , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-cpu, website, sanity, miscellaneous, unix-gpu, centos-gpu, clang, unix-cpu, edge, centos-cpu, windows-gpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

eric-haibin-lin

what's the performance?

xziya · 2020-04-13T04:13:59Z

what's the performance?

We have verified the accuracy and performance using a pre-trained language model provided by gluon-nlp (a link).

Accuracy (PPL, lower is better)

	FP32	INT8
Validataion dataset	68.80	69.24
Test dataset	65.72	66.14

The accuracy results of INT8 is very close to that of FP32.

Performance

Profiler Dumps of FP32 End-to-End

Name	Total Count	Time (ms)	Min Time (ms)	Max Time (ms)	Avg Time (ms)
log_softmax	350	10968.93	31.09	31.54	31.34
RNN	1050	5664.45	3.13	7.37	5.39
_sg_mkldnn_fully_connected	350	2630.26	7.40	7.78	7.52
_rnn_param_concat	1050	2392.41	0.94	3.73	2.28
Reshape	4200	775.83	0.01	0.64	0.18
DeleteVariable	3856	185.39	0.00	0.53	0.05
CopyCPU2CPU	2450	48.89	0.01	0.05	0.02
Embedding	350	21.29	0.06	0.08	0.06
WaitForVar	2800	12.85	0.00	0.02	0.00
mean	350	9.26	0.02	0.05	0.03
Dropout	1400	8.38	0.00	0.01	0.01
sum	350	6.85	0.02	0.04	0.02
pick	350	6.55	0.02	0.03	0.02
_mul_scalar	350	3.56	0.01	0.02	0.01
_zeros	6	0.16	0.01	0.07	0.03
Total		22735.04

Profiler Dumps of INT8 End-to-End

Name	Total Count	Time (ms)	Min Time (ms)	Max Time (ms)	Avg Time (ms)
log_softmax	350	10805.84	30.72	35.89	30.87
_contrib_quantized_rnn	1050	2857.42	1.52	3.81	2.72
_rnn_param_concat	1050	2375.36	0.83	5.93	2.26
_contrib_quantize_asym	1050	1580.61	0.55	4.87	1.51
_sg_mkldnn_fully_connected	350	1559.83	4.42	4.65	4.46
Reshape	4200	762.71	0.01	0.66	0.18
DeleteVariable	3856	131.79	0.00	0.44	0.03
CopyCPU2CPU	2450	48.68	0.01	0.06	0.02
Embedding	350	21.03	0.06	0.08	0.06
WaitForVar	2796	12.34	0.00	0.02	0.00
_contrib_quantize_v2	350	11.29	0.03	0.06	0.03
mean	350	9.17	0.02	0.15	0.03
Dropout	1400	8.31	0.00	0.01	0.01
sum	350	6.63	0.02	0.04	0.02
pick	350	6.22	0.02	0.03	0.02
_mul_scalar	350	3.67	0.01	0.03	0.01
_zeros	6	0.11	0.01	0.07	0.02
Total		20201.01

End-to-End latency got ~1.1x speedup (22735.04 vs 20201.01) which is not that good. However, _contrib_quantized_rnn got ~2.0x speedup compared with RNN. Since RNN only occupies ~25% of total time while it's ~48% with log_softmax, the speedup of _contrib_quantized_rnn might be weakened. And _contrib_quantize_asym has a poor performance which needs further optimization (WIP).

Besides, the quantization flow of LSTM only takes some gemm operations into INT8 calculation. Others, such as gates' additions, bias additions, element-wise activations, are remain as FP32. So the speedup of _contrib_quantized_rnn isn't able to reach the expected 3~4x speedup.

eric-haibin-lin

thanks for sharing!

eric-haibin-lin · 2020-04-14T20:32:53Z

Is there a plan to improve log_softmax on cpu?

ciyongch · 2020-04-15T01:05:23Z

@eric-haibin-lin we'll enable DNNL primitive for log_softmax to improve its performance on CPU, but not in this PR:)

pengzhao-intel · 2020-04-16T02:39:23Z

@zixuanweeei could you rebase and resolve the conflict?

xziya · 2020-04-16T03:01:50Z

@zixuanweeei could you rebase and resolve the conflict?

Currently, we are focusing on adding the feature on v1.6.x branch, as well as the quantized LSTMP operator. I will port the changes there to this PR soon. Thanks for the reminder.

xziya · 2020-04-22T23:45:56Z

@mxnet-bot run ci [all]

mxnet-bot · 2020-04-22T23:46:09Z

Jenkins CI successfully triggered : [unix-cpu, windows-gpu, centos-cpu, sanity, miscellaneous, website, clang, windows-cpu, centos-gpu, unix-gpu, edge]

xziya · 2020-04-23T08:57:58Z

@mxnet-bot run ci [windows-gpu]

mxnet-bot · 2020-04-23T08:58:07Z

Jenkins CI successfully triggered : [windows-gpu]

xziya · 2020-04-23T23:50:31Z

@mxnet-bot run ci [windows-gpu]

xziya · 2020-04-25T00:58:34Z

@mxnet-bot run ci [windows-gpu]

mxnet-bot · 2020-04-25T00:58:43Z

Jenkins CI successfully triggered : [windows-gpu]

* Add _contrib_quantized_rnn op * Add asymmetric quantization - _contrib_quantized_asym op * Add MXNET_USE_WEIGHT_CACHE to control rnn init behavior * Support data layout in NDArrayIter * Move MKLDNNRnnMgr to individual layer

pengzhao-intel · 2020-08-24T15:47:01Z

Closing since we need to refactor quantization flow in master

xziya requested review from aaronmarkham, anirudh2290, eric-haibin-lin and szha as code owners April 9, 2020 05:57

xziya force-pushed the rnn/quantization branch from 09fc10a to 9457d63 Compare April 9, 2020 06:36

pengzhao-intel added the MKLDNN label Apr 9, 2020

xziya force-pushed the rnn/quantization branch from 9457d63 to c7e2d0a Compare April 9, 2020 12:19

eric-haibin-lin reviewed Apr 10, 2020

View reviewed changes

xziya mentioned this pull request Apr 11, 2020

[MKLDNN] Support quantized rnn towards v1.6.x #18028

Merged

9 tasks

xziya force-pushed the rnn/quantization branch from c7e2d0a to 90dfcd6 Compare April 11, 2020 07:05

eric-haibin-lin reviewed Apr 14, 2020

View reviewed changes

pengzhao-intel mentioned this pull request Apr 15, 2020

Improve log_softmax performance by OneDNN library #18065

Closed

xziya force-pushed the rnn/quantization branch from 90dfcd6 to 76b85f4 Compare April 22, 2020 01:44

xziya added 2 commits April 25, 2020 12:07

Support quantized rnn

1d55231

* Add _contrib_quantized_rnn op * Add asymmetric quantization - _contrib_quantized_asym op * Add MXNET_USE_WEIGHT_CACHE to control rnn init behavior * Support data layout in NDArrayIter * Move MKLDNNRnnMgr to individual layer

Check omp reductions supporting version

9d61032

xziya force-pushed the rnn/quantization branch from 76b85f4 to 9d61032 Compare April 25, 2020 04:07

pengzhao-intel closed this Aug 24, 2020

josephevans mentioned this pull request Nov 11, 2021

Port #18028 to v1.x, master #20737

Closed

[MKLDNN] Support quantized rnn #18001

[MKLDNN] Support quantized rnn #18001

Uh oh!

Conversation

xziya commented Apr 9, 2020

Description

Checklist

Essentials

Changes

Uh oh!

mxnet-bot commented Apr 9, 2020

Uh oh!

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Uh oh!

xziya commented Apr 13, 2020

Accuracy (PPL, lower is better)

Performance

Profiler Dumps of FP32 End-to-End

Profiler Dumps of INT8 End-to-End

Uh oh!

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Uh oh!

eric-haibin-lin commented Apr 14, 2020

Uh oh!

ciyongch commented Apr 15, 2020

Uh oh!

pengzhao-intel commented Apr 16, 2020

Uh oh!

xziya commented Apr 16, 2020

Uh oh!

xziya commented Apr 22, 2020

Uh oh!

mxnet-bot commented Apr 22, 2020

Uh oh!

xziya commented Apr 23, 2020

Uh oh!

mxnet-bot commented Apr 23, 2020

Uh oh!

xziya commented Apr 23, 2020

Uh oh!

xziya commented Apr 25, 2020

Uh oh!

mxnet-bot commented Apr 25, 2020

Uh oh!

pengzhao-intel commented Aug 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants