MKL-DNN LBR-GRU Inference Integration (FP32 LBR-GRU) #15621

xziya · 2019-07-22T03:05:10Z

Description

We integrated the mkl-dnn Linear-Before-Reset GRU into MXNet. Currently, it supports FP32 inference. @pengzhao-intel @ciyongch @TaoLv

Checklist

Essentials

Changes are complete. The FP32 inference of all the RNN variants supported by MXNet is ready in this PR.

Changes

LBR-GRU inference goes directly into mkl-dnn rnn forward primitive by default.
Move mkldnn::memorys into a struct.
Drop unnecessary mkldnn::memorys from multi-layer implementation.

Performance

We tested the performance of FusedRNN with mode='gru' using the same dimension as that in PR#14713, i.e. seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800.

mode	Layer	Direction	MXNET_USE_MKLDNN_RNN=0		MXNET_USE_MKLDNN_RNN=1		SpeedUp
mode	Layer	Direction	Throughput (samples/sec)	Latency (ms)	Throughput (samples/sec)	Latency (ms)	Throughtput	Latency
gru	1	1	430.03	20.43	806.27	4.28	1.87	4.78
gru	1	2	218.58	119.50	416.55	8.58	1.91	13.93
gru	5	1	89.47	100.07	177.52	21.20	1.98	4.72
gru	5	2	39.68	611.38	71.15	46.45	1.79	13.16

We also compared the performance of this PR with that of the previously integrated LSTM, vRNN tanh, vRNN Relu on branch master. It seems that there is a distinct regression with mode='lstm'.

mode	Layer	Direction	`eec0fb4`		This PR (`c186863`)		Gap
mode	Layer	Direction	Throughput (samples/sec)	Latency (ms)	Throughput (samples/sec)	Latency (ms)	Throughput	Latency
lstm	1	1	675.24	4.98	654.61	5.71	0.97	0.87
lstm	1	2	343.99	9.86	333.13	11.65	0.97	0.85
lstm	5	1	141.30	24.03	138.59	28.39	0.98	0.85
lstm	5	2	55.67	53.16	54.11	61.29	0.97	0.87
rnn_tanh	1	1	1617.27	2.46	1541.13	2.60	0.95	0.94
rnn_tanh	1	2	851.16	4.82	828.10	5.01	0.97	0.96
rnn_tanh	5	1	390.48	11.66	376.38	12.27	0.96	0.95
rnn_tanh	5	2	164.11	25.72	156.64	26.74	0.95	0.96
rnn_relu	1	1	1582.22	2.65	1508.54	2.59	0.95	1.02
rnn_relu	1	2	824.18	5.20	803.40	5.04	0.97	1.03
rnn_relu	5	1	381.53	12.58	366.71	12.09	0.96	1.04
rnn_relu	5	2	153.11	27.57	153.67	27.06	1.00	1.02

Comments

The gates orders of GRU from MXNet and MKL-DNN are different. There are more overhead costs when it prepares mkldnn::memorys with mode='gru'.

marcoabreu · 2019-07-22T12:48:18Z

src/operator/nn/mkldnn/mkldnn_rnn_impl.h

    case rnn_enum::kGru:
      size = 2 * (D * (I + H) * 3 * H + (L - 1) * D * (D * H + H) * 3 * H +
-             L * D * 2 * N * H) + T * N * D * H + L * 2 * D * 3 * H + (L + 2) * D * 2 * N * H +
+             L * D * 2 * N * H) + T * N * D * H + L * 2 * D * 4 * H + (L + 2) * D * 2 * N * H +


Nit: I know it's out of the scope of this PR, but could we rename the variables to something more self-explanatory?

No problem. I will have a try. I ever did it. But it seems that there is a large amount of code. I think we can segregate the common parts from them.

@marcoabreu Sorry for the late update. I have renamed this part. Would you mind checking it again? Thanks.

marcoabreu · 2019-07-22T12:50:08Z

Could you elaborate where "MXNET_USE_MKLDNN_RNN" comes from? Out of scope of the PR, but why did we introduce that switch instead of just going with it?

pengzhao-intel · 2019-07-22T17:50:56Z

Could you elaborate where "MXNET_USE_MKLDNN_RNN" comes from? Out of scope of the PR, but why did we introduce that switch instead of just going with it?

@marcoabreu some background of this env variable. We had pulled in the MKL-DNN RNN integration in 1.5 by the application team for RNN/LSTM. So we set an env to make a rollback for the user in case there are un-expected functionality and performance degree. Now, we are refactoring the code and add the whole support for RNN API. This variable will be removed in the near future.

What's your opinion?

marcoabreu · 2019-07-22T17:54:24Z

That approach sounds excellent and very clean, great idea! Thanks for elaborating.

Near future is before 1.6, correct?

pengzhao-intel · 2019-07-22T17:55:32Z

That approach sounds excellent and very clean, great idea! Thanks for elaborating.

Near future is before 1.6, correct?

Yes, before 1.6 :)

abhinavs95 · 2019-07-26T19:29:56Z

@mxnet-label-bot add [mkldnn, pr-work-in-progress]

xziya · 2019-07-30T00:57:35Z

Performance of the latest commit. @ciyongch I have checked the performance again. Their perf are similar, same as we discussed last time.

Mode	Layer	Direction	`a26af2b`		This PR ( `cfc6910` )		Gap
Mode	Layer	Direction	Throughput (samples/sec)	Latency (ms)	Throughput (samples/sec)	Latency (ms)	Throughput	Latency
lstm	1	1	630.78	4.82	670.23	4.87	1.06	0.99
lstm	1	2	313.71	9.68	338.51	9.72	1.08	1.00
lstm	5	1	139.85	23.59	138.22	23.48	0.99	1.00
lstm	5	2	54.63	51.19	54.27	51.28	0.99	1.00
rnn_tanh	1	1	1573.45	2.44	1576.23	2.51	1.00	0.97
rnn_tanh	1	2	836.43	4.63	830.33	4.67	0.99	0.99
rnn_tanh	5	1	381.32	11.44	379.88	11.50	1.00	1.00
rnn_tanh	5	2	159.76	24.92	149.86	24.90	0.94	1.00
rnn_relu	1	1	1536.55	2.65	1540.29	2.75	1.00	0.96
rnn_relu	1	2	805.00	5.09	807.68	5.06	1.00	1.01
rnn_relu	5	1	373.27	12.41	377.79	12.32	1.01	1.01
rnn_relu	5	2	154.21	26.93	153.80	26.61	1.00	1.01

ciyongch · 2019-07-30T01:07:49Z

@zixuanweeei Thanks for fixing the perf drop of LSTM. It's ok to remove "WIP" from title now.

xziya · 2019-07-31T00:56:45Z

@TaoLv Please take some reviews on this PR. Thanks.

pengzhao-intel

LGTM

minor suggestion to update the MKLDNN wiki for RNN parts in this PR since all basic RNN is supported now.

@TaoLv @ciyongch any other comments, I plan to merge the merge this PR soon.

TaoLv · 2019-08-02T02:16:30Z

src/operator/nn/mkldnn/mkldnn_rnn_impl.h

  auto cpu_engine = CpuEngine::Get()->get_engine();
  std::vector<mkldnn::memory::primitive_desc> srcs_pd;
-  std::vector<mkldnn::memory> srcs;
+  bool initialized = tmp_src_mems->size() > 0;


const bool initialized?

TaoLv · 2019-08-02T02:19:08Z

src/operator/nn/mkldnn/mkldnn_rnn_impl.h

+  GetMKLDNNRNNAlgo(mode, &n_gates, &n_states);
+  int n_bias = mode == rnn_enum::kGru ? n_gates + 1 : n_gates;
+  // sizes of single gates from a single cell
+  const size_t weights_size_0 = direction * (input_size + hidden_size) * hidden_size;


assign int to size_t?

The input params of GetMKLDNNRNNCacheMemorySize are set to be const size_t type. The intermdiate results of int type may overflow.

TaoLv · 2019-08-02T02:21:28Z

src/operator/nn/mkldnn/mkldnn_rnn_impl.h

-    auto src_wx = (*concat_weight_memory)[2 * layer_index];
-    auto src_wh = (*concat_weight_memory)[2 * layer_index + 1];
+    auto src_wx = mkldnn_mems->concat_weight_memory[2 * layer_index];
+    auto src_wh = mkldnn_mems->concat_weight_memory[2 * layer_index + 1];


reference or copy?

TaoLv · 2019-08-02T02:25:27Z

src/operator/nn/mkldnn/mkldnn_rnn_impl.h

+        user_bias[single_b_sz + j] = back_bx[j + H] + back_bh[j + H];
+      }
+      #pragma omp parallel for num_threads(omp_threads)
+      for (int j = 2 * H; j < 3 * H; j++) {


Is it possible to merge this for loop to the above one? I can see they have the same steps but not sure if there is any dependency.

Yep, we can merge these two into one loop. Both variants have the same performance. They cost about ~18 us with hidden_size=4096 on 1 socket of SkyLake 8180 .

* Add missing default arg * Add test * add test

* add 4c type * trigger

* Fix _copy_to * Add comment

* refine autograd docs * CR comments * Fix examples * CR comments * Followup CR * CR

* enhance api and new tutorial * Update MKLDNN_QUANTIZATION.md update * fix lint * modify pics * skip test * add quantize layer in graph * update * remove center css flag * change requantize color * fix markdown pics * change to use png * Update MKLDNN_QUANTIZATION.md update * enable ipython script * fix png * fix lint * Update MKLDNN_QUANTIZATION.md * change title * trigger * use lower case * some typo * some typo * use dmlc web data * trigger * trigger

* make TransposeShape infer shape form both sides * small fixes * remove redundant lines * unit tests

* Added tutorial for FIT API * Added tests for Fit API tutorial * Updated index.md for the new tutorial to show up * Addressed PR feedback * Addressed PR feedback * Removed spurious comment for Py2 and Py3 compatibility * Address PR feedback * Addressed PR feedback * Fixed typo * Added example to showcase custom event handler * Fixed imports as estimator moved to contrib package * Added a side note to inform about estimator reference being updated by the handlers * Corrected typo * update tutorial * address comments * new line * fix import * fix cached graph * fix import * address comments * fix doc gen * add softmax * add to website index * fix doc string * Fix doc gen (#12) * fix warining * fix test * fix * fix * fix print * fix test (#13) * fix warning (#14) * fix href (#15)

* Update test_profiler.py * retrigger tests

* add magic method abs to ndarray * add relevant tests * add magic method abs to symbol * add relevant tests * retrigger CI * retrigger CI

* prevent TRT_Logger to be destroyed before TRT engine * use unique_ptr for trt_logger/parser/engine/executor ownership * reduce line length for lint

xziya · 2019-08-03T00:20:47Z

Sorry for the inconvenience. This PR has been moved to #15741.

MKLDNN LBR-GRU Integration

c186863

marcoabreu reviewed Jul 22, 2019

View reviewed changes

xziya added 3 commits July 25, 2019 12:23

Merge master into mkldnn-lbr-gru

18eb141

Readable params and UT supplement

1e1f799

Fix lint errors

49ebe01

marcoabreu added MKLDNN pr-work-in-progress PR is still work in progress labels Jul 26, 2019

xziya added 2 commits July 29, 2019 09:06

Trigger CI

cfc6910

Retrigger CI

71a822a

xziya changed the title ~~[WIP] MKL-DNN LBR-GRU Inference Integration (FP32 LBR-GRU)~~ MKL-DNN LBR-GRU Inference Integration (FP32 LBR-GRU) Jul 30, 2019

xziya added 3 commits July 31, 2019 09:26

Enable re-initialization with training path

2facb29

Trigger CI

feaf117

Trigger CI

a6ee56c

pengzhao-intel approved these changes Aug 1, 2019

View reviewed changes

TaoLv reviewed Aug 2, 2019

View reviewed changes

xziya and others added 7 commits August 2, 2019 16:35

Type refine and meaningful params

302f8dc

Add missing default axis value to symbol.squeeze op (#15707)

794b190

* Add missing default arg * Add test * add test

add deconv in TRT subgraph (#15666)

0042c49

Fix Scala Symbolic API some/Some typo (#15687)

0bfac7d

Add MKLDNN 4c layout to fix gluoncv se_resnext101_64x4d (#15692)

d6c17fa

* add 4c type * trigger

Fix _copy_to on MKLDNN backend (#15637)

29ba4fb

* Fix _copy_to * Add comment

[DOC] refine autograd docs (#15109)

862423a

* refine autograd docs * CR comments * Fix examples * CR comments * Followup CR * CR

ZhennanQin and others added 11 commits August 3, 2019 07:52

Fix quantized concat when inputs are mixed int8 and uint8 (#15693)

4f6f124

make TransposeShape infer shape form both sides (#15713)

0b1c8f6

* make TransposeShape infer shape form both sides * small fixes * remove redundant lines * unit tests

remove mshadow submodule

e156056

import mshadow source tree

3f60274

Skip Flaky Test (#15722)

4a2bfe0

* Update test_profiler.py * retrigger tests

Add magic method abs to NDArray and Symbol. (#15680)

60a2fc0

* add magic method abs to ndarray * add relevant tests * add magic method abs to symbol * add relevant tests * retrigger CI * retrigger CI

fix boolean_mask for 0-size output (#15731)

c01e88e

prevent TRT_Logger to be destroyed before TRT engine (#14898)

7767f47

* prevent TRT_Logger to be destroyed before TRT engine * use unique_ptr for trt_logger/parser/engine/executor ownership * reduce line length for lint

Delete extra white space

4455b77

xziya requested review from anirudh2290, eric-haibin-lin and szha as code owners August 2, 2019 23:52

xziya closed this Aug 2, 2019

xziya mentioned this pull request Aug 3, 2019

MKL-DNN LBR-GRU Inference Integration (FP32 LBR-GRU) #15741

Closed

MKL-DNN LBR-GRU Inference Integration (FP32 LBR-GRU) #15621

MKL-DNN LBR-GRU Inference Integration (FP32 LBR-GRU) #15621

Uh oh!

Conversation

xziya commented Jul 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Changes

Performance

Comments

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcoabreu commented Jul 22, 2019

Uh oh!

pengzhao-intel commented Jul 22, 2019

Uh oh!

marcoabreu commented Jul 22, 2019

Uh oh!

pengzhao-intel commented Jul 22, 2019

Uh oh!

abhinavs95 commented Jul 26, 2019

Uh oh!

xziya commented Jul 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ciyongch commented Jul 30, 2019

Uh oh!

xziya commented Jul 31, 2019

Uh oh!

pengzhao-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xziya commented Aug 3, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

xziya commented Jul 22, 2019 •

edited

Loading

xziya commented Jul 30, 2019 •

edited

Loading