Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Conversation

@xziya
Copy link
Contributor

@xziya xziya commented Jul 22, 2019

Description

We integrated the mkl-dnn Linear-Before-Reset GRU into MXNet. Currently, it supports FP32 inference. @pengzhao-intel @ciyongch @TaoLv

Checklist

Essentials

  • Changes are complete. The FP32 inference of all the RNN variants supported by MXNet is ready in this PR.

Changes

  • LBR-GRU inference goes directly into mkl-dnn rnn forward primitive by default.
  • Move mkldnn::memorys into a struct.
  • Drop unnecessary mkldnn::memorys from multi-layer implementation.

Performance

We tested the performance of FusedRNN with mode='gru' using the same dimension as that in PR#14713, i.e. seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800.

mode Layer Direction MXNET_USE_MKLDNN_RNN=0 MXNET_USE_MKLDNN_RNN=1 SpeedUp
Throughput (samples/sec) Latency (ms) Throughput (samples/sec) Latency (ms) Throughtput Latency
gru 1 1 430.03 20.43 806.27 4.28 1.87 4.78
gru 1 2 218.58 119.50 416.55 8.58 1.91 13.93
gru 5 1 89.47 100.07 177.52 21.20 1.98 4.72
gru 5 2 39.68 611.38 71.15 46.45 1.79 13.16

We also compared the performance of this PR with that of the previously integrated LSTM, vRNN tanh, vRNN Relu on branch master. It seems that there is a distinct regression with mode='lstm'.

mode Layer Direction eec0fb4 This PR (c186863) Gap
Throughput (samples/sec) Latency (ms) Throughput (samples/sec) Latency (ms) Throughput Latency
lstm 1 1 675.24 4.98 654.61 5.71 0.97 0.87
lstm 1 2 343.99 9.86 333.13 11.65 0.97 0.85
lstm 5 1 141.30 24.03 138.59 28.39 0.98 0.85
lstm 5 2 55.67 53.16 54.11 61.29 0.97 0.87
rnn_tanh 1 1 1617.27 2.46 1541.13 2.60 0.95 0.94
rnn_tanh 1 2 851.16 4.82 828.10 5.01 0.97 0.96
rnn_tanh 5 1 390.48 11.66 376.38 12.27 0.96 0.95
rnn_tanh 5 2 164.11 25.72 156.64 26.74 0.95 0.96
rnn_relu 1 1 1582.22 2.65 1508.54 2.59 0.95 1.02
rnn_relu 1 2 824.18 5.20 803.40 5.04 0.97 1.03
rnn_relu 5 1 381.53 12.58 366.71 12.09 0.96 1.04
rnn_relu 5 2 153.11 27.57 153.67 27.06 1.00 1.02

Comments

  • The gates orders of GRU from MXNet and MKL-DNN are different. There are more overhead costs when it prepares mkldnn::memorys with mode='gru'.

case rnn_enum::kGru:
size = 2 * (D * (I + H) * 3 * H + (L - 1) * D * (D * H + H) * 3 * H +
L * D * 2 * N * H) + T * N * D * H + L * 2 * D * 3 * H + (L + 2) * D * 2 * N * H +
L * D * 2 * N * H) + T * N * D * H + L * 2 * D * 4 * H + (L + 2) * D * 2 * N * H +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I know it's out of the scope of this PR, but could we rename the variables to something more self-explanatory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. I will have a try. I ever did it. But it seems that there is a large amount of code. I think we can segregate the common parts from them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcoabreu Sorry for the late update. I have renamed this part. Would you mind checking it again? Thanks.

@marcoabreu
Copy link
Contributor

Could you elaborate where "MXNET_USE_MKLDNN_RNN" comes from? Out of scope of the PR, but why did we introduce that switch instead of just going with it?

@pengzhao-intel
Copy link
Contributor

Could you elaborate where "MXNET_USE_MKLDNN_RNN" comes from? Out of scope of the PR, but why did we introduce that switch instead of just going with it?

@marcoabreu some background of this env variable. We had pulled in the MKL-DNN RNN integration in 1.5 by the application team for RNN/LSTM. So we set an env to make a rollback for the user in case there are un-expected functionality and performance degree. Now, we are refactoring the code and add the whole support for RNN API. This variable will be removed in the near future.

What's your opinion?

@marcoabreu
Copy link
Contributor

That approach sounds excellent and very clean, great idea! Thanks for elaborating.

Near future is before 1.6, correct?

@pengzhao-intel
Copy link
Contributor

That approach sounds excellent and very clean, great idea! Thanks for elaborating.

Near future is before 1.6, correct?

Yes, before 1.6 :)

@abhinavs95
Copy link
Contributor

@mxnet-label-bot add [mkldnn, pr-work-in-progress]

@marcoabreu marcoabreu added MKLDNN pr-work-in-progress PR is still work in progress labels Jul 26, 2019
@xziya
Copy link
Contributor Author

xziya commented Jul 30, 2019

Performance of the latest commit. @ciyongch I have checked the performance again. Their perf are similar, same as we discussed last time.

Mode Layer Direction a26af2b This PR ( cfc6910 ) Gap
Throughput (samples/sec) Latency (ms) Throughput (samples/sec) Latency (ms) Throughput Latency
lstm 1 1 630.78 4.82 670.23 4.87 1.06 0.99
lstm 1 2 313.71 9.68 338.51 9.72 1.08 1.00
lstm 5 1 139.85 23.59 138.22 23.48 0.99 1.00
lstm 5 2 54.63 51.19 54.27 51.28 0.99 1.00
rnn_tanh 1 1 1573.45 2.44 1576.23 2.51 1.00 0.97
rnn_tanh 1 2 836.43 4.63 830.33 4.67 0.99 0.99
rnn_tanh 5 1 381.32 11.44 379.88 11.50 1.00 1.00
rnn_tanh 5 2 159.76 24.92 149.86 24.90 0.94 1.00
rnn_relu 1 1 1536.55 2.65 1540.29 2.75 1.00 0.96
rnn_relu 1 2 805.00 5.09 807.68 5.06 1.00 1.01
rnn_relu 5 1 373.27 12.41 377.79 12.32 1.01 1.01
rnn_relu 5 2 154.21 26.93 153.80 26.61 1.00 1.01

@ciyongch
Copy link
Contributor

@zixuanweeei Thanks for fixing the perf drop of LSTM. It's ok to remove "WIP" from title now.

@xziya xziya changed the title [WIP] MKL-DNN LBR-GRU Inference Integration (FP32 LBR-GRU) MKL-DNN LBR-GRU Inference Integration (FP32 LBR-GRU) Jul 30, 2019
@xziya
Copy link
Contributor Author

xziya commented Jul 31, 2019

@TaoLv Please take some reviews on this PR. Thanks.

Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

minor suggestion to update the MKLDNN wiki for RNN parts in this PR since all basic RNN is supported now.

@TaoLv @ciyongch any other comments, I plan to merge the merge this PR soon.

auto cpu_engine = CpuEngine::Get()->get_engine();
std::vector<mkldnn::memory::primitive_desc> srcs_pd;
std::vector<mkldnn::memory> srcs;
bool initialized = tmp_src_mems->size() > 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const bool initialized?

GetMKLDNNRNNAlgo(mode, &n_gates, &n_states);
int n_bias = mode == rnn_enum::kGru ? n_gates + 1 : n_gates;
// sizes of single gates from a single cell
const size_t weights_size_0 = direction * (input_size + hidden_size) * hidden_size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assign int to size_t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input params of GetMKLDNNRNNCacheMemorySize are set to be const size_t type. The intermdiate results of int type may overflow.

auto src_wx = (*concat_weight_memory)[2 * layer_index];
auto src_wh = (*concat_weight_memory)[2 * layer_index + 1];
auto src_wx = mkldnn_mems->concat_weight_memory[2 * layer_index];
auto src_wh = mkldnn_mems->concat_weight_memory[2 * layer_index + 1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference or copy?

user_bias[single_b_sz + j] = back_bx[j + H] + back_bh[j + H];
}
#pragma omp parallel for num_threads(omp_threads)
for (int j = 2 * H; j < 3 * H; j++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to merge this for loop to the above one? I can see they have the same steps but not sure if there is any dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we can merge these two into one loop. Both variants have the same performance. They cost about ~18 us with hidden_size=4096 on 1 socket of SkyLake 8180 .

ZhennanQin and others added 11 commits August 3, 2019 07:52
* enhance api and new tutorial

* Update MKLDNN_QUANTIZATION.md

update

* fix lint

* modify pics

* skip test

* add quantize layer in graph

* update

* remove center css flag

* change requantize color

* fix markdown pics

* change to use png

* Update MKLDNN_QUANTIZATION.md

update

* enable ipython script

* fix png

* fix lint

* Update MKLDNN_QUANTIZATION.md

* change title

* trigger

* use lower case

* some typo

* some typo

* use dmlc web data

* trigger

* trigger
* make TransposeShape infer shape form both sides

* small fixes

* remove redundant lines

* unit tests
* Added tutorial for FIT API

* Added tests for Fit API tutorial

* Updated index.md for the new tutorial to show up

* Addressed PR feedback

* Addressed PR feedback

* Removed spurious comment for Py2 and Py3 compatibility

* Address PR feedback

* Addressed PR feedback

* Fixed typo

* Added example to showcase custom event handler

* Fixed imports as estimator moved to contrib package

* Added a side note to inform about estimator reference being updated by the handlers

* Corrected typo

* update tutorial

* address comments

* new line

* fix import

* fix cached graph

* fix import

* address comments

* fix doc gen

* add softmax

* add to website index

* fix doc string

* Fix doc gen (#12)

* fix warining

* fix test

* fix

* fix

* fix print

* fix test (#13)

* fix warning (#14)

* fix href (#15)
* Update test_profiler.py

* retrigger tests
* add magic method abs to ndarray

* add relevant tests

* add magic method abs to symbol

* add relevant tests

* retrigger CI

* retrigger CI
* prevent TRT_Logger to be destroyed before TRT engine

* use unique_ptr for trt_logger/parser/engine/executor ownership

* reduce line length for lint
@xziya
Copy link
Contributor Author

xziya commented Aug 3, 2019

Sorry for the inconvenience. This PR has been moved to #15741.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

MKLDNN pr-work-in-progress PR is still work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.