-
Notifications
You must be signed in to change notification settings - Fork 6.7k
MKL-DNN LBR-GRU Inference Integration (FP32 LBR-GRU) #15621
Conversation
| case rnn_enum::kGru: | ||
| size = 2 * (D * (I + H) * 3 * H + (L - 1) * D * (D * H + H) * 3 * H + | ||
| L * D * 2 * N * H) + T * N * D * H + L * 2 * D * 3 * H + (L + 2) * D * 2 * N * H + | ||
| L * D * 2 * N * H) + T * N * D * H + L * 2 * D * 4 * H + (L + 2) * D * 2 * N * H + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I know it's out of the scope of this PR, but could we rename the variables to something more self-explanatory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. I will have a try. I ever did it. But it seems that there is a large amount of code. I think we can segregate the common parts from them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcoabreu Sorry for the late update. I have renamed this part. Would you mind checking it again? Thanks.
|
Could you elaborate where "MXNET_USE_MKLDNN_RNN" comes from? Out of scope of the PR, but why did we introduce that switch instead of just going with it? |
@marcoabreu some background of this env variable. We had pulled in the MKL-DNN RNN integration in 1.5 by the application team for RNN/LSTM. So we set an env to make a rollback for the user in case there are un-expected functionality and performance degree. Now, we are refactoring the code and add the whole support for RNN API. This variable will be removed in the near future. What's your opinion? |
|
That approach sounds excellent and very clean, great idea! Thanks for elaborating. Near future is before 1.6, correct? |
Yes, before 1.6 :) |
|
@mxnet-label-bot add [mkldnn, pr-work-in-progress] |
|
Performance of the latest commit. @ciyongch I have checked the performance again. Their perf are similar, same as we discussed last time.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@zixuanweeei Thanks for fixing the perf drop of LSTM. It's ok to remove "WIP" from title now. |
|
@TaoLv Please take some reviews on this PR. Thanks. |
pengzhao-intel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| auto cpu_engine = CpuEngine::Get()->get_engine(); | ||
| std::vector<mkldnn::memory::primitive_desc> srcs_pd; | ||
| std::vector<mkldnn::memory> srcs; | ||
| bool initialized = tmp_src_mems->size() > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const bool initialized?
| GetMKLDNNRNNAlgo(mode, &n_gates, &n_states); | ||
| int n_bias = mode == rnn_enum::kGru ? n_gates + 1 : n_gates; | ||
| // sizes of single gates from a single cell | ||
| const size_t weights_size_0 = direction * (input_size + hidden_size) * hidden_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assign int to size_t?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input params of GetMKLDNNRNNCacheMemorySize are set to be const size_t type. The intermdiate results of int type may overflow.
| auto src_wx = (*concat_weight_memory)[2 * layer_index]; | ||
| auto src_wh = (*concat_weight_memory)[2 * layer_index + 1]; | ||
| auto src_wx = mkldnn_mems->concat_weight_memory[2 * layer_index]; | ||
| auto src_wh = mkldnn_mems->concat_weight_memory[2 * layer_index + 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reference or copy?
| user_bias[single_b_sz + j] = back_bx[j + H] + back_bh[j + H]; | ||
| } | ||
| #pragma omp parallel for num_threads(omp_threads) | ||
| for (int j = 2 * H; j < 3 * H; j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to merge this for loop to the above one? I can see they have the same steps but not sure if there is any dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we can merge these two into one loop. Both variants have the same performance. They cost about ~18 us with hidden_size=4096 on 1 socket of SkyLake 8180 .
* Add missing default arg * Add test * add test
* add 4c type * trigger
* Fix _copy_to * Add comment
* refine autograd docs * CR comments * Fix examples * CR comments * Followup CR * CR
* enhance api and new tutorial * Update MKLDNN_QUANTIZATION.md update * fix lint * modify pics * skip test * add quantize layer in graph * update * remove center css flag * change requantize color * fix markdown pics * change to use png * Update MKLDNN_QUANTIZATION.md update * enable ipython script * fix png * fix lint * Update MKLDNN_QUANTIZATION.md * change title * trigger * use lower case * some typo * some typo * use dmlc web data * trigger * trigger
* make TransposeShape infer shape form both sides * small fixes * remove redundant lines * unit tests
* Added tutorial for FIT API * Added tests for Fit API tutorial * Updated index.md for the new tutorial to show up * Addressed PR feedback * Addressed PR feedback * Removed spurious comment for Py2 and Py3 compatibility * Address PR feedback * Addressed PR feedback * Fixed typo * Added example to showcase custom event handler * Fixed imports as estimator moved to contrib package * Added a side note to inform about estimator reference being updated by the handlers * Corrected typo * update tutorial * address comments * new line * fix import * fix cached graph * fix import * address comments * fix doc gen * add softmax * add to website index * fix doc string * Fix doc gen (#12) * fix warining * fix test * fix * fix * fix print * fix test (#13) * fix warning (#14) * fix href (#15)
* Update test_profiler.py * retrigger tests
* add magic method abs to ndarray * add relevant tests * add magic method abs to symbol * add relevant tests * retrigger CI * retrigger CI
* prevent TRT_Logger to be destroyed before TRT engine * use unique_ptr for trt_logger/parser/engine/executor ownership * reduce line length for lint
|
Sorry for the inconvenience. This PR has been moved to #15741. |
Description
We integrated the mkl-dnn Linear-Before-Reset GRU into MXNet. Currently, it supports FP32 inference. @pengzhao-intel @ciyongch @TaoLv
Checklist
Essentials
Changes
mkldnn::memorys into a struct.mkldnn::memorys from multi-layer implementation.Performance
We tested the performance of FusedRNN with
mode='gru'using the same dimension as that in PR#14713, i.e. seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800.We also compared the performance of this PR with that of the previously integrated LSTM, vRNN tanh, vRNN Relu on branch master. It seems that there is a distinct regression with
mode='lstm'.Comments
mkldnn::memorys withmode='gru'.