[WIP][MXNET-107] Fused LSTM implementation for CPU#10104
[WIP][MXNET-107] Fused LSTM implementation for CPU#10104piiswrong merged 39 commits intoapache:masterfrom
Conversation
|
@szha @Jerryzcn @eric-haibin-lin @pengzhao-intel Could you help to review this PR? Need cooperation to refactor cudnn registration. |
| size_t size = 0; | ||
| switch (mode) { | ||
| case rnn_enum::kRnnRelu: | ||
| break; |
There was a problem hiding this comment.
Need error message for unimplemented modes.
| size = (seq_length + 1) * batch_size * hidden_size * 4 + batch_size * hidden_size; | ||
| break; | ||
| case rnn_enum::kGru: | ||
| break; |
There was a problem hiding this comment.
Add default statement for code robustnesss.
| w_ptr, y_ptr, hy_ptr, cy_ptr); | ||
| break; | ||
| case rnn_enum::kGru: | ||
| break; |
There was a problem hiding this comment.
Also need error message for unimplemented modes and default statement for switch-case.
| /*! | ||
| * Copyright (c) 2015 by Contributors | ||
| * \file rnn.cc | ||
| * \file rnn.cc |
| * \file rnn.cc | ||
| * \brief | ||
| * \author Sebastian Bodenstein | ||
| * \author Sebastian Bodenstein, Shu Zhang(shu.zhang@intel.com) |
| } | ||
| } | ||
| static inline int NumVisibleOutputs(const NodeAttrs& attrs) { | ||
| const RNNParam& params = nnvm::get<RNNParam>(attrs.parsed); |
| MXNET_REGISTER_OP_PROPERTY(RNN, RNNProp) | ||
| .describe("Applies a recurrent layer to input.") | ||
| inline static bool RNNStorageType(const nnvm::NodeAttrs& attrs, | ||
| const int dev_mask, |
There was a problem hiding this comment.
fix indent. Align with the first parameter.
|
|
||
| inline static bool BackwardRNNStorageType(const nnvm::NodeAttrs& attrs, | ||
| const int dev_mask, | ||
| DispatchMode* dispatch_mode, |
There was a problem hiding this comment.
fix indent. Align with the first parameter.
|
Seems |
| wh = mx.random.uniform(-1, 1, (4 * H, H), ctx=xpu,dtype=type1) | ||
| bx = mx.nd.zeros((4 * H,), ctx=xpu, dtype=type1) | ||
| bh = mx.nd.zeros((4 * H,), ctx=xpu, dtype=type1) | ||
| x1.attach_grad() |
There was a problem hiding this comment.
why do you need to manually attach grad??
There was a problem hiding this comment.
attach_grad is used to create gradient buffer for these NDArrays here. Do you mean this can be implemented in other ways or do you have any suggestion about this piece of code?
There was a problem hiding this comment.
In case use stateful OP, what's your opinion @eric-haibin-lin ?
| for g, c in zip(gs, cs): | ||
| assert_almost_equal(g.asnumpy(), c.asnumpy(), rtol=1e-2, atol=1e-6) | ||
|
|
||
| @unittest.skip("test fails intermittently. temporarily disabled till it gets fixed.") |
There was a problem hiding this comment.
If USE_CUDNN=1, I think this test will run into cudnn implementation which has been disabled temporarily. We will reopen this test case after we add cudnn back. In fact, building on gpu is failed currently. We are working on the failure.
| }; | ||
|
|
||
| NNVM_REGISTER_OP(RNN) | ||
| .describe(R"code(Applies a recurrent layer to input |
There was a problem hiding this comment.
Please provide more detailed descriptions
| DType* reserve_space_ptr = out_data[out_expected - 1].dptr<DType>(); | ||
|
|
||
| // allocate temp space | ||
| size_t workspace_size = GetRNNWorkspaceSize(param_.seq_length_, param_.batch_size_, |
| Tensor<cpu, 1, DType> workspace = ctx.requested[rnn_enum::kTempSpace] | ||
| .get_space_typed<cpu, 1, DType>(Shape1(workspace_size), s); | ||
|
|
||
| int direction = param_.bidirectional ? 2 : 1; |
|
I used to replace collapse |
|
@Jerryzcn Good suggestion. I will take a try. |
|
BTW, is there any existing jira issue for RNN implementation? Do I need to create a jira issue for this PR? @eric-haibin-lin @Jerryzcn @szha |
|
pls create one |
|
should we create a separate branch for cpu rnn. Once all the changes are checked in, we merge the rnn branch with the master. This way the master won't break people's code. |
|
@Jerryzcn good idea. @eric-haibin-lin please open an branch |
|
Happen to be around on github. I created the branch cpu_fused_rnn and updated PR base. |
|
Thanks @szha, will keep working on this. |
|
@sherry-zhang Good Job! |
1ee9ee3 to
dde8e23
Compare
|
@marcoabreu I am working on branch cpu_fused_rnn, but CI fails in sanity check. I doubt that CI environment has been adjusted for master branch, so this branch cannot work properly. Could you help take a look? Thanks. pylint check is passed on my local server but fails in snanity check: |
|
Looks like this is not going to make it into 1.2 |
|
@piiswrong the merged RNN feature supports inference-only LSTM that is compatible with cudnn implementation. Gluon LSTM layer now supports inference-only forwarding with this feature, and the rest of the use cases are still on old code paths, thanks to @Jerryzcn. The merged PR does what it sets out to do better than what previously exists, so it's more than half baked. |
|
cc @zhiheng-huang as his team will likely be impacted by the decision of reverting Jerry's PR. |
|
For now, please fork the master branch in your own repository and let collaborators make PRs towards your repository. At the same time, create a PR from your fork towards the master branch to have constant feedback every time a commit to your fork is being made. |
|
@szha @piiswrong May I have your opinions? I don't have permission to create/delete branchs and redirect this PR to master branch. I can rebase code to master branch if needed. |
|
I have changed the base branch as requested. We currently have an internal discussion about whether we support feature-branches in the official repository, until then, it would be better to work towards the master to ensure your PR is always receiving the latest updates. I have also retriggered CI. |
|
Thanks, @marcoabreu! Really understand your concern. Will keep working on this PR. |
|
Thanks a lot! Please excuse the inconvenience - in case of further problems, feel free to ping me again. |
|
rebase code to master branch and retrigger ci. |
|
I feel it difficult to change the existing gluon LSTM layer from normal |
|
@TaoLv I'll look into this later. I think you can do it similar to RNNCell. |
| out = exe.forward(is_train=False) | ||
| out[0].wait_to_read() | ||
| assert False # should not reach here | ||
| except mx.base.MXNetError as err: |
There was a problem hiding this comment.
Excellent approach! This will ensure we don't miss it to re-enable the test when we introduce dropout. Great job
There was a problem hiding this comment.
Yes. Also to ensure the failure happens at a proper position and correct error message is presented. Follow @reminisce 's idea in PR 10844 .
|
@piiswrong I added a test for dropout. I think this lstm operator is good to merge. Dropout support and hybrid rnn layer are WIP and will be submitted in another PR. I will also rebase #10311 accordingly. |
| @@ -0,0 +1,454 @@ | |||
| /* | |||
There was a problem hiding this comment.
we don't use hpp. Please rename to .h
| for (int i = 0; i < T; ++i) { | ||
| int t = bid ? T - 1 - i : i; | ||
| linalg_gemm(i ? h : hx, wh, yh_flat, alpha, beta, false, true); | ||
| #pragma omp parallel for |
There was a problem hiding this comment.
use https://github.com/apache/incubator-mxnet/blob/master/src/operator/mxnet_op.h#L435 to get recommended number of threads for openmp
| } | ||
| } | ||
| } | ||
| memcpy(y_ptr, rs + y_offset, T * N * H * D * sizeof(DType)); |
There was a problem hiding this comment.
One copy is for forward output and the other copy is for the reuse in backward computation.
| for (int i = 0; i < T; ++i) { | ||
| int t = bid ? T - 1 - i : i; | ||
| linalg_gemm(i ? h : hx, wh, yh_flat, alpha, beta, false, true); | ||
| #pragma omp parallel for |
| const Tensor<cpu, 2, DType>& dcnext = i ? dc : dcx; | ||
| const Tensor<cpu, 2, DType>& hnext = i ? htmp : hx; | ||
| const Tensor<cpu, 2, DType>& cnext = i ? c[i - 1] : cx; | ||
| #pragma omp parallel for |
| const int row = T * N; | ||
| const int col = H * 4; | ||
| for (int i = 0; i < row; ++i) { | ||
| #pragma omp parallel for |
There was a problem hiding this comment.
omp usage may not be efficient here. Operations in this loop is very simple while col usually is less than a few thousands
There was a problem hiding this comment.
You are right. I will remove this omp temporarily and look for better optimization for this piece of code.
| const DType beta1 = 1.0; | ||
| const int cell_size = N * H; | ||
| if (dhy_ptr != NULL) { | ||
| memcpy(dh.dptr_, dhy_ptr, cell_size * sizeof(DType)); |
| data = mx.sym.Variable('data') | ||
|
|
||
| Y1, _ = cell1.unroll(T, data, layout='NTC', merge_outputs=True) | ||
| mod1 = mx.mod.Module(Y1, label_names=None, context=mx.cpu()) |
There was a problem hiding this comment.
use default_context() here and remove the corresponding tests in test_operator_gpu. These tests will automatically be run again in test_operator_gpu with default_context() = gpu()
There was a problem hiding this comment.
Fixed. Also I changed the name of test_lstm to test_lstm_sym since it would confict with that in /unittest/test_gluon_rnn.py after imported to test_operator_gpu.py.
|
@eric-haibin-lin @piiswrong @szha @Jerryzcn the comments are solved. Please help take a review again. |
| DType ft = ifgo[i][j][k][1]; | ||
| DType gt = ifgo[i][j][k][2]; | ||
| DType ot = ifgo[i][j][k][3]; | ||
| dh[j][k] += dy[t][j][k + offset]; |
There was a problem hiding this comment.
dh and dc is never read before they are overwritten. Why do you need the copy at line 341?
| } | ||
| } | ||
| } | ||
| memcpy(y_ptr, rs + y_offset, T * N * H * D * sizeof(DType)); |
There was a problem hiding this comment.
why not write to y_ptr directly for the last layer?
| } | ||
| Tensor<cpu, 2, DType> dyh(difgo[t].dptr_, Shape2(N, H * 4)); | ||
| linalg_gemm(dyh, wh, dhnext, alpha, beta0, false, false); | ||
| linalg_gemm(dyh, hnext, dwh, alpha, beta1, true, false); |
There was a problem hiding this comment.
dwh is overwritten. why do you need to set it to 0 with memset at 328?
|
I can merge this first. But I think the memset and memcpy statements are superfluous. We should get ride of them later |
* register RNN fused-API with nnvm, finish single-layer && undirection LSTM forward function * fix coding style and lint complains * add single-layer && undirectional LSTM backward function * make interface universal for other RNN mode * share intermediate result between forward and backward in a trick way * add comments for important parameters * modify testcase * Fix coding style and error message * fix openmp collapse error * fix const * remove rnn.cu and skip related testcases temporarily for building on GPU * support multi-layer and bidirectional for lstm inference * remove some testcaseS in test_gluon_rnn.py to build on GPU * remove testcase between fp32 and fp64 temporarily * retrigger ci * fix some logs * use a better way to share memory * fix cudnn registration * fix invariant calculations and enable some gpu testcases * add thread local cache for cudnn rnn op * add thread local cache for rnn op * fix bugs * remove some testcases to check segmentfault * remove cudnn registeration to check segmentfault * support multi-layer for LSTM Training * modify lstm testcase * add bidirectional support for lstm * fix gluon and coding style * fix bugs * remove nnvm registration * enable gpu testcases * add detailed descriptions * add dropout check * fix workspace size * dropout is not supported, add unit test for it * fix review comments
* register RNN fused-API with nnvm, finish single-layer && undirection LSTM forward function * fix coding style and lint complains * add single-layer && undirectional LSTM backward function * make interface universal for other RNN mode * share intermediate result between forward and backward in a trick way * add comments for important parameters * modify testcase * Fix coding style and error message * fix openmp collapse error * fix const * remove rnn.cu and skip related testcases temporarily for building on GPU * support multi-layer and bidirectional for lstm inference * remove some testcaseS in test_gluon_rnn.py to build on GPU * remove testcase between fp32 and fp64 temporarily * retrigger ci * fix some logs * use a better way to share memory * fix cudnn registration * fix invariant calculations and enable some gpu testcases * add thread local cache for cudnn rnn op * add thread local cache for rnn op * fix bugs * remove some testcases to check segmentfault * remove cudnn registeration to check segmentfault * support multi-layer for LSTM Training * modify lstm testcase * add bidirectional support for lstm * fix gluon and coding style * fix bugs * remove nnvm registration * enable gpu testcases * add detailed descriptions * add dropout check * fix workspace size * dropout is not supported, add unit test for it * fix review comments
* register RNN fused-API with nnvm, finish single-layer && undirection LSTM forward function * fix coding style and lint complains * add single-layer && undirectional LSTM backward function * make interface universal for other RNN mode * share intermediate result between forward and backward in a trick way * add comments for important parameters * modify testcase * Fix coding style and error message * fix openmp collapse error * fix const * remove rnn.cu and skip related testcases temporarily for building on GPU * support multi-layer and bidirectional for lstm inference * remove some testcaseS in test_gluon_rnn.py to build on GPU * remove testcase between fp32 and fp64 temporarily * retrigger ci * fix some logs * use a better way to share memory * fix cudnn registration * fix invariant calculations and enable some gpu testcases * add thread local cache for cudnn rnn op * add thread local cache for rnn op * fix bugs * remove some testcases to check segmentfault * remove cudnn registeration to check segmentfault * support multi-layer for LSTM Training * modify lstm testcase * add bidirectional support for lstm * fix gluon and coding style * fix bugs * remove nnvm registration * enable gpu testcases * add detailed descriptions * add dropout check * fix workspace size * dropout is not supported, add unit test for it * fix review comments
Description
In this PR, a fused LSTM operator for CPU is implemented. More supports for other RNN variants are WIP and will be submitted in other PRs.
Feature changes
New features
sym.RNNoperatorsym.RNNoperatorRefactor code and register it with NNVM interfacesUnit-test changes
test_lstmandtest_lstm_bidirectionalintests/python/unittests/test_operator.pyPerformance
We have tested performance of
sym.RNNandrnn.LSTMCellon our local Skylake-8180 with 2 Sockets and 56 cores. Use MKL as blas lib in this performance test.Test input size is from DS2 default parameters(seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800).
single layer measurement:
multi layer measurement: num_layers=5
#10104 /LSTMCell
#10104 /LSTMCell(cuda)
#10104 / sym.RNN(cudnn)
Opens
Fix cudnn registeration in this PRChecklist
Essentials
make lint)