Add force_deterministic option for sparse embedding#9882
Add force_deterministic option for sparse embedding#9882piiswrong merged 18 commits intoapache:masterfrom
Conversation
|
Considering the big performance impact, would it make sense to print a prominent warning message making the user aware of the speed reduction? |
| const DType* ograd, | ||
| const nnvm::dim_t row_length, | ||
| const nnvm::dim_t num_threads_per_row, | ||
| const int SZ) { |
There was a problem hiding this comment.
I think SZ should be used as a template argument, which is combined with this kind of loop: https://github.com/dmlc/mshadow/blob/master/mshadow/cuda/tensor_gpu-inl.cuh#L662-L668
| const dim_t ograd_offset = idx * row_length; | ||
| const dim_t out_offset = row_id * row_length; | ||
| for (int i = feature_start; i < feature_end; i++) { | ||
| out[out_offset + i] += ograd[ograd_offset + i]; |
There was a problem hiding this comment.
Would it be faster if we use a local storage to store the values of out[offset +i] and write it back after we finish the loop?
if (tid == 0 || sorted_data[tid - 1] != sorted_data[tid]) {
out_local[...] = out[...]
do {
UPDATE_LOCAL(out_local, ograd)
} while(...)
out[...] = out_local[...]
}| using nnvm::dim_t; | ||
| if (req == kNullOp) return; | ||
| CHECK_EQ(req, kWriteTo) << "SparseEmbedding layer doesn't support " | ||
| << "weight gradient calculation with req != write"; |
There was a problem hiding this comment.
For the Embedding layer, enabling AddTo in the backward pass is essential to the training speed of RNN (Because we use the same embedding for all the timestamps). I think we need to support kAddTo in the sparse embedding layer (Maybe in another PR).
There was a problem hiding this comment.
Thanks for bringing this up. For embedding:
- Usually the inputs are concatenated before passing to Embedding so it is only calculated once and no "addto" req is required.
- "addto" req is usually not supported for sparse grad because it requires re-allocation of memory which is expensive
Maybe we can revisit supporting "addto" req later.
| int input_dim; | ||
| int output_dim; | ||
| int dtype; | ||
| bool force_deterministic; |
There was a problem hiding this comment.
force_deterministic -> deterministic
| .add_enum("int32", mshadow::kInt32) | ||
| .describe("Data type of weight."); | ||
| DMLC_DECLARE_FIELD(force_deterministic).set_default(false) | ||
| .describe("Force the gradient computation to be executed according to a deterministic order."); |
There was a problem hiding this comment.
Explain that this is slower?
| MSHADOW_TYPE_SWITCH(ograd.type_flag_, DType, { | ||
| MSHADOW_IDX_TYPE_SWITCH(output.aux_type(kIdx), RType, { | ||
| // temp resource declarations | ||
| dim_t* lookup_table = NULL; |
There was a problem hiding this comment.
Can this huge chunk of code be pulled out into a template function so that it's steppable in the debugger?
There was a problem hiding this comment.
Sure. I'll update it.
| } | ||
| } | ||
|
|
||
| inline void SparseEmbeddingOpBackwardDeterministicRspImpl(const OpContext& ctx, |
There was a problem hiding this comment.
Are the deterministic/nondeterministic versions divergent for more than 50% of their code or can they be combined somewhat? Looks kind of hard to maintain.
There was a problem hiding this comment.
Unfortunately they use totally different kernels:
non-deterministic:
- mark row idx
- prefix sum
- add_grad_atomic_add
deterministic:
- copy
- range
- sort
- unique
- add_grad_deterministic
| Kernel<mark_lookup_table, gpu>::Launch(s, nnr, lookup_table, grad_row_idx); | ||
|
|
||
| // accumulate gradients | ||
| DType* grad_data = output.data().dptr<DType>(); |
There was a problem hiding this comment.
Yes. I should not have removed it. Will update
| tid++; | ||
| } while (tid < data_size && sorted_data[tid - 1] == sorted_data[tid]); | ||
| for (int i = 0; i < num_features; i++) { | ||
| out[out_offset + i] = acc[i]; |
There was a problem hiding this comment.
Which one should be correct, out[out_offset + i] = acc[i]; or out[out_offset + i] += acc[i];?
There was a problem hiding this comment.
should be += instead
|
@marcoabreu added warning msg. |
2c97d35 to
86f3833
Compare
* refactor embed backward kernelcallker * pass unit test * refactor * fix dim bug * add unique impl * remove old op * remove unused kernel * Revert "remove unused kernel" This reverts commit 948c5a3. * Revert "remove old op" This reverts commit 5d1cd64. * fix kernellaucnher * add force_determ option * add doc * fix lint * update test * CR comments * lint * set grad to be 0s initially * add warning
* refactor embed backward kernelcallker * pass unit test * refactor * fix dim bug * add unique impl * remove old op * remove unused kernel * Revert "remove unused kernel" This reverts commit 948c5a3. * Revert "remove old op" This reverts commit 5d1cd64. * fix kernellaucnher * add force_determ option * add doc * fix lint * update test * CR comments * lint * set grad to be 0s initially * add warning
* refactor embed backward kernelcallker * pass unit test * refactor * fix dim bug * add unique impl * remove old op * remove unused kernel * Revert "remove unused kernel" This reverts commit 948c5a3. * Revert "remove old op" This reverts commit 5d1cd64. * fix kernellaucnher * add force_determ option * add doc * fix lint * update test * CR comments * lint * set grad to be 0s initially * add warning
Description
(reopen of #9846)
Add
force_deterministicoption forcontrib.SparseEmbedding. The option guarantees deterministic gradient during backward pass. The backward performance offorce_deterministic=Trueis 50% slower on p2 instance / 80% slower on p3 instance compared toforce_deterministic=False.indexing_op-inl.cuhis simply a refactoring of the original codeChecklist
Essentials
make lint)Changes
Comments