Add force_deterministic option for sparse embedding by eric-haibin-lin · Pull Request #9882 · apache/mxnet

eric-haibin-lin · 2018-02-25T10:00:13Z

Description

(reopen of #9846)
Add force_deterministic option for contrib.SparseEmbedding. The option guarantees deterministic gradient during backward pass. The backward performance of force_deterministic=True is 50% slower on p2 instance / 80% slower on p3 instance compared to force_deterministic=False.

The benchmark script is at the bottom
Changes in indexing_op-inl.cuh is simply a refactoring of the original code

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

# the benchmark script also requires other files under example/rnn/word_lm

import numpy as np
import mxnet as mx, math
import argparse, math
import logging
from data import Corpus, CorpusIter
from model import *
from module import *
from mxnet.model import BatchEndParam

parser = argparse.ArgumentParser(description='PennTreeBank LSTM Language Model')
parser.add_argument('--data', type=str, default='./data/ptb.',
                    help='location of the data corpus')
parser.add_argument('--batch_size', type=int, default=128,
                    help='batch size')
parser.add_argument('--bptt', type=int, default=35,
                    help='sequence length')
parser.add_argument('--dim', type=int, default=1024*1024,
                    help='dim')
parser.add_argument('--force', action='store_true')
args = parser.parse_args()

if __name__ == '__main__':
    # args
    head = '%(asctime)-15s %(message)s'
    logging.basicConfig(level=logging.DEBUG, format=head)
    args = parser.parse_args()
    logging.info(args)
    ctx = mx.gpu()
    batch_size = args.batch_size
    bptt = args.bptt
    # data
    ctx = mx.gpu()
    corpus = Corpus(args.data)
    ntokens = len(corpus.dictionary)
    train_data = CorpusIter(corpus.train, batch_size, bptt)
    data = []
    for i in range(1):
        data.append(train_data.next().data[0].reshape((-1,)).astype('int64'))
    word = mx.sym.var('data')
    weight = mx.sym.var('embed_weight', stype='row_sparse')
    embed = mx.sym.contrib.SparseEmbedding(data=word, weight=weight, input_dim=args.dim,
                                           output_dim=512, name='embed', force_deterministic=args.force)
    grad_req = {'data': 'null', 'embed_weight': 'write'}
    exe_test = embed.simple_bind(mx.gpu(), grad_req=grad_req, data=(data[0].shape[0],))
    arg_map = dict(zip(embed.list_arguments(), exe_test.arg_arrays))
    grad_map = dict(zip(embed.list_arguments(), exe_test.grad_arrays))
    # init data
    arg_map["data"][:] = data[0].astype('float32')
    print(data[0])
    grad = mx.nd.ones(exe_test.outputs[0].shape).copyto(mx.gpu(0))
    # weight
    weight = arg_map["embed_weight"]
    weight[:] = 1
    exe_test.forward()
    # warm up
    for i in range(10):
        exe_test.backward([grad])
    import time
    mx.nd.waitall()
    a = time.time()
    for i in range(10000):
        exe_test.backward([grad])
    mx.nd.waitall()
    b = time.time()
    print(b - a)

This reverts commit 948c5a3.

This reverts commit 5d1cd64.

marcoabreu · 2018-02-26T13:41:23Z

Considering the big performance impact, would it make sense to print a prominent warning message making the user aware of the speed reduction?

sxjscience · 2018-02-26T17:21:57Z

+                                             const DType* ograd,
+                                             const nnvm::dim_t row_length,
+                                             const nnvm::dim_t num_threads_per_row,
+                                             const int SZ) {


I think SZ should be used as a template argument, which is combined with this kind of loop: https://github.com/dmlc/mshadow/blob/master/mshadow/cuda/tensor_gpu-inl.cuh#L662-L668

sxjscience · 2018-02-26T17:24:51Z

+        const dim_t ograd_offset = idx * row_length;
+        const dim_t out_offset = row_id * row_length;
+        for (int i = feature_start; i < feature_end; i++) {
+          out[out_offset + i] += ograd[ograd_offset + i];


Would it be faster if we use a local storage to store the values of out[offset +i] and write it back after we finish the loop?

if (tid == 0 || sorted_data[tid - 1] != sorted_data[tid]) { out_local[...] = out[...] do { UPDATE_LOCAL(out_local, ograd) } while(...) out[...] = out_local[...] }

sxjscience · 2018-02-26T17:34:24Z

+  using nnvm::dim_t;
+  if (req == kNullOp) return;
+  CHECK_EQ(req, kWriteTo) << "SparseEmbedding layer doesn't support "
+                          << "weight gradient calculation with req != write";


For the Embedding layer, enabling AddTo in the backward pass is essential to the training speed of RNN (Because we use the same embedding for all the timestamps). I think we need to support kAddTo in the sparse embedding layer (Maybe in another PR).

Thanks for bringing this up. For embedding:

Usually the inputs are concatenated before passing to Embedding so it is only calculated once and no "addto" req is required.

"addto" req is usually not supported for sparse grad because it requires re-allocation of memory which is expensive
Maybe we can revisit supporting "addto" req later.

piiswrong · 2018-02-26T18:11:15Z

+  int input_dim;
+  int output_dim;
+  int dtype;
+  bool force_deterministic;


force_deterministic -> deterministic

Will update it

piiswrong · 2018-02-26T18:12:12Z

+    .add_enum("int32", mshadow::kInt32)
+    .describe("Data type of weight.");
+    DMLC_DECLARE_FIELD(force_deterministic).set_default(false)
+    .describe("Force the gradient computation to be executed according to a deterministic order.");


Explain that this is slower?

cjolivier01 · 2018-02-26T19:05:11Z

+    MSHADOW_TYPE_SWITCH(ograd.type_flag_, DType, {
+      MSHADOW_IDX_TYPE_SWITCH(output.aux_type(kIdx), RType, {
+        // temp resource declarations
+        dim_t* lookup_table = NULL;


Can this huge chunk of code be pulled out into a template function so that it's steppable in the debugger?

Sure. I'll update it.

cjolivier01 · 2018-02-26T19:07:08Z

  }
 }

+inline void SparseEmbeddingOpBackwardDeterministicRspImpl(const OpContext& ctx,


Are the deterministic/nondeterministic versions divergent for more than 50% of their code or can they be combined somewhat? Looks kind of hard to maintain.

Unfortunately they use totally different kernels:
non-deterministic:

mark row idx

prefix sum

add_grad_atomic_add

deterministic:

copy

range

sort

unique

add_grad_deterministic

sxjscience · 2018-02-27T18:01:02Z

+  Kernel<mark_lookup_table, gpu>::Launch(s, nnr, lookup_table, grad_row_idx);
+
+  // accumulate gradients
+  DType* grad_data = output.data().dptr<DType>();


Should we set it to zero?

Yes. I should not have removed it. Will update

sxjscience · 2018-02-27T18:04:22Z

+        tid++;
+      } while (tid < data_size && sorted_data[tid - 1] == sorted_data[tid]);
+      for (int i = 0; i < num_features; i++) {
+        out[out_offset + i] = acc[i];


Which one should be correct, out[out_offset + i] = acc[i]; or out[out_offset + i] += acc[i];?

should be += instead

eric-haibin-lin · 2018-02-28T13:43:10Z

@marcoabreu added warning msg.

* refactor embed backward kernelcallker * pass unit test * refactor * fix dim bug * add unique impl * remove old op * remove unused kernel * Revert "remove unused kernel" This reverts commit 948c5a3. * Revert "remove old op" This reverts commit 5d1cd64. * fix kernellaucnher * add force_determ option * add doc * fix lint * update test * CR comments * lint * set grad to be 0s initially * add warning

ZiyueHuang and others added 12 commits February 20, 2018 02:29

refactor embed backward kernelcallker

b8ac9c9

pass unit test

131bf32

refactor

04f353e

fix dim bug

d9fc5f6

add unique impl

021d66e

remove old op

5d1cd64

remove unused kernel

948c5a3

Revert "remove unused kernel"

1d64ce2

This reverts commit 948c5a3.

Revert "remove old op"

f57df63

This reverts commit 5d1cd64.

fix kernellaucnher

9dd756d

add force_determ option

f40bf9c

add doc

fd1f619

eric-haibin-lin requested a review from cjolivier01 as a code owner February 25, 2018 10:00

Ubuntu added 2 commits February 25, 2018 11:50

fix lint

e87697a

update test

1feb99d

ZiyueHuang approved these changes Feb 26, 2018

View reviewed changes

sxjscience reviewed Feb 26, 2018

View reviewed changes

piiswrong suggested changes Feb 26, 2018

View reviewed changes

cjolivier01 reviewed Feb 26, 2018

View reviewed changes

ZiyueHuang added 2 commits February 27, 2018 13:38

CR comments

b6c6fce

lint

35e9b07

sxjscience reviewed Feb 27, 2018

View reviewed changes

set grad to be 0s initially

6551c38

marcoabreu approved these changes Feb 28, 2018

View reviewed changes

add warning

86f3833

eric-haibin-lin force-pushed the fix-embedding-unique branch from 2c97d35 to 86f3833 Compare March 1, 2018 07:53

piiswrong approved these changes Mar 1, 2018

View reviewed changes

sxjscience approved these changes Mar 1, 2018

View reviewed changes

piiswrong merged commit b8b869f into apache:master Mar 1, 2018

eric-haibin-lin mentioned this pull request Mar 2, 2018

R Package Fails to Build since #9882 #9957

Closed

eric-haibin-lin mentioned this pull request May 8, 2018

model using contrib.SparseEmbedding returns inconsistent result between runs #9310

Closed

eric-haibin-lin deleted the fix-embedding-unique branch May 9, 2018 18:05

Conversation

eric-haibin-lin commented Feb 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Changes

Comments

Uh oh!

marcoabreu commented Feb 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-haibin-lin commented Feb 28, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

eric-haibin-lin commented Feb 25, 2018 •

edited

Loading