[MXNET-331] NVLink communication pattern updated by Laurawly · Pull Request #8915 · apache/mxnet

Laurawly · 2017-12-01T22:21:07Z

Description

(Optimized kvstore communication pattern to make full use of NVLinks)

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
For user-facing API changes, API doc string has been updated. For new C++ functions in header files, their functionalities and arguments are well-documented.
To my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Device reduce
Device broadcast

Comments

The changes make kvstore to use more NVLinks/PCIe and avoid using QPI when both are present, edge cases include 4 and 8 gpus are used.

weixingzhang

nit: indentation

weixingzhang · 2017-12-02T03:29:46Z

-        CopyFromTo(src[0], &buf.merged, priority);
-        return buf.merged;
+          CopyFromTo(src[0], &buf.merged, priority);
+          return buf.merged;


indentation need to be corrected.

ZiyueHuang · 2017-12-02T03:39:41Z

-    ElementwiseSum(reduce, &buf.merged);
+        CopyFromTo(stage.merged, &(buf.copy_buf[buf.copy_buf.size()-1]), priority);
+        reduce[reduce.size()-1] = buf.copy_buf[buf.copy_buf.size()-1];
+        ElementwiseSum(reduce, &buf.merged);


Missing priority? Should this line be ElementwiseSum(reduce, &buf.merged, priority);?

ZiyueHuang · 2017-12-02T03:46:51Z

      for (size_t i = 0; i < dst.size(); ++i) {
        if (i != static_cast<size_t>(dev_id)) {
-          CopyFromTo(*dst[dev_id], dst[i], priority);
+          CopyFromTo(*dst[dev_id], (dst[i]), priority);


nit: why there is a bracket, (dst[i])

ZiyueHuang · 2017-12-02T03:47:18Z

      // copy to a random device first
      int dev_id = key % dst.size();
-      CopyFromTo(src, dst[dev_id], priority);
+      CopyFromTo(src, (dst[dev_id]), priority);


nit: why there is a bracket, dst[dev_id]

piiswrong · 2017-12-02T08:33:55Z


+    std::vector<Context> g1, g2;
+    for (auto& d : devs) {
+        if (d.dev_id < 4) g1.push_back(d);


Can we decide this by querying cuda instead of magic numbers?

piiswrong · 2017-12-02T08:34:18Z

Looks like this is not turned off when not using nvlink?

eric-haibin-lin · 2017-12-02T19:09:21Z

-        buf.copy_buf[i] = NDArray(
-          buf.merged.shape(), buf.merged.ctx(), false, buf.merged.dtype());
-      }
+    if (buf.merged.is_none()&& stage.copy_buf.empty()) {


nit: space before &&

I don't think buf.merged.is_none() will ever be True since InitBuffersAndComm has intialized buf.merged?

eric-haibin-lin · 2017-12-02T19:12:29Z

    std::vector<NDArray> compressed_recv_buf;
  };
  std::unordered_map<int, BufferEntry> merge_buf_;
+  std::unordered_map<int, BufferEntry> stage_buf_;


Please add brief description for what this is for

eric-haibin-lin · 2017-12-02T19:25:34Z

+      stage.merged = NDArray(s, ctx, false, type);
      ctx_info[ctx.dev_id].second += s.Size();
    }
+    } else {


What's the impact of this update on older devices/architectures?

It'll avoid using QPIs.

eric-haibin-lin · 2017-12-02T19:28:37Z

@rahul003 please help review

Laurawly · 2017-12-05T08:28:34Z

@piiswrong When not using NVLink, this method uses more PCIe express than QPI which also accelerates the original communications.

rahul003 · 2017-12-08T21:09:49Z

@Laurawly Can we also make use of this feature for ReduceCompressed function?

eric-haibin-lin · 2017-12-10T18:20:42Z

+                return std::get<1>(a).Size() > std::get<1>(b).Size();
+              });
+
+    std::vector<Context> g1, g2;


Are there more readable var names for g1 and g2 to explain its purpose?

So g represents group here. It means I separate the GPU cards into two communication groups.

I've added some description for g1 and g2, and moved them to class members.

Laurawly · 2017-12-13T01:39:35Z

@rahul003 Yeah, I'll do an update on ReduceCompressed function as well. Thanks for reminding that.

Laurawly · 2017-12-20T00:23:57Z

@rahul003 Could you review my updates in ReduceCompressed, thanks in advance!

szha · 2017-12-22T21:16:36Z

@rahul003 ping

eric-haibin-lin · 2018-01-09T01:14:12Z

Any idea why test_rsp_pull failed?

larroy · 2018-01-11T11:40:28Z

-    pinned_ctx_ = Context::CPUPinned(0);
-  }
-  virtual ~Comm() { }
+  Comm() { pinned_ctx_ = Context::CPUPinned(0); }


Why don't we initialize on the constructor initialization list? It's more efficient.

larroy · 2018-01-11T11:41:17Z

-      int key, const NDArray& src,
-      const std::vector<NDArray*> dst, int priority) = 0;
+  virtual void Broadcast(int key, const NDArray& src,
+                         const std::vector<NDArray*> dst, int priority) = 0;


vector is passed by value, shouldn't it be passed by ref?

larroy · 2018-01-11T11:43:44Z

    auto& buf = merge_buf_[key];
-    std::vector<NDArray> reduce(src.size());
-    if (buf.copy_buf.empty()) {
+    auto& stage = stage_buf_[key];


I would type the full type here for readability.

larroy · 2018-01-11T11:48:39Z


-  void Broadcast(int key, const NDArray& src,
-                 const std::vector<NDArray*> dst, int priority) override {
+  void Broadcast(int key, const NDArray& src, const std::vector<NDArray*> dst,


Shouldn't be by ref?

larroy · 2018-01-11T11:49:34Z

Why is this file such a big header and no impl?

eric-haibin-lin · 2018-01-24T00:23:45Z

@rahul003 can you review the changes made for grad compression? Thanks!

rahul003

Thanks for modifying reduceCompressed too.
Could you please add a few comments to reduce or reduceCompressed explaining the flow of data? That would make this easier for others to maintain or develop further.

rahul003 · 2018-01-24T23:22:50Z

  }

+  /// \brief the NVLinked connected gpu groups
+  std::vector<Context> g1, g2;


It might be more readable to expand the names of these variables?

rahul003 · 2018-01-25T00:10:03Z

+      }
+    } else {
+      // QPI connections are included: use spanning tree
+      size_t gpu0, gpu1;


Could you add some comments on what the below computation is doing, what does gpu0 and gpu1 hold?

rahul003 · 2018-01-25T00:40:38Z

+      int id = src[i].ctx().dev_id;
+      if ((!buf.merged.is_none() && id == stage.merged.ctx().dev_id) ||
+          (buf.merged.is_none() && i == 0)) {
+        CopyFromTo(src[i], &(stage.merged), priority);


Why do we have to copy src[i] onto the same ctx? Can we directly use src[i]

rahul003 · 2018-01-25T01:00:18Z

+        buf.copy_buf.resize(g1.size() + 1);
+        buf.compressed_recv_buf.resize(g1.size() + 1);
+        buf.compressed_send_buf.resize(g1.size() + 1);
+        buf.residual.resize(g1.size() + 1);


We are declaring g1.size()+1 as size of array for residuals. Residuals are not sent to other GPUs. We don't need to allocate one extra residual array

That extra array is for copying back reduced value from stage buffer

But is the residual array only remains on the original GPU. It is never sent anywhere, but is only updated in place. Or are you just declaring an extra array for residual so that you can index this array similar to the other arrays (copy_buf or compressed_recv_buf)?

Either way, we can avoid creation of an extra residual array, right?
That would be significant memory savings (equal to the parameters of the model).

Oh, I see what you mean. Yeah, that's right. I'll correct it accordingly.

Tested tests/nightly/test_kvstore.py and it passes.

rahul003 · 2018-01-25T22:51:51Z

Could you please do these three things,

add some comments on the flow of data for the reduce function? The flow is not easy to follow with code itself.
ensure that tests/nightly/test_kvstore.py passes
fix the extra residual issue

Laurawly · 2018-01-26T18:42:23Z

@eric-haibin-lin Could you take a look if test_rsp_pull passes now?

rahul003 · 2018-01-26T18:45:29Z

No, it is still failing.

test_kvstore_gpu.test_rsp_push_pull ... terminate called after throwing an instance of 'dmlc::Error'

  what():  [23:44:37] src/engine/./threaded_engine.h:359: [23:44:37] src/ndarray/ndarray_function.cc:181: ElementwiseSum<cpu> has not been implemented for storage_type = << 0


Stack trace returned 10 entries:

[bt] (0) /workspace/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5a) [0x7f9f39b7334a]

[bt] (1) /workspace/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f9f39b73ee8]

[bt] (2) /workspace/python/mxnet/../../lib/libmxnet.so(void mxnet::ndarray::ElementwiseSum<mshadow::cpu>(mshadow::Stream<mshadow::cpu>*, mxnet::Resource const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, mxnet::NDArray*)+0x6a) [0x7f9f3c16f8aa]

[bt] (3) /workspace/python/mxnet/../../lib/libmxnet.so(+0x2f0ad9d) [0x7f9f3c1a5d9d]

[bt] (4) /workspace/python/mxnet/../../lib/libmxnet.so(+0x330848b) [0x7f9f3c5a348b]

[bt] (5) /workspace/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x100) [0x7f9f3c5afb50]

[bt] (6) /workspace/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>&&)+0xe2) [0x7f9f3c5b7b42]

[bt] (7) /workspace/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)> (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)> >::_M_run()+0x4a) [0x7f9f3c5b210a]

[bt] (8) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f9f45601c80]

[bt] (9) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f9f4cf136ba]



A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.


Stack trace returned 8 entries:

[bt] (0) /workspace/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5a) [0x7f9f39b7334a]

[bt] (1) /workspace/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f9f39b73ee8]

[bt] (2) /workspace/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x39a) [0x7f9f3c5afdea]

[bt] (3) /workspace/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>&&)+0xe2) [0x7f9f3c5b7b42]

[bt] (4) /workspace/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)> (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)> >::_M_run()+0x4a) [0x7f9f3c5b210a]

[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f9f45601c80]

[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f9f4cf136ba]

[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f9f4cc493dd]


MKL Build:20171227

Laurawly · 2018-01-29T19:48:25Z

@rahul003 should be solved by commit 683653e

Laurawly · 2018-01-30T22:20:01Z

@piiswrong ping.

eric-haibin-lin · 2018-02-11T07:43:48Z

    int mask = src.ctx().dev_mask();
    if (mask == Context::kCPU) {
-      for (auto d : dst) CopyFromTo(src, d, priority);
+      for (auto& d : dst) CopyFromTo(src, d, priority);


Is broadcast not leveraging the new comm pattern? The same logic can be applied to fully utilize the bandwidth during copy, right? Or is the plan to do that in the next PR?

larroy · 2018-02-12T15:43:41Z

-                rctx.get_stream<gpu>()->Wait();
-                break;
-              }
+                case gpu::kDevMask: {


Wrong indentation?

piiswrong

Two general points

Too many irrelevant cosmetic changes
the magic number 4 appears a lot of times. Are you assuming there are 4 gpus? This should be queried dynamically instead of being constant.

piiswrong · 2018-02-23T21:38:21Z

 */
 #ifndef MXNET_KVSTORE_COMM_H_
 #define MXNET_KVSTORE_COMM_H_
+#define NVLINK_SUPPORT 4


what's this? Can we avoid magic numbers?

piiswrong · 2018-02-23T21:39:58Z

-              on_complete();
-            }, Context::CPU(), {src.var(), row_id.var()}, {out_cpu.var()},
-            FnProperty::kNormal, priority, PROFILER_MESSAGE("KVStoreSparseRetain"));
+              [=](RunContext rctx, Engine::CallbackOnComplete on_complete) {


this sort of cosmetic changes are really distracting for code review. Try not to do it next time

piiswrong · 2018-02-23T21:42:08Z

-      reduce[0] = buf.merged;
-
-      if (buf.copy_buf.empty()) {
-        // TODO(mli) this results in large device memory usage for huge ndarray,


Is this TODO handled by this PR?

piiswrong · 2018-02-23T21:42:41Z

-  inline static void ReduceSumCPU(
-      const std::vector<DType*> &dptr, size_t offset, index_t size) {
+  template <typename DType>
+  inline static void ReduceSumCPU(const std::vector<DType*>& dptr,


This changes are really annoying

piiswrong · 2018-02-23T21:50:19Z

+      reduce_s.resize(stage.copy_buf.size());
+      for (size_t i = 0, j = 0; i < src.size(); ++i) {
+        int id = src[i].ctx().dev_id;
+        if (id >= 4 || buf.merged.is_none()) {


why 4? can we avoid magic numbers?

CodingCat · 2018-03-07T04:45:43Z

Hi, the community has passed to vote about associating the code changes with JIRA (https://lists.apache.org/thread.html/ab22cf0e35f1bce2c3bf3bec2bc5b85a9583a3fe7fd56ba1bbade55f@%3Cdev.mxnet.apache.org%3E)

We have updated the guidelines for contributors in https://cwiki.apache.org/confluence/display/MXNET/Development+Process, please ensure that you have created a JIRA at https://issues.apache.org/jira/projects/MXNET/issues/ to describe your work in this pull request and include the JIRA title in your PR as [MXNET-xxxx] your title where MXNET-xxxx is the JIRA id

Thanks!

Jerryzcn · 2018-03-10T00:28:19Z

when can we expect this to be merged?

marcoabreu · 2018-04-08T00:53:17Z

@Jerryzcn we're waiting for @Laurawly to address the review comments

eric-haibin-lin · 2018-05-07T21:38:11Z

Closing this for now due to inactivity.

eric-haibin-lin · 2018-06-26T22:17:21Z

Moved to #11357

weixingzhang reviewed Dec 2, 2017

View reviewed changes

ZiyueHuang reviewed Dec 2, 2017

View reviewed changes

piiswrong reviewed Dec 2, 2017

View reviewed changes

eric-haibin-lin reviewed Dec 2, 2017

View reviewed changes

eric-haibin-lin self-assigned this Dec 6, 2017

eric-haibin-lin reviewed Dec 10, 2017

View reviewed changes

Laurawly force-pushed the master branch from bb01ba9 to b86fd0c Compare December 12, 2017 23:57

Laurawly force-pushed the master branch from 8056fe6 to 3981de6 Compare December 19, 2017 23:38

Laurawly force-pushed the master branch 4 times, most recently from 067b612 to e10997d Compare January 5, 2018 19:28

larroy suggested changes Jan 11, 2018

View reviewed changes

rahul003 reviewed Jan 25, 2018

View reviewed changes

Laurawly force-pushed the master branch from e10997d to 7e05e0d Compare January 25, 2018 21:21

Laurawly force-pushed the master branch 2 times, most recently from 7332be1 to d3aeed5 Compare January 29, 2018 19:47

Laurawly force-pushed the master branch from 24b678f to 683653e Compare January 29, 2018 23:10

Laurawly added 8 commits February 7, 2018 19:46

merge with master

5009b13

review comments addressed

f4e45cf

nvlink communication applied to ReduceCompressed

a11158f

comments addressed

faa74f5

fixed residual array size

90a47b0

add comments to illustrate data flow in reduce function

d10da3c

test kvstore gpu debugged

cd62a65

stype bugs fixed

d174d78

Laurawly force-pushed the master branch from 683653e to d174d78 Compare February 7, 2018 19:46

eric-haibin-lin reviewed Feb 11, 2018

View reviewed changes

larroy reviewed Feb 12, 2018

View reviewed changes

piiswrong suggested changes Feb 23, 2018

View reviewed changes

Laurawly changed the title ~~NVLink communication pattern updated~~ [MXNET-331]NVLink communication pattern updated Apr 18, 2018

Laurawly changed the title ~~[MXNET-331]NVLink communication pattern updated~~ [MXNET-331] NVLink communication pattern updated Apr 18, 2018

eric-haibin-lin closed this May 7, 2018

eric-haibin-lin reopened this May 7, 2018

eric-haibin-lin closed this Jun 26, 2018

Conversation

Laurawly commented Dec 1, 2017

Description

Checklist

Essentials

Changes

Comments

Uh oh!

weixingzhang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piiswrong commented Dec 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-haibin-lin commented Dec 2, 2017

Uh oh!

Laurawly commented Dec 5, 2017

Uh oh!

rahul003 commented Dec 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Laurawly commented Dec 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Laurawly commented Dec 20, 2017

Uh oh!

szha commented Dec 22, 2017

Uh oh!

eric-haibin-lin commented Jan 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

larroy commented Jan 11, 2018

Uh oh!

eric-haibin-lin commented Jan 24, 2018

Uh oh!

rahul003 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Laurawly commented Dec 13, 2017 •

edited

Loading

rahul003 commented Jan 25, 2018 •

edited

Loading

rahul003 commented Jan 26, 2018 •

edited

Loading

Laurawly commented Jan 29, 2018 •

edited

Loading