Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Conversation

@rahul003
Copy link
Member

@rahul003 rahul003 commented Oct 10, 2017

Description

Fixes segfaults in kvstore_dist for row sparse, and fixes tests in dist_sync_kvstore
@eric-haibin-lin and I worked on this.

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage

Changes

  1. Move copying of indices to happen before ZPull, because after ZPull we release the locks and the later function don't see actual variable
  2. nullptr was not handled for offsets variable
  3. fixes tests which were broken when WaitToWrite was removed in Removing WaitToWrite in dist kvstore #8116 . Also added comments describing how they are no guarantees for (immediate) consecutive pushes
  4. Removed the usage of reference in BroadcastRowSparse to keep it consistent with style of Broadcast

Comments

When pushes happen one after the other, without a pull in between, there is no ordering guarantee we provide now that the WaitToWrite was removed in #8116 for performance reasons. The second push may start and finish before first push finishes

Please comment if we missed something, or if you have any suggestions, or if any clarifications are needed.

@eric-haibin-lin eric-haibin-lin self-requested a review October 10, 2017 17:06
@szha
Copy link
Member

szha commented Oct 10, 2017

PR checklist is for reflecting the progress and important aspects of the change. Please maintain.


void BroadcastRowSparse(int key, const NDArray& src,
const std::vector<std::pair<NDArray*, NDArray>>& dst,
const std::vector<std::pair<NDArray*, NDArray>> dst,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted after verifying that passing by reference doesn't cause scope issues

for (size_t i = 0; i < num_vals; i++) {
auto &row_id = target_val_rowids[i].second;
NDArray indices = row_id.Copy(pinned_ctx_);
NDArray indices(row_id.shape(), pinned_ctx_, false, mshadow::kInt64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sure rsp_pull(val, rowid) can accept rowid with dtype other than int64.

// This shouldn't affect training of networks though because training involves
// a sequence of push, pull, then push. This imposes ordering that the
// second push happens after the first pull, and the pull happens after first push.
send_buf = merged; // avoid memory copy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix indent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// at this point, later functions may access the indices variable while copy happens
mshadow::Copy(recv_buf->aux_data(kIdx).FlatTo1D<cpu, int64_t>(),
indices_data.FlatTo1D<cpu, int64_t>());
CHECK_NOTNULL(ps_worker_)->ZPull(pskv.keys, vals, &pskv.lens, kRowSparsePushPull,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to make sure Copy(indices) is done before ZPull() is completed, so that we don't broadcast with garbage indices

pskv.keys.push_back(ps_key);
pskv.lens.push_back(unit_len);
pskv.size += unit_len;
if (offsets) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset might be nullptr when gradients are complete zeros

@rahul003
Copy link
Member Author

Addressed comments and made a few changes

Update docs
Update comment
@rahul003 rahul003 changed the title Fix segfaults in and tests for distributed kvstore Fix segfaults and tests for distributed kvstore Oct 10, 2017
@piiswrong piiswrong merged commit 4286780 into apache:master Oct 11, 2017
mbaijal pushed a commit to mbaijal/incubator-mxnet that referenced this pull request Oct 12, 2017
* fix segfault in kvstore_dist for row sparse by: moving copy before zpull, and handling the case of a nullptr

* fix indent, and bring back references

* Update kvstore.py

Update docs

* Update kvstore_dist.h

Update comment

* Update kvstore.py

lint issues

* warning updated
crazy-cat pushed a commit to crazy-cat/incubator-mxnet that referenced this pull request Oct 26, 2017
* fix segfault in kvstore_dist for row sparse by: moving copy before zpull, and handling the case of a nullptr

* fix indent, and bring back references

* Update kvstore.py

Update docs

* Update kvstore_dist.h

Update comment

* Update kvstore.py

lint issues

* warning updated
@rahul003 rahul003 deleted the kvstore-push-test branch October 31, 2017 21:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants