-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Fix segfaults and tests for distributed kvstore #8207
Conversation
…ull, and handling the case of a nullptr
|
PR checklist is for reflecting the progress and important aspects of the change. Please maintain. |
src/kvstore/comm.h
Outdated
|
|
||
| void BroadcastRowSparse(int key, const NDArray& src, | ||
| const std::vector<std::pair<NDArray*, NDArray>>& dst, | ||
| const std::vector<std::pair<NDArray*, NDArray>> dst, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted after verifying that passing by reference doesn't cause scope issues
| for (size_t i = 0; i < num_vals; i++) { | ||
| auto &row_id = target_val_rowids[i].second; | ||
| NDArray indices = row_id.Copy(pinned_ctx_); | ||
| NDArray indices(row_id.shape(), pinned_ctx_, false, mshadow::kInt64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sure rsp_pull(val, rowid) can accept rowid with dtype other than int64.
src/kvstore/kvstore_dist.h
Outdated
| // This shouldn't affect training of networks though because training involves | ||
| // a sequence of push, pull, then push. This imposes ordering that the | ||
| // second push happens after the first pull, and the pull happens after first push. | ||
| send_buf = merged; // avoid memory copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| // at this point, later functions may access the indices variable while copy happens | ||
| mshadow::Copy(recv_buf->aux_data(kIdx).FlatTo1D<cpu, int64_t>(), | ||
| indices_data.FlatTo1D<cpu, int64_t>()); | ||
| CHECK_NOTNULL(ps_worker_)->ZPull(pskv.keys, vals, &pskv.lens, kRowSparsePushPull, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to make sure Copy(indices) is done before ZPull() is completed, so that we don't broadcast with garbage indices
| pskv.keys.push_back(ps_key); | ||
| pskv.lens.push_back(unit_len); | ||
| pskv.size += unit_len; | ||
| if (offsets) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offset might be nullptr when gradients are complete zeros
|
Addressed comments and made a few changes |
Update docs
Update comment
* fix segfault in kvstore_dist for row sparse by: moving copy before zpull, and handling the case of a nullptr * fix indent, and bring back references * Update kvstore.py Update docs * Update kvstore_dist.h Update comment * Update kvstore.py lint issues * warning updated
* fix segfault in kvstore_dist for row sparse by: moving copy before zpull, and handling the case of a nullptr * fix indent, and bring back references * Update kvstore.py Update docs * Update kvstore_dist.h Update comment * Update kvstore.py lint issues * warning updated
Description
Fixes segfaults in kvstore_dist for row sparse, and fixes tests in dist_sync_kvstore
@eric-haibin-lin and I worked on this.
Checklist
Essentials
make lint)Changes
Comments
When pushes happen one after the other, without a pull in between, there is no ordering guarantee we provide now that the
WaitToWritewas removed in #8116 for performance reasons. The second push may start and finish before first push finishesPlease comment if we missed something, or if you have any suggestions, or if any clarifications are needed.