DID loop split for allgather for non-outermost sharded axis.#4170
DID loop split for allgather for non-outermost sharded axis.#4170
Conversation
|
!test |
|
Review updated until commit 1c8204c Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
!test |
|
!test |
|
LGTM otherwise. Thanks for the change! |
|
!test |
csrc/multidevice/communication.cpp
Outdated
| // Presegmentation pass `makeReshardingContiguous` ensures that the tvs are contiguous | ||
| // and HostIrExecutor validates the tensor against the tv allocation domain. | ||
|
|
||
| auto flattened_output_tensor = output_tensor.as_strided({output_tensor.numel()}, {1}); |
There was a problem hiding this comment.
Also check contiguity of communication->in() and out()?
There was a problem hiding this comment.
It is already enforced by makeReshardingContiguous pass so I am not duplicating it here.
There was a problem hiding this comment.
Nit: makeReshardingContiguous is a bit too far and many changes could happen in between. For example, makeReshardingContiguous runs before segmentation and postSingleCommunication is at runtime. makeReshardingContiguous works on fusion IR e.g. set and reduce, and postSingleCommunication works on host IR e.g. Communication.
There was a problem hiding this comment.
I am running into some test failures with a contiguity check here for the manual tests in test_multidevice_host_ir.cpp. Since these tests do not set an allocation domain, we have the contiguity set to false. How cumbersome is it to require manual tests also to have the allocation domain set correctly?
CC: @samnordmann
There was a problem hiding this comment.
How cumbersome is it to require manual tests also to have the allocation domain set correctly?
IIUC, it should be set correctly, so let's set it correctly. The change can't be too large because test_multidevice_host_ir.cpp is <500 lines and that file has <20 calls to make*Tensor, many of which are Contig already.
There was a problem hiding this comment.
hey Priya, I am not sure how cumbersome that is -- but if needed feel free to do it, and please let me know how it looks like.
Let me know also if you need help
|
!test |
csrc/host_ir/executor.cpp
Outdated
| auto out_tensor = output_args[out_idx].as<at::Tensor>(); | ||
|
|
||
| c10::intrusive_ptr<c10d::Work> work = postSingleCommunication( | ||
| c10::intrusive_ptr<c10d::Work> work = validateAndPostSingleCommunication( |
There was a problem hiding this comment.
I'll skip validation for HostIrExecutor. It's done elsewhere already.
Input:
Fuser/csrc/tensor_metadata.cpp
Line 364 in 8a1028f
Output:
Fuser/csrc/runtime/allocations.cpp
Line 692 in 8a1028f
There was a problem hiding this comment.
Outputs are validated when they are not provided, i.e. output_args is empty. If not, then there is no validation. Inputs can be skipped, like you said.
I am looking at the callgraph to see what is the case, where output_args come pre-allocated.
There was a problem hiding this comment.
what is the case, where output_args come pre-allocated
I looked for this before. I found only unit tests that explicitly call KernelExecutor, not via FusionExecutorCache.
There was a problem hiding this comment.
Right. FusionExecutorCache provides an empty output_args (
Fuser/csrc/runtime/fusion_kernel_runtime.cpp
Line 714 in 802f042
I'll leave the validation in for just the
output tensor and skip for inputs.
csrc/host_ir/executor.cpp
Outdated
| for (const auto i : c10::irange(tensors.size())) { | ||
| const auto& tensor = tensors.at(i); | ||
| const auto& tv = tvs.at(i); |
There was a problem hiding this comment.
| } | ||
| // getBackendForTeam throws an error if the requested backend type isn't | ||
| // available. Therefore, we call it after the isBackendAvailable check. | ||
| communicator_->setDefaultBackend(backend_type); |
There was a problem hiding this comment.
I discovered that while this allows me to set the backend when using FusionExecutorCache, communication->backend() is still different from the backend parameter passed to postSingleCommunciation.
getBackendForTeam when not provided a backend returns the default backend set for the communicator.
For executions running through HostIrEvaluator, this approach will not work.
We should change this to have a uniform value of backend between communicator and communication object for easier verification. I'll attempt this in a future PR.
|
!test |
…utor (#4470) Prep PR for Issue #3900. I am modifying the `reorderShardedAxisPass` to set allocation domain consistent with the memory layout requirements of ProcessGroup NCCL and UCC, without changing the logical shape (see PR #4170 for example). MultiDeviceExecutor does not respect allocation domain, hence, removing these tests. Issue #4453.
Adds support for allgather if the sharded axis is not outermost.
ProcessGroupNCCLandUCCdoes require allocation of the sharded axis to be outermost. We do not change the logical shape, and instead permute the tensors to meet the requirements of NCCL and UCC withinpostAllgather.This will be added within the
reorderShardedAxispreseg pass to correctly set the loop and allocation domain for Allgather communication. Additionally, asetoperation is needed to change the allocation of input if it does not have the sharded axis as the outermost allocated axis.