Conversation
8b782a3 to
937d050
Compare
is_root_in_mesh and team from CommParams.
|
!build --dist |
| std::unordered_set<DeviceIdxType> team_without_duplicates( | ||
| params.team.begin(), params.team.end()); | ||
| NVF_ERROR( | ||
| team_without_duplicates.size() == params.team.size(), | ||
| "the communication must not involve the same device more than once"); |
There was a problem hiding this comment.
I could move this check to Communication::team() but it looks unnecessary because
- DeviceMesh already checks against duplicates, and
- Communication::team() adds the root only when it's not in the mesh.
There was a problem hiding this comment.
This sounds ok to me. I think only the CommunicationTests directly set the team, but the normal use case creates a team from the DeviceMesh, which checks against duplicates.
There was a problem hiding this comment.
This PR changed the CommunicationTests as well to set mesh instead of team. The bottom line is that CommParams contains all parameters (e.g. mesh) needed for constructing a Communication and other fields (e.g. team) are all computed.
| NVF_ERROR( | ||
| team_without_duplicates.size() == params.team.size(), | ||
| "the communication must not involve the same device more than once"); | ||
| NVF_ERROR(!params.team.empty(), "the team size must be greater than 0"); |
There was a problem hiding this comment.
This is moved to Communication's constructor.
| NVF_ERROR(!params.team.empty(), "the team size must be greater than 0"); | ||
| if (hasRoot(params.type)) { | ||
| auto it = std::find(params.team.begin(), params.team.end(), params.root); | ||
| NVF_ERROR( |
There was a problem hiding this comment.
This check could be moved to Communication::team(), but it would look a bit silly:
if (root isn't in team) {
team.push_back(root);
}
assert root is in team // obviously true
|
|
||
| namespace { | ||
|
|
||
| inline void assertBufferCount( |
There was a problem hiding this comment.
Unless proven to be important, I tend to leave the inline decision to the compiler. Otherwise, the function could organic grow to a giant without people realizing the overhead of inlining.
| std::unordered_set<DeviceIdxType> team_without_duplicates( | ||
| params.team.begin(), params.team.end()); | ||
| NVF_ERROR( | ||
| team_without_duplicates.size() == params.team.size(), | ||
| "the communication must not involve the same device more than once"); |
There was a problem hiding this comment.
This sounds ok to me. I think only the CommunicationTests directly set the team, but the normal use case creates a team from the DeviceMesh, which checks against duplicates.
|
Thanks @wujingyue! I am glad to see this kind of clean up, because it significantly cleans up the code! However, I am not sure about the design decision and the choice of parameters to encode the communication. I am talking more precisely about keeping the Actually, the mesh shouldn't be needed at all to describe the collective. I understand that for now it used for implementing the trick of adding an empty tensor when the root is not in the sender/receiver mesh, but that is
Anyway, my point is that this special case should not influence our design decisions. Given the current stage of development, storing the To produce tv1, we need two allgather: Allgather1 among Team (0,1,2,3), and Allgather 2 among Team (4,5,6,7). We see that, with no extra info, there is no way to deduce the team from the mesh. Another thing that I don't like in the current code are lines like https://github.com/NVIDIA/Fuser/blob/e7fdb714bec23ef51d8f72b81c42046fb204273b/csrc/multidevice/lower_communication.cpp#L184C11-L184C22 The design I have been thinking about that I would suggest is to have:
In my opinion, this design is cleaner and more sustainable for the future. @cowanmeg @wujingyue @naoyam please let me know what you think, I'd be happy to have your opinion and to discuss before deciding what to do |
I'm almost on the same page as you. I think the second item will be needed to conveniently traverse through Exprs in host IR, e.g., by reusing the existing graph traversal utilities in the code base. What's why I put a TODO here https://github.com/NVIDIA/Fuser/pull/2256/files#diff-8a31ccb86eb9f8d9b346f35addcbd61ba465a418dd23124e61634f1fd5ecc9e7R51-R52. So, in the end state, Communication will contain neither If we agree on that, I'll send out a PR by Monday to add input and output to the op and collect |
As discussed orally I think we still need to pass the Team -- not the mesh |
I may have misunderstood what you mean the first time I read this, so let me try again. For this example, Are you proposing to lower/compile the fusion into two That would make some sense; otherwise, I failed to see how |
There will only be one allgather in each program, but each device will look up its team (in this case the devices along its column (axis y) of the DeviceMesh). Globally there will be multiple allgathers running in parallel over disjoint teams. If a device is in a the mesh of the consumer/producer tv it's guaranteed to take part in the communication, but it will only involve devices in its team (which will be an axis of the device mesh). |
That was my first understanding as well. But, if we choose to implement it that way, Footnotes
|
|
Sorry for the long comment... I have many things to say :) Let's discuss orally if things are not clear. About the PR status only: As far as im concerned, the other open PRs authored by @wujingyue pose no problem and could be merged independently from this one. The only problematic point imo in current PR is that I want to be able to pass
Imo, we should represent all the communications in the DAG, even the ones not run by the current device 1. In other words, each device should be able to see the complete DAG of operation that is run on the network. If a device is not involved in a communication, we can either (symbolically) predicate out this communication (or, better, a whole segment) or at runtime simply skip posting any collective at the executor level.
that's exactly what I mean! Thanks for rephrasing.
I think I see what you mean, and I am on the same page. In particular, imo However, it is probably a not so important detail, but imo
Let me add one more example to illustrate that there is not always a canonical way to deduce the Team from sender/receiver meshes. This situation is a type of pipeline parallelism where tv0 is duplicated on {0,1}, and we want to send the whole data to {2,3}. There are many way to lower this resharding op to MPI-communications:
This dummy example illustrates that there are lowering decision when mapping a resharding tv op to MPI collective. We want our class Let me now go back to what I wrote before:
Now that I think about it we shouldn't pass the I/O tvs as the communication's I/O at construction. Instead, the communication should optionnally hold a pointer to the Expr* from which it was lowered from.
This is not a problem, because a Footnotes
|
|
Thanks for the verbosity -- it clarified lots of your thought process, which is critical given our working hours rarely overlap :) I'm summarizing the discussion with you today: I'll add I'll add input/output TensorViews to Communication in future PRs. While I understand how you can live without them using PostOnStream, I plan to try out, in parallel, using them as top-level host IR instructions. I haven't convinced myself whether PostOnStream is needed eventually (there are many ways to represent stream assignments), so it makes sense at the moment to keep the flexibility to use Communication without having to encapsulate them in a PostOnStream.
We are on the same page already -- I'm writing this down for posterity: there's a difference between fusion IR which requires SSA and kernel IR which doesn't and is I think more analogous to host IR. The concern you raised is definitely a challenge to SSA (although there's a solution called Phi node to address that), but shouldn't be a problem for host IR. |
is_root_in_mesh and team from CommParams.is_root_in_mesh from CommParams.
is_root_in_mesh from CommParams.is_root_in_mesh from CommParams and add mesh.
| } | ||
|
|
||
| if (params.team.size() == 1) { | ||
| if (communication->team().size() == 1) { |
There was a problem hiding this comment.
This was an intermediate change that turns out to be unnecessary for this PR. However, #2256 will remove CommParams, so I chose to keep this change which makes the code closer to the end state.
|
I added |
|
!build |
1 similar comment
|
!build |
|
!build |
As a follow up to #2172. `mesh` will eventually go away after the input and output TensorViews of a communication are embedded in the base Expr. Anyhow, `is_root_in_mesh` can be computed from other parameters so is removed.
…OnStream` (#2328) Bootstrapping for simplifying the design. With this patch, the `Communication` class is used to post a collective, without the need of wrapping it inside a `PostOnStream` Ir. There is no functional change, this is only a minor cleanup I would like to upstream before introducing the "collective wait" IR. To achieve this, we need to add I/O TVs to the `Communication`'s constructor, and use those I/O directly. This patch aligns with the discussion from #2250 and should lead to further simplification in `MultiDeviceExecutor`, that I would like to leave to another PR for avoiding conflicts. Note that the `Communication` class is used in two places: Host IRs and `MultiDeviceExecutor`. Since the former case should replace the latter eventually, we maintain for now these two usages. In `MultiDeviceExecutor`, we don't set a `Communication`'s I/O, that's why those arguments are made optional in the constructor --------- Co-authored-by: Jingyue Wu <wujingyue@gmail.com>
As a follow up to #2172.
meshwill eventually go away after the input and output TensorViews of a communication are embedded in the base Expr. Anyhow,is_root_in_meshcan be computed from other parameters so is removed.