Skip to content

Add another constructor of Communication that takes I/O TensorViews instead of the mesh. #2411

Merged
wujingyue merged 14 commits intomainfrom
wjy/mesh
Jun 25, 2024
Merged

Add another constructor of Communication that takes I/O TensorViews instead of the mesh. #2411
wujingyue merged 14 commits intomainfrom
wjy/mesh

Conversation

@wujingyue
Copy link
Collaborator

@wujingyue wujingyue commented Jun 13, 2024

The other constructor is still used in some tests.

Communications created by the new constructor must be put in a non-SSA IrContainer. This is because multiple Communications can have the same input and output TensorViews, e.g., Gather with a receiver mesh of size 2.

@wujingyue wujingyue force-pushed the wjy/mesh branch 2 times, most recently from 37db3ee to 87de3fa Compare June 15, 2024 23:45
@wujingyue wujingyue marked this pull request as draft June 15, 2024 23:47
@wujingyue wujingyue force-pushed the wjy/mesh branch 6 times, most recently from d3e326c to 8007ade Compare June 17, 2024 04:57
@wujingyue
Copy link
Collaborator Author

!build

1 similar comment
@wujingyue
Copy link
Collaborator Author

!build

Comment on lines +147 to +152
NVF_ERROR(
in->getDeviceMesh().size() > 0,
"The input mesh size must be greater than 0.");
NVF_ERROR(
out->getDeviceMesh().size() > 0,
"The output mesh size must be greater than 0.");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like status quo, but correct me if it's the wrong contract to enforce. We could allow one of the meshes to be empty and inferred from the other, but it feels less explicit and adds complexity to properly infer meshes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like status quo, but correct me if it's the wrong contract to enforce.

I think it's ok

We could allow one of the meshes to be empty and inferred from the other, but it feels less explicit and adds complexity to properly infer meshes.

I wouldn't do that, IMO, infering meshes could be implemented in a scheduling propagation pass, upstream to instantiating communications, but not burried inside the communication's constructor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could allow one of the meshes to be empty and inferred from the other, but it feels less explicit and adds complexity to properly infer meshes.

There might be a few edge cases where this could be nice... For example, if a scheduler adds new exprs, it should re-run the sharding propagation pass, but it's easy to forget and we haven't updated all these passes everywhere. I would leave this as an error for now, but something to consider.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a scheduler adds new exprs, it should re-run the sharding propagation pass

Or it should remember to set the TensorView of the new TensorViews.

sender_mesh.at(i),
DeviceMesh({receiver_mesh.at(i)}),
comms);
const DeviceIdxType sender = sender_mesh.at(i);
Copy link
Collaborator Author

@wujingyue wujingyue Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is beneficial but I'm open to ideas.

Consider a collective send/recv from mesh{0,1,2,3} to mesh{4,5,6,7} (for pipeline parallelism) or from mesh{0,1,2,3} to mesh{1,2,3,0} (for p2p ring), that's decomposed into four SendRecvs.

Currently, these four SendRecvs differ in mesh and therefore team. With this PR, they will differ in team but have the same input and output meshes, allowing them to reuse the same input and output TensorViews.

(These examples also make me think about letting SendRecv take a vector of send/recv pairs. This way, one SendRev does all and therefore shortens the host IR. But that's out of the scope of this PR.)

Copy link
Collaborator

@samnordmann samnordmann Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allowing them to reuse the same input and output TensorViews.

That's IMO a bit of an artificial benefit, since the Team is normally enough to fully describe the Communication. In principle, executing a communication shouldn't rely on the I/O TVs.

However, currently, the I/O TVs are used indeed, but only to detect if the root belongs to the send&recv meshes in bcast,reduce,gather,scatter. I think this dependency should be removed. I can see two alternative ways to avoid relying on the send/recv meshes:

  1. having a bool root_in_meshes in the communication's param. Probably not ideal but still cheaper than relying on the send/recv meshes.
  2. Not support the case where the root of a collective doesn't belong to send&recv.

I personally would advocate for solution 2., for the following reasons:

  • Cases where the root doesn't belong to send&recv mesh are unusual corner cases. Idk any application that features those cases.
  • Cases where the root doesn't belong to send&recv mesh do not respect MPI requirements and are thus not supported by the backends. Therefore, supporting these cases adds a lot of complexity in the code.
  • Not supporting this case implies no loss in generality, since we can always add a Send/Recv during lowering to fallback on a case where root belong to send&recv mesh. E.g.:
    • To pass from tv0 sharded on {0,1} to tv1=set(tv0) on {2}, one could do Gather(team={0,1}, root=0) + SendRecv(dst=2, src=1)
    • To pass from tv0 shared on {0} to tv1=set(tv0) sharded on {1, 2}, one could do SendRecv(dst=1, src=0) + Scatter(team={1,2}, root=2)
    • To reduce from tv0 sharded on {0,1} to tv1=reduce(tv2) on {2}, one could do Reduce(team={0,1}, root=0) + SendRecv(dst=2, src=0)
    • etc.
  • What we actually do to support this case is IMO unnatural/ugly, and even problematic for gather/scatter because 1) of the allocation and 2) bw is used for communicating garbage data.

Other incidental remark: using a ATen local copy inside the collective post is not ideal... When possible, we should use bcast_oop, reduce_oop, etc. Or insert the copy op in the compute DAG.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(These examples also make me think about letting SendRecv take a vector of send/recv pairs. This way, one SendRev does all and therefore shortens the host IR. But that's out of the scope of this PR.)

Also out of the scope of the PR :)
Since these programs are SPMD like why not just use an expression to calculate one send/recv pairs instead of expanding out everything? For the pipeline parallelism example 0 sends to 4, 1 to 5 etc. Once a device has been determined to be to be in a pipeline stage then it's guaranteed to be either a sender or receiver for one of many parallel communications.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cowanmeg I like that and IIUC you are proposing something similar to this comment.

@wujingyue wujingyue marked this pull request as ready for review June 17, 2024 06:14
Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this patch!

While having the ability to provide I/O Tensors to Communication may sounds natural, I'm also concerned because, in principle, the Team should be enough to fully describe a Communication, and therefore we shouldn't rely on the I/O meshes. Currently, we do rely on those meshes, but only in an incidental way, and we should remove this dependency IMO -- I develop this idea in more detail in one of the comment.

I think adding the I/O is nice to express data dependency in the DAG, however, those arguments should be optional -- or, alternatively, Communications should be wrapped inside an op PostOnStream that manages the I/O, i.e., we should revert #2328.
This would save a lot of code complexity in the current code base, and also in the present patch (c.f. the tests that need to artificially define I/O TVs).

I am making those points because I'm afraid that the Communication's apparent dependency on the I/O TVs are misleading when it comes to making design decision. I think that the IR Communication should be as close possible to the backend API so it matches the generated code better -- keeping I/O TVs and supporting corner cases (aka when the root is not in the send/recv meshes) may add unnecessary complexity, both in usage (artificially create I/O TVs) and in execution/implementation (burry ATen copies and allocation inside collective post)

Comment on lines +147 to +152
NVF_ERROR(
in->getDeviceMesh().size() > 0,
"The input mesh size must be greater than 0.");
NVF_ERROR(
out->getDeviceMesh().size() > 0,
"The output mesh size must be greater than 0.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like status quo, but correct me if it's the wrong contract to enforce.

I think it's ok

We could allow one of the meshes to be empty and inferred from the other, but it feels less explicit and adds complexity to properly infer meshes.

I wouldn't do that, IMO, infering meshes could be implemented in a scheduling propagation pass, upstream to instantiating communications, but not burried inside the communication's constructor.

sender_mesh.at(i),
DeviceMesh({receiver_mesh.at(i)}),
comms);
const DeviceIdxType sender = sender_mesh.at(i);
Copy link
Collaborator

@samnordmann samnordmann Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allowing them to reuse the same input and output TensorViews.

That's IMO a bit of an artificial benefit, since the Team is normally enough to fully describe the Communication. In principle, executing a communication shouldn't rely on the I/O TVs.

However, currently, the I/O TVs are used indeed, but only to detect if the root belongs to the send&recv meshes in bcast,reduce,gather,scatter. I think this dependency should be removed. I can see two alternative ways to avoid relying on the send/recv meshes:

  1. having a bool root_in_meshes in the communication's param. Probably not ideal but still cheaper than relying on the send/recv meshes.
  2. Not support the case where the root of a collective doesn't belong to send&recv.

I personally would advocate for solution 2., for the following reasons:

  • Cases where the root doesn't belong to send&recv mesh are unusual corner cases. Idk any application that features those cases.
  • Cases where the root doesn't belong to send&recv mesh do not respect MPI requirements and are thus not supported by the backends. Therefore, supporting these cases adds a lot of complexity in the code.
  • Not supporting this case implies no loss in generality, since we can always add a Send/Recv during lowering to fallback on a case where root belong to send&recv mesh. E.g.:
    • To pass from tv0 sharded on {0,1} to tv1=set(tv0) on {2}, one could do Gather(team={0,1}, root=0) + SendRecv(dst=2, src=1)
    • To pass from tv0 shared on {0} to tv1=set(tv0) sharded on {1, 2}, one could do SendRecv(dst=1, src=0) + Scatter(team={1,2}, root=2)
    • To reduce from tv0 sharded on {0,1} to tv1=reduce(tv2) on {2}, one could do Reduce(team={0,1}, root=0) + SendRecv(dst=2, src=0)
    • etc.
  • What we actually do to support this case is IMO unnatural/ugly, and even problematic for gather/scatter because 1) of the allocation and 2) bw is used for communicating garbage data.

Other incidental remark: using a ATen local copy inside the collective post is not ideal... When possible, we should use bcast_oop, reduce_oop, etc. Or insert the copy op in the compute DAG.

@wujingyue
Copy link
Collaborator Author

!build

@wujingyue wujingyue changed the base branch from main to wjy/relative June 17, 2024 23:20
sender_mesh.at(i),
DeviceMesh({receiver_mesh.at(i)}),
comms);
const DeviceIdxType sender = sender_mesh.at(i);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(These examples also make me think about letting SendRecv take a vector of send/recv pairs. This way, one SendRev does all and therefore shortens the host IR. But that's out of the scope of this PR.)

Also out of the scope of the PR :)
Since these programs are SPMD like why not just use an expression to calculate one send/recv pairs instead of expanding out everything? For the pipeline parallelism example 0 sends to 4, 1 to 5 etc. Once a device has been determined to be to be in a pipeline stage then it's guaranteed to be either a sender or receiver for one of many parallel communications.

Comment on lines +147 to +152
NVF_ERROR(
in->getDeviceMesh().size() > 0,
"The input mesh size must be greater than 0.");
NVF_ERROR(
out->getDeviceMesh().size() > 0,
"The output mesh size must be greater than 0.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could allow one of the meshes to be empty and inferred from the other, but it feels less explicit and adds complexity to properly infer meshes.

There might be a few edge cases where this could be nice... For example, if a scheduler adds new exprs, it should re-run the sharding propagation pass, but it's easy to forget and we haven't updated all these passes everywhere. I would leave this as an error for now, but something to consider.

Base automatically changed from wjy/relative to main June 18, 2024 18:18
Co-authored-by: samnordmann <snordmann@nvidia.com>
@wujingyue wujingyue changed the title Remove mesh from Communication. Add another constructor of Communication that takes I/O TensorViews without mesh. Jun 19, 2024
@wujingyue
Copy link
Collaborator Author

!build

@wujingyue
Copy link
Collaborator Author

I think adding the I/O is nice to express data dependency in the DAG, however, those arguments should be optional

I doubt that. Nonetheless, I updated the PR to keep both constructors, which should address your concern now. PTAL!

@wujingyue wujingyue changed the title Add another constructor of Communication that takes I/O TensorViews without mesh. Add another constructor of Communication that takes I/O TensorViews instead of the mesh. Jun 19, 2024
@wujingyue wujingyue requested a review from samnordmann June 19, 2024 05:40
@samnordmann
Copy link
Collaborator

I think adding the I/O is nice to express data dependency in the DAG, however, those arguments should be optional

I doubt that. Nonetheless, I updated the PR to keep both constructors, which should address your concern now. PTAL!

To sum up our discussion, we agreed that a communication's execution should rely only on the Team, not on I/O TensorViews nor meshes. So, we should remove the dependency (by not supporting "unorthodox MPI collectives" where the root is not in the mesh, see here), and there should be a Communication's constructor taking only a team, but no mesh nor TV.

In addition to that, you would like to have another Communication's constructor taking not only a Team but also I/O TVs. Those I/O Tvs are used only to express data dependency in the complete fusion. (This constructor should/could eventually be a thin wrapper around the first constructor mentioned above, taking a Team only). Even though there is no usage right now of the I/O TVs, we add this for the future -- I can't really see how it will be useful but I trust you. My only remaining concern is this comment: if the I/O of the communication are TVs belonging to the HostIrContainer and not the full fusion, this will not embed the desired data dependency. Wdyt?

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a slight misunderstanding, sorry if I have been unclear, but let me try to clarify. I have been advocating (e.g. here, here, there and orally) for having a constructor taking only the Team, no DeviceMesh nor TVs. However, this requires removing the dependency on the device Mesh, which need additional work, and so we decided to do this in a future PR.

Besides the aforementioned constructor, you would like another constructor taking additional I/O TVs, used to express data dependency in the complete fusion. This makes sense.

But I don't understand why in this PR we have two constructors

  1. one constructor taking I/O TVs, but no DeviceMesh, and
  2. an other taking a DeviceMesh but no I/O TVs.

I don't see the benefit of having 2), and I find 1) always better and more general. So I would be in favor of removing 2) completely and keeping only 1), as you originally did in the first version of the PR.

I really don't want to slow things down, so I'm approving, please do as you think is best. I'm happy anyway we had those discussions which I feel helped understand the northern light and be on the same page.


addInput(in);
addOutput(out);
addDataAttribute(type);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not calling the other constructor here instead of duplicating?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I tried constructor delegation but couldn't figure out a good way to call validate and also check the mesh -- I want both constructors to call the same validate but check meshes differently. I'll leave it as is for now, because we'll eventually (or soon?) remove the mesh.


const DeviceMesh& mesh() const {
return attribute<DeviceMesh>(1);
TensorView* out() const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out, in will throw on instances created with the constructor without I/O Tvs, right?

@wujingyue
Copy link
Collaborator Author

I don't see the benefit of having 2), and I find 1) always better and more general. So I would be in favor of removing 2) completely and keeping only 1), as you originally did in the first version of the PR.

I bifurcated Communication's constructor because

  1. According to Add another constructor of Communication that takes I/O TensorViews instead of the mesh.  #2411 (review), you want the I/O TVs to be optional, so the first version wouldn't work for you.
  2. At this moment, the mesh is still needed for execution, so the constructor needs to take it when the I/O TVs are null.
  3. While I can make one constructor with the I/O TVs and the mesh all being optional, C++ isn't as good as Python at having too many optional arguments, e.g., all arguments before a non-default argument must be explicitly specified.

That being said, I do agree with you on the end state of removing mesh from Communication's constructor, so the above is about order of actions.

@wujingyue
Copy link
Collaborator Author

!build

@wujingyue wujingyue merged commit a47489d into main Jun 25, 2024
@wujingyue wujingyue deleted the wjy/mesh branch June 25, 2024 00:52
protonu pushed a commit that referenced this pull request Jun 25, 2024
… instead of the mesh. (#2411)

Communications created by the new constructor must be put in a non-SSA
IrContainer. This is because multiple Communications can have the same
input and output TensorViews, e.g., Gather with a receiver mesh of size
2.

---------

Co-authored-by: samnordmann <snordmann@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants