Add another constructor of `Communication` that takes I/O TensorViews instead of the mesh. by wujingyue · Pull Request #2411 · NVIDIA/Fuser

wujingyue · 2024-06-13T23:26:00Z

The other constructor is still used in some tests.

Communications created by the new constructor must be put in a non-SSA IrContainer. This is because multiple Communications can have the same input and output TensorViews, e.g., Gather with a receiver mesh of size 2.

wujingyue · 2024-06-17T05:14:33Z

!build

wujingyue · 2024-06-17T05:35:15Z

!build

csrc/multidevice/communication.cpp

wujingyue · 2024-06-17T05:51:49Z

csrc/multidevice/communication.cpp

+  NVF_ERROR(
+      in->getDeviceMesh().size() > 0,
+      "The input mesh size must be greater than 0.");
+  NVF_ERROR(
+      out->getDeviceMesh().size() > 0,
+      "The output mesh size must be greater than 0.");


This sounds like status quo, but correct me if it's the wrong contract to enforce. We could allow one of the meshes to be empty and inferred from the other, but it feels less explicit and adds complexity to properly infer meshes.

This sounds like status quo, but correct me if it's the wrong contract to enforce.

I think it's ok

We could allow one of the meshes to be empty and inferred from the other, but it feels less explicit and adds complexity to properly infer meshes.

I wouldn't do that, IMO, infering meshes could be implemented in a scheduling propagation pass, upstream to instantiating communications, but not burried inside the communication's constructor.

We could allow one of the meshes to be empty and inferred from the other, but it feels less explicit and adds complexity to properly infer meshes.

There might be a few edge cases where this could be nice... For example, if a scheduler adds new exprs, it should re-run the sharding propagation pass, but it's easy to forget and we haven't updated all these passes everywhere. I would leave this as an error for now, but something to consider.

if a scheduler adds new exprs, it should re-run the sharding propagation pass

Or it should remember to set the TensorView of the new TensorViews.

wujingyue · 2024-06-17T06:04:59Z

csrc/multidevice/lower_communication.cpp

-          sender_mesh.at(i),
-          DeviceMesh({receiver_mesh.at(i)}),
-          comms);
+      const DeviceIdxType sender = sender_mesh.at(i);


I think this change is beneficial but I'm open to ideas.

Consider a collective send/recv from mesh{0,1,2,3} to mesh{4,5,6,7} (for pipeline parallelism) or from mesh{0,1,2,3} to mesh{1,2,3,0} (for p2p ring), that's decomposed into four SendRecvs.

Currently, these four SendRecvs differ in mesh and therefore team. With this PR, they will differ in team but have the same input and output meshes, allowing them to reuse the same input and output TensorViews.

(These examples also make me think about letting SendRecv take a vector of send/recv pairs. This way, one SendRev does all and therefore shortens the host IR. But that's out of the scope of this PR.)

allowing them to reuse the same input and output TensorViews.

That's IMO a bit of an artificial benefit, since the Team is normally enough to fully describe the Communication. In principle, executing a communication shouldn't rely on the I/O TVs.

However, currently, the I/O TVs are used indeed, but only to detect if the root belongs to the send&recv meshes in bcast,reduce,gather,scatter. I think this dependency should be removed. I can see two alternative ways to avoid relying on the send/recv meshes:

having a bool root_in_meshes in the communication's param. Probably not ideal but still cheaper than relying on the send/recv meshes.

Not support the case where the root of a collective doesn't belong to send&recv.

I personally would advocate for solution 2., for the following reasons:

Cases where the root doesn't belong to send&recv mesh are unusual corner cases. Idk any application that features those cases.

Cases where the root doesn't belong to send&recv mesh do not respect MPI requirements and are thus not supported by the backends. Therefore, supporting these cases adds a lot of complexity in the code.

Not supporting this case implies no loss in generality, since we can always add a Send/Recv during lowering to fallback on a case where root belong to send&recv mesh. E.g.:

To pass from tv0 sharded on {0,1} to tv1=set(tv0) on {2}, one could do Gather(team={0,1}, root=0) + SendRecv(dst=2, src=1)

To pass from tv0 shared on {0} to tv1=set(tv0) sharded on {1, 2}, one could do SendRecv(dst=1, src=0) + Scatter(team={1,2}, root=2)

To reduce from tv0 sharded on {0,1} to tv1=reduce(tv2) on {2}, one could do Reduce(team={0,1}, root=0) + SendRecv(dst=2, src=0)

etc.

What we actually do to support this case is IMO unnatural/ugly, and even problematic for gather/scatter because 1) of the allocation and 2) bw is used for communicating garbage data.

Other incidental remark: using a ATen local copy inside the collective post is not ideal... When possible, we should use bcast_oop, reduce_oop, etc. Or insert the copy op in the compute DAG.

(These examples also make me think about letting SendRecv take a vector of send/recv pairs. This way, one SendRev does all and therefore shortens the host IR. But that's out of the scope of this PR.)

Also out of the scope of the PR :)
Since these programs are SPMD like why not just use an expression to calculate one send/recv pairs instead of expanding out everything? For the pipeline parallelism example 0 sends to 4, 1 to 5 etc. Once a device has been determined to be to be in a pipeline stage then it's guaranteed to be either a sender or receiver for one of many parallel communications.

@cowanmeg I like that and IIUC you are proposing something similar to this comment.

csrc/multidevice/communication.cpp

tests/cpp/test_multidevice_host_ir.cpp

tests/cpp/test_multidevice_matmul.cpp

samnordmann

Thanks for this patch!

While having the ability to provide I/O Tensors to Communication may sounds natural, I'm also concerned because, in principle, the Team should be enough to fully describe a Communication, and therefore we shouldn't rely on the I/O meshes. Currently, we do rely on those meshes, but only in an incidental way, and we should remove this dependency IMO -- I develop this idea in more detail in one of the comment.

I think adding the I/O is nice to express data dependency in the DAG, however, those arguments should be optional -- or, alternatively, Communications should be wrapped inside an op PostOnStream that manages the I/O, i.e., we should revert #2328.
This would save a lot of code complexity in the current code base, and also in the present patch (c.f. the tests that need to artificially define I/O TVs).

I am making those points because I'm afraid that the Communication's apparent dependency on the I/O TVs are misleading when it comes to making design decision. I think that the IR Communication should be as close possible to the backend API so it matches the generated code better -- keeping I/O TVs and supporting corner cases (aka when the root is not in the send/recv meshes) may add unnecessary complexity, both in usage (artificially create I/O TVs) and in execution/implementation (burry ATen copies and allocation inside collective post)

samnordmann · 2024-06-17T10:01:57Z

csrc/multidevice/communication.cpp

+  NVF_ERROR(
+      in->getDeviceMesh().size() > 0,
+      "The input mesh size must be greater than 0.");
+  NVF_ERROR(
+      out->getDeviceMesh().size() > 0,
+      "The output mesh size must be greater than 0.");


This sounds like status quo, but correct me if it's the wrong contract to enforce.

I think it's ok

We could allow one of the meshes to be empty and inferred from the other, but it feels less explicit and adds complexity to properly infer meshes.

I wouldn't do that, IMO, infering meshes could be implemented in a scheduling propagation pass, upstream to instantiating communications, but not burried inside the communication's constructor.

csrc/multidevice/executor.cpp

samnordmann · 2024-06-17T10:19:00Z

csrc/multidevice/lower_communication.cpp

-          sender_mesh.at(i),
-          DeviceMesh({receiver_mesh.at(i)}),
-          comms);
+      const DeviceIdxType sender = sender_mesh.at(i);


allowing them to reuse the same input and output TensorViews.

That's IMO a bit of an artificial benefit, since the Team is normally enough to fully describe the Communication. In principle, executing a communication shouldn't rely on the I/O TVs.

However, currently, the I/O TVs are used indeed, but only to detect if the root belongs to the send&recv meshes in bcast,reduce,gather,scatter. I think this dependency should be removed. I can see two alternative ways to avoid relying on the send/recv meshes:

having a bool root_in_meshes in the communication's param. Probably not ideal but still cheaper than relying on the send/recv meshes.

Not support the case where the root of a collective doesn't belong to send&recv.

I personally would advocate for solution 2., for the following reasons:

Cases where the root doesn't belong to send&recv mesh are unusual corner cases. Idk any application that features those cases.

Cases where the root doesn't belong to send&recv mesh do not respect MPI requirements and are thus not supported by the backends. Therefore, supporting these cases adds a lot of complexity in the code.

Not supporting this case implies no loss in generality, since we can always add a Send/Recv during lowering to fallback on a case where root belong to send&recv mesh. E.g.:

To pass from tv0 sharded on {0,1} to tv1=set(tv0) on {2}, one could do Gather(team={0,1}, root=0) + SendRecv(dst=2, src=1)

To pass from tv0 shared on {0} to tv1=set(tv0) sharded on {1, 2}, one could do SendRecv(dst=1, src=0) + Scatter(team={1,2}, root=2)

To reduce from tv0 sharded on {0,1} to tv1=reduce(tv2) on {2}, one could do Reduce(team={0,1}, root=0) + SendRecv(dst=2, src=0)

etc.

What we actually do to support this case is IMO unnatural/ugly, and even problematic for gather/scatter because 1) of the allocation and 2) bw is used for communicating garbage data.

Other incidental remark: using a ATen local copy inside the collective post is not ideal... When possible, we should use bcast_oop, reduce_oop, etc. Or insert the copy op in the compute DAG.

tests/cpp/test_multidevice_communications.cpp

csrc/fusion.cpp

wujingyue · 2024-06-17T22:52:38Z

!build

tests/cpp/test_multidevice_matmul.cpp

cowanmeg · 2024-06-18T05:34:50Z

csrc/multidevice/lower_communication.cpp

-          sender_mesh.at(i),
-          DeviceMesh({receiver_mesh.at(i)}),
-          comms);
+      const DeviceIdxType sender = sender_mesh.at(i);


(These examples also make me think about letting SendRecv take a vector of send/recv pairs. This way, one SendRev does all and therefore shortens the host IR. But that's out of the scope of this PR.)

Also out of the scope of the PR :)
Since these programs are SPMD like why not just use an expression to calculate one send/recv pairs instead of expanding out everything? For the pipeline parallelism example 0 sends to 4, 1 to 5 etc. Once a device has been determined to be to be in a pipeline stage then it's guaranteed to be either a sender or receiver for one of many parallel communications.

cowanmeg · 2024-06-18T05:40:33Z

csrc/multidevice/communication.cpp

+  NVF_ERROR(
+      in->getDeviceMesh().size() > 0,
+      "The input mesh size must be greater than 0.");
+  NVF_ERROR(
+      out->getDeviceMesh().size() > 0,
+      "The output mesh size must be greater than 0.");


We could allow one of the meshes to be empty and inferred from the other, but it feels less explicit and adds complexity to properly infer meshes.

There might be a few edge cases where this could be nice... For example, if a scheduler adds new exprs, it should re-run the sharding propagation pass, but it's easy to forget and we haven't updated all these passes everywhere. I would leave this as an error for now, but something to consider.

Co-authored-by: samnordmann <snordmann@nvidia.com>

wujingyue · 2024-06-19T00:05:55Z

!build

wujingyue · 2024-06-19T00:08:43Z

I think adding the I/O is nice to express data dependency in the DAG, however, those arguments should be optional

I doubt that. Nonetheless, I updated the PR to keep both constructors, which should address your concern now. PTAL!

samnordmann · 2024-06-19T07:52:37Z

I think adding the I/O is nice to express data dependency in the DAG, however, those arguments should be optional

I doubt that. Nonetheless, I updated the PR to keep both constructors, which should address your concern now. PTAL!

To sum up our discussion, we agreed that a communication's execution should rely only on the Team, not on I/O TensorViews nor meshes. So, we should remove the dependency (by not supporting "unorthodox MPI collectives" where the root is not in the mesh, see here), and there should be a Communication's constructor taking only a team, but no mesh nor TV.

In addition to that, you would like to have another Communication's constructor taking not only a Team but also I/O TVs. Those I/O Tvs are used only to express data dependency in the complete fusion. (This constructor should/could eventually be a thin wrapper around the first constructor mentioned above, taking a Team only). Even though there is no usage right now of the I/O TVs, we add this for the future -- I can't really see how it will be useful but I trust you. My only remaining concern is this comment: if the I/O of the communication are TVs belonging to the HostIrContainer and not the full fusion, this will not embed the desired data dependency. Wdyt?

samnordmann

I think there is a slight misunderstanding, sorry if I have been unclear, but let me try to clarify. I have been advocating (e.g. here, here, there and orally) for having a constructor taking only the Team, no DeviceMesh nor TVs. However, this requires removing the dependency on the device Mesh, which need additional work, and so we decided to do this in a future PR.

Besides the aforementioned constructor, you would like another constructor taking additional I/O TVs, used to express data dependency in the complete fusion. This makes sense.

But I don't understand why in this PR we have two constructors

one constructor taking I/O TVs, but no DeviceMesh, and
an other taking a DeviceMesh but no I/O TVs.

I don't see the benefit of having 2), and I find 1) always better and more general. So I would be in favor of removing 2) completely and keeping only 1), as you originally did in the first version of the PR.

I really don't want to slow things down, so I'm approving, please do as you think is best. I'm happy anyway we had those discussions which I feel helped understand the northern light and be on the same page.

samnordmann · 2024-06-24T12:21:16Z

csrc/multidevice/communication.cpp

+
+  addInput(in);
+  addOutput(out);
+  addDataAttribute(type);


why not calling the other constructor here instead of duplicating?

Good point. I tried constructor delegation but couldn't figure out a good way to call validate and also check the mesh -- I want both constructors to call the same validate but check meshes differently. I'll leave it as is for now, because we'll eventually (or soon?) remove the mesh.

samnordmann · 2024-06-24T12:30:22Z

csrc/multidevice/communication.h


-  const DeviceMesh& mesh() const {
-    return attribute<DeviceMesh>(1);
+  TensorView* out() const {


out, in will throw on instances created with the constructor without I/O Tvs, right?

csrc/multidevice/executor.cpp

csrc/multidevice/communication.h

wujingyue · 2024-06-24T22:22:23Z

I don't see the benefit of having 2), and I find 1) always better and more general. So I would be in favor of removing 2) completely and keeping only 1), as you originally did in the first version of the PR.

I bifurcated Communication's constructor because

According to Add another constructor of Communication that takes I/O TensorViews instead of the mesh. #2411 (review), you want the I/O TVs to be optional, so the first version wouldn't work for you.
At this moment, the mesh is still needed for execution, so the constructor needs to take it when the I/O TVs are null.
While I can make one constructor with the I/O TVs and the mesh all being optional, C++ isn't as good as Python at having too many optional arguments, e.g., all arguments before a non-default argument must be explicitly specified.

That being said, I do agree with you on the end state of removing mesh from Communication's constructor, so the above is about order of actions.

wujingyue · 2024-06-24T23:08:46Z

!build

… instead of the mesh. (#2411) Communications created by the new constructor must be put in a non-SSA IrContainer. This is because multiple Communications can have the same input and output TensorViews, e.g., Gather with a receiver mesh of size 2. --------- Co-authored-by: samnordmann <snordmann@nvidia.com>

wujingyue force-pushed the wjy/mesh branch 2 times, most recently from 37db3ee to 87de3fa Compare June 15, 2024 23:45

wujingyue marked this pull request as draft June 15, 2024 23:47

wujingyue force-pushed the wjy/mesh branch 6 times, most recently from d3e326c to 8007ade Compare June 17, 2024 04:57

wujingyue commented Jun 17, 2024

View reviewed changes

tests/cpp/test_multidevice_host_ir.cpp Show resolved Hide resolved

wujingyue commented Jun 17, 2024

View reviewed changes

tests/cpp/test_multidevice_matmul.cpp Show resolved Hide resolved

wujingyue marked this pull request as ready for review June 17, 2024 06:14

wujingyue requested review from cowanmeg, naoyam and samnordmann June 17, 2024 06:14

samnordmann reviewed Jun 17, 2024

View reviewed changes

wujingyue force-pushed the wjy/mesh branch from 87ee3c3 to 378eb57 Compare June 17, 2024 22:52

wujingyue changed the base branch from main to wjy/relative June 17, 2024 23:20

cowanmeg approved these changes Jun 18, 2024

View reviewed changes

Base automatically changed from wjy/relative to main June 18, 2024 18:18

wujingyue added 6 commits June 18, 2024 20:47

Compiles.

937874b

Fix the sendrecv test.

a9b1992

HostIrContainer isn't SSA.

4d94974

Fix matmul tests.

fd81e3a

Create HostIrContainer to hold communications.

1062bf0

Fix host ir tests to set device meshes.

a921fe0

wujingyue added 5 commits June 18, 2024 20:47

Use unique_ptr.

2eb6ad0

Change test_multidevice_communications to use HostIrContainer.

69b526e

Fix clangtidy.

d825185

Bifurcate Communication's constructor.

050bb7c

Revert the interface change in CommunicationTest.

ca396bd

wujingyue force-pushed the wjy/mesh branch from 378eb57 to ca396bd Compare June 18, 2024 22:43

Update csrc/multidevice/executor.cpp

1d6a73e

Co-authored-by: samnordmann <snordmann@nvidia.com>

wujingyue changed the title ~~Remove mesh from Communication.~~ Add another constructor of Communication that takes I/O TensorViews without mesh. Jun 19, 2024

Nit

dda5813

wujingyue changed the title ~~Add another constructor of Communication that takes I/O TensorViews without mesh.~~ Add another constructor of Communication that takes I/O TensorViews instead of the mesh. Jun 19, 2024

wujingyue requested a review from samnordmann June 19, 2024 05:40

samnordmann approved these changes Jun 24, 2024

View reviewed changes

naoyam reviewed Jun 24, 2024

View reviewed changes

csrc/multidevice/communication.h Show resolved Hide resolved

Comment.

06c2951

wujingyue merged commit a47489d into main Jun 25, 2024

wujingyue deleted the wjy/mesh branch June 25, 2024 00:52

wujingyue added the enhancement label Nov 28, 2024

Conversation

wujingyue commented Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wujingyue commented Jun 17, 2024

Uh oh!

wujingyue commented Jun 17, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wujingyue Jun 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samnordmann Jun 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samnordmann left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

samnordmann Jun 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wujingyue commented Jun 17, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wujingyue commented Jun 19, 2024

Uh oh!

wujingyue commented Jun 19, 2024

Uh oh!

samnordmann commented Jun 19, 2024

Uh oh!

samnordmann left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wujingyue commented Jun 24, 2024

Uh oh!

wujingyue commented Jun 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

wujingyue commented Jun 13, 2024 •

edited

Loading

wujingyue Jun 17, 2024 •

edited

Loading

samnordmann Jun 17, 2024 •

edited

Loading

samnordmann left a comment •

edited

Loading

samnordmann Jun 17, 2024 •

edited

Loading

samnordmann left a comment •

edited

Loading