Skip to content

Make Communications IRs inheriting from Expr.#2185

Merged
wujingyue merged 6 commits intoNVIDIA:mainfrom
samnordmann:make_communications_IRs
May 11, 2024
Merged

Make Communications IRs inheriting from Expr.#2185
wujingyue merged 6 commits intoNVIDIA:mainfrom
samnordmann:make_communications_IRs

Conversation

@samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented May 2, 2024

This PR makes the "Communication" class a proper IR inheriting from Expr. This patch is needed for implementing Host Irs. It is also one step towards making the Communications (and more generally the multidevice module) fully symbolic.

By the way, we proceed to a couple of refactoring, and remove Communicator::sendRecv method.

Remarks:

  1. This IR will be used for now in the context of Host Irs. Later, they could also serve as kernel IR (backed by device-side communication APIs at runtime).
  2. Before this patch, we had a base class Communication and one derived class per collective type (Allgather, Allreduce, Broadcast, etc.). Now, there is only the class Communication, and the collective type is encoded through an enum class CollectiveType added to the parameter member CommParams
  3. the Communication::post method was replaced by a standalone function postCollective. The motivation is to scoop out the runtime execution from the symbolic representation of the collective.
  4. Note that step 3) is only implented halfway here since a Collective is instantiated with concrete device Idx and concrete at::Tensor, while it should be instantiated with symbolic representations and binded to actual device indices and Aten buffers at runtime. This will be added in a future PR.

CI: https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/pipelines/14923305

@samnordmann samnordmann force-pushed the make_communications_IRs branch 2 times, most recently from 3d79609 to 3e6fddf Compare May 2, 2024 14:21
@samnordmann samnordmann force-pushed the make_communications_IRs branch from 3e6fddf to 4560e21 Compare May 2, 2024 15:29
Copy link
Collaborator

@wujingyue wujingyue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general! I made a few comments to address before I merge this.

} else {
assertBufferCount(params_.dst_bufs, 0);
// TODO add checking symbolic representation of src and dst buffers
bool Communication::sameAs(const Statement* other) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I failed to see why we couldn't reuse the "default" implementation:

bool Expr::sameAs(const Statement* other) const {
. Can you clarify?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two Communications could be "Expr::sameAs" without being the same, for example if one is Allgather and the other Allreduce. Right?
My goal here is to mimick what's done for other IRs, such as

bool IterDomain::sameAs(const Statement* other) const {

But I don't actually use sameAs -- I simply thought implementing it was necessary for matching the IR specs.

So I'm open to any suggestion about this implementation, including removing it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expr::sameAs calls Expr::sameOp, which checks equality of all attributes. I believe the communication type is one of the attributes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if Im wrong, but in the current implementation it is not an attribute. The reason is that, IIUC, all attributes need to be Statement*s, and CommunicationParams is not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is addDataAttribute as used in

Fuser/csrc/ir/nodes.cpp

Lines 2423 to 2424 in 730abb5

addDataAttribute(op_type);
addDataAttribute(cache_op);
. But the PR as is is already hard to merge, I'll clean that up in a separate PR.

@samnordmann samnordmann requested a review from wujingyue May 6, 2024 15:43
Copy link
Collaborator

@cowanmeg cowanmeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming that the post and validate logic is just moved around! If there is changes in those let me know so I can look more closely.

Also it is great to see Send/Recv finally removed from communicator!

@samnordmann
Copy link
Collaborator Author

!build --dist

Copy link
Collaborator

@wujingyue wujingyue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'll try to merge this. I may need to split this up into multiple PRs, but will let you know.

@wujingyue
Copy link
Collaborator

I'm in the process of resolving conflicts. I should be able to merge this tomorrow or Monday.

@samnordmann
Copy link
Collaborator Author

I'm in the process of resolving conflicts. I should be able to merge this tomorrow or Monday.

Thank you! I'm waiting for this one to be merge before going on with host ir dev. Let me know if I can help!

@wujingyue wujingyue force-pushed the make_communications_IRs branch from 3ebd7d4 to 0f7fa53 Compare May 10, 2024 23:02
@wujingyue
Copy link
Collaborator

!build --dist

@wujingyue wujingyue changed the title make Communications IRs inheriting from Expr Make Communications IRs inheriting from Expr. May 11, 2024
@wujingyue wujingyue merged commit b020415 into NVIDIA:main May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants