Skip to content

Separate the lowering and the execution of a communication.#2172

Merged
wujingyue merged 14 commits intomainfrom
wjy/comm
May 9, 2024
Merged

Separate the lowering and the execution of a communication.#2172
wujingyue merged 14 commits intomainfrom
wjy/comm

Conversation

@wujingyue
Copy link
Collaborator

@wujingyue wujingyue commented May 1, 2024

This PR prepares for integrating Communication/Communicator into FusionExecutor. The lowering of a communication will be done by FusionExecutor::compileFusion, which is done <= once per shape and where sometimes only meta tensors are available. The execution will be done by ::runFusion, which is on the critical path.

@wujingyue
Copy link
Collaborator Author

!build --dist

@wujingyue wujingyue force-pushed the wjy/comm branch 2 times, most recently from bb876a8 to 0fec8fc Compare May 1, 2024 18:11
@wujingyue wujingyue requested review from cowanmeg and samnordmann May 1, 2024 18:12
@wujingyue
Copy link
Collaborator Author

!build --dist

4 similar comments
@wujingyue
Copy link
Collaborator Author

!build --dist

@wujingyue
Copy link
Collaborator Author

!build --dist

@wujingyue
Copy link
Collaborator Author

!build --dist

@wujingyue
Copy link
Collaborator Author

!build --dist

doLocalCopy(params_.dst_bufs.at(0), params_.src_bufs.at(0));
if (params_.is_root_in_mesh) {
// Do a local copy and the subsequent broadcast will be in place.
doLocalCopy(output_tensor, input_tensor);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for this pr, but in the future all those local copies should be inserted in the fusion as IRs during lowering, so they can potentially be optimized

std::vector<at::Tensor> src_bufs;
std::vector<at::Tensor> dst_bufs;
Team team; // should not have duplicate
bool is_root_in_mesh = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be a param's field that can be set, this field is entirely deduced from root and team members. There shouldn't be such a member in the struct

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this field is entirely deduced from root and team members

team always contains root IIUC. Do you mean to let params store root and mesh instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my comment was unclear/confused.

team always contains root IIUC

Correct, my mistake

Do you mean to let params store root and mesh instead?

Yes, that would be great! Or even better: params could store a pointer to the resharding Expr and/or to the I/O TVs.
Wdyt?

Copy link
Collaborator Author

@wujingyue wujingyue May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even better: params could store a pointer to the resharding Expr and/or to the I/O TVs.

I'll think about it. One benefit of separating compilation from execution (and thus PR) is to avoid doing too much at runFusion time, e.g., having to analyze TVs and extract information needed for execution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. While it doesn't look nice to store these pre-computed values like scattered_axis and is_root_in_mesh, it will cut down on compute during execution. Personally, I'm ok with having them stored as parameters even if it's not the cleanest approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think about it. One benefit of separating compilation from execution (and thus PR) is to avoid doing too much at runFusion time, e.g., having to analyze TVs and extract information needed for execution.

Agreed. And let's stick with that for this pr!

On the other hand (for the future) I also see motivation for storing the TV and Expr ptrs:

  • it contains the full info 1) which is useful for printing e.g. IO tensors, and, 2) it can be needed in the future when the lowering becomes more fleshed up
  • it is more clean/concise/structured way to encode the symbolic Communication

What we could do to achieve best of both world is to store (non-mutable) pointers to IO TensorViews and Expr and precompute as class private data whats needed at instantiation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #2185 , Communication will be an Expr which automatically contains input and output TensorView. That should address your concern?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave up on the team=>mesh change for this PR. It would cause even more divergence with #2185 , so we'll do that in a separate PR if needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #2185 , Communication will be an Expr which automatically contains input and output TensorView. That should address your concern?

Yes! That would be a good way of doing it. If we follow this path, I think we can completely get rid of the struct CommunicationParams (we then only need to pass to the constructor the ReductionOpType, or a pointer to the set/reduction Expr*)

@wujingyue wujingyue requested a review from samnordmann May 6, 2024 06:53
Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thx! I only left minor comments

std::vector<at::Tensor> src_bufs;
std::vector<at::Tensor> dst_bufs;
Team team; // should not have duplicate
bool is_root_in_mesh = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my comment was unclear/confused.

team always contains root IIUC

Correct, my mistake

Do you mean to let params store root and mesh instead?

Yes, that would be great! Or even better: params could store a pointer to the resharding Expr and/or to the I/O TVs.
Wdyt?

Team team; // should not have duplicates and should contain both the root and
// the mesh
c10d::ReduceOp::RedOpType redOp = c10d::ReduceOp::RedOpType::UNUSED;
int64_t scattered_axis = -1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same question, but I chose to not pulling too many changes in one PR :)

I understand, np.

If we decide to store a ptr to the TensorViews in params, then the info stored in scattered_axis can be inferred later when needed. Imo it's a cleaner solution. However, we can leave it to future pr if you prefer.
Anyway this will probably have to be fixed when we'll go to 2D

Copy link
Collaborator

@cowanmeg cowanmeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think the PR moves in the right direction separating compile time and run time logic. We should discuss offline allocation optimizations, especially in relation to whether it can be pre-computed and relations with NCCL user buffers, etc, but that is well beyond the scope of the PR!

I think beside cleaning up the fixme and adding the buffer size checks to post, I am good with these changes!

std::vector<at::Tensor> src_bufs;
std::vector<at::Tensor> dst_bufs;
Team team; // should not have duplicate
bool is_root_in_mesh = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. While it doesn't look nice to store these pre-computed values like scattered_axis and is_root_in_mesh, it will cut down on compute during execution. Personally, I'm ok with having them stored as parameters even if it's not the cleanest approach.

@cowanmeg
Copy link
Collaborator

cowanmeg commented May 6, 2024

Overall, I think the PR moves in the right direction separating compile time and run time logic. We should discuss offline allocation optimizations, especially in relation to whether it can be pre-computed and relations with NCCL user buffers, etc, but that is well beyond the scope of the PR!

I think beside cleaning up the fixme and adding the buffer size checks to post, I am good with these changes!

I also am unsure which PR is easier to merge first, this one or #2185

@wujingyue
Copy link
Collaborator Author

!build --dist

@wujingyue wujingyue merged commit 66b2a0a into main May 9, 2024
@wujingyue wujingyue deleted the wjy/comm branch May 9, 2024 05:18
@samnordmann samnordmann mentioned this pull request May 14, 2024
wujingyue added a commit that referenced this pull request Jun 3, 2024
As a follow up to #2172. `mesh` will eventually go away after the input
and output TensorViews of a communication are embedded in the base Expr.
Anyhow, `is_root_in_mesh` can be computed from other parameters so is
removed.
zasdfgbnm pushed a commit that referenced this pull request Jun 5, 2024
As a follow up to #2172. `mesh` will eventually go away after the input
and output TensorViews of a communication are embedded in the base Expr.
Anyhow, `is_root_in_mesh` can be computed from other parameters so is
removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants