Patch allocation logic to produce outputs with correct logical size#5170
Patch allocation logic to produce outputs with correct logical size#5170jjsjann123 wants to merge 15 commits intomainfrom
Conversation
|
!test |
|
Review updated until commit db304bd Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
| if (!only_valid_device_split) { | ||
| break; | ||
| } | ||
|
|
There was a problem hiding this comment.
@Priya2698 would you mind double check the changes here, as well as the two files under csrc/multidevice/utils.h[.cpp].
I'm trying to relax it to allow using allocation domain to represent padding.
| for (int i = static_cast<int>(tensor_new_shape.size()) - 1; i >= 0; --i) { | ||
| prod *= tensor_new_shape[i]; | ||
| tensor_new_strides[i] = prod; | ||
| prod *= tensor_new_shape[i]; |
There was a problem hiding this comment.
@protonu this seems to be a bug in the old code. we are not computing the correct stride.
The existing swizzle test doesn't check correctness. That's why it's not caught earlier. (hit an error in the added LogicalAndAllocationSizes test, because as_strided call with allocation size/strides was out of bound)
There was a problem hiding this comment.
Thanks for catching this!
|
!test |
| std::set<IterDomain*> logical_set(logical.begin(), logical.end()); | ||
| if (frontier_set != logical_set) { | ||
| return tensor; | ||
| std::vector<int64_t> logical_sizes(logical.size(), 0); |
There was a problem hiding this comment.
Looks like the comment above needs to be updated.
Can you please remind me when this condition hits and what we were doing previously? Was it just returning an incorrect tensor?
There was a problem hiding this comment.
The comment above still holds.
This is the case we have for PreprocessGroupedMatmulInputSf output, where the logical domain to allocation domain is not connected and they are not mapped.
logical domain [ i0, i1 ]
allocation domain [ i0 + (i_group - 1) * 127, (i1 + 4 - 1) / 4 * 4 ]
Note that allocation domain extent were directly operated on, instead of using resize op.
Lines 333 to 364 in f722efc
The implication here is that, frontier_set and projected logical_set are not equivalent.
The behavior before the change is to return the allocated buffer directly, because we allocate use allocation domain, we are returning a tensor with padded size.
The updated behavior is to return a slice into the allocated buffer so that the output aten tensor would have a shape that's consistent with its logical domain.
There was a problem hiding this comment.
The comment above still holds.
I meant this part: give up on producing right shape/stride. I thought that's now fixed by this PR, no?
There was a problem hiding this comment.
I'm still trying to understand when this happens.
Looks like the frontier is always updated with the split input ID no matter if it's divisible or not.
// update frontier
if (inner_dim < outer_dim) {
*inner_it = in;
frontier_.erase(outer_it);
} else {
*outer_it = in;
frontier_.erase(inner_it);
}
Wouldn't that mean the frontier should eventually get to the same IDs as the logical IDs?
There was a problem hiding this comment.
I meant this part: give up on producing right shape/stride. I thought that's now fixed by this PR, no?
Ah, you are right. We are producing right shape, but the stride is still wrong. And there's simply no right stride to index that tensor.
Wouldn't that mean the frontier should eventually get to the same IDs as the logical IDs?
If there's ID operations between logical and allocation, then yes.
The case with PreprocessGroupedMatmulInputSf is that, there's no ID ops between allocation and logical domain at all. We are directly operating on ID extent.
So the two sets are not related at all.
There was a problem hiding this comment.
For some reason I totally forgot to mention the second case with unconnected logical & allocation domain, but only mentioned the non-divisible split case. 🥹 Apologies on that.
There was a problem hiding this comment.
Hmm, so the change here is actually not related to the above change about the non-divisible split case?
There was a problem hiding this comment.
Yep I'm hearing the feedback as split this into a separate PR. I'll do that.
| {alloc_iid}); | ||
| IterDomain* logical_id = alloc_iid; | ||
| Val* num_devices = of_tv->container()->oneVal(); | ||
| bool only_valid_device_split = true; |
There was a problem hiding this comment.
Why does device split have anything to do with the padding?
There was a problem hiding this comment.
no device split doesn't have anything to do with padding.
This is the part I was tagging Priya on. The earlier fix we have on vectorization analysis assumed all Split between logical to allocation domain has to be a device split.
We are using that for padding now, so I'm extending the analysis to allow non-device split. I'm staying on the conservative side to stop vectorization on split IDs.
| int64_t in_extent = ee_.evaluate(in->extent()).as<int64_t>(); | ||
| if (areDimsToBeMergedContiguous(tensor_, new_shape)) { | ||
| tensor_ = tensor_.view(new_shape); | ||
| if (in_extent != tensor_.size(left)) { |
There was a problem hiding this comment.
I'd like to have more extensive unit tests for this case since this seems like depending on rather intricate behavior. For example, allocation domains with not just one split, but with multiple (both inner and outer) splits and merges.
Indexing and predication with non-divisible splits is tricky, and I had a lot of issues. Some of the stuff may be just hidden with some implicit assumptions, and this case is certainly breaking some of the common assumptions like global tensors are always allocated based off logical IDs.
There was a problem hiding this comment.
I totally agree with you that this is a delicate topic.
I was trying to avoid pulling off that right now because I don't know how deep a rabbit hole this is going to be. 😉
Let me try adding some more interesting examples to see if we can prefetch some issues.
assumptions like global tensors are always allocated based off logical IDs.
That's the part that doesn't sound right to me.
I understand the old assumption that logical and allocation domains are consistent (exact coverage).
But if we break that assumption. Conceptually, logical domain is used for predication and allocation domain is used for indexing. So the actually buffer should be allocated based on allocation domain instead, otherwise, there could be out-of-bound access.
There was a problem hiding this comment.
That's why I'm suggesting adding more tests. I'm not aware of any logic that is based on some implicit assumptions, but this pattern is not common and not well tested, which isn't ideal.
| !split->innerSplit(), | ||
| "Inner split by device dimension is not supported: ", | ||
| expr->toString()); | ||
| bool isValidateDeviceSplit(Expr* expr) { |
There was a problem hiding this comment.
This is used by vectorization analysis.
The old behavior only asserts, I don't want to wrap a try/catch on that so decided to change the function to return boolean for the check result and have the call site assert on the return when necessary.
There was a problem hiding this comment.
I guess my confusion is why we would need to change this for fixing the allocation?
There was a problem hiding this comment.
This is an orthogonal issue.
We don't need to change this for fixing the computed allocation sizes. But the cpp example added in this PR would trigger assert on ToT.
auto inp = makeSymbolicTensor(2);
fusion.addInput(inp);
auto out = set(inp);
fusion.addOutput(out);
// padding output to multiple of 16
out->split(1, 16);
out->setAllocationDomain(out->getLoopDomain(), true);
// restore loop domain
out->merge(1);
So related changes under multi-device and vectorization analysis is just to patch that.
There was a problem hiding this comment.
I'd say the allocation fix itself is a significant change and deserves its own PR, which would be easier to review.
Motivation:
We want to use allocation domain to represent padding logic. This is/will be used by swizzle layout for block scaling factor in quantization, as well as more complex padding used in grouped matmul.
For a simple example, see the added cpp test
LogicalAndAllocationSizesContext
There are two cases where padding is represented at allocation:
Case 1: non-divisible split:
For a given 2d TensorView with logical domain
[i0, i1]. If we split i1 to construct its allocation domain as[i0, ceilDiv(i1, 16), 16].The implication is that we'll be allocating "extra" spaces to have the inner dimension to be padded to an extent of multiple of 16. This means the stride of the TensorView as [ceilDiv(i1, 16) * 16, 1].
However, since the given TensorView still has a logical domain of
[i0, i1], which means we should produce a tensor of shape[i0, i1]to avoid feeding consumer a tensor with mismatch shapes.Case 2: unconnected allocation & logical domain:
For
PreprocessGroupedMatmulInputSf, because the allocation domain and logical domain are not connected via any ID operations, but rather we rely on arithmetic operations on their extent directly to compute the proper buffer size. We cannot project the allocation domain to logical domain to compute its size/stride. In which case, we simply slice a section of the allocated buffer with trivial strides (contiguous) to avoid causing any issue with shape validation. It is safe to do so, since the consumer of this TensorView will not index it through its stride.What's in this PR:
Fixes in
allocation.cpp:[i0, ceilDiv(i1, 16) * 16]instead of[i0, i1].allocateOutputsneeds to allocate the buffer using allocation sizes and strides (so we'll get enough buffer space) and then slice out the correct extent.Changes in
vectorize_helper.cppandcsrc/multidevice/utils.cpp[.h]:Relax the assert on vectorization analysis to allow non-device split.
cpp test verify the updated behavior.