Fix allocation logic: unconnected alloc/logical#5185
Conversation
|
Review updated until commit fd8826c Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
c64d299 to
33d0ce3
Compare
33d0ce3 to
17df15a
Compare
|
!test |
17df15a to
f9acfc3
Compare
|
!test |
f9acfc3 to
87afb60
Compare
|
!test |
c64d299 to
ea3cd68
Compare
|
!test |
|
!test |
| for (int i = static_cast<int>(tensor_new_shape.size()) - 1; i >= 0; --i) { | ||
| prod *= tensor_new_shape[i]; | ||
| tensor_new_strides[i] = prod; | ||
| prod *= tensor_new_shape[i]; |
There was a problem hiding this comment.
a random bug fix not related in this PR.
| std::set<IterDomain*> logical_set(logical.begin(), logical.end()); | ||
| if (frontier_set != logical_set) { | ||
| return tensor; | ||
| std::vector<int64_t> logical_sizes(logical.size(), 0); |
There was a problem hiding this comment.
IIUC, logical_sizes is the correct shape of the output, but logical_strides isn't. Can you leave a comment a little more about the context, e.g., why it should be done this way and why the incorrect strides don't matter.
| c10::nullopt, | ||
| device, | ||
| c10::nullopt); | ||
| at::Tensor alloc_tensor; |
There was a problem hiding this comment.
Can you explain why it's changed this way?
|
!test |
|
!build |
## Stacked PRs Follow up PR on enabling python API and updating test_moe.py is still in cleaning mode. #5174 allow layout op in automatic scheduler #5185 Fix allocation logic: unconnected alloc/logical <- this one ## This PR Fixes allocation logic to ensure that the output tensor has: 1. shape matching its logical domain; 2. buffer size matching the allocation domain. Without this PR, the output tensor from `PreprocessGroupedMatmulInputSf` will have a mismatch shape from its logical domain, causing validation failure in downstream consumers. ### Context PreprocessGroupedMatmulInputSf op has: 1. unconnected logical and allocation domain. 4. larger allocation size, because extra padding is represented via arithmetic operations on the extent directly. Existing allocation logic allocate buffer matches logical sizes/strides. This is not the right behavior. Because allocation domain could have larger extent. We also cannot use allocation sizes/strides neither, because consumer of the tensor expects a tensor matching the logical size. We updated the logic to use allocation domain for buffer allocation. Then we slice into the buffer using logical domain to produce the correct-sized output. For the case of PreprocessGroupedMatmulInputSf, because there's no correct way to slice into the buffer for indexing, we give up on producing correct strides and just use a naive stride instead. It's safe to do so, since we are not using indexing logic on the output. ### Code change 1. refactor buffer allocation buffer to use allocation domain, instead of logical domain. 5. fixing projection from allocation to logical special path when projection is not possible: We now compute correct extent instead of returning the allocation buffer as-is, this allows that layout op to return a tensor with the correct logical size, while still allocating a large enough buffer to accommodate the padding requirement.
Stacked PRs
Follow up PR on enabling python API and updating test_moe.py is still in cleaning mode.
#5174 allow layout op in automatic scheduler
#5185 Fix allocation logic: unconnected alloc/logical <- this one
This PR
Fixes allocation logic to ensure that the output tensor has:
Without this PR, the output tensor from
PreprocessGroupedMatmulInputSfwill have a mismatch shape from its logical domain, causing validation failure in downstream consumers.Context
PreprocessGroupedMatmulInputSf op has:
Existing allocation logic allocate buffer matches logical sizes/strides. This is not the right behavior. Because allocation domain could have larger extent. We also cannot use allocation sizes/strides neither, because consumer of the tensor expects a tensor matching the logical size.
We updated the logic to use allocation domain for buffer allocation. Then we slice into the buffer using logical domain to produce the correct-sized output.
For the case of PreprocessGroupedMatmulInputSf, because there's no correct way to slice into the buffer for indexing, we give up on producing correct strides and just use a naive stride instead. It's safe to do so, since we are not using indexing logic on the output.
Code change