Conversation
|
!test |
|
Review updated until commit 75ee3fb Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
!test |
|
cc @jacobhinkle I tried to replay logical-to-leaf transforms from the original TV to the concretized TV. It works except for Fuser/tests/python/test_python_frontend.py Line 3369 in 7a9dbc6 According to Fuser/csrc/dynamic_transform.cpp Line 811 in 7a9dbc6 I'm not sure if reduction dimensions have to follow a certain order. If not, I could fix this by appending all unmapped reduction dimensions to the end of the loop domain. |
I am not sure I understand the issue. IIUC the purpose of this PR is to replay loop transforms when we do concretizations in order to preserve DID scheduling, right? When you concretize a reshape, neither the original or the replacement should have any reduction domains will it? |
Correct.
I thought so until I hit #1691, which I'm sure you know everything about. |
Ah ok, actually I had forgotten about that entirely. In a case like this we are actually replacing the ViewOp output with its input, so we're not creating a new TensorView are we? In that case the loop domain would still be in tact. However, if that TV was also replaced during concretization then it might no longer have a non-trivial loop domain... I assume this is the case you are worried about. |
|
Here's a case I want to support: Before concretization: with all TVs' first dimension being outer-split by After concretization: with all TVs' first dimension (i.e. i0) being split the same way as before. Therefore, this PR tries to replay loop transforms in addition.
That's right -- no new TensorViews are created. In fact, I can fix this error by diff --git a/csrc/dynamic_transform.cpp b/csrc/dynamic_transform.cpp
index a016f0af..3c9a3919 100644
--- a/csrc/dynamic_transform.cpp
+++ b/csrc/dynamic_transform.cpp
@@ -795,6 +795,9 @@ TensorView* DynamicTransformConcretizer::concretizeNonEmptyReshape(
TensorView* incomplete_out_tv,
const AnalyzeViewResult& view_analysis) {
TensorView* concrete_reshape_out_tv = reshape(inp_tv, view_analysis);
+ if (concrete_reshape_out_tv == inp_tv) {
+ return inp_tv;
+ }
// Extent expressions often change when concretizing a reshape. Here we
// replace these in all downstream expressions so that the Fusion looks justWhy wasn't #1691 fixed this way? Can concretize_reshape_out_tv be different from inp_tv but still have more reduction dimensions than incomplete_out_tv? (It's currently fixed by removing reduction dimensions from concrete_reshape_out_tv before registering concretization). |
We should handle this particular case in the
I am not sure why you need to return early in this function. What is the error you mention?
The code you linked is not removing that dimension from the replacement TV. This section is just for registering concretization of extents, so that we remove the |
You are right. The following diff --git a/csrc/dynamic_transform.cpp b/csrc/dynamic_transform.cpp
index a016f0af..3c9a3919 100644
--- a/csrc/dynamic_transform.cpp
+++ b/csrc/dynamic_transform.cpp
@@ -795,6 +795,9 @@ TensorView* DynamicTransformConcretizer::concretizeNonEmptyReshape(
TensorView* incomplete_out_tv,
const AnalyzeViewResult& view_analysis) {
TensorView* concrete_reshape_out_tv = reshape(inp_tv, view_analysis);
+ if (concrete_reshape_out_tv == inp_tv) {
+ return inp_tv;
+ }
// Extent expressions often change when concretizing a reshape. Here we
// replace these in all downstream expressions so that the Fusion looks justis a wrong fix, because i2 wouldn't be replaced in downstream ops. I'm experimenting putting the extra reduction dimensions in concrete_reshape_out_tv at the end of its loop domain. Will come back once I have an update... |
|
!test |
|
!test |
|
!test --diff |
jjsjann123
left a comment
There was a problem hiding this comment.
stamping since code change looks good.
| TensorDomain* new_self); | ||
| // Self replay the transformation on `self` from logical to loop and | ||
| // allocation onto `new_self`. | ||
| static void selfReplay(const TensorDomain* self, TensorDomain* new_self); |
There was a problem hiding this comment.
I'm wondering whether we would want to replay both all the time?
Like for the case with the existing use on fixing allocation domain for aliases (and graph mutation where we preserve output memory layout during graph mutation). We don't necessarily need to preserve the loop transform.
And what does replaying the loop transform mean for those use-case?
There was a problem hiding this comment.
granted that the current use in those cases are prior to scheduling. So there isn't any transform from root to loop in the first place, Realized that my question isn't really relevant..
There was a problem hiding this comment.
granted that the current use in those cases are prior to scheduling
That's right. I think my answer in the other comment also applies here. I should probably try to fix #3479 again after I'm done with DID loop split.
| axis_map[id] = new_id; | ||
| i++; | ||
| } | ||
| for (auto&& [id, new_id] : zip(self_logical, new_self_logical)) { |
| tv->setDeviceMesh(mesh); | ||
| tv->split(2, d, /*inner_split=*/false); | ||
| tv->axis(2)->parallelize(ParallelType::DIDx); | ||
| tv->setAllocationDomain(tv->getLoopDomain(), true); |
There was a problem hiding this comment.
out of curiosity, is the reason that we needed a replay on loop, because somewhere else we are expecting to see DIDx on loop as well as on allocation?
My naive mental model sees this as DIDx is specified on allocation domain, with the existing replay, it felt almost like we have some unnecessary redundancy. i.e. loop and allocation need to have some mapping parallel type.
There was a problem hiding this comment.
The current assumptions is that, before (intra-GPU) scheduling, loop has to post-dominate allocation. In other words, loop must be the "leaf" domain before scheduling. My #3621 attempted to lift this assumption but failed (cf. #3706).
It doesn't mean that loop and allocation have to be the same order. For example, loop can be a permutation of allocation, or a split of allocation.
Also, it doesn't mean that loop and allocation need to have the same parallel types. Loop may be more parallelized than allocation. For example, this example for overlapping parallelizes the column dimension of operand A on Stream only in loop but not in allocation. In practice, it can be implemented as a Set from a non-stream-parallelized IterDomain to a stream-parallelized IterDomain.
|
@jacobhinkle I'm surprised by the number of codegen diffs. The numbers of kernels don't match -- |
|
!test --diff |
|
!test --diff |
|
I tried to codegen diff locally for bin/test_nvfuser. The result makes more sense. The diffs don't change functionality and in fact make code better. For example, the first diff and the second diff avoid unnecessary alloc_stride accesses. This is because of this check added. |
For #2563
Also extends selfAllocationReplay to replay loop so it can be reused for concretization.