Conversation
|
!build |
jjsjann123
left a comment
There was a problem hiding this comment.
LGTM.
Let's merge it since it patched bug with the expanded dimensions~~
| } | ||
| if (ViewOp* view = dynamic_cast<ViewOp*>(expr)) { | ||
| TensorView* in = dynamic_cast<TensorView*>(view->in()); | ||
| if (in == nullptr) { |
There was a problem hiding this comment.
nitpick: are we expecting view op to have non-tensor IO? wondering if we should have assert or throw a warning here?
There was a problem hiding this comment.
Wdyt, @naoyam ? If the I/O is always a TensorView, I'd like to see that reflected in ViewOp's API. E.g. its in() and out() should return a TensorView instead of a Val and its constructor should do the assert.
There was a problem hiding this comment.
^^^yes, that's what I was asking as well.
Can ViewOp take a non-TensorView argument?
There was a problem hiding this comment.
In the case of ViewOp, I believe we can change the type to TensorView, but generally speaking, TensorView inputs and outputs are replaced with kir::TensorIndex during the lowering to the Kernel IR. For ViewOp, it's replaced with LoadStoreOp as part of fusion simplifications, so there should be no ViewOp with kir::TensorIndex.
Personally, I just don't remember which Expr types may have TensorIndex too, so I just tend to do ->as<TensorView>() always or use something like ir_utils::getTvOutput(Expr*).
| {out_root.begin(), out_root.end()}, | ||
| {out_rfactor.begin(), out_rfactor.end()}); | ||
|
|
||
| for (const auto* transform : transforms) { |
There was a problem hiding this comment.
nitpick: can we put a comment here that we are being conservative here?
i.e. split expanded dimension should be fine I guess, as long as we don't merge it with other domains.
Meanwhile, I'm wondering if these analysis can be done safer through some replay. tagging @jacobhinkle for comments.
There was a problem hiding this comment.
split expanded dimension should be fine I guess, as long as we don't merge it with other domains.
I agree with you, but the following test fails at the moment. I think nvFuser isn't smart enough yet to figure out the split expanded broadcast can remain stride=0. I'll file an enhancement...
TEST_F(AliasAnalysisTest, View_SplitExpandedBroadcast) {
Fusion fusion;
FusionGuard fg(&fusion);
TensorView* in = makeContigConcreteTensor({4, 5});
fusion.addInput(in);
TensorView* out = broadcast(in, {false, false, true});
out = expand(
out,
{IrBuilder::create<Val>(4),
IrBuilder::create<Val>(5),
IrBuilder::create<Val>(6)});
out = reshape(out, {4, 5, 6}, {4, 5, 2, 3});
fusion.addOutput(out);
FusionExecutor fe;
at::Tensor in_tensor =
at::randn({4, 5}, at::dtype(at::kFloat).device(at::kCUDA, 0));
fe.compileFusion(&fusion, {in_tensor});
at::Tensor out_tensor = fe.runFusion({in_tensor})[0];
EXPECT_THAT(out_tensor.strides(), ElementsAre(5, 1, 0, 0));
}
FWIW, there's no immediate bug. Without this PR, we only collect aliases between fusion inputs and fusion outputs via ViewOp. Things like broadcast and expand that trigger the bug are excluded for consideration. |
jacobhinkle
left a comment
There was a problem hiding this comment.
Just a comment on the reshape analysis.
| if (expanded_broadcast_dims.count(input)) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
nit: why use an unordered_set here instead of just placing the id->isBroadcast() && id->hasExpandedExtent() check here?
There was a problem hiding this comment.
expanded_broadcast_dims collects IterDomains from out_tv's leaf domain. For some reason, the isBroadcast bit is missing in out_tv's leaf domain and only appears in in_tv's rfactor domain. See bS7 and iS14 in the following example.
Let me know there's a better way to handle this.
%kernel {
T1_l[ iS2{4}, iS3{5}, bS4{1} ]
= broadcast( T0_g[ iS0{4}, iS1{5} ] )
i10 = (nvfuser_index_t)(6);
T2_l[ iS5{4}, iS6{5}, bS7{( (nvfuser_index_t)(6) )} ] = expand( T1_l[ iS2{4}, iS3{5}, bS4{1} ], {4, 5, i10} )
T3_g[ iS16{40}rf, iS17{( ceilDiv(( ( 4 * 5 ) * ( (nvfuser_index_t)(6) ) ), 40) )}rf ] = view( T2_l[ iS5{4}, iS6{5}, bS7{( (nvfuser_index_t)(6) )} ] )
TransformPrinter :
T0_g[ iS0{4}, iS1{5} ]
root domain : (iS0{4}, iS1{5})
contiguity: t t
leaf domain : (iS0{4}, iS1{5})
T1_l[ iS2{4}, iS3{5}, bS4{1} ]
root domain : (iS2{4}, iS3{5}, bS4{1})
contiguity: t t n
leaf domain : (iS2{4}, iS3{5}, bS4{1})
T2_l[ iS5{4}, iS6{5}, bS7{( (nvfuser_index_t)(6) )} ]
root domain : (iS5{4}, iS6{5}, bS7{( (nvfuser_index_t)(6) )})
contiguity: t t n
leaf domain : (iS5{4}, iS6{5}, bS7{( (nvfuser_index_t)(6) )})
T3_g[ iS16{40}rf, iS17{( ceilDiv(( ( 4 * 5 ) * ( (nvfuser_index_t)(6) ) ), 40) )}rf ]
root domain : (iS11{4}rf, iS12{5}rf, iS14{( (nvfuser_index_t)(6) )}rf)
Merge: iS11{4}rf and iS12{5}rf -> iS13{( 4 * 5 )}rf
Merge: iS13{( 4 * 5 )}rf and iS14{( (nvfuser_index_t)(6) )}rf -> iS15{( ( 4 * 5 ) * ( (nvfuser_index_t)(6) ) )}rf
Outer split: iS15{( ( 4 * 5 ) * ( (nvfuser_index_t)(6) ) )}rf by factor 40 -> iS16{40}rf, iS17{( ceilDiv(( ( 4 * 5 ) * ( (nvfuser_index_t)(6) ) ), 40) )}rf, start offset: 0, stop offset: 0
rfactor domain : (iS16{40}rf, iS17{( ceilDiv(( ( 4 * 5 ) * ( (nvfuser_index_t)(6) ) ), 40) )}rf)
contiguity: t t
leaf domain : (iS16{40}rf, iS17{( ceilDiv(( ( 4 * 5 ) * ( (nvfuser_index_t)(6) ) ), 40) )}rf)
}
There was a problem hiding this comment.
Thanks for pointing this out. We should probably fix this by making the root domain of the reshaped tesnor expanded in these cases as well. I will check for that when tackling #1126, unless @naoyam there is some reason for root IDs to be Iteration here as if they are concretizing the expanded input.
There was a problem hiding this comment.
This looks strange. I'll take a look.
| // Whether a ViewOp transforms any expanded broadcast IterDomain in the input. | ||
| // This is a corner case in which we can't always turn `out` into an alias. | ||
| // | ||
| // For example, given | ||
| // | ||
| // t0 = makeContigConcreteTensor({4, 5}); | ||
| // t1 = broadcast(t0, {false, false, true}); | ||
| // t2 = expand(t1, {4, 5, 6}); | ||
| // | ||
| // `reshape(t2, {40, 3})` and `reshape(t2, {4, 30})` have to copy data. | ||
| // `reshape(t2, {40, 3})` splits the expanded broadcast IterDomain (which is 6) | ||
| // into 2 and 3 and merges the 2 with preceding IterDomains. `reshape(t2, {4, | ||
| // 30})` merges the expanded broadcast IterDomain. However, the output of | ||
| // `reshape(t2, {20, 6})` can simply be an alias because the expanded broadcast | ||
| // IterDomain is forwarded not transformed. | ||
| // | ||
| // As a future improvement, when an expanded broadcast dimension is only split, | ||
| // the output of the reshape can be an alias. However, nvFuser currently decides | ||
| // to materialize the expansion, making the output not an alias (#1126). | ||
| // | ||
| // Obviously, this function assumes `in` and `out` are the input and output | ||
| // TensorView of the same ViewOp. | ||
| bool transformsExpandedBroadcastIterDomain(TensorView* in, TensorView* out) { |
There was a problem hiding this comment.
Let t and f denote contiguous and non-contiguous Iteration domains and let b and edenote regular broadcast and expanded broadcast to some non-unit size. The output IDs follow these rules I believe (please double check), where C denotes a required copy because the output cannot be expressed with a single stride:
Split:
t -> (t, t)
f -> (t, f)
b -> (b, b) <- split factor must be 1
e -> (e, e), (b, e), or (e, b) depending on split factor (after #1126) -- last two are BroadcastOp
Merge:
o i ->
o\i | t f b e
----+-----------
t | t f t C
f | C C f C
b | t f b e
e | C C e e <-- after #1126
Statuses b and e are represented in the IR with isBroadcast() and hasExpandedExtent(). You'd have to propagate contiguity t and f since that is not currently tracked at the IterDomain level but instead it is a TensorDomain property. When you encounter a C, you can stop propagating since you know you can't alias at that point.
Note that this is not quite symmetric: Merge t f can be no-copy since the output just uses the inner stride, but Merge f t consists of disjointed contiguous segments so it can't be handled with one stride parameter.
It's possible that transforms generating C outputs could be later inverted, but analyzeView should not create such sequences.
There was a problem hiding this comment.
@jacobhinkle Not exactly sure what you mean by:
b -> (b, b) <- split factor must be 1
We don't enforce any restriction of splitting broadcast domains. For example, it's possible to split by, say, 4, which would create a broadcast domain of size 4. It's not marked as expanded IIRC.
There was a problem hiding this comment.
We don't enforce any restriction of splitting broadcast domains.
We don't enforce it on the scheduling op split(), but in a reshape wouldn't we always be preserving numel in each SplitTransform and MergeTransform transform?
There was a problem hiding this comment.
Yes, broadcast domains should be squeezed before reshape transformations, except for expanded domains
| // t2 = expand(t1, {4, 5, 6}); | ||
| // | ||
| // `reshape(t2, {40, 3})` and `reshape(t2, {4, 30})` have to copy data. | ||
| // `reshape(t2, {40, 3})` splits the expanded broadcast IterDomain (which is 6) |
There was a problem hiding this comment.
nit: our reshape transformation always merges domains until large enough to create the first output domain, so it should look like:
merge: 4, 5 -> 20
merge: 20, 6 -> 120
outer-split: 120 -> 40, 3
There was a problem hiding this comment.
Thanks! A PR is coming your way to fix that comment...
#1124 left it there accidentally.
As @naoyam said in #1124 (comment).
This is a follow up to #1097. See code comments and unit tests for why it's needed.
This PR also changed the way we traverse
Exprs in a fusion. Previously, we traverse from only fusion inputs and collect onlyTensorViews that alias an input. Now, we traverse each Expr and capture all local aliases even if they are not an alias of any input. This change gives us more test coverage.