Conversation
csrc/optimization/alias_analysis.cpp
Outdated
| if (!out_tv->hasAllocation() && isContiguous(*out_tv)) { | ||
| q.push(out_tv); | ||
| alias_to_source[out_tv] = in_tv; | ||
| } |
There was a problem hiding this comment.
Can you comment more here on why this guarantees that this ViewOp can be accomplished without a copy? I had gone through this analysis a few months ago and I thought you needed to propagate contiguity from root to rfactor to do the check. Then when you encounter a Merge where the outer of the two merged IterDomains is discontiguous you know that you need a copy since a single stride can no longer represent that transformation.
There was a problem hiding this comment.
I'm pretty sure this will show my ignorance of various domains, but let me think aloud 😄. I think the check here is sufficient (but not necessary). Every TensorView added to the queue has a contiguous leaf domain. So a ViewOp entering the if-then must have a contiguous root and leaf domain and the default major-to-minor memory layout. I think this guarantees the same storage can be reused. Can you think of a counter example?
There was a problem hiding this comment.
I think if the input is contiguous in the normal stride order then any reshape is possible without a copy. However consider an irregular stride order like channels last, but still contiguous. Reshaping that to 1D is impossible without a copy.
There was a problem hiding this comment.
I agree that your check is sufficient but not necessary. Thanks for explaining.
There was a problem hiding this comment.
Where is stride order checked?
There was a problem hiding this comment.
I think if the input is contiguous in the normal stride order then any reshape is possible without a copy.
Good point -- we can always treat t as f without breaking functionality. I'll try to change my code reflecting that.
Where is stride order checked?
AFAICT, we don't keep stride order in TensorView. Instead, we convert stride order to allocation domain in TensorView:
Fuser/csrc/python_frontend/fusion_record.h
Line 1351 in cc9b438
There was a problem hiding this comment.
Good point -- we can always treat t as f without breaking functionality. I'll try to change my code reflecting that.
I don't think it's true any more. The frontend can request a non-contiguous stride order for a fusion output and expect the reshape that produces the output to copy data from contiguous to non-contiguous. While nvFuser wants to choose an allocation domain that preserves aliasing when possible, it shouldn't conflict with what's requested by the frontend.
csrc/optimization/alias_analysis.cpp
Outdated
| bool isContiguous(const TensorView& tv) { | ||
| NVF_ERROR(tv.nDims() == tv.getContiguity().size()); | ||
| for (const auto i : c10::irange(tv.nDims())) { | ||
| if (!tv.axis(static_cast<int>(i))->isBroadcast() && |
There was a problem hiding this comment.
Shouldnt this inspect allocation domain instead of leaf domain?
There was a problem hiding this comment.
Good question. Wdyt, @jjsjann123 ? In most cases, the allocation domain is empty. Do we treat empty as the same as rfactor domain? (To me, this makes more sense than "the same as leaf domain" because the leaf domain describes the schedule, which is orthogonal to the memory layout).
There was a problem hiding this comment.
yes and yes.
We were mistakenly using root/rfactor/leaf domain when we should have been using alloc domain instead. Sorry about the inconsistency in our code base.
I'm also patching this in TensorDomain as well as fusion record at this time. Hopefully we'll have them cleaned up slowly....
There was a problem hiding this comment.
Just to clarify, since we expect to enter this function only when hasAllocation returned false, we can assert on not having allocation here and use rfactor domain instead.
There was a problem hiding this comment.
PTAL the new logic.
csrc/optimization/alias_analysis.cpp
Outdated
| // that the codegen can use to generate a kernel skipping unnecessary | ||
| // computation. | ||
| std::queue<const TensorView*> q; | ||
| if (!source->hasAllocation() && isContiguous(*source)) { |
There was a problem hiding this comment.
nitpick: can we put a comment here mentioning the specific limitation we have intentionally put here?
i.e. not supporting allocation domain, not supporting non-packed tensor.
| std::unordered_map<const TensorView*, const TensorView*>; | ||
|
|
||
| // Finds aliases of the fusion inputs. | ||
| AliasAnalysisResult findAliases(const Fusion& fusion); |
There was a problem hiding this comment.
I'm perfectly fine with us keeping this simple for now.
-> moving forward we might want to unify how we would treat aliases in fusion as well as out side of fusion.
Hint: there's some ugly code I left in on our support to batchnorm running stats update.
There was a problem hiding this comment.
Yes, I noticed that different type of aliasing, which essentially makes the input and the output share the same buffer. I need to come up with a better name -- maybe call that "buffer reusing" or something...
csrc/optimization/alias_analysis.cpp
Outdated
| bool isContiguous(const TensorView& tv) { | ||
| NVF_ERROR(tv.nDims() == tv.getContiguity().size()); | ||
| for (const auto i : c10::irange(tv.nDims())) { | ||
| if (!tv.axis(static_cast<int>(i))->isBroadcast() && |
There was a problem hiding this comment.
Just to clarify, since we expect to enter this function only when hasAllocation returned false, we can assert on not having allocation here and use rfactor domain instead.
|
|
||
| // Maps aliases (e.g. fusion outputs) to their sources (e.g. fusion inputs). | ||
| using AliasAnalysisResult = | ||
| std::unordered_map<const TensorView*, const TensorView*>; |
There was a problem hiding this comment.
I know this is a WIP.
should we use a disjoint-set as the value, rather than a single map from TV -> TV.
i.e. In the case below, we should be able to recognize that all tvs in the fusion points to the same buffer?
const std::vector<int64_t> in_shape({2, 3, 4});
const std::vector<int64_t> inter_shape({2, 12});
const std::vector<int64_t> out_shape({24});
TensorView* in = makeContigConcreteTensor(in_shape);
fusion.addInput(in);
TensorView* inter = reshape(in, in_shape, inter_shape);
TensorView* out = reshape(inter, inter_shape, out_shape);
fusion.addOutput(out);
I guess this could make bookkeeping tricky for cases where we are slicing through the tensor and create an alias that doesn't cover the whole buffer.
There was a problem hiding this comment.
Good point -- added a comment.
|
|
||
| if (!out_tv->hasAllocation() && isContiguous(*out_tv)) { | ||
| q.push(out_tv); | ||
| alias_to_source[out_tv] = in_tv; |
There was a problem hiding this comment.
Given that we have a tensor to tensor mapping here, should we traverse the alias relationship so we'll reach the root for each alias tree?
There was a problem hiding this comment.
Yes, the user is expected to do that and we can create helpers.
With more comments and renaming.
jjsjann123
left a comment
There was a problem hiding this comment.
LGTM. Thanks for adding extra tests there.
csrc/optimization/alias_analysis.cpp
Outdated
| const std::vector<IterDomain*>& allocation_domain = | ||
| tv.getMaybeAllocationDomain(); | ||
| for (size_t i = 0; i < allocation_domain.size(); i++) { | ||
| // Broadcast and reduction dims are always contiguous because their sizes |
There was a problem hiding this comment.
nitpick: IIUC, isBroadcast & isReduction should be equivalent to !tv.getContiguity().has_value().
So can't we simplify the logic here as
const auto opt_contig = tv.getContiguity()[i];
if (opt_contig.has_value() && opt_contig.value() == false) {
return false;
}
There was a problem hiding this comment.
| tv.getMaybeAllocationDomain(); | ||
| for (size_t i = 0; i < allocation_domain.size(); i++) { | ||
| // We skip std::nullopt contiguity. It represents a broadcast or reduction | ||
| // dimension, which is of size 1 and always contiguous. |
There was a problem hiding this comment.
Would we need to check that the size is indeed 1 to make this assumption?
I'm thinking of, say, a grouped convolution, which might reduce a 16-channel tensor to a 4-channel tensor. Might that 4-channel tensor still have an optional/currently-unresolved contiguity?
There was a problem hiding this comment.
Would we need to check that the size is indeed 1 to make this assumption?
The reduction dimension holds the original extent, so technically it's not one but conceptually one for reasoning about contiguity... (I believe we were in the same knowledge share by @naoyam where I wondered the reason behind this design 😄 )
I'm thinking of, say, a grouped convolution, which might reduce a 16-channel tensor to a 4-channel tensor. Might that 4-channel tensor still have an optional/currently-unresolved contiguity?
AFAIU, in this case, the fusion definition will rfactor the dimension of size 16 to 4x4 and reduce one of the 4s to 1. No?
There was a problem hiding this comment.
I believe we were in the same knowledge share by @naoyam where I wondered the reason behind this design 😄
Ahh, right, oops; yeah, I remember you asking that.
I'm thinking of, say, a grouped convolution, which might reduce a 16-channel tensor to a 4-channel tensor. Might that 4-channel tensor still have an optional/currently-unresolved contiguity?
AFAIU, in this case, the fusion definition will rfactor the dimension of size 16 to 4x4 and reduce one of the 4s to 1. No?
Would it? I would imagine it would just leave it as a 4x4 in such a case, no?
Perhaps a better example is a 35-channel tensor that was grouped by 5 to get a 7-channel tensor. Unless the strides were forced on either tensor, as I understand it nvFuser would be free to choose a stride of 64 on the input side and 8 on the output side, instead of the densely-packed 35 and 7, respectively. For the smaller tensor this is a more reasonable choice, as we waste one measly element and in turn can get nicely-coalesced loads.
Then there's truly-dynamic tensors, where we can't even know at compile time if they're contiguous. This would be the group-by-5 example above, but with an unknown number of input channels.
As I understand it (and I could easily be wrong), nullopt in the above case would mean either "nvFuser has not decided on a stride" or "nvFuser can't decide on a stride (due to dynamism)". In such a situation we'd need to be conservative and assume non-contiguity, no?
(perhaps it would've made more sense for me to have put my question on the if from line 29, instead; oops, sorry!)
There was a problem hiding this comment.
Ha, I might have mislead you here @wujingyue
Conceptually, nullopt should be indicating a broadcast stride. If that comes from either a size-1 broadcast, or a reduction (which gives us an diminished dimension), it's safe to skip those and consider them as contiguous here.
In the meantime, if we have expanded dimensions, where we do require a stride == 0, I believe we also mark contiguity as nullopt? so for those cases, we cannot naively consider those as contiguous any more.
i.e. at::randn(4, 5).unsqueeze(-1).expand((4, 5, 6)).reshape(40, 3)
After the expand, the last dimension is a stride-0 dimension and we cannot split that dimension any more.
There was a problem hiding this comment.
Ouch, that hurts... Let me see what I can do. I need to detect dimensions that are an expanded broadcast.
| AliasAnalysisResult alias_to_source; | ||
| for (const Val* in : fusion.inputs()) { | ||
| if (const TensorView* in_tv = dynamic_cast<const TensorView*>(in)) { | ||
| findAliasesOfRoot(in_tv, alias_to_source); |
There was a problem hiding this comment.
If we instead return the AliasAnalysisResult instead of taking a mutable alias_to_source arg, I think this might be somewhat-easily parallelizable (albeit with a serialized merge step afterwards that inserts the results).
I do not think this is important today; due to some other issues the fusion groups we get are pretty small. But I could see us hitting hundreds of ops once we accept cat/slice/etc. and matmuls/convs.
There was a problem hiding this comment.
Good point -- I'll add a comment: #1106
This is a follow up to #1097. See code comments and unit tests for why it's needed. This PR also changed the way we traverse `Expr`s in a fusion. Previously, we traverse from only fusion inputs and collect only `TensorView`s that alias an input. Now, we traverse each Expr and capture all local aliases even if they are not an alias of any input. This change gives us more test coverage.
So I can switch to figure out how to generate faster (or even no) kernels leveraging aliasing. There's obviously a ton to improve. Most notably, alias analysis should recommend non-default allocation domain to proactively make output aliases, and should handle more op types.
Also, I haven't decided when/where to run it. We can run it between concretization and segmentation or during scheduling, or both.