A bare minimum alias analysis. by wujingyue · Pull Request #1097 · NVIDIA/Fuser

wujingyue · 2023-10-17T18:55:51Z

So I can switch to figure out how to generate faster (or even no) kernels leveraging aliasing. There's obviously a ton to improve. Most notably, alias analysis should recommend non-default allocation domain to proactively make output aliases, and should handle more op types.

Also, I haven't decided when/where to run it. We can run it between concretization and segmentation or during scheduling, or both.

jacobhinkle · 2023-10-17T19:10:21Z

csrc/optimization/alias_analysis.cpp

+      if (!out_tv->hasAllocation() && isContiguous(*out_tv)) {
+        q.push(out_tv);
+        alias_to_source[out_tv] = in_tv;
+      }


Can you comment more here on why this guarantees that this ViewOp can be accomplished without a copy? I had gone through this analysis a few months ago and I thought you needed to propagate contiguity from root to rfactor to do the check. Then when you encounter a Merge where the outer of the two merged IterDomains is discontiguous you know that you need a copy since a single stride can no longer represent that transformation.

I'm pretty sure this will show my ignorance of various domains, but let me think aloud 😄. I think the check here is sufficient (but not necessary). Every TensorView added to the queue has a contiguous leaf domain. So a ViewOp entering the if-then must have a contiguous root and leaf domain and the default major-to-minor memory layout. I think this guarantees the same storage can be reused. Can you think of a counter example?

I think if the input is contiguous in the normal stride order then any reshape is possible without a copy. However consider an irregular stride order like channels last, but still contiguous. Reshaping that to 1D is impossible without a copy.

I agree that your check is sufficient but not necessary. Thanks for explaining.

Where is stride order checked?

I think if the input is contiguous in the normal stride order then any reshape is possible without a copy.

Good point -- we can always treat t as f without breaking functionality. I'll try to change my code reflecting that.

Where is stride order checked?

AFAICT, we don't keep stride order in TensorView. Instead, we convert stride order to allocation domain in TensorView:

Fuser/csrc/python_frontend/fusion_record.h

Line 1351 in cc9b438

tv_output->setAllocationDomain(allocation_domain, true);

. My code does check that the allocation domain is empty, meaning the default major-to-minor layout.

Good point -- we can always treat t as f without breaking functionality. I'll try to change my code reflecting that.

I don't think it's true any more. The frontend can request a non-contiguous stride order for a fusion output and expect the reshape that produces the output to copy data from contiguous to non-contiguous. While nvFuser wants to choose an allocation domain that preserves aliasing when possible, it shouldn't conflict with what's requested by the frontend.

jacobhinkle · 2023-10-17T22:06:12Z

csrc/optimization/alias_analysis.cpp

+bool isContiguous(const TensorView& tv) {
+  NVF_ERROR(tv.nDims() == tv.getContiguity().size());
+  for (const auto i : c10::irange(tv.nDims())) {
+    if (!tv.axis(static_cast<int>(i))->isBroadcast() &&


Shouldnt this inspect allocation domain instead of leaf domain?

Good question. Wdyt, @jjsjann123 ? In most cases, the allocation domain is empty. Do we treat empty as the same as rfactor domain? (To me, this makes more sense than "the same as leaf domain" because the leaf domain describes the schedule, which is orthogonal to the memory layout).

yes and yes.
We were mistakenly using root/rfactor/leaf domain when we should have been using alloc domain instead. Sorry about the inconsistency in our code base.

I'm also patching this in TensorDomain as well as fusion record at this time. Hopefully we'll have them cleaned up slowly....

Just to clarify, since we expect to enter this function only when hasAllocation returned false, we can assert on not having allocation here and use rfactor domain instead.

PTAL the new logic.

jjsjann123 · 2023-10-17T22:52:58Z

csrc/optimization/alias_analysis.cpp

+  // that the codegen can use to generate a kernel skipping unnecessary
+  // computation.
+  std::queue<const TensorView*> q;
+  if (!source->hasAllocation() && isContiguous(*source)) {


nitpick: can we put a comment here mentioning the specific limitation we have intentionally put here?

i.e. not supporting allocation domain, not supporting non-packed tensor.

csrc/ir/interface_nodes.h

jjsjann123 · 2023-10-17T22:55:55Z

csrc/optimization/alias_analysis.h

+    std::unordered_map<const TensorView*, const TensorView*>;
+
+// Finds aliases of the fusion inputs.
+AliasAnalysisResult findAliases(const Fusion& fusion);


I'm perfectly fine with us keeping this simple for now.

-> moving forward we might want to unify how we would treat aliases in fusion as well as out side of fusion.
Hint: there's some ugly code I left in on our support to batchnorm running stats update.

Yes, I noticed that different type of aliasing, which essentially makes the input and the output share the same buffer. I need to come up with a better name -- maybe call that "buffer reusing" or something...

jjsjann123 · 2023-10-17T22:58:16Z

csrc/optimization/alias_analysis.cpp

+bool isContiguous(const TensorView& tv) {
+  NVF_ERROR(tv.nDims() == tv.getContiguity().size());
+  for (const auto i : c10::irange(tv.nDims())) {
+    if (!tv.axis(static_cast<int>(i))->isBroadcast() &&


Just to clarify, since we expect to enter this function only when hasAllocation returned false, we can assert on not having allocation here and use rfactor domain instead.

jjsjann123 · 2023-10-17T23:06:37Z

csrc/optimization/alias_analysis.h

+
+// Maps aliases (e.g. fusion outputs) to their sources (e.g. fusion inputs).
+using AliasAnalysisResult =
+    std::unordered_map<const TensorView*, const TensorView*>;


I know this is a WIP.

should we use a disjoint-set as the value, rather than a single map from TV -> TV.

i.e. In the case below, we should be able to recognize that all tvs in the fusion points to the same buffer?

const std::vector<int64_t> in_shape({2, 3, 4}); const std::vector<int64_t> inter_shape({2, 12}); const std::vector<int64_t> out_shape({24}); TensorView* in = makeContigConcreteTensor(in_shape); fusion.addInput(in); TensorView* inter = reshape(in, in_shape, inter_shape); TensorView* out = reshape(inter, inter_shape, out_shape); fusion.addOutput(out);

I guess this could make bookkeeping tricky for cases where we are slicing through the tensor and create an alias that doesn't cover the whole buffer.

Good point -- added a comment.

jjsjann123 · 2023-10-17T23:07:38Z

csrc/optimization/alias_analysis.cpp

+
+      if (!out_tv->hasAllocation() && isContiguous(*out_tv)) {
+        q.push(out_tv);
+        alias_to_source[out_tv] = in_tv;


Given that we have a tensor to tensor mapping here, should we traverse the alias relationship so we'll reach the root for each alias tree?

Yes, the user is expected to do that and we can create helpers.

With more comments and renaming.

jjsjann123

LGTM. Thanks for adding extra tests there.

jjsjann123 · 2023-10-18T16:30:57Z

csrc/optimization/alias_analysis.cpp

+  const std::vector<IterDomain*>& allocation_domain =
+      tv.getMaybeAllocationDomain();
+  for (size_t i = 0; i < allocation_domain.size(); i++) {
+    // Broadcast and reduction dims are always contiguous because their sizes


nitpick: IIUC, isBroadcast & isReduction should be equivalent to !tv.getContiguity().has_value().

So can't we simplify the logic here as

const auto opt_contig = tv.getContiguity()[i]; if (opt_contig.has_value() && opt_contig.value() == false) { return false; }

Yes, indeed:

Fuser/csrc/ir/nodes.cpp

Line 3598 in 9241dd7

contiguity.emplace_back(std::nullopt);

Fixed.

tfogal · 2023-10-18T22:21:40Z

csrc/optimization/alias_analysis.cpp

+      tv.getMaybeAllocationDomain();
+  for (size_t i = 0; i < allocation_domain.size(); i++) {
+    // We skip std::nullopt contiguity. It represents a broadcast or reduction
+    // dimension, which is of size 1 and always contiguous.


Would we need to check that the size is indeed 1 to make this assumption?

I'm thinking of, say, a grouped convolution, which might reduce a 16-channel tensor to a 4-channel tensor. Might that 4-channel tensor still have an optional/currently-unresolved contiguity?

Would we need to check that the size is indeed 1 to make this assumption?

The reduction dimension holds the original extent, so technically it's not one but conceptually one for reasoning about contiguity... (I believe we were in the same knowledge share by @naoyam where I wondered the reason behind this design 😄 )

I'm thinking of, say, a grouped convolution, which might reduce a 16-channel tensor to a 4-channel tensor. Might that 4-channel tensor still have an optional/currently-unresolved contiguity?

AFAIU, in this case, the fusion definition will rfactor the dimension of size 16 to 4x4 and reduce one of the 4s to 1. No?

I believe we were in the same knowledge share by @naoyam where I wondered the reason behind this design 😄

Ahh, right, oops; yeah, I remember you asking that.

I'm thinking of, say, a grouped convolution, which might reduce a 16-channel tensor to a 4-channel tensor. Might that 4-channel tensor still have an optional/currently-unresolved contiguity?

AFAIU, in this case, the fusion definition will rfactor the dimension of size 16 to 4x4 and reduce one of the 4s to 1. No?

Would it? I would imagine it would just leave it as a 4x4 in such a case, no?

Perhaps a better example is a 35-channel tensor that was grouped by 5 to get a 7-channel tensor. Unless the strides were forced on either tensor, as I understand it nvFuser would be free to choose a stride of 64 on the input side and 8 on the output side, instead of the densely-packed 35 and 7, respectively. For the smaller tensor this is a more reasonable choice, as we waste one measly element and in turn can get nicely-coalesced loads.

Then there's truly-dynamic tensors, where we can't even know at compile time if they're contiguous. This would be the group-by-5 example above, but with an unknown number of input channels.

As I understand it (and I could easily be wrong), nullopt in the above case would mean either "nvFuser has not decided on a stride" or "nvFuser can't decide on a stride (due to dynamism)". In such a situation we'd need to be conservative and assume non-contiguity, no?
(perhaps it would've made more sense for me to have put my question on the if from line 29, instead; oops, sorry!)

Ha, I might have mislead you here @wujingyue

Conceptually, nullopt should be indicating a broadcast stride. If that comes from either a size-1 broadcast, or a reduction (which gives us an diminished dimension), it's safe to skip those and consider them as contiguous here.

In the meantime, if we have expanded dimensions, where we do require a stride == 0, I believe we also mark contiguity as nullopt? so for those cases, we cannot naively consider those as contiguous any more.

i.e. at::randn(4, 5).unsqueeze(-1).expand((4, 5, 6)).reshape(40, 3)

After the expand, the last dimension is a stride-0 dimension and we cannot split that dimension any more.

Ouch, that hurts... Let me see what I can do. I need to detect dimensions that are an expanded broadcast.

tfogal · 2023-10-18T22:22:41Z

csrc/optimization/alias_analysis.cpp

+  AliasAnalysisResult alias_to_source;
+  for (const Val* in : fusion.inputs()) {
+    if (const TensorView* in_tv = dynamic_cast<const TensorView*>(in)) {
+      findAliasesOfRoot(in_tv, alias_to_source);


If we instead return the AliasAnalysisResult instead of taking a mutable alias_to_source arg, I think this might be somewhat-easily parallelizable (albeit with a serialized merge step afterwards that inserts the results).

I do not think this is important today; due to some other issues the fusion groups we get are pretty small. But I could see us hitting hundreds of ops once we accept cat/slice/etc. and matmuls/convs.

Good point -- I'll add a comment: #1106

This is a follow up to #1097. See code comments and unit tests for why it's needed. This PR also changed the way we traverse `Expr`s in a fusion. Previously, we traverse from only fusion inputs and collect only `TensorView`s that alias an input. Now, we traverse each Expr and capture all local aliases even if they are not an alias of any input. This change gives us more test coverage.

wujingyue added 5 commits October 17, 2023 11:29

Make getContiguity a const method.

bc901b7

Skeleton of MarkIdentityPass.

b95de66

Detect static contiguous ViewOps.

e7a7ba9

Rename mark_identity to alias_analysis.

587e087

Comments.

cf40c66

wujingyue requested review from jjsjann123 and naoyam October 17, 2023 18:58

jacobhinkle reviewed Oct 17, 2023

View reviewed changes

Make clang-tidy happy.

21214e6

wujingyue force-pushed the alias branch from 25d8e94 to 21214e6 Compare October 17, 2023 20:14

wujingyue added 2 commits October 17, 2023 13:36

Make clang-tidy happy.

5733378

Make clang-tidy happy.

28e207e

jacobhinkle reviewed Oct 17, 2023

View reviewed changes

jjsjann123 reviewed Oct 17, 2023

View reviewed changes

wujingyue added 3 commits October 17, 2023 16:56

Check allocation domain instead of leaf domain.

cc04074

More comments.

bfe3df6

Add a test for a chain of ViewOps.

d3f4b72

With more comments and renaming.

wujingyue requested review from jacobhinkle and jjsjann123 October 18, 2023 00:31

jjsjann123 approved these changes Oct 18, 2023

View reviewed changes

Simplify the check for contiguity.

b5a3b28

wujingyue merged commit 788533b into main Oct 18, 2023

wujingyue deleted the alias branch October 18, 2023 19:25

tfogal reviewed Oct 18, 2023

View reviewed changes

wujingyue mentioned this pull request Oct 21, 2023

No alias when an expanded broadcast IterDomain is transformed. #1124

Merged

wujingyue added the enhancement label Oct 26, 2023

Conversation

wujingyue commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wujingyue commented Oct 17, 2023 •

edited

Loading