Skip to content

A bare minimum alias analysis. #1097

Merged
wujingyue merged 12 commits intomainfrom
alias
Oct 18, 2023
Merged

A bare minimum alias analysis. #1097
wujingyue merged 12 commits intomainfrom
alias

Conversation

@wujingyue
Copy link
Collaborator

@wujingyue wujingyue commented Oct 17, 2023

So I can switch to figure out how to generate faster (or even no) kernels leveraging aliasing. There's obviously a ton to improve. Most notably, alias analysis should recommend non-default allocation domain to proactively make output aliases, and should handle more op types.

Also, I haven't decided when/where to run it. We can run it between concretization and segmentation or during scheduling, or both.

Comment on lines +59 to +63
if (!out_tv->hasAllocation() && isContiguous(*out_tv)) {
q.push(out_tv);
alias_to_source[out_tv] = in_tv;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment more here on why this guarantees that this ViewOp can be accomplished without a copy? I had gone through this analysis a few months ago and I thought you needed to propagate contiguity from root to rfactor to do the check. Then when you encounter a Merge where the outer of the two merged IterDomains is discontiguous you know that you need a copy since a single stride can no longer represent that transformation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this will show my ignorance of various domains, but let me think aloud 😄. I think the check here is sufficient (but not necessary). Every TensorView added to the queue has a contiguous leaf domain. So a ViewOp entering the if-then must have a contiguous root and leaf domain and the default major-to-minor memory layout. I think this guarantees the same storage can be reused. Can you think of a counter example?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the input is contiguous in the normal stride order then any reshape is possible without a copy. However consider an irregular stride order like channels last, but still contiguous. Reshaping that to 1D is impossible without a copy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that your check is sufficient but not necessary. Thanks for explaining.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is stride order checked?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the input is contiguous in the normal stride order then any reshape is possible without a copy.

Good point -- we can always treat t as f without breaking functionality. I'll try to change my code reflecting that.

Where is stride order checked?

AFAICT, we don't keep stride order in TensorView. Instead, we convert stride order to allocation domain in TensorView:

tv_output->setAllocationDomain(allocation_domain, true);
. My code does check that the allocation domain is empty, meaning the default major-to-minor layout.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point -- we can always treat t as f without breaking functionality. I'll try to change my code reflecting that.

I don't think it's true any more. The frontend can request a non-contiguous stride order for a fusion output and expect the reshape that produces the output to copy data from contiguous to non-contiguous. While nvFuser wants to choose an allocation domain that preserves aliasing when possible, it shouldn't conflict with what's requested by the frontend.

bool isContiguous(const TensorView& tv) {
NVF_ERROR(tv.nDims() == tv.getContiguity().size());
for (const auto i : c10::irange(tv.nDims())) {
if (!tv.axis(static_cast<int>(i))->isBroadcast() &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldnt this inspect allocation domain instead of leaf domain?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Wdyt, @jjsjann123 ? In most cases, the allocation domain is empty. Do we treat empty as the same as rfactor domain? (To me, this makes more sense than "the same as leaf domain" because the leaf domain describes the schedule, which is orthogonal to the memory layout).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and yes.
We were mistakenly using root/rfactor/leaf domain when we should have been using alloc domain instead. Sorry about the inconsistency in our code base.

I'm also patching this in TensorDomain as well as fusion record at this time. Hopefully we'll have them cleaned up slowly....

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, since we expect to enter this function only when hasAllocation returned false, we can assert on not having allocation here and use rfactor domain instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL the new logic.

// that the codegen can use to generate a kernel skipping unnecessary
// computation.
std::queue<const TensorView*> q;
if (!source->hasAllocation() && isContiguous(*source)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: can we put a comment here mentioning the specific limitation we have intentionally put here?

i.e. not supporting allocation domain, not supporting non-packed tensor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

std::unordered_map<const TensorView*, const TensorView*>;

// Finds aliases of the fusion inputs.
AliasAnalysisResult findAliases(const Fusion& fusion);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm perfectly fine with us keeping this simple for now.

-> moving forward we might want to unify how we would treat aliases in fusion as well as out side of fusion.
Hint: there's some ugly code I left in on our support to batchnorm running stats update.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I noticed that different type of aliasing, which essentially makes the input and the output share the same buffer. I need to come up with a better name -- maybe call that "buffer reusing" or something...

bool isContiguous(const TensorView& tv) {
NVF_ERROR(tv.nDims() == tv.getContiguity().size());
for (const auto i : c10::irange(tv.nDims())) {
if (!tv.axis(static_cast<int>(i))->isBroadcast() &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, since we expect to enter this function only when hasAllocation returned false, we can assert on not having allocation here and use rfactor domain instead.


// Maps aliases (e.g. fusion outputs) to their sources (e.g. fusion inputs).
using AliasAnalysisResult =
std::unordered_map<const TensorView*, const TensorView*>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a WIP.

should we use a disjoint-set as the value, rather than a single map from TV -> TV.

i.e. In the case below, we should be able to recognize that all tvs in the fusion points to the same buffer?

  const std::vector<int64_t> in_shape({2, 3, 4});
  const std::vector<int64_t> inter_shape({2, 12});
  const std::vector<int64_t> out_shape({24});

  TensorView* in = makeContigConcreteTensor(in_shape);
  fusion.addInput(in);
  TensorView* inter = reshape(in, in_shape, inter_shape);
  TensorView* out = reshape(inter, inter_shape, out_shape);
  fusion.addOutput(out);

I guess this could make bookkeeping tricky for cases where we are slicing through the tensor and create an alias that doesn't cover the whole buffer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point -- added a comment.


if (!out_tv->hasAllocation() && isContiguous(*out_tv)) {
q.push(out_tv);
alias_to_source[out_tv] = in_tv;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we have a tensor to tensor mapping here, should we traverse the alias relationship so we'll reach the root for each alias tree?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the user is expected to do that and we can create helpers.

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for adding extra tests there.

const std::vector<IterDomain*>& allocation_domain =
tv.getMaybeAllocationDomain();
for (size_t i = 0; i < allocation_domain.size(); i++) {
// Broadcast and reduction dims are always contiguous because their sizes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: IIUC, isBroadcast & isReduction should be equivalent to !tv.getContiguity().has_value().

So can't we simplify the logic here as

const auto opt_contig = tv.getContiguity()[i];
if (opt_contig.has_value() && opt_contig.value() == false) {
  return false;
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed:

contiguity.emplace_back(std::nullopt);

Fixed.

@wujingyue wujingyue merged commit 788533b into main Oct 18, 2023
@wujingyue wujingyue deleted the alias branch October 18, 2023 19:25
tv.getMaybeAllocationDomain();
for (size_t i = 0; i < allocation_domain.size(); i++) {
// We skip std::nullopt contiguity. It represents a broadcast or reduction
// dimension, which is of size 1 and always contiguous.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we need to check that the size is indeed 1 to make this assumption?

I'm thinking of, say, a grouped convolution, which might reduce a 16-channel tensor to a 4-channel tensor. Might that 4-channel tensor still have an optional/currently-unresolved contiguity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we need to check that the size is indeed 1 to make this assumption?

The reduction dimension holds the original extent, so technically it's not one but conceptually one for reasoning about contiguity... (I believe we were in the same knowledge share by @naoyam where I wondered the reason behind this design 😄 )

I'm thinking of, say, a grouped convolution, which might reduce a 16-channel tensor to a 4-channel tensor. Might that 4-channel tensor still have an optional/currently-unresolved contiguity?

AFAIU, in this case, the fusion definition will rfactor the dimension of size 16 to 4x4 and reduce one of the 4s to 1. No?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we were in the same knowledge share by @naoyam where I wondered the reason behind this design 😄

Ahh, right, oops; yeah, I remember you asking that.

I'm thinking of, say, a grouped convolution, which might reduce a 16-channel tensor to a 4-channel tensor. Might that 4-channel tensor still have an optional/currently-unresolved contiguity?

AFAIU, in this case, the fusion definition will rfactor the dimension of size 16 to 4x4 and reduce one of the 4s to 1. No?

Would it? I would imagine it would just leave it as a 4x4 in such a case, no?

Perhaps a better example is a 35-channel tensor that was grouped by 5 to get a 7-channel tensor. Unless the strides were forced on either tensor, as I understand it nvFuser would be free to choose a stride of 64 on the input side and 8 on the output side, instead of the densely-packed 35 and 7, respectively. For the smaller tensor this is a more reasonable choice, as we waste one measly element and in turn can get nicely-coalesced loads.

Then there's truly-dynamic tensors, where we can't even know at compile time if they're contiguous. This would be the group-by-5 example above, but with an unknown number of input channels.

As I understand it (and I could easily be wrong), nullopt in the above case would mean either "nvFuser has not decided on a stride" or "nvFuser can't decide on a stride (due to dynamism)". In such a situation we'd need to be conservative and assume non-contiguity, no?
(perhaps it would've made more sense for me to have put my question on the if from line 29, instead; oops, sorry!)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, I might have mislead you here @wujingyue

Conceptually, nullopt should be indicating a broadcast stride. If that comes from either a size-1 broadcast, or a reduction (which gives us an diminished dimension), it's safe to skip those and consider them as contiguous here.

In the meantime, if we have expanded dimensions, where we do require a stride == 0, I believe we also mark contiguity as nullopt? so for those cases, we cannot naively consider those as contiguous any more.

i.e. at::randn(4, 5).unsqueeze(-1).expand((4, 5, 6)).reshape(40, 3)

After the expand, the last dimension is a stride-0 dimension and we cannot split that dimension any more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, that hurts... Let me see what I can do. I need to detect dimensions that are an expanded broadcast.

AliasAnalysisResult alias_to_source;
for (const Val* in : fusion.inputs()) {
if (const TensorView* in_tv = dynamic_cast<const TensorView*>(in)) {
findAliasesOfRoot(in_tv, alias_to_source);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we instead return the AliasAnalysisResult instead of taking a mutable alias_to_source arg, I think this might be somewhat-easily parallelizable (albeit with a serialized merge step afterwards that inserts the results).

I do not think this is important today; due to some other issues the fusion groups we get are pretty small. But I could see us hitting hundreds of ops once we accept cat/slice/etc. and matmuls/convs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point -- I'll add a comment: #1106

wujingyue added a commit that referenced this pull request Oct 23, 2023
This is a follow up to #1097. See code comments and unit tests for why
it's needed.

This PR also changed the way we traverse `Expr`s in a fusion.
Previously, we traverse from only fusion inputs and collect only
`TensorView`s that alias an input. Now, we traverse each Expr and
capture all local aliases even if they are not an alias of any input.
This change gives us more test coverage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants