Layout propagation by jjsjann123 · Pull Request #1744 · NVIDIA/Fuser

jjsjann123 · 2024-02-09T09:13:50Z

Stacked PRs:
#1755 enabling layout propagation through runtime
#1744 adding layout inference pass <- this one

What's in this PR:
The pass we have works on a Fusion IR:
It summaries MemoryFormat on inputs by looking at each TensorView's allocation_domain and rfactor_domain;
It uses a predefined rule (MemoryFormatInferencer) to propagate MemoryFormat from inputs to the entire fusion;

Note that the pass itself doesn't mutate the fusion IR. It's just a utility function that suggests ways to specify allocation domain to be used by other optimization passes.

adding some simple rule to propagate layout through fusion IR;
adding cpp test to verify propagation rule;

Quick design doc: #1756

Future Work:

expanding propagation rule to cover more operation;

jjsjann123 · 2024-02-13T17:43:50Z

!build

jjsjann123 · 2024-02-13T22:56:23Z

test/utils.h


+// allows overload resolution with size-1 initializer list
+inline TensorView* makeSymbolicTensor(
+    std::initializer_list<int64_t> shape,


This API is just added to allow overload of makeSymbolicTensor({-1}, ...), which would otherwise be called into makeSymbolicTensor(size_t, ...)

note to myself. Do a quick clean up for other APIs as well!

C++ syntax is weird...

jjsjann123 · 2024-02-13T22:56:53Z

csrc/optimization/layout_inference.cpp

+//     TV1 has b5 -> i4 -> i3
+//   we see that TV0 encounters a non-broadcast iter domain first, so TV0 is the
+//   dominating tensor. We'll produce an output with stride order identical to
+//   that of TV0 in the record.


@kevinstephano This was what I was describing on the propagation rule for binary operations.

jjsjann123 · 2024-02-13T22:59:58Z

😮‍💨 I realized I went with my old implementation and the memory order permutation is inconsistent with our stride_order in our python API.

Let me refactor that... 😱

jjsjann123 · 2024-02-14T20:43:39Z

!build

jjsjann123 · 2024-02-14T23:55:37Z

!build

naoyam · 2024-02-15T00:40:22Z

@jjsjann123 I'm a bit lost with what's addressed in this PR. According to your design doc, what'll be done are:

It looks up the permutation from rfactor_dom to allocation_dom on input TensorViews and record the permutation as MemoryFormat for those tensors;

The pass traverse the fusion to propagate MemoryFormat. It uses a set of propagation rules, where it compute & record MemoryFormat of outputs from the recorded MemoryFormat of inputs;

Lastly, the pass iterates through all output tensors and try to specify their allocation domain as per recorded MemoryFormat.

Am I correct that this PR does items 1 and 2?

Also, what are propagation rules?

It uses a set of propagation rules, where it compute & record MemoryFormat of outputs from the recorded MemoryFormat of inputs;

jjsjann123 · 2024-02-15T01:02:49Z

csrc/optimization/layout_inference.cpp

+ private:
+  void handle(const UnaryOp*) override;
+  void handle(const BinaryOp*) override;
+  void handle(const BroadcastOp*) override;


@naoyam Propagation rules are specified here per operation. I'll add a note on the commit description.

jjsjann123 · 2024-02-15T01:05:37Z

Am I correct that this PR does items 1 and 2?

Yes.
item 1 is done inside inferenceMemoryFormat function before it calls MemoryFormatInference to propagate it;
item 2 is done inside MemoryFormatInferencer which propagates memory format from inputs to the entire fusion.

jjsjann123 · 2024-02-15T01:10:37Z

Failing test seems to be coming from #1743.
cc'ing @Priya2698

wujingyue

If it's convenient for you to git rebase and git add -p, I'd suggest separate the BinaryOp change to a different PR. That would reduce the size a lot to make review easier.

wujingyue · 2024-02-14T22:02:57Z

csrc/optimization/layout_inference.cpp

+  if (auto iter = format_map_.find(in); iter != format_map_.end()) {
+    format_map_[out] = iter->second;
+  }


This pattern seems to appear in multiple places in this file. Consider making it a helper. Maybe something like

copyFormat(from, to);

wujingyue · 2024-02-15T03:19:39Z

csrc/optimization/layout_inference.cpp

+// e.g.
+//   lhs TV0 rfactor_dom [i0, i1, b2]
+//                         0   2   1
+//   rhs TV0 rfactor_dom [i3, i4, b5]


Suggested change

// rhs TV0 rfactor_dom [i3, i4, b5]

// rhs TV1 rfactor_dom [i3, i4, b5]

wujingyue · 2024-02-15T03:35:01Z

csrc/optimization/layout_inference.cpp

+//       TV0 has i1 -> b2 -> i0
+//       TV1 has b5 -> i4 -> i3
+//   we see that TV0 encounters a non-broadcast iter domain first, so TV0 is the
+//   dominating tensor. We'll produce an output with stride order identical to


so TV0 is the dominating tensor

Why are we in favor of the memory format that first hits a non-broadcast? (I suspect it's something about vectorization, but the comment wasn't clear about that)

I have the same question. Why is this better than just use lhs? @jjsjann123 Could you add the explanation to the code here as comment?

I somehow feel that, what we should do is: if this binary op contains a broadcast concretization, then respect the one with most number of concrete IDs, otherwise, just use lhs. cc @naoyam

If we are propagating only in the forward direction, it seems to me that we can't really know what will be the most advantageous stride order. For example if we later do a sum on some outer dimension then it might wind up that we would have preferred that dimension to be allocated inner-most, but we would need to propagate that information backwards. If we're sticking with forward-only, why not just use the first input's stride order for the output and call it a day? If we want to chase more optimality we could consider doing an iterative optimization on the segmented fusion, allowing the schedulers to specify weighted preferences for the allocation orderings of their inputs and propagating changes to the outputs using simple rules like the one here, but that optimization is a bigger change to tackle.

Why are we in favor of the memory format that first hits a non-broadcast? (I suspect it's something about vectorization, but the comment wasn't clear about that)

This one should have been updated. I was doing this earlier when I use a different propagation rule for broadcast, so I needed this trick to propagate nhwc.

tv0 = [i0 i1 i2 i3] @ {0 2 3 1} bias0 = [i4] @ {0} -> broadcast_bias0 [b5 i4 b6 b7] @ {0 1 2 3}

But now I feel @zasdfgbnm's suggestion makes a lot more sense instead.

I somehow feel that, what we should do is: if this binary op contains a broadcast concretization, then respect the one with most number of concrete IDs, otherwise, just use lhs.

This makes a lot more sense to me. i.e. favoring larger tensor (hopefully more concrete IDs would lead to a larger tensor). I'll update.

csrc/optimization/layout_inference.cpp

wujingyue · 2024-02-15T03:41:56Z

csrc/optimization/layout_inference.cpp

+//   we see that TV0 encounters a non-broadcast iter domain first, so TV0 is the
+//   dominating tensor. We'll produce an output with stride order identical to
+//   that of TV0 in the record.
+//   In the event of a tie, we'll just propagate the memory format of lhs.


Is

i->b->i i->i->b

considered a tie? I.e., do you care about just the first non-broadcast or the first difference in which case the non-broadcast wins? Either case, why?

naoyam · 2024-02-15T05:59:30Z

Can you please define what exactly the memory format means? Does it just mean the allocation domain?

naoyam · 2024-02-15T06:15:24Z

Can you please define what exactly the memory format means? Does it just mean the allocation domain?

I found a definition for tensors with an allocation domain:

// TensorView with allocatoin
//   domain that's a permutation of its corresponding rfactor domain and record
//   it as the memory format of the tensor;

What about tensors with no allocation domain?

naoyam · 2024-02-15T06:17:49Z

Do we just want to infer a preferred allocation domain of each output tensor?

How would you propagate a inferred format through reshape?

naoyam · 2024-02-15T06:04:18Z

csrc/optimization/layout_inference.h

+// unordered_map from TensorView to permutation.
+//
+// See details in Note [ Memory Format Propagation ]
+std::unordered_map<const TensorView*, MemoryFormat> inferenceMemoryFormat(


nit

Suggested change

std::unordered_map<const TensorView*, MemoryFormat> inferenceMemoryFormat(

std::unordered_map<const TensorView*, MemoryFormat> inferMemoryFormat(

naoyam · 2024-02-15T06:08:01Z

csrc/optimization/layout_inference.cpp

+  std::unordered_map<const TensorView*, MemoryFormat>& format_map_;
+};
+
+// UnaryOp propagation forward memory format from input to output


What if the output has an allocation domain? Shouldn't the permutation be calculated here too?

I make the decision to limit the scope of the pass to only propagate from inputs to outputs. So any intermediate tensor with an allocation domain would just be ignored.

Now I felt @zasdfgbnm 's comment about is this just a pass or an actual optimization thing? is quite on point. A real optimization run should have considered existing allocation domain on intermediates.

zasdfgbnm · 2024-02-15T09:02:34Z

csrc/optimization/layout_inference.h

+
+namespace nvfuser {
+
+using MemoryFormat = std::vector<int64_t>;


Should we just call this StrideOrder?

There is a messy topic.

I was avoiding the term StrideOrder, because that's used in our python API. I want our python API to match what integration's semantic of StrideOrder is. (which is, nhwc tensor would be written as [3, 0, 2, 1]).

Meanwhile, the format notation used in codegen would mark nhwc tensor as [0, 2, 3, 1]. The reason we want that is so that it looks more consistent with our setAllocationDomain API.

tv0->setAllocationDomain({tv0->axis(0), tv->axis(2), tv->axis(3), tv->axis(1)}, true);

Interesting... Could you add this note to the code as a comment?

zasdfgbnm · 2024-02-15T09:04:07Z

csrc/optimization/layout_inference.h

Why is this file placed inside csrc/optimization? Is the layout inference an "optimization"? Should we just call it passes or something like that?

One could argue this is an optimization, but I support changing the name since some other passes are not necessarily optimizing. passes might be too generic as there is already device_lower/pass. The debug dump option is fusion_ir_preseg and these are really the last thing before segmentation, so what about preseg_passes?

preseg_passes works for me. Or even simpler, just preseg. I have no preference over preseg_passes vs preseg.

zasdfgbnm · 2024-02-15T09:06:11Z

csrc/optimization/layout_inference.cpp

+
+// BinaryOp propagation tries to merge the memory format of both inputs
+//
+//   1. when there's only one operand has a recorded memory format, it forwards


Is this possible? I think exprs are visited in topological order. Should we just NVF_ERROR(both operands has recorded memory format).

For inputs without allocation domain, we're leaving them as empty, which sounds like a bad idea.

Meanwhile, this could still happen to tensors created with factory method. Since we are only recording memory format of input tensors and I don't want that to affect the output memory format.

This resonates with @jacobhinkle 's other comment on should we have a backward propagation as well.

zasdfgbnm · 2024-02-15T09:09:34Z

csrc/optimization/layout_inference.cpp

+//       TV0 has i1 -> b2 -> i0
+//       TV1 has b5 -> i4 -> i3
+//   we see that TV0 encounters a non-broadcast iter domain first, so TV0 is the
+//   dominating tensor. We'll produce an output with stride order identical to


I have the same question. Why is this better than just use lhs? @jjsjann123 Could you add the explanation to the code here as comment?

csrc/optimization/layout_inference.cpp

zasdfgbnm · 2024-02-15T09:18:52Z

csrc/optimization/layout_inference.cpp

+//       TV0 has i1 -> b2 -> i0
+//       TV1 has b5 -> i4 -> i3
+//   we see that TV0 encounters a non-broadcast iter domain first, so TV0 is the
+//   dominating tensor. We'll produce an output with stride order identical to


I somehow feel that, what we should do is: if this binary op contains a broadcast concretization, then respect the one with most number of concrete IDs, otherwise, just use lhs. cc @naoyam

zasdfgbnm · 2024-02-15T09:21:17Z

csrc/optimization/layout_inference.cpp

+  // e.g. TV0 rfactor domain [i0, i1, i2]
+  //            alloc domain [i0, i2, i1]
+  //           memory format   0,  2,  1
+  std::unordered_map<const TensorView*, MemoryFormat>& format_map_;


Question: If a tensor has [I1, r2, b3, I4], should the MemoryFormat be 2d, 3d, or 4d?

I haven't touched that yet.

But I think it should be 3d here. i.e. we'll want to exclude reduction iterdomain, since it doesn't help resolve propagation with a binary op. We can probably just leave the reduction iterdomain on the left of allocation domain... or better yet, maybe we should just remove it from allocation domain since it doesn't carry any real meaning.

zasdfgbnm · 2024-02-15T09:23:22Z

csrc/optimization/layout_inference.cpp

+
+namespace {
+
+class MemoryFormatInferencer : public OptOutConstDispatch {


Is there any reason for not making this a subclass of IterVisitor?

Definitely should have used that one instead. Thanks 🙇

zasdfgbnm · 2024-02-15T09:24:42Z

csrc/optimization/layout_inference.cpp

+  for (auto tv : ir_utils::filterByType<TensorView>(fusion->inputs())) {
+    std::optional<MemoryFormat> permutation = ir_utils::computePermutation(
+        TensorDomain::noReductions(tv->getMaybeRFactorDomain()),
+        tv->getMaybeAllocationDomain());


Should this be TensorDomain::noReductions(tv->getMaybeAllocationDomain())? IIRC allocation domain do have these reductions, although it makes no sense to do so.

Also, should we make sure that reductions in the allocation domain are correctly handled?

zasdfgbnm · 2024-02-15T09:28:20Z

test/utils.h


+// allows overload resolution with size-1 initializer list
+inline TensorView* makeSymbolicTensor(
+    std::initializer_list<int64_t> shape,


C++ syntax is weird...

jacobhinkle

I am trying to understand how propagation can be more useful than the default (or arbitrary rules like using the first input's stride order) if we are only propagating in the forward direction.

jacobhinkle · 2024-02-15T18:05:01Z

csrc/optimization/layout_inference.cpp

+//       TV0 has i1 -> b2 -> i0
+//       TV1 has b5 -> i4 -> i3
+//   we see that TV0 encounters a non-broadcast iter domain first, so TV0 is the
+//   dominating tensor. We'll produce an output with stride order identical to


If we are propagating only in the forward direction, it seems to me that we can't really know what will be the most advantageous stride order. For example if we later do a sum on some outer dimension then it might wind up that we would have preferred that dimension to be allocated inner-most, but we would need to propagate that information backwards. If we're sticking with forward-only, why not just use the first input's stride order for the output and call it a day? If we want to chase more optimality we could consider doing an iterative optimization on the segmented fusion, allowing the schedulers to specify weighted preferences for the allocation orderings of their inputs and propagating changes to the outputs using simple rules like the one here, but that optimization is a bigger change to tackle.

csrc/optimization/layout_inference.cpp

jacobhinkle · 2024-02-15T18:10:43Z

csrc/optimization/layout_inference.h

One could argue this is an optimization, but I support changing the name since some other passes are not necessarily optimizing. passes might be too generic as there is already device_lower/pass. The debug dump option is fusion_ir_preseg and these are really the last thing before segmentation, so what about preseg_passes?

Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com>

jjsjann123 · 2024-02-19T07:42:29Z

closing this PR since we are handling this one in #1788 #1790 #1792

jjsjann123 and others added 16 commits February 6, 2024 17:48

WIP

92b3f76

adding passes

25fece2

adding test; enabling build

ea034ea

fixing build

a2ee7fa

trying to fix build

c1b99cb

fixing build

5cd38fe

fixing test build

8b4bbad

Merge remote-tracking branch 'origin/main' into layout_propagation

6d2b854

updating test, updating AliasType->AllocationType

9b263ce

quick fixing reference

c18d565

Merge remote-tracking branch 'origin/main' into layout_propagation

9b1e69d

start working on cpp tests

f98a63a

adding test

5503d64

updating test with assert

668b3c8

updating tests; fixing logic

723bf8e

Merge remote-tracking branch 'origin/main' into layout_propagation

5ea1c5b

jjsjann123 mentioned this pull request Feb 13, 2024

Layout propagation (Part 2) - Enable #1755

Merged

1 task

jjsjann123 added the allocation domain issues related to allocation domain support label Feb 13, 2024

jjsjann123 mentioned this pull request Feb 13, 2024

[Feature Request] better memory format decision for outputs #1756

Closed

6 tasks

jjsjann123 and others added 6 commits February 12, 2024 22:06

Merge remote-tracking branch 'origin/main' into layout_propagation

37a34c4

clangformat

3bcde5a

clangtidy

8769d06

adding comment

d603c45

clangformat

956ac42

Merge branch 'main' into layout_propagation

f41aef2

clangtidy

f22bf27

jjsjann123 commented Feb 13, 2024

View reviewed changes

jjsjann123 and others added 4 commits February 14, 2024 13:47

review comment

178174d

review comments

f5f580d

renaming variables

beb2f7b

Merge branch 'main' into layout_propagation

2b11f4d

jjsjann123 commented Feb 15, 2024

View reviewed changes

jjsjann123 requested a review from wujingyue February 15, 2024 01:16

wujingyue reviewed Feb 15, 2024

View reviewed changes

naoyam reviewed Feb 15, 2024

View reviewed changes

zasdfgbnm reviewed Feb 15, 2024

View reviewed changes

jacobhinkle reviewed Feb 15, 2024

View reviewed changes

jjsjann123 and others added 10 commits February 15, 2024 15:55

Update csrc/optimization/layout_inference.cpp

d67a67b

Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com>

clean up tests

17e341c

Merge remote-tracking branch 'origin/main' into layout_propagation_pr_0

f9bacdb

fixing include

c9b4044

include in tests

49fc671

refactoring broadcast propagation rule

fdf7696

patching

9bb18c8

renaming memory format to allocation order

7c94e32

updating tests

a466617

fixing code logic

8add0eb

jjsjann123 closed this Feb 19, 2024

	// rhs TV0 rfactor_dom [i3, i4, b5]
	// rhs TV1 rfactor_dom [i3, i4, b5]

	std::unordered_map<const TensorView*, MemoryFormat> inferenceMemoryFormat(
	std::unordered_map<const TensorView*, MemoryFormat> inferMemoryFormat(


		namespace nvfuser {

		using MemoryFormat = std::vector<int64_t>;


		namespace {

		class MemoryFormatInferencer : public OptOutConstDispatch {

Conversation

jjsjann123 commented Feb 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjsjann123 commented Feb 13, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented Feb 13, 2024

Uh oh!

jjsjann123 commented Feb 14, 2024

Uh oh!

jjsjann123 commented Feb 14, 2024

Uh oh!

naoyam commented Feb 15, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented Feb 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjsjann123 commented Feb 15, 2024

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naoyam commented Feb 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naoyam commented Feb 15, 2024

Uh oh!

naoyam commented Feb 15, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Feb 15, 2024 • edited by jacobhinkle Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented Feb 9, 2024 •

edited

Loading

jjsjann123 commented Feb 15, 2024 •

edited

Loading

naoyam commented Feb 15, 2024 •

edited

Loading

jjsjann123 Feb 15, 2024 •

edited by jacobhinkle

Loading