IdModel-based indexing: broadcast indexing by naoyam · Pull Request #2353 · NVIDIA/Fuser

naoyam · 2024-06-06T03:41:42Z

Stacked on top of #2344. Adds support of broadcast indexing with loop promotion. The main change is just the use of promoted domains in loop and allocation domains.

Only includes minimum logic for basic indexing. Notably, no support for broadcast in this PR.

naoyam · 2024-06-06T04:10:34Z

CC: @zasdfgbnm

jacobhinkle

LGTM

jacobhinkle · 2024-06-07T13:24:21Z

csrc/id_model/indexing.cpp

 }

+bool TensorIndexer::shouldUseZeroIndex(const ValGroup& loop_group) const {
+  // All loops in this set are non-parallel, non-concretized broadcast


So if all the axes are broadcast then we should use 0 index, or if the promoted loop has extent 1 (and is not partial). How does "non-parallel" affect this check?

Ah, that comment should have been removed. I had some other code above this line. Thanks for catching it.

jacobhinkle · 2024-06-07T13:26:38Z

csrc/id_model/indexing.cpp

  // Assume consumer-based indexing. Needs to revisit for ops like
  // scatter
-  return ir_utils::getTvOutput(expr)->getLeafDomain();
+  auto loop_domains = ir_utils::getTvOutput(expr)->getLeafDomain();


Another place where we assume all outputs have same domain; in this case leaf domain.

Yes. I think lifting that restriction is quite challenging. We would need to change many things, including expression sorting, loop generation etc.

jacobhinkle · 2024-06-07T13:27:02Z

csrc/id_model/indexing.h

      const Expr* expr,
      const std::vector<IterDomain*>& index_domains) const;

+  // Check if the loop index of a a loop group should be always


Suggested change

// Check if the loop index of a a loop group should be always

// Check if the loop index of a loop group should be always

jacobhinkle · 2024-06-07T13:29:09Z

tests/cpp/test_indexing.cpp

 #include <ops/all_ops.h>
 #include <scheduler/utils.h>

+#include <functional>


std::forward is defined in <utility> so <functional> might not be needed.

jacobhinkle · 2024-06-07T13:32:56Z

tests/cpp/test_indexing.cpp


+template <typename... Args>
+Val* addExpr(Args&&... args) {
+  return SimplifyingIrBuilder::addExpr(std::forward<Args>(args)...);


These do help readability. If you're planning to do a lot of structural checking of indices we could consider using user-defined literals like what @zasdfgbnm used in test_expr_simplifier.cpp.

zasdfgbnm · 2024-06-07T16:28:44Z

csrc/id_model/indexing.cpp

+  if (std::all_of(loop_group->begin(), loop_group->end(), [](Val* val) {
+        return val->as<IterDomain>()->isBroadcast();
+      })) {
+    return true;
+  }


Is this sufficient?

auto leaf_id = getLoopPromotion(loop_group->front()->as<IterDomain>(), id_model_); leaf_id->isBroadcast();

zasdfgbnm · 2024-06-07T16:30:22Z

csrc/id_model/indexing.cpp

+  // Trivial loop
+  auto leaf_id =
+      getLoopPromotion(loop_group->front()->as<IterDomain>(), id_model_);
+  if (!leaf_id->maybePartial() && simplifyExpr(leaf_id->extent())->isOneInt()) {


Do we support partial IterDomains?

No. I think I just put this mostly by following what we have in kir::ForLoop::isTrivial before the shift removal. Now that it's removed, this seems more confusing than necessary. I'll remove it.

zasdfgbnm · 2024-06-07T18:12:11Z

csrc/id_model/indexing.cpp

+    if (!is_loop) {
+      continue;
+    }
+    allocation_domain = getLoopPromotion(allocation_domain, id_model);


isPartitionedLoop uses the parallel type of id instead of the getParallelType of the loop group of id, can this be a problem? Similarly, in line 117 above, we are also using loop_id->getParallelType() instead of the parallel type of the loop group. IIRC, if the loop promotion id is replayed, we will not set its parallelization type.

If I have

smem_tv[b, I1] ca_pos(1) tv[TIDx, I1] = smem_tv[b, I1] + concrete_tv[I0, I1]

then should smem_tv be allocated as [I0, I1] or [I1]?

How about

smem_tv[bTIDx, I1] ca_pos(1) tv[TIDx, I1] = smem_tv[bTIDx, I1] + concrete_tv[I0, I1]

?

Thank you. That's a real problem. We should either always look at a loop group or just propagate parallel types to all IDs, including promotion domains. I think the latter is a simpler solution. Will work on it.

If I have

smem_tv[b, I1] ca_pos(1) tv[TIDx, I1] = smem_tv[b, I1] + concrete_tv[I0, I1]

then should smem_tv be allocated as [I0, I1] or [I1]?

How about

smem_tv[bTIDx, I1] ca_pos(1) tv[TIDx, I1] = smem_tv[bTIDx, I1] + concrete_tv[I0, I1]

?

In both cases, the allocation size should be [I0, I1], right?

Maybe not. In the first case, [I1] should be enough as long as a proper predicate is added.

I think what we currently do is we don't inline broadcast domains like I0 of smem_tv. Let me check.

This is what I was referring to:

https://github.com/NVIDIA/Fuser/blob/main/csrc/tensor_view.cpp#L167-L170

But that only applies to innermost broadcast domains, so in this case it is inlined.

The current lowering only allocates [I1]. More specifically, when a domain is a broadcast domain, it's not allocated even when it's promoted. I'd keep this behavior as is.

Looking at the code again, this should be already what happens (with #2371). Pure broadcast domains should be filtered out !mayRequireAllocation. I'll add a test.

zasdfgbnm · 2024-06-07T18:38:45Z

csrc/id_model/indexing.cpp

+  // Loop promotion may affect allocations. Promotions of intermediate
+  // domains may not be defined correctly. Only consider loop domains
+  // for now.


What if we have the above schedule? How should we handle this? Should we just raise an error, or allocate the tv with broadcasting as I0*I1, or allocating it as 5?

When the allocation domain is not the loop domain, we don't have any logic other than fully allocating bxI1. If we want to just allocate 5, we could do it by setting the allocation domain as the loop domain.

I wonder what the domains and parallelization would look like with TMA.

I believe this is more of a question on what the allocation domain should be. I think that ideally getAllocationDomains here should be just a trivial function call to tv->getAllocationDomain and each tensor should always have its correct allocation domains.

I believe this is more of a question on what the allocation domain should be.

I agree. I think we need a restriction that, each ID in the allocation domain must be either fully inlined or fully not inlined. It can not have an ID coming from a merge of an inlined ID with an uninlined ID, or an ID who is the parent of a split whose outer is inlined but inner not.

I wonder what the domains and parallelization would look like with TMA.

In my mental model, it is similar to above: IDs in the allocation domain must be either a tile or a non-tile, it can not be a mix of both. However, in practice, for now, even if we have a mix, the code should still work (thanks to some existing hack in our system? and the same hack will also make the above example just allocate 5 today?)

I believe this is more of a question on what the allocation domain should be.

I agree. I think we need a restriction that, each ID in the allocation domain must be either fully inlined or fully not inlined. It can not have an ID coming from a merge of an inlined ID with an uninlined ID, or an ID who is the parent of a split whose outer is inlined but inner not.

Yeah, I think you could say that the mechanism of promotion is making partially inlined domains fully inlined. In the above case, the innermost 5 domain of the broadcast tensor is promoted to the innermost 5 domain of the non-broadcast tensor, meaning it's effectively fully inlined.

I think we have a reasonable understanding of promotions of loop domains. Can we propagate promotions to allocation domains that are not between logical and loop? I guess we also have a similar problem of propagating parallel types from loop domains to allocation domains.

I wonder what the domains and parallelization would look like with TMA.

In my mental model, it is similar to above: IDs in the allocation domain must be either a tile or a non-tile, it can not be a mix of both. However, in practice, for now, even if we have a mix, the code should still work (thanks to some existing hack in our system? and the same hack will also make the above example just allocate 5 today?)

As long as the allocation domain of the tensor is just the loop domain, indexing is trivial thanks to the loop promotion. How it's implemented in the current main branch isn't that different conceptually, but the implementation is quite convoluted since, for example, it always traverses back to logical domains, whereas in this case we can just directly index the loop (=allocation) domains.

@zasdfgbnm

Stacked on top of #2353 Small changes to allow indexing of tensors with DIDx domains. CC: @zasdfgbnm @cowanmeg @samnordmann @wujingyue

naoyam added 19 commits June 4, 2024 11:18

WIP: IdModel-based indexing

1e45950

cleanup

f333d42

cleanup

7cc7acf

Initial PR of IdModel-based indexing

39ad06a

Only includes minimum logic for basic indexing. Notably, no support for broadcast in this PR.

Disable idmodel

096f638

fix

82d747f

cleanup

c8dfdf7

fix

efa09bd

clang-tidy

471dcd7

cleanup

81937bb

Merge branch 'main' into idmodel_indexing

a7b0ae3

Merge branch 'idmodel_indexing_part1' into idmodel_indexing

967153e

disable idmodel

eb2bc2b

Add broadcast tests

1097cd5

tests

d57487a

error check

98e5fd2

loop promotion

19c509e

Merge branch 'main' into idmodel_indexing_part1

3c97973

Merge branch 'idmodel_indexing_part1' into idmodel_indexing_broadcast

6c32b5c

naoyam added the idmodel label Jun 6, 2024

naoyam added 5 commits June 5, 2024 20:58

cleanup

b9f9108

cleanup

1ae83b3

cleanup

1d9a670

cleanup

358e54b

cleanup

699cd53

naoyam marked this pull request as ready for review June 6, 2024 04:08

naoyam requested a review from jacobhinkle June 6, 2024 04:10

Base automatically changed from idmodel_indexing_part1 to main June 6, 2024 20:02

Merge branch 'main' into idmodel_indexing_broadcast

084a0b5

cleanup

de055b3

naoyam mentioned this pull request Jun 7, 2024

Support indexing of DIDx parallelized tensors #2364

Merged

jacobhinkle approved these changes Jun 7, 2024

View reviewed changes

zasdfgbnm reviewed Jun 7, 2024

View reviewed changes

naoyam added 4 commits June 7, 2024 11:42

cleanup

dd56709

Add a test on promoted broadcast domains

d5d9983

Merge branch 'main' into idmodel_indexing_broadcast

d921a18

update

9fbbd07

naoyam merged commit 8443c26 into main Jun 8, 2024

naoyam deleted the idmodel_indexing_broadcast branch June 8, 2024 02:17

naoyam added a commit that referenced this pull request Jun 8, 2024

Support indexing of DIDx parallelized tensors (#2364)

25903d2

Stacked on top of #2353 Small changes to allow indexing of tensors with DIDx domains. CC: @zasdfgbnm @cowanmeg @samnordmann @wujingyue

	// Check if the loop index of a a loop group should be always
	// Check if the loop index of a loop group should be always

Conversation

naoyam commented Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naoyam commented Jun 6, 2024

Uh oh!

jacobhinkle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

naoyam commented Jun 6, 2024 •

edited

Loading