Skip to content

Handling allocation domain of the input TensorViews in the matmul scheduler #2309

Merged
protonu merged 28 commits intomainfrom
pbasu_experiment_alloc_domai
Jun 4, 2024
Merged

Handling allocation domain of the input TensorViews in the matmul scheduler #2309
protonu merged 28 commits intomainfrom
pbasu_experiment_alloc_domai

Conversation

@protonu
Copy link
Collaborator

@protonu protonu commented May 28, 2024

In this PR we extend the matmul scheduler to support inputs with allocation domains.

To the fusion (with inputs tv_a and tv_b), we add two LoadStoreOps to both inputs.
The first Op corresponds to a load to shared memory, where we propagate the allocation domain. The second op corresponds to reading to registers, where we don't propagate the allocation domain since the scheduler takes charge of setting the allocation domain in the registers. Based on the difference in the (maybe)allocation domain of the producer and consumer of the second LoadStoreOp, we may do transposed load when reading to registers.

image

See also #2315.

Copy link
Collaborator

@jacobhinkle jacobhinkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there are a few changes here:

  • Don't propagate allocation domain when using cacheAfter on smem buffers, since we need the loaded register buffers to have allocation domains matching their root domains.
  • Change scheduleLdMatrix to check consumer/producer innermost allocation ID to see whether transpose is needed. Previously this was signalled by the LoadStoreOpType on that op.
  • Use allocation domain instead of root domain for orderTiledConcreteIdAsRoot. This is called only on shared memory TVs, and we now have possibly non-trivial allocation domains on those tensors.

I think this generally seems fine. I do have a question that we can address in the future: how should we handle cases where there is a transposed operand which has a prologue that comes before the transpose? It seems that we still rely on using the smem->register load for transposing but in such a case that will come after the prologue.

{consumer->getMaybeAllocationDomain().back()});

auto ids = ir_utils::filterByType<IterDomain>(vals);
auto idsOnPath = std::vector<IterDomain*>(ids.begin(), ids.end());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this line is needed is it? Just use ids instead of idsOnPath. Also a nit: const on all these variables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort of based it on this:

Fuser/csrc/ir/utils.cpp

Lines 738 to 748 in 5e0c89b

std::vector<IterDomain*> allIDsOf(const TensorView* tv) {
const auto& root_domain = tv->getRootDomain();
const auto& domain = tv->getLeafDomain();
// Grab all values in the history of the tensor view's domain
auto all_vals = DependencyCheck::getAllValsBetween(
{root_domain.begin(), root_domain.end()}, {domain.begin(), domain.end()});
// Filter so we only have iteration domains (ignore Ints used in split)
auto all_ids = ir_utils::filterByType<IterDomain>(all_vals);
return std::vector<IterDomain*>(all_ids.begin(), all_ids.end());
}

// Filter so we only have iteration domains (ignore Ints used in split)
auto all_ids = ir_utils::filterByType(all_vals);
return std::vector<IterDomain*>(all_ids.begin(), all_ids.end());

// Get all the IDs from the innermost ID of the allocation domain of
// the consumer to the root domain of the consumer.
auto vals = DependencyCheck::getAllValsBetween(
{consumer->getRootDomain().begin(), consumer->getRootDomain().end()},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to filter out broadcast and reduction domains?

@protonu protonu force-pushed the pbasu_experiment_alloc_domai branch from 5e0c89b to 32f15f7 Compare May 31, 2024 22:29
@zasdfgbnm
Copy link
Collaborator

This PR needs rebase so that changes in #2315 is excluded from the diff of this PR.

@protonu protonu force-pushed the pbasu_experiment_alloc_domai branch from 32f15f7 to 7b0fd07 Compare June 3, 2024 18:05
@protonu
Copy link
Collaborator Author

protonu commented Jun 3, 2024

!build

@protonu
Copy link
Collaborator Author

protonu commented Jun 3, 2024

!build

@protonu
Copy link
Collaborator Author

protonu commented Jun 3, 2024

!build

@protonu protonu changed the title [WIP] Handling allocation domain of the input TensorViews in the matmul scheduler Handling allocation domain of the input TensorViews in the matmul scheduler Jun 3, 2024
@protonu protonu requested review from jacobhinkle and zasdfgbnm June 3, 2024 19:50
@protonu protonu marked this pull request as ready for review June 3, 2024 19:51
@protonu protonu merged commit 4b427fb into main Jun 4, 2024
@protonu protonu deleted the pbasu_experiment_alloc_domai branch June 4, 2024 05:22
zasdfgbnm pushed a commit that referenced this pull request Jun 5, 2024
…eduler (#2309)

In this PR we extend the matmul scheduler to support inputs with
allocation domains.


To the fusion (with inputs tv_a and tv_b), we add two LoadStoreOps to
both inputs.
The first Op corresponds to a load to shared memory, where we propagate
the allocation domain. The second op corresponds to reading to
registers, where we don't propagate the allocation domain since the
scheduler takes charge of setting the allocation domain in the
registers. Based on the difference in the (maybe)allocation domain of
the producer and consumer of the second LoadStoreOp, we may do
transposed load when reading to registers.


![image](https://github.com/NVIDIA/Fuser/assets/10635897/89395990-9b85-4ce1-8e7d-006e43a86b85)


See also #2315.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants