Skip to content

Matmul, enable epilogue input vectorization#807

Closed
drzejan2 wants to merge 3 commits intomainfrom
ab/matmul_epilogue_smem_input_vectorization
Closed

Matmul, enable epilogue input vectorization#807
drzejan2 wants to merge 3 commits intomainfrom
ab/matmul_epilogue_smem_input_vectorization

Conversation

@drzejan2
Copy link
Contributor

@drzejan2 drzejan2 commented Aug 29, 2023

This PR will solve reported issue:
#682

The scope of this PR:

  • remove earlier prepared WAR which blocks vectorization whenever
    there is smem epilogue enabled,
  • introduce basic analysis for deciding epilogue inputs vectorization,
  • fix a number of gcc 9.4 build errors,
  • update tools around matmul scheduler so there is an access to batch
    iterdomain / its extend,

Verified on tu102, a100 and ad102 GPUs.

@drzejan2 drzejan2 force-pushed the ab/matmul_epilogue_smem_input_vectorization branch 6 times, most recently from 7016544 to ac703be Compare August 31, 2023 14:27
@drzejan2 drzejan2 marked this pull request as ready for review August 31, 2023 14:27
@drzejan2 drzejan2 force-pushed the ab/matmul_epilogue_smem_input_vectorization branch 3 times, most recently from b62fcc1 to 066300b Compare August 31, 2023 16:32
@drzejan2
Copy link
Contributor Author

!build

Copy link
Collaborator

@liqiangxl liqiangxl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add a test case to verify the vectorization is as expected.

const auto& tvs_with_vec = scheduler_utils::getInputsOutputsWithInnerDim(
reference_tv, true /*inner_only*/, true /*vectorize_pass*/);

std::vector<TensorView*> epilogue_input_tvs_with_vec = {};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

epilogue_input_tvs_with_vec.reserve(tvs_with_vec.size())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this use case I prefer to rely on the initial size of vector set during construction, and only if there is not enough space, let the STL implementation of vector handle the resize (usually it doubles the reserved size if limit is reached).

Details:

The reason is that getInputsOutputsWithInnerDim(..) will look for vectorizable TVs in fusion's inputs and outputs, while for us the scope is limited to fusion inputs with INPUT_C roles, that is additional inputs that are used in epilogue. I explicitly avoid MMA inputs, as they are processed separately.

For first example, for vanilla matmul (D = A x B) this function can return up to 2 TVs (A and B, D is our reference here), none of them can be considered so we will end up with empty epilogue_input_tvs_with_vec.

For second, more advanced, example with fusion D = AxB + beta * C, the getInputsOutputsWithInnerDim(..) call may return 3 TVs (A, B and C) where only C will be further considered.

With reserve we will have most likely unneeded memory operations (allocation, copy and deallocation of previous buffer). If we were to deal with TensorViews objects directly then it might be worth considering, but here we operate on pointers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is not a big deal in this case since the number of added elements is very small. I am just a little confused by With reserve we will have most likely unneeded memory operations (I think reserve is used to avoid these) and I prefer to rely on the initial size of vector set during construction (I think the initial size is 0).

Here is my logic:
Initially, without using reserve, the vector is created with both size and capacity set to 0. As elements are added to the vector, its capacity is automatically increased, typically doubling each time: from 1 to 2, then 4, 8, and so on. Each increase in capacity may involve additional memory operations, as the existing elements need to be copied to a new location, and the old memory needs to be deallocated. Is there any hole?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liqiangxl for reply and sharing the details. I double checked what standard says about initial size of allocated space for std::vector and this is not defined, but most of the current implementations does start with 0. Then your logic has no holes there. What I wrote is based on experiments I've done in the past, but this looks like changed. I will update the implementation to follow your advice.

Thanks!

return {epilogue_input_tvs_with_vec, epilogue_inputs};
}

std::set<TensorView*> sink(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order doesn't seem matter, may use std::unordered_set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will change it. You are right, with rather small number of elements (<100) inserting/re-balansing tree won't be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I moved to the approach you mentioned in other comment, where I search for elements in the vector with std::find so now no set/unordered set is needed. This indeed simplifies the code.

Thanks!


for (auto* tv : epilogue_inputs) {
const auto status = sink.insert(tv);
if (status.second) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be combined as if (sink.insert(tv).second)

@drzejan2 drzejan2 force-pushed the ab/matmul_epilogue_smem_input_vectorization branch from 0ab0853 to 881935c Compare September 1, 2023 11:11
@drzejan2
Copy link
Contributor Author

drzejan2 commented Sep 1, 2023

Maybe we should add a test case to verify the vectorization is as expected.

@liqiangxl

I thought about this, the initial though was that removing WAR + and current code passing on ad102 GPUs is a good starting point. I will think about a real test that will have vectorization enabled (bias that operates on columns, rather than rows?), but first I need to stabilize dev environment. Unfortunately this blocks me from checking if the latest change are correct for existing test cases.

@drzejan2 drzejan2 force-pushed the ab/matmul_epilogue_smem_input_vectorization branch 2 times, most recently from b301a65 to 63c4d09 Compare September 1, 2023 15:41
Copy link
Collaborator

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally look good. Left some minor comments. I do agree with @liqiangxl that we should have some test on vectorization. A good reference on how to check vectorization size could be ResizeTest.SliceVectorization.

partitionEpilogueInputsByParallelizationModes(
TensorView* reference_tv,
const std::vector<TensorView*>& epilogue_inputs) {
const auto& tvs_with_vec = scheduler_utils::getInputsOutputsWithInnerDim(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a reference to temporary object? getInputsOutputsWithInnerDim returns std::vector<TensorView*>, instead of returning a reference.

Comment on lines +651 to +645
std::vector<TensorView*> epilogue_input_tvs_without_vec = {};
for (auto* tv : epilogue_inputs) {
if (std::find(
epilogue_input_tvs_with_vec.cbegin(),
epilogue_input_tvs_with_vec.cend(),
tv) == epilogue_input_tvs_with_vec.cend()) {
epilogue_input_tvs_without_vec.push_back(tv);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the logic be cleaner if we make this like below:

Suggested change
std::vector<TensorView*> epilogue_input_tvs_without_vec = {};
for (auto* tv : epilogue_inputs) {
if (std::find(
epilogue_input_tvs_with_vec.cbegin(),
epilogue_input_tvs_with_vec.cend(),
tv) == epilogue_input_tvs_with_vec.cend()) {
epilogue_input_tvs_without_vec.push_back(tv);
}
}
std::vector<TensorView*> epilogue_input_tvs_with_vec = {};
std::vector<TensorView*> epilogue_input_tvs_without_vec = {};
for (auto* tv : epilogue_inputs) {
if (std::find(
tvs_with_vec.cbegin(),
tvs_with_vec.cend(),
tv) == tvs_with_vec.cend()) {
epilogue_input_tvs_without_vec.push_back(tv);
} else {
epilogue_input_tvs_with_vec.push_back(tv);
}
}

Then I think that's enough for the entire function.

@drzejan2 drzejan2 force-pushed the ab/matmul_epilogue_smem_input_vectorization branch 2 times, most recently from 40fb98c to 9d8fdae Compare September 6, 2023 05:07
- expose batch iterdomain through tools used by heuristics in matmul
  scheduler,
- remove WAR for unaligned memory access when smem epilogue is enabled
  and vectorization parallization mode is propagated,
- introduce infrastructure for vectorization analysis in matmul
  scheduler,
@drzejan2 drzejan2 force-pushed the ab/matmul_epilogue_smem_input_vectorization branch from 9d8fdae to 03b1cd7 Compare September 6, 2023 05:28
jacobhinkle added a commit that referenced this pull request May 13, 2024
This change collects the maximum supported vectorization size for A, B,
and "epilogue" and attaches these to `MatmulParams`. In case any "C"
role tensors like bias are provided, they are considered when computing
epilogue vectorization. This information becomes available to heuristic
plugins also.

Fixes #2083. Fixes #983. Related to existing PR #807. (Related to issue
#682 but I'm not sure yet if this approach will fix it).

---------

Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>
@naoyam
Copy link
Collaborator

naoyam commented Jul 24, 2025

Closing as there's no activity in the last two years. Reopen it if needed.

@naoyam naoyam closed this Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants