Matmul, enable epilogue input vectorization by drzejan2 · Pull Request #807 · NVIDIA/Fuser

drzejan2 · 2023-08-29T09:12:56Z

This PR will solve reported issue:
#682

The scope of this PR:

remove earlier prepared WAR which blocks vectorization whenever
there is smem epilogue enabled,
introduce basic analysis for deciding epilogue inputs vectorization,
fix a number of gcc 9.4 build errors,
update tools around matmul scheduler so there is an access to batch
iterdomain / its extend,

Verified on tu102, a100 and ad102 GPUs.

drzejan2 · 2023-08-31T16:32:43Z

!build

liqiangxl

Maybe we should add a test case to verify the vectorization is as expected.

liqiangxl · 2023-08-31T18:36:23Z

csrc/scheduler/matmul.cpp

+  const auto& tvs_with_vec = scheduler_utils::getInputsOutputsWithInnerDim(
+      reference_tv, true /*inner_only*/, true /*vectorize_pass*/);
+
+  std::vector<TensorView*> epilogue_input_tvs_with_vec = {};


epilogue_input_tvs_with_vec.reserve(tvs_with_vec.size())

For this use case I prefer to rely on the initial size of vector set during construction, and only if there is not enough space, let the STL implementation of vector handle the resize (usually it doubles the reserved size if limit is reached).

Details:

The reason is that getInputsOutputsWithInnerDim(..) will look for vectorizable TVs in fusion's inputs and outputs, while for us the scope is limited to fusion inputs with INPUT_C roles, that is additional inputs that are used in epilogue. I explicitly avoid MMA inputs, as they are processed separately.

For first example, for vanilla matmul (D = A x B) this function can return up to 2 TVs (A and B, D is our reference here), none of them can be considered so we will end up with empty epilogue_input_tvs_with_vec.

For second, more advanced, example with fusion D = AxB + beta * C, the getInputsOutputsWithInnerDim(..) call may return 3 TVs (A, B and C) where only C will be further considered.

With reserve we will have most likely unneeded memory operations (allocation, copy and deallocation of previous buffer). If we were to deal with TensorViews objects directly then it might be worth considering, but here we operate on pointers.

I know it is not a big deal in this case since the number of added elements is very small. I am just a little confused by With reserve we will have most likely unneeded memory operations (I think reserve is used to avoid these) and I prefer to rely on the initial size of vector set during construction (I think the initial size is 0).

Here is my logic:
Initially, without using reserve, the vector is created with both size and capacity set to 0. As elements are added to the vector, its capacity is automatically increased, typically doubling each time: from 1 to 2, then 4, 8, and so on. Each increase in capacity may involve additional memory operations, as the existing elements need to be copied to a new location, and the old memory needs to be deallocated. Is there any hole?

Thanks @liqiangxl for reply and sharing the details. I double checked what standard says about initial size of allocated space for std::vector and this is not defined, but most of the current implementations does start with 0. Then your logic has no holes there. What I wrote is based on experiments I've done in the past, but this looks like changed. I will update the implementation to follow your advice.

Thanks!

csrc/scheduler/matmul.cpp

liqiangxl · 2023-08-31T18:40:21Z

csrc/scheduler/matmul.cpp

+    return {epilogue_input_tvs_with_vec, epilogue_inputs};
+  }
+
+  std::set<TensorView*> sink(


order doesn't seem matter, may use std::unordered_set

Ok, I will change it. You are right, with rather small number of elements (<100) inserting/re-balansing tree won't be needed.

In fact, I moved to the approach you mentioned in other comment, where I search for elements in the vector with std::find so now no set/unordered set is needed. This indeed simplifies the code.

Thanks!

liqiangxl · 2023-08-31T19:47:05Z

csrc/scheduler/matmul.cpp

+
+  for (auto* tv : epilogue_inputs) {
+    const auto status = sink.insert(tv);
+    if (status.second) {


can be combined as if (sink.insert(tv).second)

csrc/scheduler/matmul.cpp

drzejan2 · 2023-09-01T13:38:17Z

Maybe we should add a test case to verify the vectorization is as expected.

@liqiangxl

I thought about this, the initial though was that removing WAR + and current code passing on ad102 GPUs is a good starting point. I will think about a real test that will have vectorization enabled (bias that operates on columns, rather than rows?), but first I need to stabilize dev environment. Unfortunately this blocks me from checking if the latest change are correct for existing test cases.

zasdfgbnm

Generally look good. Left some minor comments. I do agree with @liqiangxl that we should have some test on vectorization. A good reference on how to check vectorization size could be ResizeTest.SliceVectorization.

zasdfgbnm · 2023-09-01T16:03:47Z

csrc/scheduler/matmul.cpp

+partitionEpilogueInputsByParallelizationModes(
+    TensorView* reference_tv,
+    const std::vector<TensorView*>& epilogue_inputs) {
+  const auto& tvs_with_vec = scheduler_utils::getInputsOutputsWithInnerDim(


Is this a reference to temporary object? getInputsOutputsWithInnerDim returns std::vector<TensorView*>, instead of returning a reference.

zasdfgbnm · 2023-09-01T16:21:13Z

csrc/scheduler/matmul.cpp

+  std::vector<TensorView*> epilogue_input_tvs_without_vec = {};
+  for (auto* tv : epilogue_inputs) {
+    if (std::find(
+            epilogue_input_tvs_with_vec.cbegin(),
+            epilogue_input_tvs_with_vec.cend(),
+            tv) == epilogue_input_tvs_with_vec.cend()) {
+      epilogue_input_tvs_without_vec.push_back(tv);
+    }
+  }


Would the logic be cleaner if we make this like below:

Suggested change

std::vector<TensorView*> epilogue_input_tvs_without_vec = {};

for (auto* tv : epilogue_inputs) {

if (std::find(

epilogue_input_tvs_with_vec.cbegin(),

epilogue_input_tvs_with_vec.cend(),

tv) == epilogue_input_tvs_with_vec.cend()) {

epilogue_input_tvs_without_vec.push_back(tv);

}

}

std::vector<TensorView*> epilogue_input_tvs_with_vec = {};

std::vector<TensorView*> epilogue_input_tvs_without_vec = {};

for (auto* tv : epilogue_inputs) {

if (std::find(

tvs_with_vec.cbegin(),

tvs_with_vec.cend(),

tv) == tvs_with_vec.cend()) {

epilogue_input_tvs_without_vec.push_back(tv);

} else {

epilogue_input_tvs_with_vec.push_back(tv);

}

}

Then I think that's enough for the entire function.

- expose batch iterdomain through tools used by heuristics in matmul scheduler,

- remove WAR for unaligned memory access when smem epilogue is enabled and vectorization parallization mode is propagated, - introduce infrastructure for vectorization analysis in matmul scheduler,

This change collects the maximum supported vectorization size for A, B, and "epilogue" and attaches these to `MatmulParams`. In case any "C" role tensors like bias are provided, they are considered when computing epilogue vectorization. This information becomes available to heuristic plugins also. Fixes #2083. Fixes #983. Related to existing PR #807. (Related to issue #682 but I'm not sure yet if this approach will fix it). --------- Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>

naoyam · 2025-07-24T02:16:04Z

Closing as there's no activity in the last two years. Reopen it if needed.

drzejan2 force-pushed the ab/matmul_epilogue_smem_input_vectorization branch 6 times, most recently from 7016544 to ac703be Compare August 31, 2023 14:27

drzejan2 marked this pull request as ready for review August 31, 2023 14:27

drzejan2 force-pushed the ab/matmul_epilogue_smem_input_vectorization branch 3 times, most recently from b62fcc1 to 066300b Compare August 31, 2023 16:32

drzejan2 requested review from liqiangxl and zasdfgbnm August 31, 2023 16:33

liqiangxl reviewed Aug 31, 2023

View reviewed changes

drzejan2 force-pushed the ab/matmul_epilogue_smem_input_vectorization branch from 0ab0853 to 881935c Compare September 1, 2023 11:11

drzejan2 force-pushed the ab/matmul_epilogue_smem_input_vectorization branch 2 times, most recently from b301a65 to 63c4d09 Compare September 1, 2023 15:41

zasdfgbnm reviewed Sep 1, 2023

View reviewed changes

drzejan2 force-pushed the ab/matmul_epilogue_smem_input_vectorization branch 2 times, most recently from 40fb98c to 9d8fdae Compare September 6, 2023 05:07

drzejan2 added 3 commits September 6, 2023 07:15

Buildsystem, fixes for gcc 9.4

7a485ed

Matmul scheduler - heuristics infra claning

e0d2b02

- expose batch iterdomain through tools used by heuristics in matmul scheduler,

Matmul scheduler - epilogue input vectorization

03b1cd7

- remove WAR for unaligned memory access when smem epilogue is enabled and vectorization parallization mode is propagated, - introduce infrastructure for vectorization analysis in matmul scheduler,

drzejan2 force-pushed the ab/matmul_epilogue_smem_input_vectorization branch from 9d8fdae to 03b1cd7 Compare September 6, 2023 05:28

zasdfgbnm mentioned this pull request Sep 28, 2023

Correctly set the vectorization size for matmul scheduler #983

Closed

kevinstephano mentioned this pull request Apr 16, 2024

Matmul Schedules requires vectorization Analysis, it is currently hard coded #2083

Closed

jacobhinkle mentioned this pull request Apr 18, 2024

Support misaligned matmul using vectorization analysis #2105

Merged

jacobhinkle mentioned this pull request May 1, 2024

Analyze matmul epilogue to determine vectorization #2169

Closed

naoyam closed this Jul 24, 2025

Conversation

drzejan2 commented Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drzejan2 commented Aug 31, 2023

Uh oh!

liqiangxl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drzejan2 commented Sep 1, 2023

Uh oh!

zasdfgbnm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naoyam commented Jul 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

drzejan2 commented Aug 29, 2023 •

edited

Loading