Conversation
7016544 to
ac703be
Compare
b62fcc1 to
066300b
Compare
|
!build |
liqiangxl
left a comment
There was a problem hiding this comment.
Maybe we should add a test case to verify the vectorization is as expected.
| const auto& tvs_with_vec = scheduler_utils::getInputsOutputsWithInnerDim( | ||
| reference_tv, true /*inner_only*/, true /*vectorize_pass*/); | ||
|
|
||
| std::vector<TensorView*> epilogue_input_tvs_with_vec = {}; |
There was a problem hiding this comment.
epilogue_input_tvs_with_vec.reserve(tvs_with_vec.size())
There was a problem hiding this comment.
For this use case I prefer to rely on the initial size of vector set during construction, and only if there is not enough space, let the STL implementation of vector handle the resize (usually it doubles the reserved size if limit is reached).
Details:
The reason is that getInputsOutputsWithInnerDim(..) will look for vectorizable TVs in fusion's inputs and outputs, while for us the scope is limited to fusion inputs with INPUT_C roles, that is additional inputs that are used in epilogue. I explicitly avoid MMA inputs, as they are processed separately.
For first example, for vanilla matmul (D = A x B) this function can return up to 2 TVs (A and B, D is our reference here), none of them can be considered so we will end up with empty epilogue_input_tvs_with_vec.
For second, more advanced, example with fusion D = AxB + beta * C, the getInputsOutputsWithInnerDim(..) call may return 3 TVs (A, B and C) where only C will be further considered.
With reserve we will have most likely unneeded memory operations (allocation, copy and deallocation of previous buffer). If we were to deal with TensorViews objects directly then it might be worth considering, but here we operate on pointers.
There was a problem hiding this comment.
I know it is not a big deal in this case since the number of added elements is very small. I am just a little confused by With reserve we will have most likely unneeded memory operations (I think reserve is used to avoid these) and I prefer to rely on the initial size of vector set during construction (I think the initial size is 0).
Here is my logic:
Initially, without using reserve, the vector is created with both size and capacity set to 0. As elements are added to the vector, its capacity is automatically increased, typically doubling each time: from 1 to 2, then 4, 8, and so on. Each increase in capacity may involve additional memory operations, as the existing elements need to be copied to a new location, and the old memory needs to be deallocated. Is there any hole?
There was a problem hiding this comment.
Thanks @liqiangxl for reply and sharing the details. I double checked what standard says about initial size of allocated space for std::vector and this is not defined, but most of the current implementations does start with 0. Then your logic has no holes there. What I wrote is based on experiments I've done in the past, but this looks like changed. I will update the implementation to follow your advice.
Thanks!
csrc/scheduler/matmul.cpp
Outdated
| return {epilogue_input_tvs_with_vec, epilogue_inputs}; | ||
| } | ||
|
|
||
| std::set<TensorView*> sink( |
There was a problem hiding this comment.
order doesn't seem matter, may use std::unordered_set
There was a problem hiding this comment.
Ok, I will change it. You are right, with rather small number of elements (<100) inserting/re-balansing tree won't be needed.
There was a problem hiding this comment.
In fact, I moved to the approach you mentioned in other comment, where I search for elements in the vector with std::find so now no set/unordered set is needed. This indeed simplifies the code.
Thanks!
csrc/scheduler/matmul.cpp
Outdated
|
|
||
| for (auto* tv : epilogue_inputs) { | ||
| const auto status = sink.insert(tv); | ||
| if (status.second) { |
There was a problem hiding this comment.
can be combined as if (sink.insert(tv).second)
0ab0853 to
881935c
Compare
I thought about this, the initial though was that removing WAR + and current code passing on ad102 GPUs is a good starting point. I will think about a real test that will have vectorization enabled (bias that operates on columns, rather than rows?), but first I need to stabilize dev environment. Unfortunately this blocks me from checking if the latest change are correct for existing test cases. |
b301a65 to
63c4d09
Compare
zasdfgbnm
left a comment
There was a problem hiding this comment.
Generally look good. Left some minor comments. I do agree with @liqiangxl that we should have some test on vectorization. A good reference on how to check vectorization size could be ResizeTest.SliceVectorization.
csrc/scheduler/matmul.cpp
Outdated
| partitionEpilogueInputsByParallelizationModes( | ||
| TensorView* reference_tv, | ||
| const std::vector<TensorView*>& epilogue_inputs) { | ||
| const auto& tvs_with_vec = scheduler_utils::getInputsOutputsWithInnerDim( |
There was a problem hiding this comment.
Is this a reference to temporary object? getInputsOutputsWithInnerDim returns std::vector<TensorView*>, instead of returning a reference.
| std::vector<TensorView*> epilogue_input_tvs_without_vec = {}; | ||
| for (auto* tv : epilogue_inputs) { | ||
| if (std::find( | ||
| epilogue_input_tvs_with_vec.cbegin(), | ||
| epilogue_input_tvs_with_vec.cend(), | ||
| tv) == epilogue_input_tvs_with_vec.cend()) { | ||
| epilogue_input_tvs_without_vec.push_back(tv); | ||
| } | ||
| } |
There was a problem hiding this comment.
Would the logic be cleaner if we make this like below:
| std::vector<TensorView*> epilogue_input_tvs_without_vec = {}; | |
| for (auto* tv : epilogue_inputs) { | |
| if (std::find( | |
| epilogue_input_tvs_with_vec.cbegin(), | |
| epilogue_input_tvs_with_vec.cend(), | |
| tv) == epilogue_input_tvs_with_vec.cend()) { | |
| epilogue_input_tvs_without_vec.push_back(tv); | |
| } | |
| } | |
| std::vector<TensorView*> epilogue_input_tvs_with_vec = {}; | |
| std::vector<TensorView*> epilogue_input_tvs_without_vec = {}; | |
| for (auto* tv : epilogue_inputs) { | |
| if (std::find( | |
| tvs_with_vec.cbegin(), | |
| tvs_with_vec.cend(), | |
| tv) == tvs_with_vec.cend()) { | |
| epilogue_input_tvs_without_vec.push_back(tv); | |
| } else { | |
| epilogue_input_tvs_with_vec.push_back(tv); | |
| } | |
| } |
Then I think that's enough for the entire function.
40fb98c to
9d8fdae
Compare
- expose batch iterdomain through tools used by heuristics in matmul scheduler,
- remove WAR for unaligned memory access when smem epilogue is enabled and vectorization parallization mode is propagated, - introduce infrastructure for vectorization analysis in matmul scheduler,
9d8fdae to
03b1cd7
Compare
This change collects the maximum supported vectorization size for A, B, and "epilogue" and attaches these to `MatmulParams`. In case any "C" role tensors like bias are provided, they are considered when computing epilogue vectorization. This information becomes available to heuristic plugins also. Fixes #2083. Fixes #983. Related to existing PR #807. (Related to issue #682 but I'm not sure yet if this approach will fix it). --------- Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>
|
Closing as there's no activity in the last two years. Reopen it if needed. |
This PR will solve reported issue:
#682
The scope of this PR:
there is smem epilogue enabled,
iterdomain / its extend,
Verified on tu102, a100 and ad102 GPUs.