Reordering optimization in the resize scheduler#3693
Conversation
Currently only has one parameter
…ed_enabled' into resize_scheduler_test
…r_update_heuristic
may not be a valid path from loop IDs
…ds' into resize_scheduler_test
|
!test |
|
!test |
|
!test |
jjsjann123
left a comment
There was a problem hiding this comment.
WIP. still need to look at reorderTensorLike
csrc/scheduler/resize.cpp
Outdated
| // Make sure the DID ID located at the outermost position | ||
| const auto outermost_pos = scheduler_utils::reorderDevicesToOuter(ref_tv); | ||
|
|
||
| const int64_t bdimx = 128; |
There was a problem hiding this comment.
Looks like unintended changes here.
There was a problem hiding this comment.
Yeah, it's probably just I accidentally moved when merging PRs.
| // The tensors are going to be reordered to align with the largest | ||
| // input. To make it work, merge operations for reshape should be | ||
| // cancelled. | ||
| scheduler_tools::cancelReshapeInLoopDomains(largest_input); |
There was a problem hiding this comment.
I wonder what would happen if we have transformation that can't be cancelled.
Fuser/csrc/scheduler/tools/loop_domain_scheduler.h
Lines 102 to 104 in 3d27e10
Is this reshape cancel applied here, so that later when we call scheduler_tools::scheduleLoopDomainsLike from the reference would successfully apply?
To re-phrase the question, are we expecting cancel Reshape to successfully cancel transformations on all tensors, or we are just trying to cancel the transformation on reference tv?
There was a problem hiding this comment.
All tensors that are direct or indirect consumers of largest_input will be targets of the cancellation. Reshapes are not just done on largest_input. Any reshape op that depends on largest_input, directly or indirectly, will be cancelled.
If it isn't valid, it'll be just skipped, so it's done in a best-effort manner. It should not affect the correctness but the reference reordering may become suboptimal.
To be honest, this is only tested with the RoPE patterns, so it's likely there would be some corner cases, which I'd probably need to work on before enabling the resize scheduler by default.
|
@jjsjann123 Please let me know if you have any additional comment. I'd like to get this merged quickly if not. |
jjsjann123
left a comment
There was a problem hiding this comment.
I have a couple questions that's not blocking the merge of the PR. stamping.
csrc/scheduler/utils.cpp
Outdated
| for (auto it = inputs.rbegin(); it != inputs.rend(); ++it) { | ||
| innermost_it = | ||
| std::find(ordered_domain.begin(), ordered_domain.end(), *it); | ||
| NVF_ERROR(innermost_it != ordered_domain.end()); |
There was a problem hiding this comment.
why is this iterator called innermost_it when it's breaking upon the first reference encountered?
There was a problem hiding this comment.
so it's the innermost inputs? but I thought the order of inputs aren't significant, but the position in ordered_domain is...
There was a problem hiding this comment.
Hmm, I don't remember why I had the loop here. It just needs to find the position of the innermost input in ordered_domain. It should be just:
std::deque<ValGroup>::iterator innermost_it =
std::find(ordered_domain.begin(), ordered_domain.end(), inputs.back());
NVF_ERROR(innermost_it != ordered_domain.end());
| const auto& tv_loop_domain = target_tv->getLoopDomain(); | ||
|
|
||
| IdModel id_model(target_tv->fusion(), /*build_graphs=*/false); | ||
| const auto& graph = id_model.buildBroadcastGraph(); |
There was a problem hiding this comment.
Out of curiosity, why are we using broadcast graph here?
There was a problem hiding this comment.
That's because I think it makes sense for ordering to consider broadcast and corresponding non-broadcast domains are mapped.
|
|
||
| // Place IDs that do not appear in ref at the outer position | ||
| int64_t new_id_pos = 0; | ||
| for (const auto i : c10::irange(tv_loop_domain.size())) { |
There was a problem hiding this comment.
I see self-mapping is default to false, so we won't have multiple elements in tv_loop_domain that belongs to the same ValGroup. Is that a general assumption we are holding in the future or is it purely an implementation short-cut we are taking for now?
There was a problem hiding this comment.
I'm not sure. Does that matter here?
|
!test |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
Stacked on #3693 This PR adds a preliminary vectorization support to the resize scheduler. It currently only considers vectorization of the innermost dimension, just because that's good enough for the RoPE cases. It should eventually be extended to support vectorizing multiple innermost dimensions.
This is a WAR for an issue with the vectorization by the resize scheduler (unrelated to #3640). #3693 introduced a reordering optimization for the resize scheduler that attempted to minimize strides in read accesses of fusion inputs by canceling reshapes. It turned out it can result in conflicts with vectorization. The scheduler uses the fusion input as the reference of the vectorization analysis, assuming any reshape is canceled, which is not always the case. So, in this PR, the vectorization analysis is changed to use the reference output. However, that isn't enough since when a resize is indeed canceled, the analysis should actually be done using the pre-reshape shape. To workaround that, this PR also adds a flag to disable canceling reshapes that use innermost logical IDs. This should make sure it's always valid to use the fusion output as the reference of the vectorization analysis. This is an ad-hoc WAR but should be good enough for the RoPE cases. The real problem is a bit inter-twinned here, and I'm not attempting to address it completely in this PR.
Depends on #3674, #3675, #3679
Reorder tensors to align with the largest input. This should improve memory accesses by minimizing strides. Store throughputs may be lowered, but it should generally be more important to optimize load accesses.
I do not have actual performance results by this change. I just remember this was effective in some cases while manually trying out different optimization strategies. We may eventually need to enable or disable this reordering by some heuristic.