Reordering optimization in the resize scheduler by naoyam · Pull Request #3693 · NVIDIA/Fuser

naoyam · 2025-01-10T01:29:17Z

Reorder tensors to align with the largest input. This should improve memory accesses by minimizing strides. Store throughputs may be lowered, but it should generally be more important to optimize load accesses.

I do not have actual performance results by this change. I just remember this was effective in some cases while manually trying out different optimization strategies. We may eventually need to enable or disable this reordering by some heuristic.

Currently only has one parameter

…ed_enabled' into resize_scheduler_test

…r_update_heuristic

…heduler_test

…duler_test

may not be a valid path from loop IDs

…ds' into resize_scheduler_test

naoyam · 2025-01-10T01:29:45Z

!test

…duler_reorder

naoyam · 2025-01-10T04:32:20Z

!test

naoyam · 2025-01-14T22:01:03Z

!test

jjsjann123

WIP. still need to look at reorderTensorLike

jjsjann123 · 2025-01-15T00:55:11Z

csrc/scheduler/resize.cpp

  // Make sure the DID ID located at the outermost position
  const auto outermost_pos = scheduler_utils::reorderDevicesToOuter(ref_tv);

+  const int64_t bdimx = 128;


Looks like unintended changes here.

Yeah, it's probably just I accidentally moved when merging PRs.

csrc/scheduler/resize_heuristic.h

jjsjann123 · 2025-01-15T01:13:05Z

csrc/scheduler/resize.cpp

+    // The tensors are going to be reordered to align with the largest
+    // input. To make it work, merge operations for reshape should be
+    // cancelled.
+    scheduler_tools::cancelReshapeInLoopDomains(largest_input);


I wonder what would happen if we have transformation that can't be cancelled.

Fuser/csrc/scheduler/tools/loop_domain_scheduler.h

Lines 102 to 104 in 3d27e10

// This scheduling is not always feasible. Specifically, if a reshape

// output iter domain is resized, the loop domain needs to keep using

// the reshape output iter domain. Similarly, if a rehape output iter

Is this reshape cancel applied here, so that later when we call scheduler_tools::scheduleLoopDomainsLike from the reference would successfully apply?
To re-phrase the question, are we expecting cancel Reshape to successfully cancel transformations on all tensors, or we are just trying to cancel the transformation on reference tv?

All tensors that are direct or indirect consumers of largest_input will be targets of the cancellation. Reshapes are not just done on largest_input. Any reshape op that depends on largest_input, directly or indirectly, will be cancelled.

If it isn't valid, it'll be just skipped, so it's done in a best-effort manner. It should not affect the correctness but the reference reordering may become suboptimal.

To be honest, this is only tested with the RoPE patterns, so it's likely there would be some corner cases, which I'd probably need to work on before enabling the resize scheduler by default.

naoyam · 2025-01-15T16:43:42Z

@jjsjann123 Please let me know if you have any additional comment. I'd like to get this merged quickly if not.

jjsjann123

I have a couple questions that's not blocking the merge of the PR. stamping.

jjsjann123 · 2025-01-15T17:32:25Z

csrc/scheduler/utils.cpp

+    for (auto it = inputs.rbegin(); it != inputs.rend(); ++it) {
+      innermost_it =
+          std::find(ordered_domain.begin(), ordered_domain.end(), *it);
+      NVF_ERROR(innermost_it != ordered_domain.end());


why is this iterator called innermost_it when it's breaking upon the first reference encountered?

so it's the innermost inputs? but I thought the order of inputs aren't significant, but the position in ordered_domain is...

Hmm, I don't remember why I had the loop here. It just needs to find the position of the innermost input in ordered_domain. It should be just:

std::deque<ValGroup>::iterator innermost_it = std::find(ordered_domain.begin(), ordered_domain.end(), inputs.back()); NVF_ERROR(innermost_it != ordered_domain.end());

jjsjann123 · 2025-01-15T17:50:58Z

csrc/scheduler/utils.cpp

+  const auto& tv_loop_domain = target_tv->getLoopDomain();
+
+  IdModel id_model(target_tv->fusion(), /*build_graphs=*/false);
+  const auto& graph = id_model.buildBroadcastGraph();


Out of curiosity, why are we using broadcast graph here?

That's because I think it makes sense for ordering to consider broadcast and corresponding non-broadcast domains are mapped.

jjsjann123 · 2025-01-15T17:54:40Z

csrc/scheduler/utils.cpp

+
+  // Place IDs that do not appear in ref at the outer position
+  int64_t new_id_pos = 0;
+  for (const auto i : c10::irange(tv_loop_domain.size())) {


I see self-mapping is default to false, so we won't have multiple elements in tv_loop_domain that belongs to the same ValGroup. Is that a general assumption we are holding in the future or is it purely an implementation short-cut we are taking for now?

I'm not sure. Does that matter here?

naoyam · 2025-01-15T18:37:31Z

!test

github-actions · 2025-01-15T18:37:42Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪

🧪 No relevant tests

⚡ Recommended focus areas for review

Logic Change

The schedule function in ResizeScheduler class has been modified to reorder tensors to align with the largest input. This change may impact the performance of the resize scheduler and requires careful review.

    continue;
  }
  auto inp_tv_copy = RecomputeTv::recompute(inp_tv);
  ir_utils::replaceValInExprInputs(resize_tensor_op, inp_tv, inp_tv_copy);
}

TensorView* largest_input = nullptr;
if (resize_params->largest_input >= 0) {
  largest_input =
      fusion->inputs().at(resize_params->largest_input)->as<TensorView>();

  // The tensors are going to be reordered to align with the largest
  // input. To make it work, merge operations for reshape should be
  // cancelled.
  scheduler_tools::cancelReshapeInLoopDomains(largest_input);
}

for (auto expr : fusion->exprs()) {
  if (!expr->isOneOf<SliceOp, PadOp>()) {
    continue;
  }

  scheduler_tools::propagateResizeToInputs(expr);
}

auto ref_tv = getReferenceTensor(fusion);
NVF_ERROR(ref_tv != nullptr);

// Just simple scheduling for now.
// TODO: Do something smarter. Can just use the pointwise scheduler?

// Reorder tensors to align with the largest input. This is expected
// to improve the memory read performance, while the write
// performance could be lowered. This should generally be more
// important to optimize the read performance, but more robust
// decision would be needed.
if (largest_input != nullptr) {
  std::vector<IterDomain*> ref_alloc;
  ref_alloc.reserve(largest_input->getMaybeAllocationDomain().size());
  std::copy_if(
      largest_input->getMaybeAllocationDomain().begin(),
      largest_input->getMaybeAllocationDomain().end(),
      std::back_inserter(ref_alloc),
      [](IterDomain* alloc_id) {
        return !alloc_id->isBroadcast() && !alloc_id->isReduction() &&
            !alloc_id->isDeviceDim();
      });

  // Reorder the reference as the allocation domain of the largest fusion
  // input
  scheduler_utils::reorderTensorLike(ref_tv, ref_alloc);
}

New Function

A new function reorderTensorLike has been added to reorder the loop domain of a given tensor to align with a given list of reference IDs. This function requires review to ensure it is correctly implemented and does not introduce any bugs.

void reorderTensorLike(
    TensorView* target_tv,
    const std::vector<IterDomain*>& ref) {
  const auto& tv_loop_domain = target_tv->getLoopDomain();

  IdModel id_model(target_tv->fusion(), /*build_graphs=*/false);
  const auto& graph = id_model.buildBroadcastGraph();

  ValGroups target_groups = graph.toGroups(tv_loop_domain);

  ValGroups ref_groups = graph.toGroups(ref);

  // Traverse from the reference to the target tv. The reference is
  // not guaranteed to cover all loop IDs of target, so
  // require_all_to_visited needs to be false
  auto path = ValGraphBFS::getExprGroupsBetween(
                  graph,
                  ref_groups,
                  target_groups,
                  /*require_all_to_visited=*/false)
                  .first;

  // Traverse the expr path to create an ordered ID groups
  std::deque<ValGroup> ordered_domain{
      ref_groups.vector().begin(), ref_groups.vector().end()};

  for (const auto& [expr_g, dir] : path) {
    auto inputs = getInputsOfExpr(
        expr_g, dir, ValGraphInputs(graph), ValGraphOutputs(graph));
    auto outputs = getOutputsOfExpr(
        expr_g, dir, ValGraphInputs(graph), ValGraphOutputs(graph));

    // Inserts the outputs at the innermost position
    auto innermost_it =
        std::find(ordered_domain.begin(), ordered_domain.end(), inputs.back());
    NVF_ERROR(innermost_it != ordered_domain.end());
    ordered_domain.insert(innermost_it, outputs.begin(), outputs.end());

    // Removes the inputs
    for (const auto& inp : inputs) {
      ordered_domain.erase(
          std::remove(ordered_domain.begin(), ordered_domain.end(), inp),
          ordered_domain.end());
    }
  }

  std::unordered_map<int64_t, int64_t> old2new;

  // Place IDs that do not appear in ref at the outer position
  int64_t new_id_pos = 0;
  for (const auto i : c10::irange(tv_loop_domain.size())) {
    const auto& loop_id_group = graph.toGroup(tv_loop_domain.at(i));
    auto it =
        std::find(ordered_domain.begin(), ordered_domain.end(), loop_id_group);
    if (it == ordered_domain.end()) {
      old2new.emplace((int64_t)i, new_id_pos);
      ++new_id_pos;
    }
  }
  for (const auto i : c10::irange(tv_loop_domain.size())) {
    const auto& loop_id_group = graph.toGroup(tv_loop_domain.at(i));
    auto it =
        std::find(ordered_domain.begin(), ordered_domain.end(), loop_id_group);
    if (it != ordered_domain.end()) {
      int64_t new_pos =
          (int64_t)std::distance(ordered_domain.begin(), it) + new_id_pos;
      old2new.emplace((int64_t)i, new_pos);
    }
  }

  target_tv->reorder(old2new);
}

Stacked on #3693 This PR adds a preliminary vectorization support to the resize scheduler. It currently only considers vectorization of the innermost dimension, just because that's good enough for the RoPE cases. It should eventually be extended to support vectorizing multiple innermost dimensions.

This is a WAR for an issue with the vectorization by the resize scheduler (unrelated to #3640). #3693 introduced a reordering optimization for the resize scheduler that attempted to minimize strides in read accesses of fusion inputs by canceling reshapes. It turned out it can result in conflicts with vectorization. The scheduler uses the fusion input as the reference of the vectorization analysis, assuming any reshape is canceled, which is not always the case. So, in this PR, the vectorization analysis is changed to use the reference output. However, that isn't enough since when a resize is indeed canceled, the analysis should actually be done using the pre-reshape shape. To workaround that, this PR also adds a flag to disable canceling reshapes that use innermost logical IDs. This should make sure it's always valid to use the fusion output as the reference of the vectorization analysis. This is an ad-hoc WAR but should be good enough for the RoPE cases. The real problem is a bit inter-twinned here, and I'm not attempting to address it completely in this PR.

naoyam added 21 commits January 6, 2025 15:47

Add ResizeHeuristic

944dc65

Currently only has one parameter

cleanup

8ec8e25

Disable the move-pad preseg pass when the resize scheduler is enabled

88b2008

Merge remote-tracking branch 'origin/disable_move_pad_when_resize_sch…

c0e3e91

…ed_enabled' into resize_scheduler_test

remove unnecessary check

fde550b

Merge remote-tracking branch 'origin/main' into resize_scheduler_mino…

f4d2a3d

…r_update_heuristic

Relax broadcast constraint in scheduleLoopDomainsLike

65ab3fb

Merge branch 'allow_broadcast_ref' into resize_scheduler_test

c3b3daf

Merge branch 'resize_scheduler_minor_update_heuristic' into resize_sc…

61bcbb5

…heduler_test

Merge remote-tracking branch 'origin/main' into resize_scheduler_test

eed894b

bcast check

92823d0

Schedule loop domains such that reshape transforms are cancelled

dd0f134

Merge remote-tracking branch 'origin/cancel_reshape' into resize_sche…

a7513c5

…duler_test

fix

ed100ac

comment

0c2c335

Don't try to index broadcast IDs as they should always be zero and there

80bcffb

may not be a valid path from loop IDs

repro

3b5c621

cleanup

3f79de9

Merge remote-tracking branch 'origin/main' into resize_scheduler_test

70df6fa

Merge remote-tracking branch 'origin/skip_trying_indexing_broadcast_i…

deeb24b

…ds' into resize_scheduler_test

reorder tensors like largest input

5d42ace

naoyam added the rope label Jan 10, 2025

naoyam added 2 commits January 9, 2025 20:31

Merge remote-tracking branch 'origin/cancel_reshape' into resize_sche…

536a006

…duler_reorder

Merge remote-tracking branch 'origin/main' into resize_scheduler_reorder

ab4fcd3

naoyam mentioned this pull request Jan 10, 2025

Resize scheduler: enable vectorization #3694

Merged

naoyam added 3 commits January 14, 2025 13:47

Merge branch 'main' into resize_scheduler_reorder

6c5da61

comment

9edd096

comment

41310f2

naoyam marked this pull request as ready for review January 14, 2025 22:00

naoyam requested a review from jjsjann123 January 14, 2025 22:01

jjsjann123 reviewed Jan 15, 2025

View reviewed changes

PR feedback

79b8c1e

jjsjann123 approved these changes Jan 15, 2025

View reviewed changes

naoyam added 2 commits January 15, 2025 10:31

Merge remote-tracking branch 'origin/main' into resize_scheduler_reorder

c0bc871

cleanup

f498fe5

naoyam merged commit be2cc2a into main Jan 15, 2025
7 checks passed

naoyam deleted the resize_scheduler_reorder branch January 15, 2025 20:29

naoyam mentioned this pull request Feb 24, 2025

Resize scheduler vectorization WAR #3955

Merged

	// This scheduling is not always feasible. Specifically, if a reshape
	// output iter domain is resized, the loop domain needs to keep using
	// the reshape output iter domain. Similarly, if a rehape output iter

Conversation

naoyam commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naoyam commented Jan 10, 2025

Uh oh!

naoyam commented Jan 10, 2025

Uh oh!

naoyam commented Jan 14, 2025

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naoyam commented Jan 15, 2025

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naoyam commented Jan 15, 2025

Uh oh!

github-actions bot commented Jan 15, 2025

PR Reviewer Guide 🔍

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

naoyam commented Jan 10, 2025 •

edited

Loading