Skip to content

Reordering optimization in the resize scheduler#3693

Merged
naoyam merged 29 commits intomainfrom
resize_scheduler_reorder
Jan 15, 2025
Merged

Reordering optimization in the resize scheduler#3693
naoyam merged 29 commits intomainfrom
resize_scheduler_reorder

Conversation

@naoyam
Copy link
Collaborator

@naoyam naoyam commented Jan 10, 2025

Depends on #3674, #3675, #3679

Reorder tensors to align with the largest input. This should improve memory accesses by minimizing strides. Store throughputs may be lowered, but it should generally be more important to optimize load accesses.

I do not have actual performance results by this change. I just remember this was effective in some cases while manually trying out different optimization strategies. We may eventually need to enable or disable this reordering by some heuristic.

@naoyam naoyam added the rope label Jan 10, 2025
@naoyam
Copy link
Collaborator Author

naoyam commented Jan 10, 2025

!test

@naoyam
Copy link
Collaborator Author

naoyam commented Jan 10, 2025

!test

@naoyam naoyam marked this pull request as ready for review January 14, 2025 22:00
@naoyam
Copy link
Collaborator Author

naoyam commented Jan 14, 2025

!test

@naoyam naoyam requested a review from jjsjann123 January 14, 2025 22:01
Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP. still need to look at reorderTensorLike

// Make sure the DID ID located at the outermost position
const auto outermost_pos = scheduler_utils::reorderDevicesToOuter(ref_tv);

const int64_t bdimx = 128;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like unintended changes here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's probably just I accidentally moved when merging PRs.

// The tensors are going to be reordered to align with the largest
// input. To make it work, merge operations for reshape should be
// cancelled.
scheduler_tools::cancelReshapeInLoopDomains(largest_input);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what would happen if we have transformation that can't be cancelled.

// This scheduling is not always feasible. Specifically, if a reshape
// output iter domain is resized, the loop domain needs to keep using
// the reshape output iter domain. Similarly, if a rehape output iter

Is this reshape cancel applied here, so that later when we call scheduler_tools::scheduleLoopDomainsLike from the reference would successfully apply?
To re-phrase the question, are we expecting cancel Reshape to successfully cancel transformations on all tensors, or we are just trying to cancel the transformation on reference tv?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tensors that are direct or indirect consumers of largest_input will be targets of the cancellation. Reshapes are not just done on largest_input. Any reshape op that depends on largest_input, directly or indirectly, will be cancelled.

If it isn't valid, it'll be just skipped, so it's done in a best-effort manner. It should not affect the correctness but the reference reordering may become suboptimal.

To be honest, this is only tested with the RoPE patterns, so it's likely there would be some corner cases, which I'd probably need to work on before enabling the resize scheduler by default.

@naoyam
Copy link
Collaborator Author

naoyam commented Jan 15, 2025

@jjsjann123 Please let me know if you have any additional comment. I'd like to get this merged quickly if not.

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple questions that's not blocking the merge of the PR. stamping.

for (auto it = inputs.rbegin(); it != inputs.rend(); ++it) {
innermost_it =
std::find(ordered_domain.begin(), ordered_domain.end(), *it);
NVF_ERROR(innermost_it != ordered_domain.end());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this iterator called innermost_it when it's breaking upon the first reference encountered?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it's the innermost inputs? but I thought the order of inputs aren't significant, but the position in ordered_domain is...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't remember why I had the loop here. It just needs to find the position of the innermost input in ordered_domain. It should be just:

std::deque<ValGroup>::iterator innermost_it = 
          std::find(ordered_domain.begin(), ordered_domain.end(), inputs.back());
NVF_ERROR(innermost_it != ordered_domain.end());

const auto& tv_loop_domain = target_tv->getLoopDomain();

IdModel id_model(target_tv->fusion(), /*build_graphs=*/false);
const auto& graph = id_model.buildBroadcastGraph();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why are we using broadcast graph here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because I think it makes sense for ordering to consider broadcast and corresponding non-broadcast domains are mapped.


// Place IDs that do not appear in ref at the outer position
int64_t new_id_pos = 0;
for (const auto i : c10::irange(tv_loop_domain.size())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see self-mapping is default to false, so we won't have multiple elements in tv_loop_domain that belongs to the same ValGroup. Is that a general assumption we are holding in the future or is it purely an implementation short-cut we are taking for now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. Does that matter here?

@naoyam
Copy link
Collaborator Author

naoyam commented Jan 15, 2025

!test

@github-actions
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
⚡ Recommended focus areas for review

Logic Change

The schedule function in ResizeScheduler class has been modified to reorder tensors to align with the largest input. This change may impact the performance of the resize scheduler and requires careful review.

    continue;
  }
  auto inp_tv_copy = RecomputeTv::recompute(inp_tv);
  ir_utils::replaceValInExprInputs(resize_tensor_op, inp_tv, inp_tv_copy);
}

TensorView* largest_input = nullptr;
if (resize_params->largest_input >= 0) {
  largest_input =
      fusion->inputs().at(resize_params->largest_input)->as<TensorView>();

  // The tensors are going to be reordered to align with the largest
  // input. To make it work, merge operations for reshape should be
  // cancelled.
  scheduler_tools::cancelReshapeInLoopDomains(largest_input);
}

for (auto expr : fusion->exprs()) {
  if (!expr->isOneOf<SliceOp, PadOp>()) {
    continue;
  }

  scheduler_tools::propagateResizeToInputs(expr);
}

auto ref_tv = getReferenceTensor(fusion);
NVF_ERROR(ref_tv != nullptr);

// Just simple scheduling for now.
// TODO: Do something smarter. Can just use the pointwise scheduler?

// Reorder tensors to align with the largest input. This is expected
// to improve the memory read performance, while the write
// performance could be lowered. This should generally be more
// important to optimize the read performance, but more robust
// decision would be needed.
if (largest_input != nullptr) {
  std::vector<IterDomain*> ref_alloc;
  ref_alloc.reserve(largest_input->getMaybeAllocationDomain().size());
  std::copy_if(
      largest_input->getMaybeAllocationDomain().begin(),
      largest_input->getMaybeAllocationDomain().end(),
      std::back_inserter(ref_alloc),
      [](IterDomain* alloc_id) {
        return !alloc_id->isBroadcast() && !alloc_id->isReduction() &&
            !alloc_id->isDeviceDim();
      });

  // Reorder the reference as the allocation domain of the largest fusion
  // input
  scheduler_utils::reorderTensorLike(ref_tv, ref_alloc);
}
New Function

A new function reorderTensorLike has been added to reorder the loop domain of a given tensor to align with a given list of reference IDs. This function requires review to ensure it is correctly implemented and does not introduce any bugs.

void reorderTensorLike(
    TensorView* target_tv,
    const std::vector<IterDomain*>& ref) {
  const auto& tv_loop_domain = target_tv->getLoopDomain();

  IdModel id_model(target_tv->fusion(), /*build_graphs=*/false);
  const auto& graph = id_model.buildBroadcastGraph();

  ValGroups target_groups = graph.toGroups(tv_loop_domain);

  ValGroups ref_groups = graph.toGroups(ref);

  // Traverse from the reference to the target tv. The reference is
  // not guaranteed to cover all loop IDs of target, so
  // require_all_to_visited needs to be false
  auto path = ValGraphBFS::getExprGroupsBetween(
                  graph,
                  ref_groups,
                  target_groups,
                  /*require_all_to_visited=*/false)
                  .first;

  // Traverse the expr path to create an ordered ID groups
  std::deque<ValGroup> ordered_domain{
      ref_groups.vector().begin(), ref_groups.vector().end()};

  for (const auto& [expr_g, dir] : path) {
    auto inputs = getInputsOfExpr(
        expr_g, dir, ValGraphInputs(graph), ValGraphOutputs(graph));
    auto outputs = getOutputsOfExpr(
        expr_g, dir, ValGraphInputs(graph), ValGraphOutputs(graph));

    // Inserts the outputs at the innermost position
    auto innermost_it =
        std::find(ordered_domain.begin(), ordered_domain.end(), inputs.back());
    NVF_ERROR(innermost_it != ordered_domain.end());
    ordered_domain.insert(innermost_it, outputs.begin(), outputs.end());

    // Removes the inputs
    for (const auto& inp : inputs) {
      ordered_domain.erase(
          std::remove(ordered_domain.begin(), ordered_domain.end(), inp),
          ordered_domain.end());
    }
  }

  std::unordered_map<int64_t, int64_t> old2new;

  // Place IDs that do not appear in ref at the outer position
  int64_t new_id_pos = 0;
  for (const auto i : c10::irange(tv_loop_domain.size())) {
    const auto& loop_id_group = graph.toGroup(tv_loop_domain.at(i));
    auto it =
        std::find(ordered_domain.begin(), ordered_domain.end(), loop_id_group);
    if (it == ordered_domain.end()) {
      old2new.emplace((int64_t)i, new_id_pos);
      ++new_id_pos;
    }
  }
  for (const auto i : c10::irange(tv_loop_domain.size())) {
    const auto& loop_id_group = graph.toGroup(tv_loop_domain.at(i));
    auto it =
        std::find(ordered_domain.begin(), ordered_domain.end(), loop_id_group);
    if (it != ordered_domain.end()) {
      int64_t new_pos =
          (int64_t)std::distance(ordered_domain.begin(), it) + new_id_pos;
      old2new.emplace((int64_t)i, new_pos);
    }
  }

  target_tv->reorder(old2new);
}

@naoyam naoyam merged commit be2cc2a into main Jan 15, 2025
7 checks passed
@naoyam naoyam deleted the resize_scheduler_reorder branch January 15, 2025 20:29
naoyam added a commit that referenced this pull request Jan 15, 2025
Stacked on #3693 

This PR adds a preliminary vectorization support to the resize
scheduler. It currently only considers vectorization of the innermost
dimension, just because that's good enough for the RoPE cases. It should
eventually be extended to support vectorizing multiple innermost
dimensions.
naoyam added a commit that referenced this pull request Feb 25, 2025
This is a WAR for an issue with the vectorization by the resize
scheduler (unrelated to #3640).

#3693 introduced a reordering optimization for the resize scheduler that
attempted to minimize strides in read accesses of fusion inputs by
canceling reshapes. It turned out it can result in conflicts with
vectorization. The scheduler uses the fusion input as the reference of
the vectorization analysis, assuming any reshape is canceled, which is
not always the case.

So, in this PR, the vectorization analysis is changed to use the
reference output. However, that isn't enough since when a resize is
indeed canceled, the analysis should actually be done using the
pre-reshape shape.

To workaround that, this PR also adds a flag to disable canceling
reshapes that use innermost logical IDs. This should make sure it's
always valid to use the fusion output as the reference of the
vectorization analysis.

This is an ad-hoc WAR but should be good enough for the RoPE cases. The
real problem is a bit inter-twinned here, and I'm not attempting to
address it completely in this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants