Skip to content

Enable resize scheduler by default#3848

Merged
naoyam merged 47 commits intomainfrom
enable_resize_scheduler_by_default
Mar 19, 2025
Merged

Enable resize scheduler by default#3848
naoyam merged 47 commits intomainfrom
enable_resize_scheduler_by_default

Conversation

@naoyam
Copy link
Collaborator

@naoyam naoyam commented Feb 7, 2025

This enables the resize scheduler by default. In summary:

  • The resize scheduler is on by default and can be disabled by NVFUSER_DISABLE=resize_scheduler.
  • Adjusted the test option settings accordingly

@naoyam
Copy link
Collaborator Author

naoyam commented Feb 7, 2025

!test

@github-actions
Copy link

github-actions bot commented Feb 7, 2025

Review updated until commit f473ecb

Description

  • Enable resize scheduler by default

  • Update option handling in multiple files

  • Adjust tests to accommodate new default


Changes walkthrough 📝

Relevant files
Configuration changes
2 files
options.cpp
Move resize scheduler option to disable map                           
+1/-1     
options.h
Move resize scheduler option to disable map                           
+1/-1     
Code modification
4 files
pre_segmenter.cpp
Update resize scheduler check                                                       
+1/-1     
resize.cpp
Update resize scheduler checks and logic                                 
+3/-5     
loop_domain_scheduler.cpp
Simplify replay direction checks                                                 
+2/-5     
resize_utils.cpp
Add direction parameter to scheduleLoopDomainsBy                 
+2/-1     
Test modification
4 files
test_move_pad.cpp
Disable resize scheduler in test setup                                     
+7/-1     
test_resize.cpp
Update test setup to disable resize scheduler                       
+4/-16   
test_rope.cpp
Remove resize scheduler enable in test setup                         
+1/-7     
test_python_frontend.py
Add cache reset for long-running fusion test                         
+10/-0   

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review

Option Handling

Ensure that the enabling and disabling of the resize scheduler are correctly handled and do not introduce any conflicts or unintended behavior.

          {"kernel_db", EnableOption::KernelDb},
          {"kernel_debug", EnableOption::KernelDebug},
          {"kernel_lineinfo", EnableOption::KernelLineInfo},
          {"kernel_profile", EnableOption::KernelProfile},
          {"memory_promotion", EnableOption::MemoryPromotion},
          {"reuse_zeroed_memory", EnableOption::ReuseZeroedMemory},
          {"static_fusion_count", EnableOption::StaticFusionCount},
          {"wait_debugger", EnableOption::WaitDebugger},
          {"warn_register_spill", EnableOption::WarnRegisterSpill},
          {"host_ir_lowering", EnableOption::HostIrLowering},
      };
  return available_options;
}

template <>
std::unordered_map<EnableOption, std::vector<std::string>> Options<
    EnableOption>::getOptionsFromEnv() {
  const auto& available_options = getEnableOptions();
  return parseEnvOptions("ENABLE", available_options);
}

std::optional<EnableOption> stringToEnableOption(
    const std::string& enable_option) {
  const auto& opts = getEnableOptions();
  auto it = opts.find(enable_option);
  if (it != opts.end()) {
    return it->second;
  }
  return std::nullopt;
}

const std::unordered_map<std::string, DisableOption>& getDisableOptions() {
  static const std::unordered_map<std::string, DisableOption>
      available_options = {
          {"compile_to_sass", DisableOption::CompileToSass},
          {"contig_indexing", DisableOption::ContigIndexing},
          {"expr_simplify", DisableOption::ExprSimplify},
          {"fallback", DisableOption::Fallback},
          {"fma", DisableOption::Fma},
          {"grouped_grid_welford_outer_opt",
           DisableOption::GroupedGridWelfordOuterOpt},
          {"id_model", DisableOption::IdModel},
          {"index_hoist", DisableOption::IndexHoist},
          {"magic_zero", DisableOption::MagicZero},
          {"matmul_expr_eval", DisableOption::MatmulExprEval},
          {"nvtx", DisableOption::Nvtx},
          {"parallel_compile", DisableOption::ParallelCompile},
          {"parallel_serde", DisableOption::ParallelSerde},
          {"predicate_elimination", DisableOption::PredicateElimination},
          {"python_inline_definitions", DisableOption::PythonInlineDefinitions},
          {"kernel_reuse", DisableOption::KernelReuse},
          {"var_name_remapping", DisableOption::VarNameRemapping},
          {"welford_vectorization", DisableOption::WelfordVectorization},
          {"resize_scheduler", DisableOption::ResizeScheduler},
Conditional Logic

Verify that the conditional logic for the resize scheduler is correctly implemented and that the pre-segmentation pass behaves as expected when the resize scheduler is disabled.

if (isOptionDisabled(DisableOption::ResizeScheduler)) {
  OptimizationPass<MovePadPass>::runPass(fusion);
}
Test Coverage

Ensure that the tests cover all scenarios, including cases where the resize scheduler is enabled and disabled, to validate the correctness of the implementation.

TEST_F(ResizeTest, TraversalForInliningPosition) {
  auto fusion_ptr = std::make_unique<Fusion>();
  auto& fusion = *fusion_ptr;
  FusionGuard fg(fusion_ptr.get());

  // Disable the resize schedule because the original issue happened
  // with the pointwise scheduler
  DisableOptionsGuard::getCurOptions().set(DisableOption::ResizeScheduler);

  auto tv0 = makeContigConcreteTensor({16});
  fusion.addInput(tv0);
  auto tv1 = makeContigConcreteTensor({8});
  fusion.addInput(tv1);

  auto tv2 =
      slice(tv0, {{IrBuilder::create<Val>(0L), IrBuilder::create<Val>(8L)}});
  auto tv3 = sin(tv2);
  fusion.addOutput(tv3);

  auto tv4 =
      slice(tv0, {{IrBuilder::create<Val>(0L), IrBuilder::create<Val>(8L)}});
  auto tv5 = add(tv1, tv4);
  auto tv6 = add(tv2, tv5);
  fusion.addOutput(tv6);

  // This fusion will be scheduled as a pointwise kernel. The issue
  // was that the cache of the tv1 input was not inlined at all. That
  // is because the spanning tree propagation from the reference
  // tensor, which is tv3, arrives at the cache tensor through tv0 and
  // tv4, which means that no mapped ID is returned by
  // getPositionsMappedTo since resized IDs are not mapped in
  // TransformReplay::getMatchedLeafPosWithoutReplayPasC.
  //
  // This issue should not happen if the spanning tree travesal took
  // the path from tv2 -> tv6 -> tv5 -> tv1_cache since there's no
  // resize along that path.

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  auto t0 = at::randn({16}, options);
  auto t1 = at::randn({8}, options);

  FusionExecutorCache executor_cache(std::move(fusion_ptr));
  auto outputs = executor_cache.runFusionWithInputs({t0, t1});
  testValidate(&fusion, outputs, {t0, t1}, __LINE__, __FILE__);

  // Make sure all the tensors are at least inlined at some
  // position. The cache of tv1 was not inlined at all due to the issue.
  FusionKernelRuntime* runtime = executor_cache.getMostRecentKernelRuntime();
  EXPECT_FALSE(runtime->isSegmented());
  const auto& heuristic_param =
      runtime->schedulerHeuristics()->heuristicsList().front();
  EXPECT_EQ(heuristic_param->scheduler_type, SchedulerType::PointWise);
  auto scheduled_fusion = runtime->executors()
                              .at(0)
                              ->as<KernelExecutor>()
                              ->compiledKernel()
                              ->kernel();

  auto ref_tv = scheduled_fusion->outputs().at(0)->as<TensorView>();
  for (auto tv : ref_tv->fusion()->allTvs()) {
    if (tv->isFusionInput() || tv->isFusionOutput()) {
      continue;
    }
    EXPECT_GT(tv->getComputeAtPosition(), 0)
        << "Unexpected computeAt position: " << tv->toString();
  }
}

// Repro of issue 3801 (https://github.com/NVIDIA/Fuser/issues/3801)
// clang-format off
/*
Inputs:
  T13_g_float[bS35{1}, iS36{16}]
  T59_g_float[bS405{1}, iS406{4}, iS407{3}, bS408{1}, iS409{16}]
Outputs:
  T64_g_float[bS215{1}, iS216{4}, bS217{1}, iS218{16}]
  T89_g_float[bS319{1}, iS320{4}, bS321{1}, iS322{16}]
  T63_g_float[bS211{1}, iS212{4}, bS213{1}, iS214{16}]
  T78_g_float[bS271{1}, iS272{4}, bS273{1}, iS274{16}]

%kernel_math {
T61_g_float[bS199{1}, iS200{4}, bS202{1}rf, bS203{1}, iS204{16}]
   = slice( T59_g_float[bS405{1}, iS406{4}, iS407{3}, bS408{1}, iS409{16}], { {0, 1, 1} {0, 4, 1} {1, 2, 1} {0, 1, 1} {0, 16, 1} } )
T64_g_float[bS215{1}, iS216{4}, bS217{1}, iS218{16}]
   = squeeze( T61_g_float[bS199{1}, iS200{4}, bS202{1}rf, bS203{1}, iS204{16}], flags = {false, false, false, true, false} )
T106_l_float[bS399{1}, bS400{1}, bS401{1}, iS402{16}]
   = broadcast( T13_g_float[bS35{1}, iS36{16}], flags = {true, true, false, false} )
T77_g_float[bS267{1}, bS268{1 ex 4}, bS269{1}, iS270{16}] = expand( T106_l_float[bS399{1}, bS400{1}, bS401{1}, iS402{16}], {1, 4, 1, 16} )
T89_g_float[bS319{1}, iS320{4}, bS321{1}, iS322{16}]
   = T64_g_float[bS215{1}, iS216{4}, bS217{1}, iS218{16}]
   * T77_g_float[bS267{1}, bS268{1 ex 4}, bS269{1}, iS270{16}];
T60_g_float[bS193{1}, iS194{4}, bS196{1}rf, bS197{1}, iS198{16}]
   = slice( T59_g_float[bS405{1}, iS406{4}, iS407{3}, bS408{1}, iS409{16}], { {0, 1, 1} {0, 4, 1} {0, 1, 1} {0, 1, 1} {0, 16, 1} } )
T63_g_float[bS211{1}, iS212{4}, bS213{1}, iS214{16}]
   = squeeze( T60_g_float[bS193{1}, iS194{4}, bS196{1}rf, bS197{1}, iS198{16}], flags = {false, false, false, true, false} )
T78_g_float[bS271{1}, iS272{4}, bS273{1}, iS274{16}]
   = T63_g_float[bS211{1}, iS212{4}, bS213{1}, iS214{16}]
   * T77_g_float[bS267{1}, bS268{1 ex 4}, bS269{1}, iS270{16}];
} // %kernel_math
*/
// clang-format on
TEST_F(ResizeTest, Repro3801) {
  auto fusion_ptr = std::make_unique<Fusion>();
  auto& fusion = *fusion_ptr;
  FusionGuard fg(fusion_ptr.get());

  // Disable the resize schedule because the original issue happened
  // with the pointwise scheduler
  DisableOptionsGuard::getCurOptions().set(DisableOption::ResizeScheduler);

  auto T13 = makeContigConcreteTensor({1, 16});
  fusion.addInput(T13);
  auto T59 = makeContigConcreteTensor({1, 4, 3, 1, 16});
  fusion.addInput(T59);

Base automatically changed from fix_rope_llama3_bwd to main February 7, 2025 17:05
@naoyam naoyam force-pushed the enable_resize_scheduler_by_default branch from 006ef2b to ae17a5a Compare February 7, 2025 17:08
@naoyam
Copy link
Collaborator Author

naoyam commented Feb 10, 2025

!test

1 similar comment
@naoyam
Copy link
Collaborator Author

naoyam commented Feb 10, 2025

!test

@naoyam naoyam force-pushed the enable_resize_scheduler_by_default branch 2 times, most recently from 9044331 to dd5d144 Compare February 12, 2025 21:42
@naoyam naoyam force-pushed the enable_resize_scheduler_by_default branch from dd5d144 to 4cb7fbd Compare February 12, 2025 21:43
@naoyam
Copy link
Collaborator Author

naoyam commented Feb 13, 2025

!test

@naoyam
Copy link
Collaborator Author

naoyam commented Feb 14, 2025

!test

@naoyam
Copy link
Collaborator Author

naoyam commented Mar 3, 2025

!test

@naoyam naoyam changed the title [WIP] Enable resize scheduler by default Enable resize scheduler by default Mar 3, 2025
@naoyam naoyam marked this pull request as ready for review March 3, 2025 19:17
nvf_out, _ = self.exec_nvfuser(fusion_func, inputs, supports_segmentation=False)
# self.assertEqual(nvf_out[0], t24)

# This fusion takes a long time to segment and schedule
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, the python tests took about 2 hours to complete. This is a WAR @rdspring1 suggested.

Direction replay_dir_tv = Direction::Undefined;
if (replay_dir != Direction::Backward &&
input_ids.size() == transform->inputs().size()) {
NVF_ERROR(output_ids.empty());
Copy link
Collaborator Author

@naoyam naoyam Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally thought this should be always the case but that isn't actually case. In particular, there can be a resize that produces an output that is mapped with the input. In that case, output_ids won't be empty, but as long as all the mapped inputs are found, that should not be a problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: we should remove the comment above on line 525-527 then.


scheduler_tools::scheduleLoopDomainsBy(tvs_to_schedule, resize);
scheduler_tools::scheduleLoopDomainsBy(
tvs_to_schedule, resize, Direction::Forward);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the direction option to make it more explicit as it's always forward transformations.

@naoyam
Copy link
Collaborator Author

naoyam commented Mar 4, 2025

!test

@naoyam naoyam requested a review from jjsjann123 March 4, 2025 10:36
@naoyam naoyam added the rope label Mar 4, 2025
Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

auto vec_ref_tv = largest_input != nullptr ? largest_input : ref_tv;
const auto tvs_to_vectorize =
scheduler_utils::getInputsOutputsWithInnerDim(vec_ref_tv, true, true);
scheduler_utils::getInputsOutputsWithInnerDim(ref_tv, true, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me why we were using largest_input as vectorization reference in the first place?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to mention this, but this was an actually a bug. It should have been changed in #3955.

Direction replay_dir_tv = Direction::Undefined;
if (replay_dir != Direction::Backward &&
input_ids.size() == transform->inputs().size()) {
NVF_ERROR(output_ids.empty());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: we should remove the comment above on line 525-527 then.

@naoyam
Copy link
Collaborator Author

naoyam commented Mar 5, 2025

!build

@naoyam
Copy link
Collaborator Author

naoyam commented Mar 5, 2025

I'll merge this after #3906.

@naoyam
Copy link
Collaborator Author

naoyam commented Mar 15, 2025

!test

@naoyam
Copy link
Collaborator Author

naoyam commented Mar 17, 2025

!test

@naoyam
Copy link
Collaborator Author

naoyam commented Mar 19, 2025

!test

@naoyam naoyam merged commit 8101ff4 into main Mar 19, 2025
53 checks passed
@naoyam naoyam deleted the enable_resize_scheduler_by_default branch March 19, 2025 19:35
IvanYashchuk pushed a commit to Lightning-AI/lightning-thunder that referenced this pull request May 2, 2025
…1949)

This PR is removing the `torchcompile_cat` executor from Thunder's default executor list in favor of nvFuser's implementation of RoPE and the surrounding logic for RoPE.

nvFuser significantly upgraded it's RoPE and surrounding logic performance with [nvFuser PR 3848](NVIDIA/Fuser#3848).  Regression testing on a large set of nemo based HuggingFace models showed 1-9% speedup end-2-end with many models not showing a difference.  However, most importantly, there were no slowdowns.  Therefore, this should be only a performance bonus over current expectations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants