Concretization replays loop transforms by wujingyue · Pull Request #3950 · NVIDIA/Fuser

wujingyue · 2025-02-23T18:51:48Z

For #2563

Also extends selfAllocationReplay to replay loop so it can be reused for concretization.

For #2563

wujingyue · 2025-02-23T18:52:02Z

!test

github-actions · 2025-02-23T18:52:31Z

Review updated until commit 75ee3fb

Description

Renamed selfAllocationReplay to selfReplay in TransformReplay.
Extended selfReplay to handle loop domains in addition to allocation domains.
Updated test cases to use the new selfReplay method.
Added a new test case for loop splitting in DynamicTransformTest.

Changes walkthrough 📝

Relevant files

Enhancement

dynamic_transform.cpp `Update concretizeNonEmptyReshape to use selfReplay` csrc/dynamic_transform.cpp Included `transform_replay.h`. Replaced `selfAllocationReplay` with `selfReplay` in `concretizeNonEmptyReshape`.	+8/-3
fusion.cpp `Update aliasOutputToInput to use selfReplay` csrc/fusion.cpp Replaced `selfAllocationReplay` with `selfReplay` in `aliasOutputToInput`.	+1/-1
remove_bcast_squeeze.cpp `Update maybeDoReplacement to use selfReplay` csrc/preseg_passes/remove_bcast_squeeze.cpp Replaced `selfAllocationReplay` with `selfReplay` in `maybeDoReplacement`.	+1/-2
transform_replay.cpp `Extend selfReplay to handle loop domains` csrc/transform_replay.cpp Renamed `selfAllocationReplay` to `selfReplay`. Extended `selfReplay` to handle loop domains.	+59/-41
transform_replay.h `Rename and update selfReplay` csrc/transform_replay.h Updated the documentation for `selfReplay`. Renamed `selfAllocationReplay` to `selfReplay`.	+3/-5
test_python_frontend.py `Simplify contiguity parameters` tests/python/test_python_frontend.py Simplified `contiguity` parameters in `define_tensor` calls.	+2/-2

Tests

test_dynamic_transform.cpp `Update and add test cases` tests/cpp/test_dynamic_transform.cpp Updated test cases to use `DynamicTransformTest` instead of `NVFuserTest`. Added a new test case `LoopSplit` to verify loop splitting.	+64/-28

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Loop Handling

The new selfReplay function includes logic to handle loop domains, but it's unclear if this is correctly implemented and tested. Ensure that the loop domain replay is accurate and that it doesn't introduce any unintended side effects.

  std::vector<IterDomain*> new_loop;
  ReplaySelf replay(self->loop(), axis_map);
  for (auto id : new_self->logical()) {
    if (id->isReduction()) {
      new_loop.push_back(id);
    }
  }

  for (IterDomain* id : self->loop()) {
    auto it = replay.getReplay().find(id);
    NVF_ERROR(
        it != replay.getReplay().end(), "failed to replay IterDomain: ", id);
    it->second->parallelize(id->getParallelType());
    new_loop.push_back(it->second);
  }

  new_self->setLoopDomain(new_loop);
}

Test Coverage

The new test cases added (DynamicTransformTest) should be reviewed to ensure they cover all the new functionality introduced by selfReplay. Verify that the tests are comprehensive and that they exercise the new loop domain replay logic.

// Simple test of analyzing dynamic reshape
TEST_F(DynamicTransformTest, DynamicTransform1) {
  Fusion fusion;
  FusionGuard fg(&fusion);

  auto tv0 = makeSymbolicTensor(2);
  fusion.addInput(tv0);
  auto tv1 = makeSymbolicTensor(2);
  fusion.addInput(tv1);

  auto reshape_shape0 = IrBuilder::create<Val>(DataType::Int);
  fusion.addInput(reshape_shape0);
  auto reshape_shape1 = IrBuilder::create<Val>(DataType::Int);
  fusion.addInput(reshape_shape1);

  auto tv2 = reshape(tv0, {reshape_shape0, reshape_shape1});
  auto tv3 = add(tv1, tv2);

  fusion.addOutput(tv3);

  // tv2 has symbolic axes as reshape is dynamic
  NVF_CHECK(
      tv2->domain()->hasSymbolicAxis(),
      "Expected to have symbolic axes: ",
      tv2->toString());

  // The symbolic axes of tv2 should not be propagated to tv3 as tv1
  // is fully concrete
  NVF_CHECK(
      !tv3->domain()->hasSymbolicAxis(),
      "Not expected to have symbolic axes: ",
      tv3->toString());

  {
    ExpressionEvaluator expr_eval;

    // input: 4, 3
    // output: 3, 4
    expr_eval.bind(tv0->axis(0)->extent(), 4L);
    expr_eval.bind(tv0->axis(1)->extent(), 3L);
    expr_eval.bind(reshape_shape0, 3L);
    expr_eval.bind(reshape_shape1, 4L);
    // We cannot infer the shape of tv1 from the above bound values, since
    // either axis of tv2 might be broadcast against one from tv1.
    expr_eval.bind(tv1->axis(0)->extent(), 3L);
    expr_eval.bind(tv1->axis(1)->extent(), 4L);

    auto initial_info = DynamicTransform::getInitialInfo(&fusion);
    auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);
    NVF_CHECK(
        info.getReshapeTransforms().size() == 1,
        "Expected to have one reshape transform: ",
        info.toString());
  }

  {
    ExpressionEvaluator expr_eval;

    // input: 4, 3
    // output: 3, -1
    expr_eval.bind(tv0->axis(0)->extent(), 4L);
    expr_eval.bind(tv0->axis(1)->extent(), 3L);
    expr_eval.bind(reshape_shape0, 3L);
    expr_eval.bind(reshape_shape1, -1L);

    // This should throw an exception since any reshape size of -1 must be
    // specified as a definition-time constant, as opposed to an input scalar.
    EXPECT_THAT(
        [&]() {
          auto initial_info = DynamicTransform::getInitialInfo(&fusion);
          auto info =
              DynamicTransformConcretizationInfo(&initial_info, &expr_eval);
        },
        ::testing::ThrowsMessage<nvfError>(::testing::HasSubstr(
            "Values of -1 passed to reshape must be constant at definition")));
  }

  {
    ExpressionEvaluator expr_eval;

    // input: 4, 3
    // output: 5, 4
    expr_eval.bind(tv0->axis(0)->extent(), 4L);
    expr_eval.bind(tv0->axis(1)->extent(), 3L);
    expr_eval.bind(reshape_shape0, 5L);
    expr_eval.bind(reshape_shape1, 4L);

    // This should fail as (4 * 3) is not equal to (5 * 4)
    EXPECT_THAT(
        [&]() {
          auto initial_info = DynamicTransform::getInitialInfo(&fusion);
          auto info =
              DynamicTransformConcretizationInfo(&initial_info, &expr_eval);
        },
        ::testing::ThrowsMessage<nvfuser::nvfError>(::testing::HasSubstr(
            "Total element counts across view operation must match:")));
  }
}

// Reshape a tensor like another tensor
TEST_F(DynamicTransformTest, DynamicTransform2) {
  Fusion fusion;
  FusionGuard fg(&fusion);

  // All tensors are 2D symbolic tensors. tv1 and tv2 have the same shape
  auto tv0 = makeSymbolicTensor(2);
  fusion.addInput(tv0);
  auto tv1 = makeSymbolicTensor(2);
  fusion.addInput(tv1);
  auto tv2 = makeSymbolicTensor(2);
  fusion.addInput(tv2);

  // Reshape to the same shape as tv1
  auto tv3 = reshape(tv0, {tv1->axis(0)->extent(), tv1->axis(1)->extent()});
  auto tv4 = add(tv1, tv2);
  auto tv5 = add(tv3, tv4);
  fusion.addOutput(tv5);

  {
    ExpressionEvaluator expr_eval;

    // input: 4, 3
    // output: 3, 4
    expr_eval.bind(tv0->axis(0)->extent(), 4L);
    expr_eval.bind(tv0->axis(1)->extent(), 3L);
    // Bind only tv2 extents. It should be enough as tv1 has the same
    // shape
    expr_eval.bind(tv2->axis(0)->extent(), 3L);
    expr_eval.bind(tv2->axis(1)->extent(), 4L);

    auto initial_info = DynamicTransform::getInitialInfo(&fusion);
    auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);

    NVF_CHECK(
        info.getReshapeTransforms().size() == 1,
        "Expected to have one reshape transform: ",
        info.toString());
  }
}

// Analyze dynamic reshape and concretize
TEST_F(DynamicTransformTest, DynamicTransform3) {
  auto fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(&fusion);

  auto tv0 = makeSymbolicTensor(2);
  fusion.addInput(tv0);
  auto tv1 = makeSymbolicTensor(2);
  fusion.addInput(tv1);

  auto reshape_shape0 = IrBuilder::create<Val>(DataType::Int);
  auto reshape_shape1 = IrBuilder::create<Val>(DataType::Int);

  auto tv2 = reshape(tv0, {reshape_shape0, reshape_shape1});
  auto tv3 = add(tv1, tv2);

  fusion.addOutput(tv3);

  std::vector<int64_t> shape_before({4, 3});
  std::vector<int64_t> shape_after({3, 4});

  ExpressionEvaluator expr_eval;

  // input: 4, 3
  // output: 3, 4
  expr_eval.bind(tv0->axis(0)->extent(), shape_before.at(0));
  expr_eval.bind(tv0->axis(1)->extent(), shape_before.at(1));
  expr_eval.bind(tv1->axis(0)->extent(), shape_after.at(0));
  expr_eval.bind(tv1->axis(1)->extent(), shape_after.at(1));
  // We cannot infer reshape_shape0 and reshape_shape1 from tv0's and tv1's
  // extents alone, since either of these reshaped extents could either match
  // that of tv1 or be 1, resulting in a broadcast.
  expr_eval.bind(reshape_shape0, shape_after.at(0));
  expr_eval.bind(reshape_shape1, shape_after.at(1));

  auto initial_info = DynamicTransform::getInitialInfo(&fusion);
  auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);

  DynamicTransform::concretizeFusion(&fusion, &info);
  NVF_CHECK(
      !fusion.hasDynamicTransform(), "Expected to have no dynamic transform");

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  at::Tensor t0 = at::randn(shape_before, options);
  at::Tensor t1 = at::randn(shape_after, options);

  FusionExecutorCache executor_cache(std::move(fusion_ptr));
  auto cg_outputs = executor_cache.runFusionWithInputs({t0, t1});

  testValidate(
      executor_cache.fusion(), cg_outputs, {t0, t1}, __LINE__, __FILE__);
}

// Test multiple patterns of reshape
TEST_F(DynamicTransformTest, DynamicTransform4) {
  std::vector<std::pair<std::vector<int64_t>, std::vector<int64_t>>>
      before_after_shapes = {
          {{4, 3}, {3, 4}},
          {{4, 3}, {12, 1}},
          {{4, 3}, {4, 3}},
          {{4, 6}, {4, 2, 3}},
      };
  for (const auto& before_after : before_after_shapes) {
    const auto& before_shape = before_after.first;
    const auto& after_shape = before_after.second;

    Fusion fusion;
    FusionGuard fg(&fusion);

    auto tv0 = makeSymbolicTensor(before_shape.size());
    fusion.addInput(tv0);
    auto tv1 = makeSymbolicTensor(after_shape.size());
    fusion.addInput(tv1);

    std::vector<Val*> shape_arg;
    for (const auto i : c10::irange(after_shape.size())) {
      (void)i;
      shape_arg.push_back(IrBuilder::create<Val>(DataType::Int));
    }

    auto tv2 = reshape(tv0, shape_arg);

    // tv3 will also have symbolic axes
    auto tv3 = set(tv2);
    auto tv4 = add(tv1, tv3);

    fusion.addOutput(tv4);

    ExpressionEvaluator expr_eval;

    for (const auto i : c10::irange(before_shape.size())) {
      expr_eval.bind(tv0->axis(i)->extent(), before_shape.at(i));
    }

    for (const auto i : c10::irange(after_shape.size())) {
      expr_eval.bind(tv2->axis(i)->extent(), after_shape.at(i));
      // We must bind tv1's extents, since they cannot be inferred until after
      // concretization. Because tv2 is a dynamic reshape both its IterDomains
      // are Symbolic, which means both of tv3's IterDomains are also Symbolic.
      // tv1 has both IterDomains of type Iteration, but it since we add tv3 to
      // it to get tv4, we do not know whether this will resolve broadcasts from
      // tv3 or not until concretization.
      expr_eval.bind(tv1->axis(i)->extent(), after_shape.at(i));
    }

    auto initial_info = DynamicTransform::getInitialInfo(&fusion);
    auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);

    DynamicTransform::concretizeFusion(&fusion, &info);

    NVF_CHECK(
        !fusion.hasDynamicTransform(), "Expected to have no dynamic transform");
  }
}

// Dynamic reshape followed by static resize
TEST_F(DynamicTransformTest, DynamicTransform5) {
  std::vector<std::pair<std::vector<int64_t>, std::vector<int64_t>>>
      before_after_shapes = {
          {{4, 3}, {3, 4}},
          //{{4, 3}, {12, 1}}, not possible to do pad a broadcast domain yet
      };

  for (auto before_after : before_after_shapes) {
    Fusion fusion;
    FusionGuard fg(&fusion);

    auto tv0 = makeSymbolicTensor(2);
    fusion.addInput(tv0);

    auto reshape_shape0 = IrBuilder::create<Val>(DataType::Int);
    fusion.addInput(reshape_shape0);
    auto reshape_shape1 = IrBuilder::create<Val>(DataType::Int);
    fusion.addInput(reshape_shape1);

    auto tv1 = reshape(tv0, {reshape_shape0, reshape_shape1});
    auto tv2 =
        pad(tv1,
            {IrBuilder::create<Val>(1L),
             IrBuilder::create<Val>(1L),
             IrBuilder::create<Val>(1L),
             IrBuilder::create<Val>(1L)});
    auto tv3 = set(tv2);

    fusion.addOutput(tv3);

    ExpressionEvaluator expr_eval;

    expr_eval.bind(tv0->axis(0)->extent(), before_after.first.at(0));
    expr_eval.bind(tv0->axis(1)->extent(), before_after.first.at(1));
    expr_eval.bind(tv1->axis(0)->extent(), before_after.second.at(0));
    expr_eval.bind(tv1->axis(1)->extent(), before_after.second.at(1));

    auto initial_info = DynamicTransform::getInitialInfo(&fusion);
    auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);

    DynamicTransform::concretizeFusion(&fusion, &info);

    NVF_CHECK(
        !fusion.hasDynamicTransform(), "Expected to have no dynamic transform");
  }
}

// Reshape of reshape
TEST_F(DynamicTransformTest, DynamicTransform6) {
  std::vector<std::vector<std::vector<int64_t>>> reshape_lists = {
      {{4, 3}, {3, 4}},
      {{4, 3}, {3, 4}, {12}},
      {{4, 3}, {3, 1, 4}, {12, 1}},
      {{4, 3}, {12}, {3, 4}},
      {{4, 3}, {1, 2, 1, 3, 2}, {3, 4}},
  };

  for (auto reshape_list : reshape_lists) {
    std::vector<TensorView*> reshape_tvs;

    Fusion fusion;
    FusionGuard fg(&fusion);

    auto tv0 = makeSymbolicTensor(reshape_list.at(0).size());
    fusion.addInput(tv0);

    reshape_tvs.push_back(tv0);

    for (auto it = reshape_list.begin() + 1; it != reshape_list.end(); ++it) {
      auto shape = *it;
      std::vector<Val*> shape_arg;
      for (const auto i : c10::irange(shape.size())) {
        (void)i;
        shape_arg.push_back(IrBuilder::create<Val>(DataType::Int));
      }

      auto tv = reshape(reshape_tvs.back(), shape_arg);
      reshape_tvs.push_back(tv);
    }
    fusion.addOutput(reshape_tvs.back());

    ExpressionEvaluator expr_eval;

    for (const auto i : c10::irange(reshape_list.size())) {
      const auto& shape = reshape_list.at(i);
      for (const auto j : c10::irange(shape.size())) {
        expr_eval.bind(reshape_tvs.at(i)->axis(j)->extent(), shape.at(j));
      }
    }

    auto initial_info = DynamicTransform::getInitialInfo(&fusion);
    auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);

    DynamicTransform::concretizeFusion(&fusion, &info);

    NVF_CHECK(
        !fusion.hasDynamicTransform(), "Expected to have no dynamic transform");
  }
}

// Test equality of DynamicTransformInfo
TEST_F(DynamicTransformTest, DynamicTransform7) {
  // Represents a series of reshapes
  struct TransformList {
    std::vector<std::vector<int64_t>> shapes;
  };

  struct ShapeInfo {
    TransformList ref_transform;
    std::vector<TransformList> equal_transforms;
    std::vector<TransformList> different_transforms;
  };

  std::vector<ShapeInfo> patterns;

  patterns.push_back(ShapeInfo{
      .ref_transform = {{{3, 4}, {4, 3}}},
      .equal_transforms =
          {{{{3, 4}, {4, 3}}}, {{{2, 8}, {4, 4}}}, {{{3, 8}, {4, 6}}}},
      .different_transforms = {{{{3, 4}, {2, 6}}}}});

  patterns.push_back(ShapeInfo{
      .ref_transform = {{{3, 4}, {12}, {1, 4, 3}}},
      .equal_transforms =
          {
              {{{3, 4}, {12}, {1, 4, 3}}},
              {{{5, 8}, {40}, {1, 4, 10}}},
          },
      .different_transforms = {
          {{{3, 4}, {12}, {4, 1, 3}}},
          {{{3, 4}, {12}, {4, 3, 1}}},
      }});

  for (const auto& pattern : patterns) {
    const auto& ref_transform = pattern.ref_transform;
    std::vector<TensorView*> reshape_tvs;

    Fusion fusion;
    FusionGuard fg(&fusion);

    auto tv0 = makeSymbolicTensor(ref_transform.shapes.at(0).size());
    fusion.addInput(tv0);

    reshape_tvs.push_back(tv0);

    for (auto it = ref_transform.shapes.begin() + 1;
         it != ref_transform.shapes.end();
         ++it) {
      const auto& shape = *it;
      std::vector<Val*> shape_arg;
      for (const auto i : c10::irange(shape.size())) {
        (void)i;
        shape_arg.push_back(IrBuilder::create<Val>(DataType::Int));
      }

      auto tv = reshape(reshape_tvs.back(), shape_arg);
      reshape_tvs.push_back(tv);
    }
    fusion.addOutput(reshape_tvs.back());

    ExpressionEvaluator ref_expr_eval;

    for (const auto i : c10::irange(ref_transform.shapes.size())) {
      const auto& shape = ref_transform.shapes.at(i);
      for (const auto j : c10::irange(shape.size())) {
        ref_expr_eval.bind(reshape_tvs.at(i)->axis(j)->extent(), shape.at(j));
      }
    }

    auto ref_initial_info = DynamicTransform::getInitialInfo(&fusion);
    auto ref_info =
        DynamicTransformConcretizationInfo(&ref_initial_info, &ref_expr_eval);

    for (const auto& transform : pattern.equal_transforms) {
      NVF_CHECK(transform.shapes.size() == ref_transform.shapes.size());
      ExpressionEvaluator expr_eval;
      for (const auto i : c10::irange(transform.shapes.size())) {
        const auto& shape = transform.shapes.at(i);
        for (const auto j : c10::irange(shape.size())) {
          expr_eval.bind(reshape_tvs.at(i)->axis(j)->extent(), shape.at(j));
        }
      }

      auto initial_info = DynamicTransform::getInitialInfo(&fusion);
      auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);

      NVF_CHECK(
          ref_info == info,
          "Expected to be equal: ",
          ref_info.toString(),
          "\n",
          info.toString());
    }

    for (const auto& transform : pattern.different_transforms) {
      NVF_CHECK(transform.shapes.size() == ref_transform.shapes.size());
      ExpressionEvaluator expr_eval;
      for (const auto i : c10::irange(transform.shapes.size())) {
        const auto& shape = transform.shapes.at(i);
        for (const auto j : c10::irange(shape.size())) {
          expr_eval.bind(reshape_tvs.at(i)->axis(j)->extent(), shape.at(j));
        }
      }

      auto initial_info = DynamicTransform::getInitialInfo(&fusion);
      auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);

      NVF_CHECK(
          ref_info != info,
          "Expected to be different: ",
          ref_info.toString(),
          "\n",
          info.toString());
    }
  }
}

// Make sure non-dynamic reshape op is created when possible
TEST_F(DynamicTransformTest, DynamicTransform8) {
  Fusion fusion;
  FusionGuard fg(&fusion);

  auto tv0 = makeConcreteTensor({3, 4});
  fusion.addInput(tv0);

  auto tv1 =
      reshape(tv0, {IrBuilder::create<Val>(4L), IrBuilder::create<Val>(3L)});
  fusion.addOutput(tv1);

  // Make sure the reshape is recognized as a static reshape
  NVF_CHECK(
      !tv1->domain()->hasSymbolicAxis(),
      "Not expected to have symbolic axes: ",
      tv1->toString());
}

// Mix of static and dynamic reshape. Make sure only dynamic reshape
// is handled by the dynamic transform concretizer.
TEST_F(DynamicTransformTest, DynamicTransform9) {
  Fusion fusion;
  FusionGuard fg(&fusion);

  auto tv0 = makeSymbolicTensor(2);
  fusion.addInput(tv0);

  auto tv1 = reshape(tv0, {3, 4}, {4, 3});

  auto reshape_shape0 = IrBuilder::create<Val>(DataType::Int);

  auto tv2 = reshape(tv1, {reshape_shape0});
  fusion.addOutput(tv2);

  // The first reshape is static
  NVF_CHECK(
      !tv1->domain()->hasSymbolicAxis(),
      "Unexpected to have symblic axes: ",
      tv1->toString());
  // The second reshape is static
  NVF_CHECK(
      tv2->domain()->hasSymbolicAxis(),
      "Expected to have symblic axes: ",
      tv2->toString());

  ExpressionEvaluator expr_eval;

  expr_eval.bind(tv0->axis(0)->extent(), 3L);
  expr_eval.bind(tv0->axis(1)->extent(), 4L);
  expr_eval.bind(reshape_shape0, 12L);

  auto initial_info = DynamicTransform::getInitialInfo(&fusion);
  auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);

  // There must be only one dynamic reshape entry, and that must be
  // for tv2.
  NVF_CHECK(
      info.getReshapeTransforms().size() == 1,
      info.getReshapeTransforms().at(0).first == 0, // first and only reshape
      "Unexpected dynamic transform info:",
      info.toString());
}

// Make sure inherited symbolic IDs are concretized through producer projection
TEST_F(DynamicTransformTest, DynamicTransform10) {
  Fusion fusion;
  FusionGuard fg(&fusion);

  auto tv0 = makeSymbolicTensor(2);
  fusion.addInput(tv0);

  auto tv1 = reshape(
      tv0,
      {IrBuilder::create<Val>(DataType::Int),
       IrBuilder::create<Val>(DataType::Int)});
  auto tv2 = slice(
      tv1,
      {Slice(),
       {IrBuilder::create<Val>(1L),
        sub(tv1->axis(0)->extent(), IrBuilder::create<Val>(1L))}});
  fusion.addOutput(tv2);

  // tv2 has an producer projection (i.e., resize). The input to the expr is
  // symbolic, so is the output. When concretized, both of the input
  // and output must be concretized.

  ExpressionEvaluator expr_eval;

  expr_eval.bind(tv0->axis(0)->extent(), 3L);
  expr_eval.bind(tv0->axis(1)->extent(), 4L);
  expr_eval.bind(tv1->axis(0)->extent(), 4L);
  expr_eval.bind(tv1->axis(1)->extent(), 3L);

  auto initial_info = DynamicTransform::getInitialInfo(&fusion);
  auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);

  DynamicTransform::concretizeFusion(&fusion, &info);

  NVF_CHECK(
      !fusion.hasDynamicTransform(), "Expected to have no dynamic transform");
}

// Simple test of hashing. Create concretization info objects with two
// similar but different reshape sizes and see if their hashes are different.
TEST_F(DynamicTransformTest, DynamicTransform11) {
  auto fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(&fusion);

  auto tv0 = makeSymbolicTensor(2);
  fusion.addInput(tv0);

  auto tv1 = reshape(
      tv0,
      {IrBuilder::create<Val>(DataType::Int),
       IrBuilder::create<Val>(DataType::Int),
       IrBuilder::create<Val>(DataType::Int)});
  fusion.addOutput(tv1);

  ExpressionEvaluator expr_eval1;
  // input: 4, 3
  // output: 2, 2, 3
  expr_eval1.bind(tv0->axis(0)->extent(), 4L);
  expr_eval1.bind(tv0->axis(1)->extent(), 3L);
  expr_eval1.bind(tv1->axis(0)->extent(), 2L);
  expr_eval1.bind(tv1->axis(1)->extent(), 2L);
  expr_eval1.bind(tv1->axis(2)->extent(), 3L);

  auto initial_info1 = DynamicTransform::getInitialInfo(&fusion);
  auto info1 = DynamicTransformConcretizationInfo(&initial_info1, &expr_eval1);

  ExpressionEvaluator expr_eval2;
  ;
  // input: 4, 3
  // output: 3, 2, 2
  expr_eval2.bind(tv0->axis(0)->extent(), 4L);
  expr_eval2.bind(tv0->axis(1)->extent(), 3L);
  expr_eval2.bind(tv1->axis(0)->extent(), 3L);
  expr_eval2.bind(tv1->axis(1)->extent(), 2L);
  expr_eval2.bind(tv1->axis(2)->extent(), 2L);

  auto initial_info2 = DynamicTransform::getInitialInfo(&fusion);
  auto info2 = DynamicTransformConcretizationInfo(&initial_info2, &expr_eval2);

  // Generally different concretizations doesn't always mean different
  // hashes, but in this case they should be different
  auto hash1 = std::hash<DynamicTransformConcretizationInfo>{}(info1);
  auto hash2 = std::hash<DynamicTransformConcretizationInfo>{}(info2);
  NVF_CHECK(
      hash1 != hash2,
      "Unexpected hash collision: ",
      hash1,
      " for\n",
      info1.toString(),
      "and\n",
      info2.toString());
}

// Test FusionExecutorCache with dynamic reshapes
TEST_F(DynamicTransformTest, DynamicTransformFusionExecutorCache) {
  auto fusion = std::make_unique<Fusion>();
  FusionGuard fg(fusion.get());

  auto tv0 = makeSymbolicTensor(2);
  fusion->addInput(tv0);
  auto tv1 = makeSymbolicTensor(2);
  fusion->addInput(tv1);

  auto tv2 = reshape(tv0, {tv1->axis(0)->extent(), tv1->axis(1)->extent()});
  auto tv3 = add(tv1, tv2);

  fusion->addOutput(tv3);

  // tv2 has symbolic axes as reshape is dynamic
  NVF_CHECK(
      tv2->domain()->hasSymbolicAxis(),
      "Expected to have symbolic axes: ",
      tv2->toString());

  // The symbolic axes of tv2 should not be propagated to tv3 as tv1
  // is fully concrete
  NVF_CHECK(
      !tv3->domain()->hasSymbolicAxis(),
      "Not expected to have symbolic axes: ",
      tv3->toString());

  FusionExecutorCache executor_cache(std::move(fusion));

  NVF_CHECK(
      executor_cache.countRuntimes() == 0, "Expect to start with no runtimes");

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  { // trivial reshape
    auto t0 = at::randn({3, 4}, options);
    auto t1 = at::randn({3, 4}, options);
    auto cg_outputs = executor_cache.runFusionWithInputs({t0, t1});
    testValidate(
        executor_cache.fusion(), cg_outputs, {t0, t1}, __LINE__, __FILE__);
    NVF_CHECK(
        executor_cache.countRuntimes() == 1,
        "Expect to create a single runtime");
  }
  { // non-trivial reshape: merge and split
    auto t0 = at::randn({3, 4}, options);
    auto t1 = at::randn({4, 3}, options);
    auto cg_outputs = executor_cache.runFusionWithInputs({t0, t1});
    testValidate(
        executor_cache.fusion(), cg_outputs, {t0, t1}, __LINE__, __FILE__);
    auto num_rts = executor_cache.countRuntimes();
    auto num_concs = executor_cache.countConcretizations();
    NVF_CHECK(num_rts == 2, "Non-trivial reshape should create new runtime");
    NVF_CHECK(
        num_concs == 2,
        "Non-trivial reshape should create new concretization cache level");
  }
  { // different non-trivial reshape
    auto t0 = at::randn({2, 6}, options);
    auto t1 = at::randn({4, 3}, options);
    auto cg_outputs = executor_cache.runFusionWithInputs({t0, t1});
    testValidate(
        executor_cache.fusion(), cg_outputs, {t0, t1}, __LINE__, __FILE__);
    auto num_rts = executor_cache.countRuntimes();
    auto num_concs = executor_cache.countConcretizations();
    NVF_CHECK(
        num_rts == 2,
        "Second non-trivial reshape should not create new runtime");
    NVF_CHECK(
        num_concs == 2,
        "Second non-trivial reshape should not create new concretization cache level");
  }
}

using shape_t = std::vector<int64_t>;
using dynamic_view_invocation = std::tuple<
    shape_t, // input_shape
    shape_t, // output_shape
    bool // expect miss
    >;

//! Given a collection of input/output shapes test that FusionExecutorCache
//! properly caches concretized Fusions. The first argument is a vector of
//! input/output shape pairs. Each of these shape pairs will be run using the
//! same FusionExecutorCache. The argument expect_miss indicates whether we
//! expect a cache hit or miss at the concretization level.
//! reshape_before_reduction has the same meaning as in reductionViewAddFusion
//! in test_gpu_view.cpp.
void reductionDynamicViewAddFusion(
    std::vector<dynamic_view_invocation>& invocations,
    bool reshape_before_reduction) {
  constexpr int kReductionAxis = -1;

  auto input_dims = std::get<0>(invocations[0]).size();
  auto output_dims = std::get<1>(invocations[0]).size();

  auto bias_dims = (reshape_before_reduction) ? input_dims : output_dims;

  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(&fusion);

  TensorView* x = makeSymbolicTensor(input_dims);
  TensorView* bias = makeSymbolicTensor(bias_dims);
  fusion.addInput(x);
  fusion.addInput(bias);

  auto tv1 =
      (reshape_before_reduction) ? add(x, bias) : sum(x, {kReductionAxis});
  // create vectors of input scalars describing this reshape
  std::vector<Val*> output_shape(output_dims);
  for (size_t i : c10::irange(output_dims)) {
    output_shape[i] = IrBuilder::create<Val>(DataType::Int);
    fusion.addInput(output_shape[i]);
  }
  auto x_reshape = reshape(tv1, output_shape);
  auto y = (reshape_before_reduction) ? sum(x_reshape, {kReductionAxis})
                                      : add(x_reshape, bias);
  fusion.addOutput(y);

  FusionExecutorCache executor_cache(std::move(fusion_ptr));

  size_t num_concretizations = executor_cache.countConcretizations();
  // Check that concretizations and runtimes are cache misses only when they
  // should be
  auto checkCache = [&](bool expect_miss) {
    auto current = executor_cache.countConcretizations();
    ASSERT_EQ(current, num_concretizations + (size_t)expect_miss);
    num_concretizations = current;
  };

  for (auto& inv : invocations) {
    // Shmoo tests can occupy a lot of memory due to allocating many
    // different tensor sizes. So in order to avoid an OOM during this
    // test, we manually clear the allocator after it's reached a certain
    // threshold.
    maybeClearAllocator();

    auto input_shape = std::get<0>(inv);
    auto output_shape = std::get<1>(inv);
    auto expect_miss = std::get<2>(inv);

    NVF_ERROR(input_shape.size() == input_dims);
    NVF_ERROR(output_shape.size() == output_dims);

    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);

    at::Tensor at_x = at::randn(input_shape, options);
    auto bias_shape = (reshape_before_reduction) ? input_shape : output_shape;
    if (!reshape_before_reduction) {
      // When bias_shape = output_shape, it may contain -1s
      // concretize bias_shape so that we can properly initialize at_bias
      size_t other_numel = 1;
      ssize_t negone_dim = -1; // negative if no -1 shape is provided
      for (auto i : c10::irange(bias_shape.size())) {
        if (bias_shape[i] == -1) {
          ASSERT_EQ(negone_dim, -1); // test cases should not have multiple -1s
          negone_dim = -1;
        } else {
          other_numel *= bias_shape[i];
        }
      }
      if (negone_dim >= 0) {
        bias_shape[negone_dim] = (int64_t)at_x.numel() / (int64_t)other_numel;
      }
    }
    at::Tensor at_bias = at::randn(bias_shape, options);
    KernelArgumentHolder args = {at_x, at_bias};
    // Add input scalars describing the reshape size for concretization
    for (size_t i : c10::irange(output_dims)) {
      args.push(output_shape[i]);
    }

    auto outputs = executor_cache.runFusionWithInputs(args);
    checkCache(expect_miss);

    auto at_tv1 = (reshape_before_reduction) ? (at_x + at_bias)
                                             : at::sum(at_x, kReductionAxis);
    auto at_x_reshape = at::native::view(at_tv1, output_shape);

    testValidate(&fusion, outputs, args, __LINE__, __FILE__);
  }
}

TEST_F(DynamicTransformTest, FusionDynamicReshapeReductionShmoo) {
  auto invocations = std::vector<dynamic_view_invocation>{
      {{8, 3 * 4, 7, 9}, {8, 3 * 4, 7, 9}, true}, // trivial
      {{8, 3 * 4, 7, 5}, {8, 3 * 4, 7, 5}, false}, // trivial
      {{8, 3 * 4, 7, 9}, {8, 3, 4, 7 * 9}, true}, // merge(2) osplit(1, 3)
      {{8, 3 * 4, 7, 9},
       {8, 3, 4 * 7, 9},
       true}, // merge(1) merge(2) osplit(1, 3)
      {{8, 3 * 4, 7, 5},
       {8, 3, 4 * 7, 5},
       false}, // merge(1) merge(2) osplit(1, 3)
      {{8, 3 * 5, 7, 9}, {8, 3, 5 * 7, 9}, false}, // merge(1) osplit(1, 3)

      // test passing -1 dynamically for dimension size
      // This is unsupported. See https://github.com/NVIDIA/Fuser/issues/249
      // Values of -1 must be passed as constants instead of input-dependent
      // scalars.
      //{{8, 3 * 5, 7, 9}, {8, 3, -1, 9}, false} // merge(1) osplit(1, 3)

      // Empty reshapes should translate to FullOp
      {{8, 0, 7, 9}, {7, 8, 0, 9}, true}, // symbolic_sizes = [ -1, -1, 0, -1 ]
      // In the case below there's now a separate Val introduced for the output
      // extent, which is zero. This is represented in
      // DynamicTransformConcretizationInfo causing cache miss
      {{8, 0, 7, 9}, {7, 8, -1, 9}, true}, // symbolic_sizes = [ -1, -1, 0, -1 ]
      {{8, 0, 7, 9}, {7, 8, 0, 0}, true}, // symbolic_sizes = [ -1, -1, 0, 0 ]
      {{8, 0, 7, 9}, {47, 0, 13, 0}, true}, // symbolic_sizes = [ -1, 0, -1, 0 ]
  };
  reductionDynamicViewAddFusion(
      invocations, true /* reshape_before_reduction */);
}

using dynamic_pad_invocation = std::tuple<
    std::vector<int64_t>, // input_shape
    std::vector<int64_t>, // pad_widths
    bool // expect miss
    >;

void reductionDynamicPadAddFusion(
    std::vector<dynamic_pad_invocation>& invocations) {
  constexpr int kReductionAxis = -1;

  auto input_dims = std::get<0>(invocations[0]).size();
  auto num_pad_widths = std::get<1>(invocations[0]).size();

  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(&fusion);

  TensorView* x = makeSymbolicTensor(input_dims);
  fusion.addInput(x);

  std::vector<Val*> pad_width_vals(num_pad_widths);
  for (auto i : c10::irange(num_pad_widths)) {
    pad_width_vals[i] = IrBuilder::create<Val>(DataType::Int);
    fusion.addInput(pad_width_vals[i]);
  }
  auto x_pad = pad(x, pad_width_vals);
  auto y = sum(x_pad, {kReductionAxis});
  fusion.addOutput(y);

  FusionExecutorCache executor_cache(std::move(fusion_ptr));

  // Check that concretizations and runtimes are cache misses only when they
  // should be
  size_t num_concretizations = executor_cache.getKernelRuntimes().size();
#define CHECK_CACHE(expect_miss, ...)                        \
  auto current = executor_cache.getKernelRuntimes().size();  \
  auto expected = num_concretizations + (size_t)expect_miss; \
  NVF_CHECK(                                                 \
      current == expected,                                   \
      "Expected cache size ",                                \
      expected,                                              \
      " but found ",                                         \
      current,                                               \
      ". ",                                                  \
      __VA_ARGS__);                                          \
  num_concretizations = current;

  for (auto& inv : invocations) {
    // Shmoo tests can occupy a lot of memory due to allocating many
    // different tensor sizes. So in order to avoid an OOM during this
    // test, we manually clear the allocator after it's reached a certain
    // threshold.
    maybeClearAllocator();

    auto input_shape = std::get<0>(inv);
    auto pad_widths = std::get<1>(inv);
    auto expect_miss = std::get<2>(inv);

    NVF_ERROR(input_shape.size() == input_dims);
    NVF_ERROR(pad_widths.size() == num_pad_widths);

    auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);

    at::Tensor at_x = at::randn(input_shape, options);
    KernelArgumentHolder args = {at_x};
    // Add input scalars describing the reshape size for concretization
    for (size_t i : c10::irange(pad_widths.size())) {
      args.push(pad_widths[i]);
    }

    auto outputs = executor_cache.runFusionWithInputs(args);
    CHECK_CACHE(
        expect_miss, "Input shape=", input_shape, " pad_widths=", pad_widths);

    auto at_x_pad = at::pad(at_x, pad_widths);
    auto at_y = at::sum(at_x_pad, kReductionAxis);

    testValidate(&fusion, outputs, args, __LINE__, __FILE__);
  }
}
#undef CHECK_CACHE

// Test dynamic pad for various inputs
TEST_F(DynamicTransformTest, DynamicPadShmoo) {
  // NOLINTBEGIN(bugprone-implicit-widening-of-multiplication-result)
  auto invocations = std::vector<dynamic_pad_invocation>{
      {{3, 5}, {0, 0}, true}, // trivial

      {{3, 5}, {2, 1}, false}, // simple pad of both sides
      {{3, 5}, {-1, 1}, false}, // shift by one
      // The following fails with a SIGFPE in innerReductionHeuristic
      // See https://github.com/NVIDIA/Fuser/issues/264
      //{{3, 5}, {-3, -2}, false}, // output is zero-dimensional

      // Output has size 1 so is set to broadcast.
      // This was previously "working" by concretizing the size-1 pad to
      // Iteration, even though it should be Broadcast. When set properly to
      // Broadcast, it fails with an error in ConcretizedBroadcastDomains.
      //{{3, 5}, {0, -4}, true},

      // Test full negative shifts, so output doesn't overlap input
      {{3, 5}, {-5, 2}, false},
      {{3, 5}, {2, -5}, false}, // full shift the other direction, re-use

      // The following reuses the schedule of {3, 5} inputs, and does not set
      // broadcast on the second input dimension.
      {{3, 1}, {1, 1}, false},

      // Test zero-dimensional input
      //{{3, 0}, {0, 0}, false}, // SIGFPE (see #264 above)
      {{3, 0}, {1, 1}, true}, // zero-dimensional concretizes differently
      //{{3, 0}, {-1, 1}, false}, // SIGFPE (see #264 above)
  };
  // NOLINTEND(bugprone-implicit-widening-of-multiplication-result)
  reductionDynamicPadAddFusion(invocations);
}

// Test that a Symbolic root/Broadcast logical is not concretized to
// Iteration/Iteration
TEST_F(DynamicTransformTest, FusionDynamicSliceToBroadcast) {
  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(fusion_ptr.get());
  auto tv0 = makeSymbolicTensor(1);
  fusion.addInput(tv0);
  // tv0[:2] introduces symbolic IterDomain
  auto tv1 = slice(
      tv0, {{fusion.zeroVal(), IrBuilder::create<Val>(2L), fusion.oneVal()}});
  // tv1 has Broadcast logical, Iteration root
  auto tv2 = slice(tv1, {{fusion.zeroVal(), fusion.oneVal(), fusion.oneVal()}});
  // tv2 has a Symbolic root related to a Broadcast logical through a Resize op
  fusion.addOutput(tv2);

  // At concretization, tv1's logical will be set to Iteration, which will
  // propagate to tv2s root. This test will test that when tv2 root is
  // concretized to Iteration, it does not wind up overwriting the Broadcast
  // logical.

  FusionExecutorCache executor_cache(std::move(fusion_ptr));
  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  at::Tensor at0 = at::randn({5}, options);
  auto outputs = executor_cache.runFusionWithInputs({at0});
  testValidate(&fusion, outputs, {at0}, __LINE__, __FILE__);
}

// Test that empty input to cat is concretized away
TEST_F(DynamicTransformTest, FusionDynamicEmptyCat1) {
  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(fusion_ptr.get());

  auto tv0 = makeSymbolicTensor(1);
  fusion.addInput(tv0);
  auto tv1 = makeSymbolicTensor(1);
  fusion.addInput(tv1);
  auto tv2 = makeSymbolicTensor(1);
  fusion.addInput(tv2);

  auto tv3 = cat({tv0, tv1, tv2}, 0);

  fusion.addOutput(tv3);

  // Check correctness
  FusionExecutorCache executor_cache(std::move(fusion_ptr));
  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  at::Tensor at0 = at::randn({5}, options);
  at::Tensor at1 = at::randn({0}, options);
  at::Tensor at2 = at::randn({3}, options);
  auto outputs = executor_cache.runFusionWithInputs({at0, at1, at2});
  testValidate(&fusion, outputs, {at0, at1, at2}, __LINE__, __FILE__);
}

// Test that empty input to cat is concretized away
TEST_F(DynamicTransformTest, FusionDynamicEmptyCat2) {
  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(fusion_ptr.get());

  auto tv0 = makeSymbolicTensor(1);
  fusion.addInput(tv0);
  auto tv1 = makeSymbolicTensor(1);
  fusion.addInput(tv1);

  auto tv2 = cat({tv0, tv1}, 0);

  fusion.addOutput(tv2);

  // Check correctness
  FusionExecutorCache executor_cache(std::move(fusion_ptr));
  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  at::Tensor at0 = at::randn({5}, options);
  at::Tensor at1 = at::randn({0}, options);
  auto outputs = executor_cache.runFusionWithInputs({at0, at1});
  testValidate(&fusion, outputs, {at0, at1}, __LINE__, __FILE__);

  // Check that fusion consists only of tv2 = set(tv0)
  auto fkr = executor_cache.getMostRecentKernelRuntime();
  auto seg_fusion = fkr->fusionSegments();
  auto output_def = seg_fusion->outputs()[0]->definition();
  EXPECT_TRUE(output_def->isA<LoadStoreOp>());
  EXPECT_EQ(output_def->as<LoadStoreOp>()->opType(), LoadStoreOpType::Set);
  EXPECT_EQ(output_def->input(0), seg_fusion->inputs()[0]);
}

// Repro of https://github.com/NVIDIA/Fuser/issues/418
TEST_F(DynamicTransformTest, DynamicTransformIssue418) {
  auto fusion = std::make_unique<Fusion>();
  FusionGuard fg(fusion.get());

  auto tv0 = makeSymbolicTensor(4);
  fusion->addInput(tv0);
  auto s0 = IrBuilder::create<Val>(DataType::Int);
  fusion->addInput(s0);

  auto sh = shape(tv0);
  auto tv1 = reshape(tv0, {sh[0], div(sh[1], s0), s0, sh[2], sh[3]});
  // Reducing along axis 2 in tv1 is equivalent to a partial reduction across
  // axis 1 of tv0.
  auto vm = variance_mean(tv1, {2, 3, 4}, 0, true);
  fusion->addOutput(vm.mean);
  fusion->addOutput(vm.var);

  FusionExecutorCache executor_cache(std::move(fusion));

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  at::Tensor at0 = at::randn({256, 128, 28, 28}, options);
  auto outputs = executor_cache.runFusionWithInputs({at0, 32});

  testValidate(executor_cache.fusion(), outputs, {at0, 32}, __LINE__, __FILE__);
}

TEST_F(DynamicTransformTest, Issue249) {
  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(&fusion);

  TensorView* tv0 = makeSymbolicTensor(4);
  fusion.addInput(tv0);

  auto tv1 = add(tv0, tv0);
  auto tv2 = reshape(
      tv1,
      {tv1->axis(0)->extent(),
       tv1->axis(2)->extent(),
       IrBuilder::create<Val>(-1L)});
  auto tv3 = add(tv2, tv2);
  fusion.addOutput(tv3);

  FusionExecutorCache executor_cache(std::move(fusion_ptr));

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  at::Tensor at_x = at::randn({2, 3, 4, 5}, options);

  auto outputs = executor_cache.runFusionWithInputs({at_x});

  testValidate(executor_cache.fusion(), outputs, {at_x}, __LINE__, __FILE__);
}

// This is just like the test above, but uses an input scalar with value -1
TEST_F(DynamicTransformTest, Issue249InputNegative1) {
  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(&fusion);

  TensorView* tv0 = makeSymbolicTensor(4);
  fusion.addInput(tv0);

  auto s0 = IrBuilder::create<Val>(DataType::Int);
  auto s1 = IrBuilder::create<Val>(DataType::Int);
  auto s2 = IrBuilder::create<Val>(DataType::Int);
  fusion.addInput(s0);
  fusion.addInput(s1);
  fusion.addInput(s2);

  auto tv1 = add(tv0, tv0);
  auto tv2 = reshape(tv1, {s0, s1, s2});
  auto tv3 = add(tv2, tv2);
  fusion.addOutput(tv3);

  FusionExecutorCache executor_cache(std::move(fusion_ptr));

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  at::Tensor at_x = at::randn({2, 3, 4, 5}, options);

  // Dynamic reshape sizes that are not constant at definition must be explicit:
  // no -1 allowed
  EXPECT_THROW(
      executor_cache.runFusionWithInputs({at_x, 2, 4, -1}), std::exception);

  // Passing explicit sizes works fine
  auto outputs = executor_cache.runFusionWithInputs({at_x, 2, 4, 15});

  testValidate(
      executor_cache.fusion(), outputs, {at_x, 2, 4, 15}, __LINE__, __FILE__);
}

// Test that we can squeeze Symbolic IterDomains and that we properly detect
// improper concretizations where we have squeezed a dimension with extent
// other than 1.
// See https://github.com/NVIDIA/Fuser/issues/1273
TEST_F(DynamicTransformTest, SymbolicSqueeze) {
  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
  Fusion* fusion = fusion_ptr.get();
  FusionGuard fg(fusion);

  auto tv0 = makeSymbolicTensor(2);
  auto s0 = IrBuilder::create<Val>(DataType::Index);
  auto s1 = IrBuilder::create<Val>(DataType::Index);
  fusion->addInput(tv0);
  fusion->addInput(s0);
  fusion->addInput(s1);

  auto tv1 = reshape(tv0, {s0, s1});
  auto tv2 = squeeze(
      tv1, std::vector<bool>({false, true})); // Squeeze second dimension
  fusion->addOutput(tv2);

  FusionExecutorCache executor_cache(std::move(fusion_ptr));

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  at::Tensor t0 = at::randn({3, 2}, options);
  KernelArgumentHolder valid_args = {t0, 6, 1};
  // An invalid input has a second dimension that cannot be squeezed
  KernelArgumentHolder invalid_args = {t0, 2, 3};

  auto outputs = executor_cache.runFusionWithInputs(valid_args);

  testValidate(fusion, outputs, valid_args, __LINE__, __FILE__);

  // An informative error message should be given by
  // SqueezeOp::checkConcretization
  EXPECT_THAT(
      [&]() { executor_cache.runFusionWithInputs(invalid_args); },
      ::testing::ThrowsMessage<nvfuser::nvfError>(::testing::HasSubstr(
          " must concretize to IterType::Broadcast but found")));
}

// See https://github.com/NVIDIA/Fuser/issues/1468
TEST_F(DynamicTransformTest, SymbolicExpand) {
  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
  Fusion* fusion = fusion_ptr.get();
  FusionGuard fg(fusion);

  auto tv0 = makeSymbolicTensor(2);
  fusion->addInput(tv0);

  auto s0 = IrBuilder::create<Val>(DataType::Index);
  auto s1 = IrBuilder::create<Val>(DataType::Index);
  auto s2 = IrBuilder::create<Val>(DataType::Index);
  auto s3 = IrBuilder::create<Val>(DataType::Index);
  fusion->addInput(s0);
  fusion->addInput(s1);
  fusion->addInput(s2);
  fusion->addInput(s3);

  auto tv1 = reshape(tv0, {s0, s1});
  auto tv2 = expand(tv1, {s2, s3});
  auto tv3 = add(tv2, tv2);

  fusion->addOutput(tv3);

  FusionExecutorCache executor_cache(std::move(fusion_ptr));

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  at::Tensor t0 = at::randn({3, 2}, options);
  KernelArgumentHolder valid_args = {t0, 6, 1, 6, 5};
  // An invalid input has a second dimension that cannot be expanded
  KernelArgumentHolder invalid_args = {t0, 2, 3, 2, 5};

  auto outputs = executor_cache.runFusionWithInputs(valid_args);

  testValidate(
      executor_cache.fusion(), outputs, valid_args, __LINE__, __FILE__);

  // An informative error message should be given during concretization
  EXPECT_THAT(
      [&]() { executor_cache.runFusionWithInputs(invalid_args); },
      ::testing::ThrowsMessage<nvfuser::nvfError>(
          ::testing::HasSubstr("Mismatch in sizes when concretizing expand.")));
}

// Test that constant zero extents are not overwritten during concretization
// with non-constant extents.
// See https://github.com/NVIDIA/Fuser/issues/1572
TEST_F(DynamicTransformTest, ConcretizeConstantExtents) {
  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
  Fusion* fusion = fusion_ptr.get();
  FusionGuard fg(fusion);

  auto tv0 = makeSymbolicTensor(2);
  fusion->addInput(tv0);

  // Explicitly cast Int to Index, so that these extents are not immediate
  // constants
  auto tv1 = reshape(
      tv0,
      {
          castOp(DataType::Index, IrBuilder::create<Val>(4096, DataType::Int)),
          castOp(DataType::Index, IrBuilder::create<Val>(32, DataType::Int)),
          castOp(DataType::Index, IrBuilder::create<Val>(3, DataType::Int)),
          castOp(DataType::Index, IrBuilder::create<Val>(128, DataType::Int)),
      });
  auto tv2 = permute(tv1, {1, 2, 0, 3});
  auto tv3 = slice(tv2, {0, 0, 0, 0}, {32, 1, 4096, 128});
  auto tv4 = reshape(
      tv3,
      {
          castOp(DataType::Index, IrBuilder::create<Val>(32, DataType::Int)),
          castOp(DataType::Index, IrBuilder::create<Val>(4096, DataType::Int)),
          castOp(DataType::Index, IrBuilder::create<Val>(128, DataType::Int)),
      });
  // Note this slice has zero extent in last dimension. RemoveEmptyPass should
  // recognize this and replace with full()
  auto tv5 = slice(tv4, {0, 0, 0}, {32, 4096, 0});

  fusion->addOutput(tv5);

  FusionExecutorCache executor_cache(std::move(fusion_ptr));

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  at::Tensor t0 = at::randn({4096, 12288}, options);

  auto outputs = executor_cache.runFusionWithInputs({t0});

  testValidate(fusion, outputs, {t0}, __LINE__, __FILE__);
}

// Test that dynamic reductions that should result in squeezes are handled
// properly.
// See https://github.com/NVIDIA/Fuser/issues/1667
TEST_F(DynamicTransformTest, DynamicSqueezeTrivialReduction) {
  auto fusion_ptr = std::make_unique<Fusion>();
  Fusion* fusion = fusion_ptr.get();
  FusionGuard fg(fusion);

  auto tv0 = makeSymbolicTensor(3);
  fusion->addInput(tv0);

  // Explicitly cast Int to Index, so that these extents are not immediate
  // constants
  auto tv1 = reshape(
      tv0,
      {
          castOp(DataType::Index, IrBuilder::create<Val>(1, DataType::Int)),
          castOp(DataType::Index, IrBuilder::create<Val>(2, DataType::Int)),
          castOp(DataType::Index, IrBuilder::create<Val>(2, DataType::Int)),
          castOp(DataType::Index, IrBuilder::create<Val>(1, DataType::Int)),
          castOp(DataType::Index, IrBuilder::create<Val>(3, DataType::Int)),
          castOp(DataType::Index, IrBuilder::create<Val>(3, DataType::Int)),
      });
  auto tv2 = sum(tv1, {0, 2, 3, 4});
  fusion->addOutput(tv2);

  FusionExecutorCache executor_cache(std::move(fusion_ptr));

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  at::Tensor t0 = at::randn({2, 2, 9}, options);

  auto outputs = executor_cache.runFusionWithInputs({t0});

  testValidate(fusion, outputs, {t0}, __LINE__, __FILE__);
}

// Same as above but for Welford ops
// See https://github.com/NVIDIA/Fuser/issues/1667
TEST_F(DynamicTransformTest, DynamicSqueezeTrivialWelford) {
  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
  Fusion* fusion = fusion_ptr.get();
  FusionGuard fg(fusion);

  auto tv0 = makeSymbolicTensor(3);
  fusion->addInput(tv0);

  // Explicitly cast Int to Index, so that these extents are not immediate
  // constants
  auto tv1 = reshape(
      tv0,
      {
          castOp(DataType::Index, IrBuilder::create<Val>(1, DataType::Int)),
          castOp(DataType::Index, IrBuilder::create<Val>(2, DataType::Int)),
          castOp(DataType::Index, IrBuilder::create<Val>(2, DataType::Int)),
          castOp(DataType::Index, IrBuilder::create<Val>(1, DataType::Int)),
          castOp(DataType::Index, IrBuilder::create<Val>(3, DataType::Int)),
          castOp(DataType::Index, IrBuilder::create<Val>(3, DataType::Int)),
      });
  auto res =
      variance_mean(tv1, {0, 2, 3, 4}, /*unbiased=*/true, /*keepdim=*/false);
  fusion->addOutput(res.mean);
  fusion->addOutput(res.var);

  FusionExecutorCache executor_cache(std::move(fusion_ptr));

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  at::Tensor t0 = at::randn({2, 2, 9}, options);

  auto outputs = executor_cache.runFusionWithInputs({t0});

  testValidate(fusion, outputs, {t0}, __LINE__, __FILE__);
}

TEST_F(DynamicTransformTest, LoopSplit) {
  const int b = 2, s = 3, h = 96, e = 128;

  Fusion fusion;
  FusionGuard fg(&fusion);
  TensorView* in = makeContigConcreteTensor({-1, -1, 12288});
  TensorView* out = reshape(
      in,
      {shape(in)[0],
       shape(in)[1],
       IrBuilder::create<Val>(96),
       IrBuilder::create<Val>(128)});
  fusion.addInput(in);
  fusion.addOutput(out);

  const int d = 2;
  auto mesh = DeviceMesh::createForNumDevices(d);
  for (auto* tv : {in, out}) {
    tv->setDeviceMesh(mesh);
    tv->split(2, d, /*inner_split=*/false);
    tv->axis(2)->parallelize(ParallelType::DIDx);
    tv->setAllocationDomain(tv->getLoopDomain(), true);
  }

  at::Tensor in_tensor = at::randn({b, s, h * e / d}, at::Device(at::kCUDA));
  KernelArgumentHolder args({in_tensor});
  DynamicTransform::concretizeFusion(&fusion, args);

  ASSERT_EQ(fusion.outputs().size(), 1);
  auto* concrete_out = fusion.outputs().at(0)->as<TensorView>();
  EXPECT_EQ(getShardedLogicalAxis(concrete_out, ParallelType::DIDx), 2);
}

Deprecated Function

The selfAllocationReplay function is deprecated and replaced by selfReplay. Ensure that all instances of selfAllocationReplay are removed or replaced with selfReplay to avoid confusion and potential bugs.

// Self replay.
static TensorDomain* fullSelfReplay(
    const TensorDomain* new_self_root,
    const TensorDomain* self);

// Self replay the transformation on `self` from logical to loop and
// allocation onto `new_self`.
static void selfReplay(const TensorDomain* self, TensorDomain* new_self);

// Returns the loop position in producer that matches with `consumer_pos` in

wujingyue · 2025-02-24T03:56:02Z

!test

csrc/dynamic_transform.cpp

tests/cpp/test_gpu3.cpp

wujingyue · 2025-02-28T00:01:29Z

cc @jacobhinkle

I tried to replay logical-to-leaf transforms from the original TV to the concretized TV. It works except for

Fuser/tests/python/test_python_frontend.py

Line 3369 in 7a9dbc6

def test_issue1691(self):

. Any idea how I can fix this?

According to

Fuser/csrc/dynamic_transform.cpp

Line 811 in 7a9dbc6

// T2[ ?S4{i2} ] = view(T1[ iS2{i0} rS3{i1} ])

, a reduction dimension is only found in the concretized TV but not the original TV. Therefore, the new loop domain I computed is missing that reduction dimension and fails the validateDomainEquivalence check.

I'm not sure if reduction dimensions have to follow a certain order. If not, I could fix this by appending all unmapped reduction dimensions to the end of the loop domain.

jacobhinkle · 2025-03-05T15:20:28Z

, a reduction dimension is only found in the concretized TV but not the original TV. Therefore, the new loop domain I computed is missing that reduction dimension and fails the validateDomainEquivalence check.

I'm not sure if reduction dimensions have to follow a certain order. If not, I could fix this by appending all unmapped reduction dimensions to the end of the loop domain.

I am not sure I understand the issue. IIUC the purpose of this PR is to replay loop transforms when we do concretizations in order to preserve DID scheduling, right? When you concretize a reshape, neither the original or the replacement should have any reduction domains will it?

wujingyue · 2025-03-05T18:25:53Z

I am not sure I understand the issue. IIUC the purpose of this PR is to replay loop transforms when we do concretizations in order to preserve DID scheduling, right?

Correct.

When you concretize a reshape, neither the original or the replacement should have any reduction domains will it?

I thought so until I hit #1691, which I'm sure you know everything about.

jacobhinkle · 2025-03-05T19:42:18Z

I thought so until I hit #1691, which I'm sure you know everything about.

Ah ok, actually I had forgotten about that entirely. In a case like this we are actually replacing the ViewOp output with its input, so we're not creating a new TensorView are we? In that case the loop domain would still be in tact. However, if that TV was also replaced during concretization then it might no longer have a non-trivial loop domain... I assume this is the case you are worried about.

wujingyue · 2025-03-05T20:48:46Z

Here's a case I want to support:

Before concretization:

T1[ iS2{i0} rS3{i1} ] = sum(T0[ iS0{i0} iS1{i1} ])
T2[ ?S4{i2} ] = view(T1[ iS2{i0} rS3{i1} ])
T3[ ?S4{i2} ] = -T2[ ?S4{i2} ]

with all TVs' first dimension being outer-split by d, which is parallelized on DIDx.

After concretization:

T1[ iS2{i0} rS3{i1} ] = sum(T0[ iS0{i0} iS1{i1} ])
T3[ iS4{i0} ] = -T1[ iS2{i0} rS3{i1} ]

with all TVs' first dimension (i.e. i0) being split the same way as before.

Therefore, this PR tries to replay loop transforms in addition.

In a case like this we are actually replacing the ViewOp output with its input, so we're not creating a new TensorView are we?

That's right -- no new TensorViews are created.

In fact, I can fix this error by

diff --git a/csrc/dynamic_transform.cpp b/csrc/dynamic_transform.cpp
index a016f0af..3c9a3919 100644
--- a/csrc/dynamic_transform.cpp
+++ b/csrc/dynamic_transform.cpp
@@ -795,6 +795,9 @@ TensorView* DynamicTransformConcretizer::concretizeNonEmptyReshape(
     TensorView* incomplete_out_tv,
     const AnalyzeViewResult& view_analysis) {
   TensorView* concrete_reshape_out_tv = reshape(inp_tv, view_analysis);
+  if (concrete_reshape_out_tv == inp_tv) {
+    return inp_tv;
+  }
 
   // Extent expressions often change when concretizing a reshape. Here we
   // replace these in all downstream expressions so that the Fusion looks just

Why wasn't #1691 fixed this way? Can concretize_reshape_out_tv be different from inp_tv but still have more reduction dimensions than incomplete_out_tv? (It's currently fixed by removing reduction dimensions from concrete_reshape_out_tv before registering concretization).

jacobhinkle · 2025-03-07T12:19:21Z

Here's a case I want to support:

Before concretization:
T1[ iS2{i0} rS3{i1} ] = sum(T0[ iS0{i0} iS1{i1} ])
T2[ ?S4{i2} ] = view(T1[ iS2{i0} rS3{i1} ])
T3[ ?S4{i2} ] = -T2[ ?S4{i2} ]
with all TVs' first dimension being outer-split by d, which is parallelized on DIDx.

After concretization:
T1[ iS2{i0} rS3{i1} ] = sum(T0[ iS0{i0} iS1{i1} ])
T3[ iS4{i0} ] = -T1[ iS2{i0} rS3{i1} ]
with all TVs' first dimension (i.e. i0) being split the same way as before.

We should handle this particular case in the reshape op. We know that when the input logical domain is 1D and so is the output, that the reshape does nothing. We can either convert it to a set or just return the input. In fact we know the transform needed whenever the output is 1D regardless of the input dims (as long as they have non-Symbolic IterTypes). Anyway, I'll assume this is 2D for this PR's sake.

Therefore, this PR tries to replay loop transforms in addition.

In a case like this we are actually replacing the ViewOp output with its input, so we're not creating a new TensorView are we?

That's right -- no new TensorViews are created.

In fact, I can fix this error by

diff --git a/csrc/dynamic_transform.cpp b/csrc/dynamic_transform.cpp
index a016f0af..3c9a3919 100644
--- a/csrc/dynamic_transform.cpp
+++ b/csrc/dynamic_transform.cpp
@@ -795,6 +795,9 @@ TensorView* DynamicTransformConcretizer::concretizeNonEmptyReshape(
     TensorView* incomplete_out_tv,
     const AnalyzeViewResult& view_analysis) {
   TensorView* concrete_reshape_out_tv = reshape(inp_tv, view_analysis);
+  if (concrete_reshape_out_tv == inp_tv) {
+    return inp_tv;
+  }
 
   // Extent expressions often change when concretizing a reshape. Here we
   // replace these in all downstream expressions so that the Fusion looks just

I am not sure why you need to return early in this function. What is the error you mention?

Why wasn't #1691 fixed this way? Can concretize_reshape_out_tv be different from inp_tv but still have more reduction dimensions than incomplete_out_tv? (It's currently fixed by removing reduction dimensions from concrete_reshape_out_tv before registering concretization).

The code you linked is not removing that dimension from the replacement TV. This section is just for registering concretization of extents, so that we remove the i2 and replace it with i0 in all downstream TVs (see comment above there). I don't think it is true that concretize_reshape_out_tv will have any reduction dims unless it is the same as inp_tv.

wujingyue · 2025-03-07T21:38:06Z

This section is just for registering concretization of extents, so that we remove the i2 and replace it with i0 in all downstream TVs

You are right. The following

diff --git a/csrc/dynamic_transform.cpp b/csrc/dynamic_transform.cpp
index a016f0af..3c9a3919 100644
--- a/csrc/dynamic_transform.cpp
+++ b/csrc/dynamic_transform.cpp
@@ -795,6 +795,9 @@ TensorView* DynamicTransformConcretizer::concretizeNonEmptyReshape(
     TensorView* incomplete_out_tv,
     const AnalyzeViewResult& view_analysis) {
   TensorView* concrete_reshape_out_tv = reshape(inp_tv, view_analysis);
+  if (concrete_reshape_out_tv == inp_tv) {
+    return inp_tv;
+  }
 
   // Extent expressions often change when concretizing a reshape. Here we
   // replace these in all downstream expressions so that the Fusion looks just

is a wrong fix, because i2 wouldn't be replaced in downstream ops.

I'm experimenting putting the extra reduction dimensions in concrete_reshape_out_tv at the end of its loop domain. Will come back once I have an update...

wujingyue · 2025-03-07T21:40:59Z

!test

wujingyue · 2025-03-10T21:25:31Z

!test

wujingyue · 2025-03-13T02:52:33Z

!test --diff

jjsjann123

stamping since code change looks good.

jjsjann123 · 2025-03-13T16:45:40Z

csrc/transform_replay.h

-      TensorDomain* new_self);
+  // Self replay the transformation on `self` from logical to loop and
+  // allocation onto `new_self`.
+  static void selfReplay(const TensorDomain* self, TensorDomain* new_self);


I'm wondering whether we would want to replay both all the time?

Like for the case with the existing use on fixing allocation domain for aliases (and graph mutation where we preserve output memory layout during graph mutation). We don't necessarily need to preserve the loop transform.
And what does replaying the loop transform mean for those use-case?

granted that the current use in those cases are prior to scheduling. So there isn't any transform from root to loop in the first place, Realized that my question isn't really relevant..

granted that the current use in those cases are prior to scheduling

That's right. I think my answer in the other comment also applies here. I should probably try to fix #3479 again after I'm done with DID loop split.

jjsjann123 · 2025-03-13T19:13:00Z

csrc/transform_replay.cpp

-      axis_map[id] = new_id;
-      i++;
-    }
+  for (auto&& [id, new_id] : zip(self_logical, new_self_logical)) {


jjsjann123 · 2025-03-13T19:17:56Z

tests/cpp/test_dynamic_transform.cpp

+    tv->setDeviceMesh(mesh);
+    tv->split(2, d, /*inner_split=*/false);
+    tv->axis(2)->parallelize(ParallelType::DIDx);
+    tv->setAllocationDomain(tv->getLoopDomain(), true);


out of curiosity, is the reason that we needed a replay on loop, because somewhere else we are expecting to see DIDx on loop as well as on allocation?

My naive mental model sees this as DIDx is specified on allocation domain, with the existing replay, it felt almost like we have some unnecessary redundancy. i.e. loop and allocation need to have some mapping parallel type.

The current assumptions is that, before (intra-GPU) scheduling, loop has to post-dominate allocation. In other words, loop must be the "leaf" domain before scheduling. My #3621 attempted to lift this assumption but failed (cf. #3706).

It doesn't mean that loop and allocation have to be the same order. For example, loop can be a permutation of allocation, or a split of allocation.

Also, it doesn't mean that loop and allocation need to have the same parallel types. Loop may be more parallelized than allocation. For example, this example for overlapping parallelizes the column dimension of operand A on Stream only in loop but not in allocation. In practice, it can be implemented as a Set from a non-stream-parallelized IterDomain to a stream-parallelized IterDomain.

wujingyue · 2025-03-13T20:35:37Z

@jacobhinkle I'm surprised by the number of codegen diffs. The numbers of kernels don't match -- WARNING: Number of kernels in 6a56066b (3126) does not match number of kernels in 0e095bff (3588), which is suspicious. In many cases, the code being compared doesn't seem to describe the same math.

wujingyue · 2025-03-14T21:34:54Z

!test --diff

wujingyue · 2025-03-18T15:10:47Z

!test --diff

wujingyue · 2025-03-18T22:21:12Z

I tried to codegen diff locally for bin/test_nvfuser. The result makes more sense. The diffs don't change functionality and in fact make code better. For example, the first diff and the second diff avoid unnecessary alloc_stride accesses. This is because of this check added.

NVFuserTest.FusionMagicSchedulerBatchNormalization_CUDA 0 --- b8197081

+++ 28aad7c4

@@ -335,11 +335,11 @@

   Array<float, 1LL, 1> T34;
   T34[0LL]
     = T9[0LL]
     + T8[0LL];
   if (b35) {
-    T10[(T10.alloc_stride[0LL] * ((nvfuser_index_t)blockIdx.x))]
+    T10[((nvfuser_index_t)blockIdx.x)]
        = T34[0LL];
   }
   Array<float, 1LL, 1> T33;
   T33[0LL] = 0LL;
   T33[0LL]
@@ -359,11 +359,11 @@

   Array<float, 1LL, 1> T35;
   T35[0LL]
     = T13[0LL]
     + T12[0LL];
   if (b35) {
-    T14[(T14.alloc_stride[0LL] * ((nvfuser_index_t)blockIdx.x))]
+    T14[((nvfuser_index_t)blockIdx.x)]
        = T35[0LL];
   }
   Array<float, 1LL, 1> T17;
   T17[0LL]
     = T6[0LL]
NVFuserTest.FusionMagicSchedulerInstanceNormalization_CUDA 1 --- b8197081

+++ 28aad7c4

@@ -11,124 +11,124 @@

   Array<nvfuser_index_t, 2, 1> a2;
   a2 = s1.logical_size;
   nvfuser_index_t i3;
   i3 = a2[0LL];
   nvfuser_index_t i4;
-  i4 = ceilDiv((ceilDiv(i3, ((nvfuser_index_t)blockDim.y))), 8LL);
+  i4 = ceilDiv((ceilDiv(i3, ((nvfuser_index_t)blockDim.y))), 8);
   nvfuser_index_t i5;
   i5 = ((nvfuser_index_t)blockDim.x) * ((nvfuser_index_t)blockIdx.x);
   nvfuser_index_t i6;
   i6 = ((T3.logical_size[0LL] * ((nvfuser_index_t)threadIdx.y)) + ((nvfuser_index_t)threadIdx.x)) + i5;
   nvfuser_index_t i7;
-  i7 = ((nvfuser_index_t)blockDim.y) * 8LL;
+  i7 = ((nvfuser_index_t)blockDim.y) * 8;
   nvfuser_index_t i8;
   i8 = i7 * T3.logical_size[0LL];
   nvfuser_index_t i9;
   i9 = ((nvfuser_index_t)blockDim.y) * T3.logical_size[0LL];
   nvfuser_index_t i10;
   i10 = ((nvfuser_index_t)threadIdx.x) + i5;
   bool b11;
   b11 = i10 < T3.logical_size[0LL];
   nvfuser_index_t i12;
-  i12 = (((nvfuser_index_t)blockDim.y) * 7LL) + ((nvfuser_index_t)threadIdx.y);
+  i12 = (((nvfuser_index_t)blockDim.y) * 7) + ((nvfuser_index_t)threadIdx.y);
   nvfuser_index_t i13;
   i13 = (-i3) + ((nvfuser_index_t)threadIdx.y);
   double d14;
   d14 = (double)(i3);
   double d15;
   d15 = 1.00000000000000000e+00 * d14;
   double d16;
   d16 = reciprocal(d15);
-  Array<float, 1LL, 1> T39;
-  T39[0LL] = 0LL;
+  Array<float, 1, 1> T39;
+  T39[0] = 0;
   if (b11) {
-    T39[0LL]
+    T39[0]
        = T3[((T3.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((((nvfuser_index_t)blockDim.x) * T3.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))];
   }
-  Array<float, 1LL, 1> T9;
-  T9[0LL]
-    = T39[0LL]
+  Array<float, 1, 1> T9;
+  T9[0]
+    = T39[0]
     * (float) d0;
-  Array<float, 1LL, 1> T10;
-  T10[0LL]
-     = T9[0LL];
-  Array<float, 1LL, 1> T42;
-  T42[0LL] = 0.000000000e+00f;
+  Array<float, 1, 1> T10;
+  T10[0]
+     = T9[0];
+  Array<float, 1, 1> T42;
+  T42[0] = 0.000000000e+00f;
   #pragma unroll 1
-  for(nvfuser_index_t i17 = 0LL; i17 < i4; ++i17) {
+  for(nvfuser_index_t i17 = 0; i17 < i4; ++i17) {
     nvfuser_index_t i18;
     i18 = i6 + (i8 * i17);
     nvfuser_index_t i19;
     i19 = i7 * i17;
     nvfuser_index_t i20;
     i20 = i13 + i19;
     if ((b11 && ((i12 + i19) < i3))) {
-      Array<float, 8LL, 1> T40;
+      Array<float, 8, 1> T40;
       #pragma unroll
-      for(nvfuser_index_t i21 = 0LL; i21 < 8LL; ++i21) {
-        T40[i21] = 0LL;
+      for(nvfuser_index_t i21 = 0; i21 < 8; ++i21) {
+        T40[i21] = 0;
       }
       NVFUSER_UPDATE_MAGIC_ZERO;
       #pragma unroll
-      for(nvfuser_index_t i21 = 0LL; i21 < 8LL; ++i21) {
+      for(nvfuser_index_t i21 = 0; i21 < 8; ++i21) {
         T40[i21]
            = T8[(i18 + (i9 * (i21 + nvfuser_zero)))];
       }
       NVFUSER_UPDATE_MAGIC_ZERO;
       #pragma unroll
-      for(nvfuser_index_t i22 = 0LL; i22 < 8LL; ++i22) {
-        Array<float, 1LL, 1> T11;
-        T11[0LL]
-          = T10[0LL]
+      for(nvfuser_index_t i22 = 0; i22 < 8; ++i22) {
+        Array<float, 1, 1> T11;
+        T11[0]
+          = T10[0]
           + T40[i22];
-        T42[0LL]
-          = T42[0LL]
-          + T11[0LL];
+        T42[0]
+          = T42[0]
+          + T11[0];
       }
       NVFUSER_UPDATE_MAGIC_ZERO;
     } else {
-      Array<float, 8LL, 1> T40;
+      Array<float, 8, 1> T40;
       #pragma unroll
-      for(nvfuser_index_t i21 = 0LL; i21 < 8LL; ++i21) {
-        T40[i21] = 0LL;
+      for(nvfuser_index_t i21 = 0; i21 < 8; ++i21) {
+        T40[i21] = 0;
       }
       NVFUSER_UPDATE_MAGIC_ZERO;
       #pragma unroll
-      for(nvfuser_index_t i21 = 0LL; i21 < 8LL; ++i21) {
+      for(nvfuser_index_t i21 = 0; i21 < 8; ++i21) {
         nvfuser_index_t i23;
         i23 = i21 + nvfuser_zero;
         if ((b11 && (i20 < (-(((nvfuser_index_t)blockDim.y) * i23))))) {
           T40[i21]
              = T8[(i18 + (i9 * i23))];
         }
       }
       NVFUSER_UPDATE_MAGIC_ZERO;
       #pragma unroll
-      for(nvfuser_index_t i22 = 0LL; i22 < 8LL; ++i22) {
-        Array<float, 1LL, 1> T11;
-        T11[0LL]
-          = T10[0LL]
+      for(nvfuser_index_t i22 = 0; i22 < 8; ++i22) {
+        Array<float, 1, 1> T11;
+        T11[0]
+          = T10[0]
           + T40[i22];
         if ((b11 && (i20 < (-(((nvfuser_index_t)blockDim.y) * (i22 + nvfuser_zero)))))) {
-          T42[0LL]
-            = T42[0LL]
-            + T11[0LL];
+          T42[0]
+            = T42[0]
+            + T11[0];
         }
       }
       NVFUSER_UPDATE_MAGIC_ZERO;
     }
   }
-  Array<float, 1LL, 1> T12;
-  T12[0LL] = 0.000000000e+00f;
-  blockReduce<false, true, false, true>(T12[0LL], T42[0LL], [](float &a, float b) { a = a + b; }, static_cast<float*>(shared_mem), true, true, float(0.000000000e+00f), DefaultBlockDim());
-  Array<float, 1LL, 1> T13;
-  T13[0LL]
-    = T12[0LL]
+  Array<float, 1, 1> T12;
+  T12[0] = 0.000000000e+00f;
+  blockReduce<false, true, false, true>(T12[0], T42[0], [](float &a, float b) { a = a + b; }, static_cast<float*>(shared_mem), true, true, float(0.000000000e+00f), DefaultBlockDim());
+  Array<float, 1, 1> T13;
+  T13[0]
+    = T12[0]
     * (float) d16;
-  Array<float, 1LL, 1> T41;
-  T41[0LL]
-     = T13[0LL];
-  if (((((nvfuser_index_t)threadIdx.y) == 0LL) && b11)) {
+  Array<float, 1, 1> T41;
+  T41[0]
+     = T13[0];
+  if (((((nvfuser_index_t)threadIdx.y) == 0) && b11)) {
     T32[i10]
-       = T41[0LL];
+       = T41[0];
   }
 }
NVFuserTest.FusionMagicSchedulerInstanceNormalization_CUDA 2 --- b8197081

+++ 28aad7c4

@@ -23,140 +23,140 @@

   Array<nvfuser_index_t, 2, 1> a10;
   a10 = s9.logical_size;
   nvfuser_index_t i11;
   i11 = a10[0LL];
   nvfuser_index_t i12;
-  i12 = ceilDiv((ceilDiv(i11, ((nvfuser_index_t)blockDim.y))), 8LL);
+  i12 = ceilDiv((ceilDiv(i11, ((nvfuser_index_t)blockDim.y))), 8);
   nvfuser_index_t i13;
   i13 = ((nvfuser_index_t)blockDim.x) * ((nvfuser_index_t)blockIdx.x);
   nvfuser_index_t i14;
   i14 = ((T4.logical_size[0LL] * ((nvfuser_index_t)threadIdx.y)) + ((nvfuser_index_t)threadIdx.x)) + i13;
   nvfuser_index_t i15;
-  i15 = ((nvfuser_index_t)blockDim.y) * 8LL;
+  i15 = ((nvfuser_index_t)blockDim.y) * 8;
   nvfuser_index_t i16;
   i16 = i15 * T4.logical_size[0LL];
   nvfuser_index_t i17;
   i17 = ((nvfuser_index_t)blockDim.y) * T4.logical_size[0LL];
   nvfuser_index_t i18;
   i18 = ((nvfuser_index_t)threadIdx.x) + i13;
   bool b19;
   b19 = i18 < T4.logical_size[0LL];
   nvfuser_index_t i20;
-  i20 = (((nvfuser_index_t)blockDim.y) * 7LL) + ((nvfuser_index_t)threadIdx.y);
+  i20 = (((nvfuser_index_t)blockDim.y) * 7) + ((nvfuser_index_t)threadIdx.y);
   nvfuser_index_t i21;
   i21 = (-i11) + ((nvfuser_index_t)threadIdx.y);
   double d22;
   d22 = (double)(i11);
   double d23;
   d23 = 1.00000000000000000e+00 * d22;
   double d24;
   d24 = reciprocal(d23);
-  Array<float, 1LL, 1> T39;
-  T39[0LL] = 0LL;
+  Array<float, 1, 1> T39;
+  T39[0] = 0;
   if (b19) {
-    T39[0LL]
+    T39[0]
        = T4[((T4.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((((nvfuser_index_t)blockDim.x) * T4.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))];
   }
-  Array<float, 1LL, 1> T16;
-  T16[0LL]
-    = T39[0LL]
+  Array<float, 1, 1> T16;
+  T16[0]
+    = T39[0]
     * (float) d8;
-  Array<float, 1LL, 1> T17;
-  T17[0LL]
-     = T16[0LL];
-  Array<float, 1LL, 1> T42;
-  T42[0LL] = 0.000000000e+00f;
+  Array<float, 1, 1> T17;
+  T17[0]
+     = T16[0];
+  Array<float, 1, 1> T42;
+  T42[0] = 0.000000000e+00f;
   #pragma unroll 1
-  for(nvfuser_index_t i25 = 0LL; i25 < i12; ++i25) {
+  for(nvfuser_index_t i25 = 0; i25 < i12; ++i25) {
     nvfuser_index_t i26;
     i26 = i14 + (i16 * i25);
     nvfuser_index_t i27;
     i27 = i15 * i25;
     nvfuser_index_t i28;
     i28 = i21 + i27;
     if ((b19 && ((i20 + i27) < i11))) {
-      Array<float, 8LL, 1> T40;
+      Array<float, 8, 1> T40;
       #pragma unroll
-      for(nvfuser_index_t i29 = 0LL; i29 < 8LL; ++i29) {
-        T40[i29] = 0LL;
+      for(nvfuser_index_t i29 = 0; i29 < 8; ++i29) {
+        T40[i29] = 0;
       }
       NVFUSER_UPDATE_MAGIC_ZERO;
       #pragma unroll
-      for(nvfuser_index_t i29 = 0LL; i29 < 8LL; ++i29) {
+      for(nvfuser_index_t i29 = 0; i29 < 8; ++i29) {
         T40[i29]
            = T6[(i26 + (i17 * (i29 + nvfuser_zero)))];
       }
       NVFUSER_UPDATE_MAGIC_ZERO;
       #pragma unroll
-      for(nvfuser_index_t i30 = 0LL; i30 < 8LL; ++i30) {
-        Array<float, 1LL, 1> T14;
-        T14[0LL]
+      for(nvfuser_index_t i30 = 0; i30 < 8; ++i30) {
+        Array<float, 1, 1> T14;
+        T14[0]
           = T40[i30]
           * (float) d7;
-        Array<float, 1LL, 1> T15;
-        T15[0LL]
-          = T14[0LL]
+        Array<float, 1, 1> T15;
+        T15[0]
+          = T14[0]
           * (float) 1.00000001490116119e-01;
-        Array<float, 1LL, 1> T18;
-        T18[0LL]
-          = T17[0LL]
-          + T15[0LL];
-        T42[0LL]
-          = T42[0LL]
-          + T18[0LL];
+        Array<float, 1, 1> T18;
+        T18[0]
+          = T17[0]
+          + T15[0];
+        T42[0]
+          = T42[0]
+          + T18[0];
       }
       NVFUSER_UPDATE_MAGIC_ZERO;
     } else {
-      Array<float, 8LL, 1> T40;
+      Array<float, 8, 1> T40;
       #pragma unroll
-      for(nvfuser_index_t i29 = 0LL; i29 < 8LL; ++i29) {
-        T40[i29] = 0LL;
+      for(nvfuser_index_t i29 = 0; i29 < 8; ++i29) {
+        T40[i29] = 0;
       }
       NVFUSER_UPDATE_MAGIC_ZERO;
       #pragma unroll
-      for(nvfuser_index_t i29 = 0LL; i29 < 8LL; ++i29) {
+      for(nvfuser_index_t i29 = 0; i29 < 8; ++i29) {
         nvfuser_index_t i31;
         i31 = i29 + nvfuser_zero;
         if ((b19 && (i28 < (-(((nvfuser_index_t)blockDim.y) * i31))))) {
           T40[i29]
              = T6[(i26 + (i17 * i31))];
         }
       }
       NVFUSER_UPDATE_MAGIC_ZERO;
       #pragma unroll
-      for(nvfuser_index_t i30 = 0LL; i30 < 8LL; ++i30) {
-        Array<float, 1LL, 1> T14;
-        T14[0LL]
+      for(nvfuser_index_t i30 = 0; i30 < 8; ++i30) {
+        Array<float, 1, 1> T14;
+        T14[0]
           = T40[i30]
           * (float) d7;
-        Array<float, 1LL, 1> T15;
-        T15[0LL]
-          = T14[0LL]
+        Array<float, 1, 1> T15;
+        T15[0]
+          = T14[0]
           * (float) 1.00000001490116119e-01;
-        Array<float, 1LL, 1> T18;
-        T18[0LL]
-          = T17[0LL]
-          + T15[0LL];
+        Array<float, 1, 1> T18;
+        T18[0]
+          = T17[0]
+          + T15[0];
         if ((b19 && (i28 < (-(((nvfuser_index_t)blockDim.y) * (i30 + nvfuser_zero)))))) {
-          T42[0LL]
-            = T42[0LL]
-            + T18[0LL];
+          T42[0]
+            = T42[0]
+            + T18[0];
         }
       }
       NVFUSER_UPDATE_MAGIC_ZERO;
     }
   }
-  Array<float, 1LL, 1> T19;
-  T19[0LL] = 0.000000000e+00f;
-  blockReduce<false, true, false, true>(T19[0LL], T42[0LL], [](float &a, float b) { a = a + b; }, static_cast<float*>(shared_mem), true, true, float(0.000000000e+00f), DefaultBlockDim());
-  Array<float, 1LL, 1> T20;
-  T20[0LL]
-    = T19[0LL]
+  Array<float, 1, 1> T19;
+  T19[0] = 0.000000000e+00f;
+  blockReduce<false, true, false, true>(T19[0], T42[0], [](float &a, float b) { a = a + b; }, static_cast<float*>(shared_mem), true, true, float(0.000000000e+00f), DefaultBlockDim());
+  Array<float, 1, 1> T20;
+  T20[0]
+    = T19[0]
     * (float) d24;
-  Array<float, 1LL, 1> T41;
-  T41[0LL]
-     = T20[0LL];
-  if (((((nvfuser_index_t)threadIdx.y) == 0LL) && b19)) {
+  Array<float, 1, 1> T41;
+  T41[0]
+     = T20[0];
+  if (((((nvfuser_index_t)threadIdx.y) == 0) && b19)) {
     T34[i18]
-       = T41[0LL];
+       = T41[0];
   }
 }
NVFuserTest.FusionBNRepro_CUDA 0 --- b8197081

+++ 28aad7c4

@@ -292,11 +292,11 @@

   Array<float, 1LL, 1> T34;
   T34[0LL]
     = T9[0LL]
     + T8[0LL];
   if (b31) {
-    T10[(T10.alloc_stride[0LL] * ((nvfuser_index_t)blockIdx.x))]
+    T10[((nvfuser_index_t)blockIdx.x)]
        = T34[0LL];
   }
   Array<float, 1LL, 1> T33;
   T33[0LL] = 0LL;
   T33[0LL]
@@ -316,11 +316,11 @@

   Array<float, 1LL, 1> T35;
   T35[0LL]
     = T13[0LL]
     + T12[0LL];
   if (b31) {
-    T14[(T14.alloc_stride[0LL] * ((nvfuser_index_t)blockIdx.x))]
+    T14[((nvfuser_index_t)blockIdx.x)]
        = T35[0LL];
   }
   Array<float, 1LL, 1> T17;
   T17[0LL]
     = T6[0LL]
AliasTest.ReuseBuffer_AliasAcrossSegments 1 --- b8197081

+++ 28aad7c4

@@ -1,36 +1,36 @@

 __global__ void nvfuser_N(Tensor<float, 2, 2> T2, Tensor<float, 1, 1> T5, Tensor<float, 2, 2> T8, Tensor<float, 2, 2> T9) {
   nvfuser_index_t i0;
-  i0 = ((nvfuser_index_t)threadIdx.x) + (128LL * ((nvfuser_index_t)blockIdx.x));
+  i0 = ((nvfuser_index_t)threadIdx.x) + (128 * ((nvfuser_index_t)blockIdx.x));
   nvfuser_index_t i1;
   i1 = i0 % T2.logical_size[1LL];
   nvfuser_index_t i2;
   i2 = T2.logical_size[0LL] * T2.logical_size[1LL];
-  if (((((nvfuser_index_t)blockIdx.x) < (ceilDiv(i2, 128LL))) && (i0 < i2))) {
-    Array<float, 1LL, 1> T11;
-    T11[0LL] = 0LL;
-    T11[0LL]
+  if (((((nvfuser_index_t)blockIdx.x) < (ceilDiv(i2, 128))) && (i0 < i2))) {
+    Array<float, 1, 1> T11;
+    T11[0] = 0;
+    T11[0]
        = T2[((T2.alloc_stride[0LL] * (i0 / T2.logical_size[1LL])) + (T2.alloc_stride[1LL] * i1))];
-    Array<float, 1LL, 1> T12;
-    T12[0LL] = 0LL;
-    T12[0LL]
+    Array<float, 1, 1> T12;
+    T12[0] = 0;
+    T12[0]
        = T5[i1];
-    Array<float, 1LL, 1> T6;
-    T6[0LL]
-       = T12[0LL];
-    Array<float, 1LL, 1> T7;
-    T7[0LL]
-      = T6[0LL]
-      + T11[0LL];
-    Array<float, 1LL, 1> T13;
-    T13[0LL]
-      = T7[0LL]
+    Array<float, 1, 1> T6;
+    T6[0]
+       = T12[0];
+    Array<float, 1, 1> T7;
+    T7[0]
+      = T6[0]
+      + T11[0];
+    Array<float, 1, 1> T13;
+    T13[0]
+      = T7[0]
       + (float) 1.00000000000000000e+00;
     T8[i0]
-       = T13[0LL];
-    Array<float, 1LL, 1> T14;
-    T14[0LL]
-       = T7[0LL];
+       = T13[0];
+    Array<float, 1, 1> T14;
+    T14[0]
+       = T7[0];
     T9[i0]
-       = T14[0LL];
+       = T14[0];
   }
 }
5 kernel differences from 4 tests found
26 new tests found
25 removed tests found

Concretization replays loop transforms

157e042

For #2563

wujingyue marked this pull request as draft February 23, 2025 18:51

Work around a limitation in ReplayTransforms

d13d87f

wujingyue added 2 commits February 23, 2025 20:56

Move a test

c2f6f35

Unit test

971a5e3

wujingyue requested a review from jacobhinkle February 24, 2025 06:05

wujingyue commented Feb 24, 2025

View reviewed changes

csrc/dynamic_transform.cpp Outdated Show resolved Hide resolved

wujingyue commented Feb 24, 2025

View reviewed changes

tests/cpp/test_gpu3.cpp Outdated Show resolved Hide resolved

wujingyue mentioned this pull request Feb 27, 2025

preserving allocation domain transform on aliased output #3970

Merged

WIP

b63b7ed

Merge remote-tracking branch 'origin/main' into wjy/replay

27321aa

WIP

75b1bc7

wujingyue added 2 commits March 10, 2025 14:50

Clean up IterDomainMap

e9a0250

s/selfAllocationReplay/selfReplay

1124fe0

wujingyue mentioned this pull request Mar 11, 2025

s/reshape/set when no transforms are applied #4056

Open

wujingyue added 3 commits March 12, 2025 18:11

Use zip

1fbadc3

Merge remote-tracking branch 'origin/main' into wjy/replay

c9171b4

Comment

4dbe3e6

Use zip

0e095bf

wujingyue requested a review from jjsjann123 March 13, 2025 02:53

wujingyue marked this pull request as ready for review March 13, 2025 02:53

jjsjann123 approved these changes Mar 13, 2025

View reviewed changes

Merge branch 'main' into wjy/replay

97e1eb0

Merge branch 'main' into wjy/replay

75ee3fb

wujingyue merged commit 342336b into main Mar 18, 2025
54 of 60 checks passed

wujingyue deleted the wjy/replay branch March 18, 2025 22:23

Conversation

wujingyue commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wujingyue commented Feb 23, 2025

Uh oh!

github-actions bot commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

wujingyue commented Feb 24, 2025

Uh oh!

Uh oh!

Uh oh!

wujingyue commented Feb 28, 2025

Uh oh!

jacobhinkle commented Mar 5, 2025

Uh oh!

wujingyue commented Mar 5, 2025

Uh oh!

jacobhinkle commented Mar 5, 2025

Uh oh!

wujingyue commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobhinkle commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wujingyue commented Mar 7, 2025

Uh oh!

wujingyue commented Mar 7, 2025

Uh oh!

wujingyue commented Mar 10, 2025

Uh oh!

wujingyue commented Mar 13, 2025

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

wujingyue Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

wujingyue Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

wujingyue commented Mar 13, 2025

Uh oh!

wujingyue commented Mar 14, 2025

Uh oh!

wujingyue commented Mar 18, 2025

Uh oh!

wujingyue commented Mar 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wujingyue commented Feb 23, 2025 •

edited

Loading

github-actions bot commented Feb 23, 2025 •

edited

Loading

wujingyue commented Mar 5, 2025 •

edited

Loading

jacobhinkle commented Mar 7, 2025 •

edited

Loading