Skip to content

Concretization replays loop transforms#3950

Merged
wujingyue merged 15 commits intomainfrom
wjy/replay
Mar 18, 2025
Merged

Concretization replays loop transforms#3950
wujingyue merged 15 commits intomainfrom
wjy/replay

Conversation

@wujingyue
Copy link
Collaborator

@wujingyue wujingyue commented Feb 23, 2025

For #2563

Also extends selfAllocationReplay to replay loop so it can be reused for concretization.

@wujingyue wujingyue marked this pull request as draft February 23, 2025 18:51
@wujingyue
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Feb 23, 2025

Review updated until commit 75ee3fb

Description

  • Renamed selfAllocationReplay to selfReplay in TransformReplay.

  • Extended selfReplay to handle loop domains in addition to allocation domains.

  • Updated test cases to use the new selfReplay method.

  • Added a new test case for loop splitting in DynamicTransformTest.


Changes walkthrough 📝

Relevant files
Enhancement
dynamic_transform.cpp
Update concretizeNonEmptyReshape to use selfReplay             

csrc/dynamic_transform.cpp

  • Included transform_replay.h.
  • Replaced selfAllocationReplay with selfReplay in
    concretizeNonEmptyReshape.
  • +8/-3     
    fusion.cpp
    Update aliasOutputToInput to use selfReplay                           

    csrc/fusion.cpp

  • Replaced selfAllocationReplay with selfReplay in aliasOutputToInput.
  • +1/-1     
    remove_bcast_squeeze.cpp
    Update maybeDoReplacement to use selfReplay                           

    csrc/preseg_passes/remove_bcast_squeeze.cpp

  • Replaced selfAllocationReplay with selfReplay in maybeDoReplacement.
  • +1/-2     
    transform_replay.cpp
    Extend selfReplay to handle loop domains                                 

    csrc/transform_replay.cpp

  • Renamed selfAllocationReplay to selfReplay.
  • Extended selfReplay to handle loop domains.
  • +59/-41 
    transform_replay.h
    Rename and update selfReplay                                                         

    csrc/transform_replay.h

  • Updated the documentation for selfReplay.
  • Renamed selfAllocationReplay to selfReplay.
  • +3/-5     
    test_python_frontend.py
    Simplify contiguity parameters                                                     

    tests/python/test_python_frontend.py

    • Simplified contiguity parameters in define_tensor calls.
    +2/-2     
    Tests
    test_dynamic_transform.cpp
    Update and add test cases                                                               

    tests/cpp/test_dynamic_transform.cpp

  • Updated test cases to use DynamicTransformTest instead of NVFuserTest.
  • Added a new test case LoopSplit to verify loop splitting.
  • +64/-28 

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Loop Handling

    The new selfReplay function includes logic to handle loop domains, but it's unclear if this is correctly implemented and tested. Ensure that the loop domain replay is accurate and that it doesn't introduce any unintended side effects.

      std::vector<IterDomain*> new_loop;
      ReplaySelf replay(self->loop(), axis_map);
      for (auto id : new_self->logical()) {
        if (id->isReduction()) {
          new_loop.push_back(id);
        }
      }
    
      for (IterDomain* id : self->loop()) {
        auto it = replay.getReplay().find(id);
        NVF_ERROR(
            it != replay.getReplay().end(), "failed to replay IterDomain: ", id);
        it->second->parallelize(id->getParallelType());
        new_loop.push_back(it->second);
      }
    
      new_self->setLoopDomain(new_loop);
    }
    Test Coverage

    The new test cases added (DynamicTransformTest) should be reviewed to ensure they cover all the new functionality introduced by selfReplay. Verify that the tests are comprehensive and that they exercise the new loop domain replay logic.

    // Simple test of analyzing dynamic reshape
    TEST_F(DynamicTransformTest, DynamicTransform1) {
      Fusion fusion;
      FusionGuard fg(&fusion);
    
      auto tv0 = makeSymbolicTensor(2);
      fusion.addInput(tv0);
      auto tv1 = makeSymbolicTensor(2);
      fusion.addInput(tv1);
    
      auto reshape_shape0 = IrBuilder::create<Val>(DataType::Int);
      fusion.addInput(reshape_shape0);
      auto reshape_shape1 = IrBuilder::create<Val>(DataType::Int);
      fusion.addInput(reshape_shape1);
    
      auto tv2 = reshape(tv0, {reshape_shape0, reshape_shape1});
      auto tv3 = add(tv1, tv2);
    
      fusion.addOutput(tv3);
    
      // tv2 has symbolic axes as reshape is dynamic
      NVF_CHECK(
          tv2->domain()->hasSymbolicAxis(),
          "Expected to have symbolic axes: ",
          tv2->toString());
    
      // The symbolic axes of tv2 should not be propagated to tv3 as tv1
      // is fully concrete
      NVF_CHECK(
          !tv3->domain()->hasSymbolicAxis(),
          "Not expected to have symbolic axes: ",
          tv3->toString());
    
      {
        ExpressionEvaluator expr_eval;
    
        // input: 4, 3
        // output: 3, 4
        expr_eval.bind(tv0->axis(0)->extent(), 4L);
        expr_eval.bind(tv0->axis(1)->extent(), 3L);
        expr_eval.bind(reshape_shape0, 3L);
        expr_eval.bind(reshape_shape1, 4L);
        // We cannot infer the shape of tv1 from the above bound values, since
        // either axis of tv2 might be broadcast against one from tv1.
        expr_eval.bind(tv1->axis(0)->extent(), 3L);
        expr_eval.bind(tv1->axis(1)->extent(), 4L);
    
        auto initial_info = DynamicTransform::getInitialInfo(&fusion);
        auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);
        NVF_CHECK(
            info.getReshapeTransforms().size() == 1,
            "Expected to have one reshape transform: ",
            info.toString());
      }
    
      {
        ExpressionEvaluator expr_eval;
    
        // input: 4, 3
        // output: 3, -1
        expr_eval.bind(tv0->axis(0)->extent(), 4L);
        expr_eval.bind(tv0->axis(1)->extent(), 3L);
        expr_eval.bind(reshape_shape0, 3L);
        expr_eval.bind(reshape_shape1, -1L);
    
        // This should throw an exception since any reshape size of -1 must be
        // specified as a definition-time constant, as opposed to an input scalar.
        EXPECT_THAT(
            [&]() {
              auto initial_info = DynamicTransform::getInitialInfo(&fusion);
              auto info =
                  DynamicTransformConcretizationInfo(&initial_info, &expr_eval);
            },
            ::testing::ThrowsMessage<nvfError>(::testing::HasSubstr(
                "Values of -1 passed to reshape must be constant at definition")));
      }
    
      {
        ExpressionEvaluator expr_eval;
    
        // input: 4, 3
        // output: 5, 4
        expr_eval.bind(tv0->axis(0)->extent(), 4L);
        expr_eval.bind(tv0->axis(1)->extent(), 3L);
        expr_eval.bind(reshape_shape0, 5L);
        expr_eval.bind(reshape_shape1, 4L);
    
        // This should fail as (4 * 3) is not equal to (5 * 4)
        EXPECT_THAT(
            [&]() {
              auto initial_info = DynamicTransform::getInitialInfo(&fusion);
              auto info =
                  DynamicTransformConcretizationInfo(&initial_info, &expr_eval);
            },
            ::testing::ThrowsMessage<nvfuser::nvfError>(::testing::HasSubstr(
                "Total element counts across view operation must match:")));
      }
    }
    
    // Reshape a tensor like another tensor
    TEST_F(DynamicTransformTest, DynamicTransform2) {
      Fusion fusion;
      FusionGuard fg(&fusion);
    
      // All tensors are 2D symbolic tensors. tv1 and tv2 have the same shape
      auto tv0 = makeSymbolicTensor(2);
      fusion.addInput(tv0);
      auto tv1 = makeSymbolicTensor(2);
      fusion.addInput(tv1);
      auto tv2 = makeSymbolicTensor(2);
      fusion.addInput(tv2);
    
      // Reshape to the same shape as tv1
      auto tv3 = reshape(tv0, {tv1->axis(0)->extent(), tv1->axis(1)->extent()});
      auto tv4 = add(tv1, tv2);
      auto tv5 = add(tv3, tv4);
      fusion.addOutput(tv5);
    
      {
        ExpressionEvaluator expr_eval;
    
        // input: 4, 3
        // output: 3, 4
        expr_eval.bind(tv0->axis(0)->extent(), 4L);
        expr_eval.bind(tv0->axis(1)->extent(), 3L);
        // Bind only tv2 extents. It should be enough as tv1 has the same
        // shape
        expr_eval.bind(tv2->axis(0)->extent(), 3L);
        expr_eval.bind(tv2->axis(1)->extent(), 4L);
    
        auto initial_info = DynamicTransform::getInitialInfo(&fusion);
        auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);
    
        NVF_CHECK(
            info.getReshapeTransforms().size() == 1,
            "Expected to have one reshape transform: ",
            info.toString());
      }
    }
    
    // Analyze dynamic reshape and concretize
    TEST_F(DynamicTransformTest, DynamicTransform3) {
      auto fusion_ptr = std::make_unique<Fusion>();
      Fusion& fusion = *fusion_ptr.get();
      FusionGuard fg(&fusion);
    
      auto tv0 = makeSymbolicTensor(2);
      fusion.addInput(tv0);
      auto tv1 = makeSymbolicTensor(2);
      fusion.addInput(tv1);
    
      auto reshape_shape0 = IrBuilder::create<Val>(DataType::Int);
      auto reshape_shape1 = IrBuilder::create<Val>(DataType::Int);
    
      auto tv2 = reshape(tv0, {reshape_shape0, reshape_shape1});
      auto tv3 = add(tv1, tv2);
    
      fusion.addOutput(tv3);
    
      std::vector<int64_t> shape_before({4, 3});
      std::vector<int64_t> shape_after({3, 4});
    
      ExpressionEvaluator expr_eval;
    
      // input: 4, 3
      // output: 3, 4
      expr_eval.bind(tv0->axis(0)->extent(), shape_before.at(0));
      expr_eval.bind(tv0->axis(1)->extent(), shape_before.at(1));
      expr_eval.bind(tv1->axis(0)->extent(), shape_after.at(0));
      expr_eval.bind(tv1->axis(1)->extent(), shape_after.at(1));
      // We cannot infer reshape_shape0 and reshape_shape1 from tv0's and tv1's
      // extents alone, since either of these reshaped extents could either match
      // that of tv1 or be 1, resulting in a broadcast.
      expr_eval.bind(reshape_shape0, shape_after.at(0));
      expr_eval.bind(reshape_shape1, shape_after.at(1));
    
      auto initial_info = DynamicTransform::getInitialInfo(&fusion);
      auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);
    
      DynamicTransform::concretizeFusion(&fusion, &info);
      NVF_CHECK(
          !fusion.hasDynamicTransform(), "Expected to have no dynamic transform");
    
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      at::Tensor t0 = at::randn(shape_before, options);
      at::Tensor t1 = at::randn(shape_after, options);
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
      auto cg_outputs = executor_cache.runFusionWithInputs({t0, t1});
    
      testValidate(
          executor_cache.fusion(), cg_outputs, {t0, t1}, __LINE__, __FILE__);
    }
    
    // Test multiple patterns of reshape
    TEST_F(DynamicTransformTest, DynamicTransform4) {
      std::vector<std::pair<std::vector<int64_t>, std::vector<int64_t>>>
          before_after_shapes = {
              {{4, 3}, {3, 4}},
              {{4, 3}, {12, 1}},
              {{4, 3}, {4, 3}},
              {{4, 6}, {4, 2, 3}},
          };
      for (const auto& before_after : before_after_shapes) {
        const auto& before_shape = before_after.first;
        const auto& after_shape = before_after.second;
    
        Fusion fusion;
        FusionGuard fg(&fusion);
    
        auto tv0 = makeSymbolicTensor(before_shape.size());
        fusion.addInput(tv0);
        auto tv1 = makeSymbolicTensor(after_shape.size());
        fusion.addInput(tv1);
    
        std::vector<Val*> shape_arg;
        for (const auto i : c10::irange(after_shape.size())) {
          (void)i;
          shape_arg.push_back(IrBuilder::create<Val>(DataType::Int));
        }
    
        auto tv2 = reshape(tv0, shape_arg);
    
        // tv3 will also have symbolic axes
        auto tv3 = set(tv2);
        auto tv4 = add(tv1, tv3);
    
        fusion.addOutput(tv4);
    
        ExpressionEvaluator expr_eval;
    
        for (const auto i : c10::irange(before_shape.size())) {
          expr_eval.bind(tv0->axis(i)->extent(), before_shape.at(i));
        }
    
        for (const auto i : c10::irange(after_shape.size())) {
          expr_eval.bind(tv2->axis(i)->extent(), after_shape.at(i));
          // We must bind tv1's extents, since they cannot be inferred until after
          // concretization. Because tv2 is a dynamic reshape both its IterDomains
          // are Symbolic, which means both of tv3's IterDomains are also Symbolic.
          // tv1 has both IterDomains of type Iteration, but it since we add tv3 to
          // it to get tv4, we do not know whether this will resolve broadcasts from
          // tv3 or not until concretization.
          expr_eval.bind(tv1->axis(i)->extent(), after_shape.at(i));
        }
    
        auto initial_info = DynamicTransform::getInitialInfo(&fusion);
        auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);
    
        DynamicTransform::concretizeFusion(&fusion, &info);
    
        NVF_CHECK(
            !fusion.hasDynamicTransform(), "Expected to have no dynamic transform");
      }
    }
    
    // Dynamic reshape followed by static resize
    TEST_F(DynamicTransformTest, DynamicTransform5) {
      std::vector<std::pair<std::vector<int64_t>, std::vector<int64_t>>>
          before_after_shapes = {
              {{4, 3}, {3, 4}},
              //{{4, 3}, {12, 1}}, not possible to do pad a broadcast domain yet
          };
    
      for (auto before_after : before_after_shapes) {
        Fusion fusion;
        FusionGuard fg(&fusion);
    
        auto tv0 = makeSymbolicTensor(2);
        fusion.addInput(tv0);
    
        auto reshape_shape0 = IrBuilder::create<Val>(DataType::Int);
        fusion.addInput(reshape_shape0);
        auto reshape_shape1 = IrBuilder::create<Val>(DataType::Int);
        fusion.addInput(reshape_shape1);
    
        auto tv1 = reshape(tv0, {reshape_shape0, reshape_shape1});
        auto tv2 =
            pad(tv1,
                {IrBuilder::create<Val>(1L),
                 IrBuilder::create<Val>(1L),
                 IrBuilder::create<Val>(1L),
                 IrBuilder::create<Val>(1L)});
        auto tv3 = set(tv2);
    
        fusion.addOutput(tv3);
    
        ExpressionEvaluator expr_eval;
    
        expr_eval.bind(tv0->axis(0)->extent(), before_after.first.at(0));
        expr_eval.bind(tv0->axis(1)->extent(), before_after.first.at(1));
        expr_eval.bind(tv1->axis(0)->extent(), before_after.second.at(0));
        expr_eval.bind(tv1->axis(1)->extent(), before_after.second.at(1));
    
        auto initial_info = DynamicTransform::getInitialInfo(&fusion);
        auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);
    
        DynamicTransform::concretizeFusion(&fusion, &info);
    
        NVF_CHECK(
            !fusion.hasDynamicTransform(), "Expected to have no dynamic transform");
      }
    }
    
    // Reshape of reshape
    TEST_F(DynamicTransformTest, DynamicTransform6) {
      std::vector<std::vector<std::vector<int64_t>>> reshape_lists = {
          {{4, 3}, {3, 4}},
          {{4, 3}, {3, 4}, {12}},
          {{4, 3}, {3, 1, 4}, {12, 1}},
          {{4, 3}, {12}, {3, 4}},
          {{4, 3}, {1, 2, 1, 3, 2}, {3, 4}},
      };
    
      for (auto reshape_list : reshape_lists) {
        std::vector<TensorView*> reshape_tvs;
    
        Fusion fusion;
        FusionGuard fg(&fusion);
    
        auto tv0 = makeSymbolicTensor(reshape_list.at(0).size());
        fusion.addInput(tv0);
    
        reshape_tvs.push_back(tv0);
    
        for (auto it = reshape_list.begin() + 1; it != reshape_list.end(); ++it) {
          auto shape = *it;
          std::vector<Val*> shape_arg;
          for (const auto i : c10::irange(shape.size())) {
            (void)i;
            shape_arg.push_back(IrBuilder::create<Val>(DataType::Int));
          }
    
          auto tv = reshape(reshape_tvs.back(), shape_arg);
          reshape_tvs.push_back(tv);
        }
        fusion.addOutput(reshape_tvs.back());
    
        ExpressionEvaluator expr_eval;
    
        for (const auto i : c10::irange(reshape_list.size())) {
          const auto& shape = reshape_list.at(i);
          for (const auto j : c10::irange(shape.size())) {
            expr_eval.bind(reshape_tvs.at(i)->axis(j)->extent(), shape.at(j));
          }
        }
    
        auto initial_info = DynamicTransform::getInitialInfo(&fusion);
        auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);
    
        DynamicTransform::concretizeFusion(&fusion, &info);
    
        NVF_CHECK(
            !fusion.hasDynamicTransform(), "Expected to have no dynamic transform");
      }
    }
    
    // Test equality of DynamicTransformInfo
    TEST_F(DynamicTransformTest, DynamicTransform7) {
      // Represents a series of reshapes
      struct TransformList {
        std::vector<std::vector<int64_t>> shapes;
      };
    
      struct ShapeInfo {
        TransformList ref_transform;
        std::vector<TransformList> equal_transforms;
        std::vector<TransformList> different_transforms;
      };
    
      std::vector<ShapeInfo> patterns;
    
      patterns.push_back(ShapeInfo{
          .ref_transform = {{{3, 4}, {4, 3}}},
          .equal_transforms =
              {{{{3, 4}, {4, 3}}}, {{{2, 8}, {4, 4}}}, {{{3, 8}, {4, 6}}}},
          .different_transforms = {{{{3, 4}, {2, 6}}}}});
    
      patterns.push_back(ShapeInfo{
          .ref_transform = {{{3, 4}, {12}, {1, 4, 3}}},
          .equal_transforms =
              {
                  {{{3, 4}, {12}, {1, 4, 3}}},
                  {{{5, 8}, {40}, {1, 4, 10}}},
              },
          .different_transforms = {
              {{{3, 4}, {12}, {4, 1, 3}}},
              {{{3, 4}, {12}, {4, 3, 1}}},
          }});
    
      for (const auto& pattern : patterns) {
        const auto& ref_transform = pattern.ref_transform;
        std::vector<TensorView*> reshape_tvs;
    
        Fusion fusion;
        FusionGuard fg(&fusion);
    
        auto tv0 = makeSymbolicTensor(ref_transform.shapes.at(0).size());
        fusion.addInput(tv0);
    
        reshape_tvs.push_back(tv0);
    
        for (auto it = ref_transform.shapes.begin() + 1;
             it != ref_transform.shapes.end();
             ++it) {
          const auto& shape = *it;
          std::vector<Val*> shape_arg;
          for (const auto i : c10::irange(shape.size())) {
            (void)i;
            shape_arg.push_back(IrBuilder::create<Val>(DataType::Int));
          }
    
          auto tv = reshape(reshape_tvs.back(), shape_arg);
          reshape_tvs.push_back(tv);
        }
        fusion.addOutput(reshape_tvs.back());
    
        ExpressionEvaluator ref_expr_eval;
    
        for (const auto i : c10::irange(ref_transform.shapes.size())) {
          const auto& shape = ref_transform.shapes.at(i);
          for (const auto j : c10::irange(shape.size())) {
            ref_expr_eval.bind(reshape_tvs.at(i)->axis(j)->extent(), shape.at(j));
          }
        }
    
        auto ref_initial_info = DynamicTransform::getInitialInfo(&fusion);
        auto ref_info =
            DynamicTransformConcretizationInfo(&ref_initial_info, &ref_expr_eval);
    
        for (const auto& transform : pattern.equal_transforms) {
          NVF_CHECK(transform.shapes.size() == ref_transform.shapes.size());
          ExpressionEvaluator expr_eval;
          for (const auto i : c10::irange(transform.shapes.size())) {
            const auto& shape = transform.shapes.at(i);
            for (const auto j : c10::irange(shape.size())) {
              expr_eval.bind(reshape_tvs.at(i)->axis(j)->extent(), shape.at(j));
            }
          }
    
          auto initial_info = DynamicTransform::getInitialInfo(&fusion);
          auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);
    
          NVF_CHECK(
              ref_info == info,
              "Expected to be equal: ",
              ref_info.toString(),
              "\n",
              info.toString());
        }
    
        for (const auto& transform : pattern.different_transforms) {
          NVF_CHECK(transform.shapes.size() == ref_transform.shapes.size());
          ExpressionEvaluator expr_eval;
          for (const auto i : c10::irange(transform.shapes.size())) {
            const auto& shape = transform.shapes.at(i);
            for (const auto j : c10::irange(shape.size())) {
              expr_eval.bind(reshape_tvs.at(i)->axis(j)->extent(), shape.at(j));
            }
          }
    
          auto initial_info = DynamicTransform::getInitialInfo(&fusion);
          auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);
    
          NVF_CHECK(
              ref_info != info,
              "Expected to be different: ",
              ref_info.toString(),
              "\n",
              info.toString());
        }
      }
    }
    
    // Make sure non-dynamic reshape op is created when possible
    TEST_F(DynamicTransformTest, DynamicTransform8) {
      Fusion fusion;
      FusionGuard fg(&fusion);
    
      auto tv0 = makeConcreteTensor({3, 4});
      fusion.addInput(tv0);
    
      auto tv1 =
          reshape(tv0, {IrBuilder::create<Val>(4L), IrBuilder::create<Val>(3L)});
      fusion.addOutput(tv1);
    
      // Make sure the reshape is recognized as a static reshape
      NVF_CHECK(
          !tv1->domain()->hasSymbolicAxis(),
          "Not expected to have symbolic axes: ",
          tv1->toString());
    }
    
    // Mix of static and dynamic reshape. Make sure only dynamic reshape
    // is handled by the dynamic transform concretizer.
    TEST_F(DynamicTransformTest, DynamicTransform9) {
      Fusion fusion;
      FusionGuard fg(&fusion);
    
      auto tv0 = makeSymbolicTensor(2);
      fusion.addInput(tv0);
    
      auto tv1 = reshape(tv0, {3, 4}, {4, 3});
    
      auto reshape_shape0 = IrBuilder::create<Val>(DataType::Int);
    
      auto tv2 = reshape(tv1, {reshape_shape0});
      fusion.addOutput(tv2);
    
      // The first reshape is static
      NVF_CHECK(
          !tv1->domain()->hasSymbolicAxis(),
          "Unexpected to have symblic axes: ",
          tv1->toString());
      // The second reshape is static
      NVF_CHECK(
          tv2->domain()->hasSymbolicAxis(),
          "Expected to have symblic axes: ",
          tv2->toString());
    
      ExpressionEvaluator expr_eval;
    
      expr_eval.bind(tv0->axis(0)->extent(), 3L);
      expr_eval.bind(tv0->axis(1)->extent(), 4L);
      expr_eval.bind(reshape_shape0, 12L);
    
      auto initial_info = DynamicTransform::getInitialInfo(&fusion);
      auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);
    
      // There must be only one dynamic reshape entry, and that must be
      // for tv2.
      NVF_CHECK(
          info.getReshapeTransforms().size() == 1,
          info.getReshapeTransforms().at(0).first == 0, // first and only reshape
          "Unexpected dynamic transform info:",
          info.toString());
    }
    
    // Make sure inherited symbolic IDs are concretized through producer projection
    TEST_F(DynamicTransformTest, DynamicTransform10) {
      Fusion fusion;
      FusionGuard fg(&fusion);
    
      auto tv0 = makeSymbolicTensor(2);
      fusion.addInput(tv0);
    
      auto tv1 = reshape(
          tv0,
          {IrBuilder::create<Val>(DataType::Int),
           IrBuilder::create<Val>(DataType::Int)});
      auto tv2 = slice(
          tv1,
          {Slice(),
           {IrBuilder::create<Val>(1L),
            sub(tv1->axis(0)->extent(), IrBuilder::create<Val>(1L))}});
      fusion.addOutput(tv2);
    
      // tv2 has an producer projection (i.e., resize). The input to the expr is
      // symbolic, so is the output. When concretized, both of the input
      // and output must be concretized.
    
      ExpressionEvaluator expr_eval;
    
      expr_eval.bind(tv0->axis(0)->extent(), 3L);
      expr_eval.bind(tv0->axis(1)->extent(), 4L);
      expr_eval.bind(tv1->axis(0)->extent(), 4L);
      expr_eval.bind(tv1->axis(1)->extent(), 3L);
    
      auto initial_info = DynamicTransform::getInitialInfo(&fusion);
      auto info = DynamicTransformConcretizationInfo(&initial_info, &expr_eval);
    
      DynamicTransform::concretizeFusion(&fusion, &info);
    
      NVF_CHECK(
          !fusion.hasDynamicTransform(), "Expected to have no dynamic transform");
    }
    
    // Simple test of hashing. Create concretization info objects with two
    // similar but different reshape sizes and see if their hashes are different.
    TEST_F(DynamicTransformTest, DynamicTransform11) {
      auto fusion_ptr = std::make_unique<Fusion>();
      Fusion& fusion = *fusion_ptr.get();
      FusionGuard fg(&fusion);
    
      auto tv0 = makeSymbolicTensor(2);
      fusion.addInput(tv0);
    
      auto tv1 = reshape(
          tv0,
          {IrBuilder::create<Val>(DataType::Int),
           IrBuilder::create<Val>(DataType::Int),
           IrBuilder::create<Val>(DataType::Int)});
      fusion.addOutput(tv1);
    
      ExpressionEvaluator expr_eval1;
      // input: 4, 3
      // output: 2, 2, 3
      expr_eval1.bind(tv0->axis(0)->extent(), 4L);
      expr_eval1.bind(tv0->axis(1)->extent(), 3L);
      expr_eval1.bind(tv1->axis(0)->extent(), 2L);
      expr_eval1.bind(tv1->axis(1)->extent(), 2L);
      expr_eval1.bind(tv1->axis(2)->extent(), 3L);
    
      auto initial_info1 = DynamicTransform::getInitialInfo(&fusion);
      auto info1 = DynamicTransformConcretizationInfo(&initial_info1, &expr_eval1);
    
      ExpressionEvaluator expr_eval2;
      ;
      // input: 4, 3
      // output: 3, 2, 2
      expr_eval2.bind(tv0->axis(0)->extent(), 4L);
      expr_eval2.bind(tv0->axis(1)->extent(), 3L);
      expr_eval2.bind(tv1->axis(0)->extent(), 3L);
      expr_eval2.bind(tv1->axis(1)->extent(), 2L);
      expr_eval2.bind(tv1->axis(2)->extent(), 2L);
    
      auto initial_info2 = DynamicTransform::getInitialInfo(&fusion);
      auto info2 = DynamicTransformConcretizationInfo(&initial_info2, &expr_eval2);
    
      // Generally different concretizations doesn't always mean different
      // hashes, but in this case they should be different
      auto hash1 = std::hash<DynamicTransformConcretizationInfo>{}(info1);
      auto hash2 = std::hash<DynamicTransformConcretizationInfo>{}(info2);
      NVF_CHECK(
          hash1 != hash2,
          "Unexpected hash collision: ",
          hash1,
          " for\n",
          info1.toString(),
          "and\n",
          info2.toString());
    }
    
    // Test FusionExecutorCache with dynamic reshapes
    TEST_F(DynamicTransformTest, DynamicTransformFusionExecutorCache) {
      auto fusion = std::make_unique<Fusion>();
      FusionGuard fg(fusion.get());
    
      auto tv0 = makeSymbolicTensor(2);
      fusion->addInput(tv0);
      auto tv1 = makeSymbolicTensor(2);
      fusion->addInput(tv1);
    
      auto tv2 = reshape(tv0, {tv1->axis(0)->extent(), tv1->axis(1)->extent()});
      auto tv3 = add(tv1, tv2);
    
      fusion->addOutput(tv3);
    
      // tv2 has symbolic axes as reshape is dynamic
      NVF_CHECK(
          tv2->domain()->hasSymbolicAxis(),
          "Expected to have symbolic axes: ",
          tv2->toString());
    
      // The symbolic axes of tv2 should not be propagated to tv3 as tv1
      // is fully concrete
      NVF_CHECK(
          !tv3->domain()->hasSymbolicAxis(),
          "Not expected to have symbolic axes: ",
          tv3->toString());
    
      FusionExecutorCache executor_cache(std::move(fusion));
    
      NVF_CHECK(
          executor_cache.countRuntimes() == 0, "Expect to start with no runtimes");
    
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      { // trivial reshape
        auto t0 = at::randn({3, 4}, options);
        auto t1 = at::randn({3, 4}, options);
        auto cg_outputs = executor_cache.runFusionWithInputs({t0, t1});
        testValidate(
            executor_cache.fusion(), cg_outputs, {t0, t1}, __LINE__, __FILE__);
        NVF_CHECK(
            executor_cache.countRuntimes() == 1,
            "Expect to create a single runtime");
      }
      { // non-trivial reshape: merge and split
        auto t0 = at::randn({3, 4}, options);
        auto t1 = at::randn({4, 3}, options);
        auto cg_outputs = executor_cache.runFusionWithInputs({t0, t1});
        testValidate(
            executor_cache.fusion(), cg_outputs, {t0, t1}, __LINE__, __FILE__);
        auto num_rts = executor_cache.countRuntimes();
        auto num_concs = executor_cache.countConcretizations();
        NVF_CHECK(num_rts == 2, "Non-trivial reshape should create new runtime");
        NVF_CHECK(
            num_concs == 2,
            "Non-trivial reshape should create new concretization cache level");
      }
      { // different non-trivial reshape
        auto t0 = at::randn({2, 6}, options);
        auto t1 = at::randn({4, 3}, options);
        auto cg_outputs = executor_cache.runFusionWithInputs({t0, t1});
        testValidate(
            executor_cache.fusion(), cg_outputs, {t0, t1}, __LINE__, __FILE__);
        auto num_rts = executor_cache.countRuntimes();
        auto num_concs = executor_cache.countConcretizations();
        NVF_CHECK(
            num_rts == 2,
            "Second non-trivial reshape should not create new runtime");
        NVF_CHECK(
            num_concs == 2,
            "Second non-trivial reshape should not create new concretization cache level");
      }
    }
    
    using shape_t = std::vector<int64_t>;
    using dynamic_view_invocation = std::tuple<
        shape_t, // input_shape
        shape_t, // output_shape
        bool // expect miss
        >;
    
    //! Given a collection of input/output shapes test that FusionExecutorCache
    //! properly caches concretized Fusions. The first argument is a vector of
    //! input/output shape pairs. Each of these shape pairs will be run using the
    //! same FusionExecutorCache. The argument expect_miss indicates whether we
    //! expect a cache hit or miss at the concretization level.
    //! reshape_before_reduction has the same meaning as in reductionViewAddFusion
    //! in test_gpu_view.cpp.
    void reductionDynamicViewAddFusion(
        std::vector<dynamic_view_invocation>& invocations,
        bool reshape_before_reduction) {
      constexpr int kReductionAxis = -1;
    
      auto input_dims = std::get<0>(invocations[0]).size();
      auto output_dims = std::get<1>(invocations[0]).size();
    
      auto bias_dims = (reshape_before_reduction) ? input_dims : output_dims;
    
      std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
      Fusion& fusion = *fusion_ptr.get();
      FusionGuard fg(&fusion);
    
      TensorView* x = makeSymbolicTensor(input_dims);
      TensorView* bias = makeSymbolicTensor(bias_dims);
      fusion.addInput(x);
      fusion.addInput(bias);
    
      auto tv1 =
          (reshape_before_reduction) ? add(x, bias) : sum(x, {kReductionAxis});
      // create vectors of input scalars describing this reshape
      std::vector<Val*> output_shape(output_dims);
      for (size_t i : c10::irange(output_dims)) {
        output_shape[i] = IrBuilder::create<Val>(DataType::Int);
        fusion.addInput(output_shape[i]);
      }
      auto x_reshape = reshape(tv1, output_shape);
      auto y = (reshape_before_reduction) ? sum(x_reshape, {kReductionAxis})
                                          : add(x_reshape, bias);
      fusion.addOutput(y);
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
    
      size_t num_concretizations = executor_cache.countConcretizations();
      // Check that concretizations and runtimes are cache misses only when they
      // should be
      auto checkCache = [&](bool expect_miss) {
        auto current = executor_cache.countConcretizations();
        ASSERT_EQ(current, num_concretizations + (size_t)expect_miss);
        num_concretizations = current;
      };
    
      for (auto& inv : invocations) {
        // Shmoo tests can occupy a lot of memory due to allocating many
        // different tensor sizes. So in order to avoid an OOM during this
        // test, we manually clear the allocator after it's reached a certain
        // threshold.
        maybeClearAllocator();
    
        auto input_shape = std::get<0>(inv);
        auto output_shape = std::get<1>(inv);
        auto expect_miss = std::get<2>(inv);
    
        NVF_ERROR(input_shape.size() == input_dims);
        NVF_ERROR(output_shape.size() == output_dims);
    
        auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
    
        at::Tensor at_x = at::randn(input_shape, options);
        auto bias_shape = (reshape_before_reduction) ? input_shape : output_shape;
        if (!reshape_before_reduction) {
          // When bias_shape = output_shape, it may contain -1s
          // concretize bias_shape so that we can properly initialize at_bias
          size_t other_numel = 1;
          ssize_t negone_dim = -1; // negative if no -1 shape is provided
          for (auto i : c10::irange(bias_shape.size())) {
            if (bias_shape[i] == -1) {
              ASSERT_EQ(negone_dim, -1); // test cases should not have multiple -1s
              negone_dim = -1;
            } else {
              other_numel *= bias_shape[i];
            }
          }
          if (negone_dim >= 0) {
            bias_shape[negone_dim] = (int64_t)at_x.numel() / (int64_t)other_numel;
          }
        }
        at::Tensor at_bias = at::randn(bias_shape, options);
        KernelArgumentHolder args = {at_x, at_bias};
        // Add input scalars describing the reshape size for concretization
        for (size_t i : c10::irange(output_dims)) {
          args.push(output_shape[i]);
        }
    
        auto outputs = executor_cache.runFusionWithInputs(args);
        checkCache(expect_miss);
    
        auto at_tv1 = (reshape_before_reduction) ? (at_x + at_bias)
                                                 : at::sum(at_x, kReductionAxis);
        auto at_x_reshape = at::native::view(at_tv1, output_shape);
    
        testValidate(&fusion, outputs, args, __LINE__, __FILE__);
      }
    }
    
    TEST_F(DynamicTransformTest, FusionDynamicReshapeReductionShmoo) {
      auto invocations = std::vector<dynamic_view_invocation>{
          {{8, 3 * 4, 7, 9}, {8, 3 * 4, 7, 9}, true}, // trivial
          {{8, 3 * 4, 7, 5}, {8, 3 * 4, 7, 5}, false}, // trivial
          {{8, 3 * 4, 7, 9}, {8, 3, 4, 7 * 9}, true}, // merge(2) osplit(1, 3)
          {{8, 3 * 4, 7, 9},
           {8, 3, 4 * 7, 9},
           true}, // merge(1) merge(2) osplit(1, 3)
          {{8, 3 * 4, 7, 5},
           {8, 3, 4 * 7, 5},
           false}, // merge(1) merge(2) osplit(1, 3)
          {{8, 3 * 5, 7, 9}, {8, 3, 5 * 7, 9}, false}, // merge(1) osplit(1, 3)
    
          // test passing -1 dynamically for dimension size
          // This is unsupported. See https://github.com/NVIDIA/Fuser/issues/249
          // Values of -1 must be passed as constants instead of input-dependent
          // scalars.
          //{{8, 3 * 5, 7, 9}, {8, 3, -1, 9}, false} // merge(1) osplit(1, 3)
    
          // Empty reshapes should translate to FullOp
          {{8, 0, 7, 9}, {7, 8, 0, 9}, true}, // symbolic_sizes = [ -1, -1, 0, -1 ]
          // In the case below there's now a separate Val introduced for the output
          // extent, which is zero. This is represented in
          // DynamicTransformConcretizationInfo causing cache miss
          {{8, 0, 7, 9}, {7, 8, -1, 9}, true}, // symbolic_sizes = [ -1, -1, 0, -1 ]
          {{8, 0, 7, 9}, {7, 8, 0, 0}, true}, // symbolic_sizes = [ -1, -1, 0, 0 ]
          {{8, 0, 7, 9}, {47, 0, 13, 0}, true}, // symbolic_sizes = [ -1, 0, -1, 0 ]
      };
      reductionDynamicViewAddFusion(
          invocations, true /* reshape_before_reduction */);
    }
    
    using dynamic_pad_invocation = std::tuple<
        std::vector<int64_t>, // input_shape
        std::vector<int64_t>, // pad_widths
        bool // expect miss
        >;
    
    void reductionDynamicPadAddFusion(
        std::vector<dynamic_pad_invocation>& invocations) {
      constexpr int kReductionAxis = -1;
    
      auto input_dims = std::get<0>(invocations[0]).size();
      auto num_pad_widths = std::get<1>(invocations[0]).size();
    
      std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
      Fusion& fusion = *fusion_ptr.get();
      FusionGuard fg(&fusion);
    
      TensorView* x = makeSymbolicTensor(input_dims);
      fusion.addInput(x);
    
      std::vector<Val*> pad_width_vals(num_pad_widths);
      for (auto i : c10::irange(num_pad_widths)) {
        pad_width_vals[i] = IrBuilder::create<Val>(DataType::Int);
        fusion.addInput(pad_width_vals[i]);
      }
      auto x_pad = pad(x, pad_width_vals);
      auto y = sum(x_pad, {kReductionAxis});
      fusion.addOutput(y);
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
    
      // Check that concretizations and runtimes are cache misses only when they
      // should be
      size_t num_concretizations = executor_cache.getKernelRuntimes().size();
    #define CHECK_CACHE(expect_miss, ...)                        \
      auto current = executor_cache.getKernelRuntimes().size();  \
      auto expected = num_concretizations + (size_t)expect_miss; \
      NVF_CHECK(                                                 \
          current == expected,                                   \
          "Expected cache size ",                                \
          expected,                                              \
          " but found ",                                         \
          current,                                               \
          ". ",                                                  \
          __VA_ARGS__);                                          \
      num_concretizations = current;
    
      for (auto& inv : invocations) {
        // Shmoo tests can occupy a lot of memory due to allocating many
        // different tensor sizes. So in order to avoid an OOM during this
        // test, we manually clear the allocator after it's reached a certain
        // threshold.
        maybeClearAllocator();
    
        auto input_shape = std::get<0>(inv);
        auto pad_widths = std::get<1>(inv);
        auto expect_miss = std::get<2>(inv);
    
        NVF_ERROR(input_shape.size() == input_dims);
        NVF_ERROR(pad_widths.size() == num_pad_widths);
    
        auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
    
        at::Tensor at_x = at::randn(input_shape, options);
        KernelArgumentHolder args = {at_x};
        // Add input scalars describing the reshape size for concretization
        for (size_t i : c10::irange(pad_widths.size())) {
          args.push(pad_widths[i]);
        }
    
        auto outputs = executor_cache.runFusionWithInputs(args);
        CHECK_CACHE(
            expect_miss, "Input shape=", input_shape, " pad_widths=", pad_widths);
    
        auto at_x_pad = at::pad(at_x, pad_widths);
        auto at_y = at::sum(at_x_pad, kReductionAxis);
    
        testValidate(&fusion, outputs, args, __LINE__, __FILE__);
      }
    }
    #undef CHECK_CACHE
    
    // Test dynamic pad for various inputs
    TEST_F(DynamicTransformTest, DynamicPadShmoo) {
      // NOLINTBEGIN(bugprone-implicit-widening-of-multiplication-result)
      auto invocations = std::vector<dynamic_pad_invocation>{
          {{3, 5}, {0, 0}, true}, // trivial
    
          {{3, 5}, {2, 1}, false}, // simple pad of both sides
          {{3, 5}, {-1, 1}, false}, // shift by one
          // The following fails with a SIGFPE in innerReductionHeuristic
          // See https://github.com/NVIDIA/Fuser/issues/264
          //{{3, 5}, {-3, -2}, false}, // output is zero-dimensional
    
          // Output has size 1 so is set to broadcast.
          // This was previously "working" by concretizing the size-1 pad to
          // Iteration, even though it should be Broadcast. When set properly to
          // Broadcast, it fails with an error in ConcretizedBroadcastDomains.
          //{{3, 5}, {0, -4}, true},
    
          // Test full negative shifts, so output doesn't overlap input
          {{3, 5}, {-5, 2}, false},
          {{3, 5}, {2, -5}, false}, // full shift the other direction, re-use
    
          // The following reuses the schedule of {3, 5} inputs, and does not set
          // broadcast on the second input dimension.
          {{3, 1}, {1, 1}, false},
    
          // Test zero-dimensional input
          //{{3, 0}, {0, 0}, false}, // SIGFPE (see #264 above)
          {{3, 0}, {1, 1}, true}, // zero-dimensional concretizes differently
          //{{3, 0}, {-1, 1}, false}, // SIGFPE (see #264 above)
      };
      // NOLINTEND(bugprone-implicit-widening-of-multiplication-result)
      reductionDynamicPadAddFusion(invocations);
    }
    
    // Test that a Symbolic root/Broadcast logical is not concretized to
    // Iteration/Iteration
    TEST_F(DynamicTransformTest, FusionDynamicSliceToBroadcast) {
      std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
      Fusion& fusion = *fusion_ptr.get();
      FusionGuard fg(fusion_ptr.get());
      auto tv0 = makeSymbolicTensor(1);
      fusion.addInput(tv0);
      // tv0[:2] introduces symbolic IterDomain
      auto tv1 = slice(
          tv0, {{fusion.zeroVal(), IrBuilder::create<Val>(2L), fusion.oneVal()}});
      // tv1 has Broadcast logical, Iteration root
      auto tv2 = slice(tv1, {{fusion.zeroVal(), fusion.oneVal(), fusion.oneVal()}});
      // tv2 has a Symbolic root related to a Broadcast logical through a Resize op
      fusion.addOutput(tv2);
    
      // At concretization, tv1's logical will be set to Iteration, which will
      // propagate to tv2s root. This test will test that when tv2 root is
      // concretized to Iteration, it does not wind up overwriting the Broadcast
      // logical.
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      at::Tensor at0 = at::randn({5}, options);
      auto outputs = executor_cache.runFusionWithInputs({at0});
      testValidate(&fusion, outputs, {at0}, __LINE__, __FILE__);
    }
    
    // Test that empty input to cat is concretized away
    TEST_F(DynamicTransformTest, FusionDynamicEmptyCat1) {
      std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
      Fusion& fusion = *fusion_ptr.get();
      FusionGuard fg(fusion_ptr.get());
    
      auto tv0 = makeSymbolicTensor(1);
      fusion.addInput(tv0);
      auto tv1 = makeSymbolicTensor(1);
      fusion.addInput(tv1);
      auto tv2 = makeSymbolicTensor(1);
      fusion.addInput(tv2);
    
      auto tv3 = cat({tv0, tv1, tv2}, 0);
    
      fusion.addOutput(tv3);
    
      // Check correctness
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      at::Tensor at0 = at::randn({5}, options);
      at::Tensor at1 = at::randn({0}, options);
      at::Tensor at2 = at::randn({3}, options);
      auto outputs = executor_cache.runFusionWithInputs({at0, at1, at2});
      testValidate(&fusion, outputs, {at0, at1, at2}, __LINE__, __FILE__);
    }
    
    // Test that empty input to cat is concretized away
    TEST_F(DynamicTransformTest, FusionDynamicEmptyCat2) {
      std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
      Fusion& fusion = *fusion_ptr.get();
      FusionGuard fg(fusion_ptr.get());
    
      auto tv0 = makeSymbolicTensor(1);
      fusion.addInput(tv0);
      auto tv1 = makeSymbolicTensor(1);
      fusion.addInput(tv1);
    
      auto tv2 = cat({tv0, tv1}, 0);
    
      fusion.addOutput(tv2);
    
      // Check correctness
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      at::Tensor at0 = at::randn({5}, options);
      at::Tensor at1 = at::randn({0}, options);
      auto outputs = executor_cache.runFusionWithInputs({at0, at1});
      testValidate(&fusion, outputs, {at0, at1}, __LINE__, __FILE__);
    
      // Check that fusion consists only of tv2 = set(tv0)
      auto fkr = executor_cache.getMostRecentKernelRuntime();
      auto seg_fusion = fkr->fusionSegments();
      auto output_def = seg_fusion->outputs()[0]->definition();
      EXPECT_TRUE(output_def->isA<LoadStoreOp>());
      EXPECT_EQ(output_def->as<LoadStoreOp>()->opType(), LoadStoreOpType::Set);
      EXPECT_EQ(output_def->input(0), seg_fusion->inputs()[0]);
    }
    
    // Repro of https://github.com/NVIDIA/Fuser/issues/418
    TEST_F(DynamicTransformTest, DynamicTransformIssue418) {
      auto fusion = std::make_unique<Fusion>();
      FusionGuard fg(fusion.get());
    
      auto tv0 = makeSymbolicTensor(4);
      fusion->addInput(tv0);
      auto s0 = IrBuilder::create<Val>(DataType::Int);
      fusion->addInput(s0);
    
      auto sh = shape(tv0);
      auto tv1 = reshape(tv0, {sh[0], div(sh[1], s0), s0, sh[2], sh[3]});
      // Reducing along axis 2 in tv1 is equivalent to a partial reduction across
      // axis 1 of tv0.
      auto vm = variance_mean(tv1, {2, 3, 4}, 0, true);
      fusion->addOutput(vm.mean);
      fusion->addOutput(vm.var);
    
      FusionExecutorCache executor_cache(std::move(fusion));
    
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      at::Tensor at0 = at::randn({256, 128, 28, 28}, options);
      auto outputs = executor_cache.runFusionWithInputs({at0, 32});
    
      testValidate(executor_cache.fusion(), outputs, {at0, 32}, __LINE__, __FILE__);
    }
    
    TEST_F(DynamicTransformTest, Issue249) {
      std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
      Fusion& fusion = *fusion_ptr.get();
      FusionGuard fg(&fusion);
    
      TensorView* tv0 = makeSymbolicTensor(4);
      fusion.addInput(tv0);
    
      auto tv1 = add(tv0, tv0);
      auto tv2 = reshape(
          tv1,
          {tv1->axis(0)->extent(),
           tv1->axis(2)->extent(),
           IrBuilder::create<Val>(-1L)});
      auto tv3 = add(tv2, tv2);
      fusion.addOutput(tv3);
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
    
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      at::Tensor at_x = at::randn({2, 3, 4, 5}, options);
    
      auto outputs = executor_cache.runFusionWithInputs({at_x});
    
      testValidate(executor_cache.fusion(), outputs, {at_x}, __LINE__, __FILE__);
    }
    
    // This is just like the test above, but uses an input scalar with value -1
    TEST_F(DynamicTransformTest, Issue249InputNegative1) {
      std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
      Fusion& fusion = *fusion_ptr.get();
      FusionGuard fg(&fusion);
    
      TensorView* tv0 = makeSymbolicTensor(4);
      fusion.addInput(tv0);
    
      auto s0 = IrBuilder::create<Val>(DataType::Int);
      auto s1 = IrBuilder::create<Val>(DataType::Int);
      auto s2 = IrBuilder::create<Val>(DataType::Int);
      fusion.addInput(s0);
      fusion.addInput(s1);
      fusion.addInput(s2);
    
      auto tv1 = add(tv0, tv0);
      auto tv2 = reshape(tv1, {s0, s1, s2});
      auto tv3 = add(tv2, tv2);
      fusion.addOutput(tv3);
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
    
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      at::Tensor at_x = at::randn({2, 3, 4, 5}, options);
    
      // Dynamic reshape sizes that are not constant at definition must be explicit:
      // no -1 allowed
      EXPECT_THROW(
          executor_cache.runFusionWithInputs({at_x, 2, 4, -1}), std::exception);
    
      // Passing explicit sizes works fine
      auto outputs = executor_cache.runFusionWithInputs({at_x, 2, 4, 15});
    
      testValidate(
          executor_cache.fusion(), outputs, {at_x, 2, 4, 15}, __LINE__, __FILE__);
    }
    
    // Test that we can squeeze Symbolic IterDomains and that we properly detect
    // improper concretizations where we have squeezed a dimension with extent
    // other than 1.
    // See https://github.com/NVIDIA/Fuser/issues/1273
    TEST_F(DynamicTransformTest, SymbolicSqueeze) {
      std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
      Fusion* fusion = fusion_ptr.get();
      FusionGuard fg(fusion);
    
      auto tv0 = makeSymbolicTensor(2);
      auto s0 = IrBuilder::create<Val>(DataType::Index);
      auto s1 = IrBuilder::create<Val>(DataType::Index);
      fusion->addInput(tv0);
      fusion->addInput(s0);
      fusion->addInput(s1);
    
      auto tv1 = reshape(tv0, {s0, s1});
      auto tv2 = squeeze(
          tv1, std::vector<bool>({false, true})); // Squeeze second dimension
      fusion->addOutput(tv2);
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
    
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      at::Tensor t0 = at::randn({3, 2}, options);
      KernelArgumentHolder valid_args = {t0, 6, 1};
      // An invalid input has a second dimension that cannot be squeezed
      KernelArgumentHolder invalid_args = {t0, 2, 3};
    
      auto outputs = executor_cache.runFusionWithInputs(valid_args);
    
      testValidate(fusion, outputs, valid_args, __LINE__, __FILE__);
    
      // An informative error message should be given by
      // SqueezeOp::checkConcretization
      EXPECT_THAT(
          [&]() { executor_cache.runFusionWithInputs(invalid_args); },
          ::testing::ThrowsMessage<nvfuser::nvfError>(::testing::HasSubstr(
              " must concretize to IterType::Broadcast but found")));
    }
    
    // See https://github.com/NVIDIA/Fuser/issues/1468
    TEST_F(DynamicTransformTest, SymbolicExpand) {
      std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
      Fusion* fusion = fusion_ptr.get();
      FusionGuard fg(fusion);
    
      auto tv0 = makeSymbolicTensor(2);
      fusion->addInput(tv0);
    
      auto s0 = IrBuilder::create<Val>(DataType::Index);
      auto s1 = IrBuilder::create<Val>(DataType::Index);
      auto s2 = IrBuilder::create<Val>(DataType::Index);
      auto s3 = IrBuilder::create<Val>(DataType::Index);
      fusion->addInput(s0);
      fusion->addInput(s1);
      fusion->addInput(s2);
      fusion->addInput(s3);
    
      auto tv1 = reshape(tv0, {s0, s1});
      auto tv2 = expand(tv1, {s2, s3});
      auto tv3 = add(tv2, tv2);
    
      fusion->addOutput(tv3);
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
    
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      at::Tensor t0 = at::randn({3, 2}, options);
      KernelArgumentHolder valid_args = {t0, 6, 1, 6, 5};
      // An invalid input has a second dimension that cannot be expanded
      KernelArgumentHolder invalid_args = {t0, 2, 3, 2, 5};
    
      auto outputs = executor_cache.runFusionWithInputs(valid_args);
    
      testValidate(
          executor_cache.fusion(), outputs, valid_args, __LINE__, __FILE__);
    
      // An informative error message should be given during concretization
      EXPECT_THAT(
          [&]() { executor_cache.runFusionWithInputs(invalid_args); },
          ::testing::ThrowsMessage<nvfuser::nvfError>(
              ::testing::HasSubstr("Mismatch in sizes when concretizing expand.")));
    }
    
    // Test that constant zero extents are not overwritten during concretization
    // with non-constant extents.
    // See https://github.com/NVIDIA/Fuser/issues/1572
    TEST_F(DynamicTransformTest, ConcretizeConstantExtents) {
      std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
      Fusion* fusion = fusion_ptr.get();
      FusionGuard fg(fusion);
    
      auto tv0 = makeSymbolicTensor(2);
      fusion->addInput(tv0);
    
      // Explicitly cast Int to Index, so that these extents are not immediate
      // constants
      auto tv1 = reshape(
          tv0,
          {
              castOp(DataType::Index, IrBuilder::create<Val>(4096, DataType::Int)),
              castOp(DataType::Index, IrBuilder::create<Val>(32, DataType::Int)),
              castOp(DataType::Index, IrBuilder::create<Val>(3, DataType::Int)),
              castOp(DataType::Index, IrBuilder::create<Val>(128, DataType::Int)),
          });
      auto tv2 = permute(tv1, {1, 2, 0, 3});
      auto tv3 = slice(tv2, {0, 0, 0, 0}, {32, 1, 4096, 128});
      auto tv4 = reshape(
          tv3,
          {
              castOp(DataType::Index, IrBuilder::create<Val>(32, DataType::Int)),
              castOp(DataType::Index, IrBuilder::create<Val>(4096, DataType::Int)),
              castOp(DataType::Index, IrBuilder::create<Val>(128, DataType::Int)),
          });
      // Note this slice has zero extent in last dimension. RemoveEmptyPass should
      // recognize this and replace with full()
      auto tv5 = slice(tv4, {0, 0, 0}, {32, 4096, 0});
    
      fusion->addOutput(tv5);
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
    
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      at::Tensor t0 = at::randn({4096, 12288}, options);
    
      auto outputs = executor_cache.runFusionWithInputs({t0});
    
      testValidate(fusion, outputs, {t0}, __LINE__, __FILE__);
    }
    
    // Test that dynamic reductions that should result in squeezes are handled
    // properly.
    // See https://github.com/NVIDIA/Fuser/issues/1667
    TEST_F(DynamicTransformTest, DynamicSqueezeTrivialReduction) {
      auto fusion_ptr = std::make_unique<Fusion>();
      Fusion* fusion = fusion_ptr.get();
      FusionGuard fg(fusion);
    
      auto tv0 = makeSymbolicTensor(3);
      fusion->addInput(tv0);
    
      // Explicitly cast Int to Index, so that these extents are not immediate
      // constants
      auto tv1 = reshape(
          tv0,
          {
              castOp(DataType::Index, IrBuilder::create<Val>(1, DataType::Int)),
              castOp(DataType::Index, IrBuilder::create<Val>(2, DataType::Int)),
              castOp(DataType::Index, IrBuilder::create<Val>(2, DataType::Int)),
              castOp(DataType::Index, IrBuilder::create<Val>(1, DataType::Int)),
              castOp(DataType::Index, IrBuilder::create<Val>(3, DataType::Int)),
              castOp(DataType::Index, IrBuilder::create<Val>(3, DataType::Int)),
          });
      auto tv2 = sum(tv1, {0, 2, 3, 4});
      fusion->addOutput(tv2);
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
    
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      at::Tensor t0 = at::randn({2, 2, 9}, options);
    
      auto outputs = executor_cache.runFusionWithInputs({t0});
    
      testValidate(fusion, outputs, {t0}, __LINE__, __FILE__);
    }
    
    // Same as above but for Welford ops
    // See https://github.com/NVIDIA/Fuser/issues/1667
    TEST_F(DynamicTransformTest, DynamicSqueezeTrivialWelford) {
      std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
      Fusion* fusion = fusion_ptr.get();
      FusionGuard fg(fusion);
    
      auto tv0 = makeSymbolicTensor(3);
      fusion->addInput(tv0);
    
      // Explicitly cast Int to Index, so that these extents are not immediate
      // constants
      auto tv1 = reshape(
          tv0,
          {
              castOp(DataType::Index, IrBuilder::create<Val>(1, DataType::Int)),
              castOp(DataType::Index, IrBuilder::create<Val>(2, DataType::Int)),
              castOp(DataType::Index, IrBuilder::create<Val>(2, DataType::Int)),
              castOp(DataType::Index, IrBuilder::create<Val>(1, DataType::Int)),
              castOp(DataType::Index, IrBuilder::create<Val>(3, DataType::Int)),
              castOp(DataType::Index, IrBuilder::create<Val>(3, DataType::Int)),
          });
      auto res =
          variance_mean(tv1, {0, 2, 3, 4}, /*unbiased=*/true, /*keepdim=*/false);
      fusion->addOutput(res.mean);
      fusion->addOutput(res.var);
    
      FusionExecutorCache executor_cache(std::move(fusion_ptr));
    
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      at::Tensor t0 = at::randn({2, 2, 9}, options);
    
      auto outputs = executor_cache.runFusionWithInputs({t0});
    
      testValidate(fusion, outputs, {t0}, __LINE__, __FILE__);
    }
    
    TEST_F(DynamicTransformTest, LoopSplit) {
      const int b = 2, s = 3, h = 96, e = 128;
    
      Fusion fusion;
      FusionGuard fg(&fusion);
      TensorView* in = makeContigConcreteTensor({-1, -1, 12288});
      TensorView* out = reshape(
          in,
          {shape(in)[0],
           shape(in)[1],
           IrBuilder::create<Val>(96),
           IrBuilder::create<Val>(128)});
      fusion.addInput(in);
      fusion.addOutput(out);
    
      const int d = 2;
      auto mesh = DeviceMesh::createForNumDevices(d);
      for (auto* tv : {in, out}) {
        tv->setDeviceMesh(mesh);
        tv->split(2, d, /*inner_split=*/false);
        tv->axis(2)->parallelize(ParallelType::DIDx);
        tv->setAllocationDomain(tv->getLoopDomain(), true);
      }
    
      at::Tensor in_tensor = at::randn({b, s, h * e / d}, at::Device(at::kCUDA));
      KernelArgumentHolder args({in_tensor});
      DynamicTransform::concretizeFusion(&fusion, args);
    
      ASSERT_EQ(fusion.outputs().size(), 1);
      auto* concrete_out = fusion.outputs().at(0)->as<TensorView>();
      EXPECT_EQ(getShardedLogicalAxis(concrete_out, ParallelType::DIDx), 2);
    }
    
    Deprecated Function

    The selfAllocationReplay function is deprecated and replaced by selfReplay. Ensure that all instances of selfAllocationReplay are removed or replaced with selfReplay to avoid confusion and potential bugs.

    // Self replay.
    static TensorDomain* fullSelfReplay(
        const TensorDomain* new_self_root,
        const TensorDomain* self);
    
    // Self replay the transformation on `self` from logical to loop and
    // allocation onto `new_self`.
    static void selfReplay(const TensorDomain* self, TensorDomain* new_self);
    
    // Returns the loop position in producer that matches with `consumer_pos` in

    @wujingyue
    Copy link
    Collaborator Author

    !test

    @wujingyue
    Copy link
    Collaborator Author

    cc @jacobhinkle

    I tried to replay logical-to-leaf transforms from the original TV to the concretized TV. It works except for

    def test_issue1691(self):
    . Any idea how I can fix this?

    According to

    // T2[ ?S4{i2} ] = view(T1[ iS2{i0} rS3{i1} ])
    , a reduction dimension is only found in the concretized TV but not the original TV. Therefore, the new loop domain I computed is missing that reduction dimension and fails the validateDomainEquivalence check.

    I'm not sure if reduction dimensions have to follow a certain order. If not, I could fix this by appending all unmapped reduction dimensions to the end of the loop domain.

    @jacobhinkle
    Copy link
    Collaborator

    , a reduction dimension is only found in the concretized TV but not the original TV. Therefore, the new loop domain I computed is missing that reduction dimension and fails the validateDomainEquivalence check.

    I'm not sure if reduction dimensions have to follow a certain order. If not, I could fix this by appending all unmapped reduction dimensions to the end of the loop domain.

    I am not sure I understand the issue. IIUC the purpose of this PR is to replay loop transforms when we do concretizations in order to preserve DID scheduling, right? When you concretize a reshape, neither the original or the replacement should have any reduction domains will it?

    @wujingyue
    Copy link
    Collaborator Author

    I am not sure I understand the issue. IIUC the purpose of this PR is to replay loop transforms when we do concretizations in order to preserve DID scheduling, right?

    Correct.

    When you concretize a reshape, neither the original or the replacement should have any reduction domains will it?

    I thought so until I hit #1691, which I'm sure you know everything about.

    @jacobhinkle
    Copy link
    Collaborator

    I thought so until I hit #1691, which I'm sure you know everything about.

    Ah ok, actually I had forgotten about that entirely. In a case like this we are actually replacing the ViewOp output with its input, so we're not creating a new TensorView are we? In that case the loop domain would still be in tact. However, if that TV was also replaced during concretization then it might no longer have a non-trivial loop domain... I assume this is the case you are worried about.

    @wujingyue
    Copy link
    Collaborator Author

    wujingyue commented Mar 5, 2025

    Here's a case I want to support:

    Before concretization:

    T1[ iS2{i0} rS3{i1} ] = sum(T0[ iS0{i0} iS1{i1} ])
    T2[ ?S4{i2} ] = view(T1[ iS2{i0} rS3{i1} ])
    T3[ ?S4{i2} ] = -T2[ ?S4{i2} ]
    

    with all TVs' first dimension being outer-split by d, which is parallelized on DIDx.

    After concretization:

    T1[ iS2{i0} rS3{i1} ] = sum(T0[ iS0{i0} iS1{i1} ])
    T3[ iS4{i0} ] = -T1[ iS2{i0} rS3{i1} ]
    

    with all TVs' first dimension (i.e. i0) being split the same way as before.

    Therefore, this PR tries to replay loop transforms in addition.

    In a case like this we are actually replacing the ViewOp output with its input, so we're not creating a new TensorView are we?

    That's right -- no new TensorViews are created.

    In fact, I can fix this error by

    diff --git a/csrc/dynamic_transform.cpp b/csrc/dynamic_transform.cpp
    index a016f0af..3c9a3919 100644
    --- a/csrc/dynamic_transform.cpp
    +++ b/csrc/dynamic_transform.cpp
    @@ -795,6 +795,9 @@ TensorView* DynamicTransformConcretizer::concretizeNonEmptyReshape(
         TensorView* incomplete_out_tv,
         const AnalyzeViewResult& view_analysis) {
       TensorView* concrete_reshape_out_tv = reshape(inp_tv, view_analysis);
    +  if (concrete_reshape_out_tv == inp_tv) {
    +    return inp_tv;
    +  }
     
       // Extent expressions often change when concretizing a reshape. Here we
       // replace these in all downstream expressions so that the Fusion looks just

    Why wasn't #1691 fixed this way? Can concretize_reshape_out_tv be different from inp_tv but still have more reduction dimensions than incomplete_out_tv? (It's currently fixed by removing reduction dimensions from concrete_reshape_out_tv before registering concretization).

    @jacobhinkle
    Copy link
    Collaborator

    jacobhinkle commented Mar 7, 2025

    Here's a case I want to support:

    Before concretization:

    T1[ iS2{i0} rS3{i1} ] = sum(T0[ iS0{i0} iS1{i1} ])
    T2[ ?S4{i2} ] = view(T1[ iS2{i0} rS3{i1} ])
    T3[ ?S4{i2} ] = -T2[ ?S4{i2} ]
    

    with all TVs' first dimension being outer-split by d, which is parallelized on DIDx.

    After concretization:

    T1[ iS2{i0} rS3{i1} ] = sum(T0[ iS0{i0} iS1{i1} ])
    T3[ iS4{i0} ] = -T1[ iS2{i0} rS3{i1} ]
    

    with all TVs' first dimension (i.e. i0) being split the same way as before.

    We should handle this particular case in the reshape op. We know that when the input logical domain is 1D and so is the output, that the reshape does nothing. We can either convert it to a set or just return the input. In fact we know the transform needed whenever the output is 1D regardless of the input dims (as long as they have non-Symbolic IterTypes). Anyway, I'll assume this is 2D for this PR's sake.

    Therefore, this PR tries to replay loop transforms in addition.

    In a case like this we are actually replacing the ViewOp output with its input, so we're not creating a new TensorView are we?

    That's right -- no new TensorViews are created.

    In fact, I can fix this error by

    diff --git a/csrc/dynamic_transform.cpp b/csrc/dynamic_transform.cpp
    index a016f0af..3c9a3919 100644
    --- a/csrc/dynamic_transform.cpp
    +++ b/csrc/dynamic_transform.cpp
    @@ -795,6 +795,9 @@ TensorView* DynamicTransformConcretizer::concretizeNonEmptyReshape(
         TensorView* incomplete_out_tv,
         const AnalyzeViewResult& view_analysis) {
       TensorView* concrete_reshape_out_tv = reshape(inp_tv, view_analysis);
    +  if (concrete_reshape_out_tv == inp_tv) {
    +    return inp_tv;
    +  }
     
       // Extent expressions often change when concretizing a reshape. Here we
       // replace these in all downstream expressions so that the Fusion looks just

    I am not sure why you need to return early in this function. What is the error you mention?

    Why wasn't #1691 fixed this way? Can concretize_reshape_out_tv be different from inp_tv but still have more reduction dimensions than incomplete_out_tv? (It's currently fixed by removing reduction dimensions from concrete_reshape_out_tv before registering concretization).

    The code you linked is not removing that dimension from the replacement TV. This section is just for registering concretization of extents, so that we remove the i2 and replace it with i0 in all downstream TVs (see comment above there). I don't think it is true that concretize_reshape_out_tv will have any reduction dims unless it is the same as inp_tv.

    @wujingyue
    Copy link
    Collaborator Author

    This section is just for registering concretization of extents, so that we remove the i2 and replace it with i0 in all downstream TVs

    You are right. The following

    diff --git a/csrc/dynamic_transform.cpp b/csrc/dynamic_transform.cpp
    index a016f0af..3c9a3919 100644
    --- a/csrc/dynamic_transform.cpp
    +++ b/csrc/dynamic_transform.cpp
    @@ -795,6 +795,9 @@ TensorView* DynamicTransformConcretizer::concretizeNonEmptyReshape(
         TensorView* incomplete_out_tv,
         const AnalyzeViewResult& view_analysis) {
       TensorView* concrete_reshape_out_tv = reshape(inp_tv, view_analysis);
    +  if (concrete_reshape_out_tv == inp_tv) {
    +    return inp_tv;
    +  }
     
       // Extent expressions often change when concretizing a reshape. Here we
       // replace these in all downstream expressions so that the Fusion looks just

    is a wrong fix, because i2 wouldn't be replaced in downstream ops.

    I'm experimenting putting the extra reduction dimensions in concrete_reshape_out_tv at the end of its loop domain. Will come back once I have an update...

    @wujingyue
    Copy link
    Collaborator Author

    !test

    @wujingyue
    Copy link
    Collaborator Author

    !test

    @wujingyue
    Copy link
    Collaborator Author

    !test --diff

    @wujingyue wujingyue requested a review from jjsjann123 March 13, 2025 02:53
    @wujingyue wujingyue marked this pull request as ready for review March 13, 2025 02:53
    Copy link
    Collaborator

    @jjsjann123 jjsjann123 left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    stamping since code change looks good.

    TensorDomain* new_self);
    // Self replay the transformation on `self` from logical to loop and
    // allocation onto `new_self`.
    static void selfReplay(const TensorDomain* self, TensorDomain* new_self);
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I'm wondering whether we would want to replay both all the time?

    Like for the case with the existing use on fixing allocation domain for aliases (and graph mutation where we preserve output memory layout during graph mutation). We don't necessarily need to preserve the loop transform.
    And what does replaying the loop transform mean for those use-case?

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    granted that the current use in those cases are prior to scheduling. So there isn't any transform from root to loop in the first place, Realized that my question isn't really relevant..

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    granted that the current use in those cases are prior to scheduling

    That's right. I think my answer in the other comment also applies here. I should probably try to fix #3479 again after I'm done with DID loop split.

    axis_map[id] = new_id;
    i++;
    }
    for (auto&& [id, new_id] : zip(self_logical, new_self_logical)) {
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    👏

    tv->setDeviceMesh(mesh);
    tv->split(2, d, /*inner_split=*/false);
    tv->axis(2)->parallelize(ParallelType::DIDx);
    tv->setAllocationDomain(tv->getLoopDomain(), true);
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    out of curiosity, is the reason that we needed a replay on loop, because somewhere else we are expecting to see DIDx on loop as well as on allocation?

    My naive mental model sees this as DIDx is specified on allocation domain, with the existing replay, it felt almost like we have some unnecessary redundancy. i.e. loop and allocation need to have some mapping parallel type.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The current assumptions is that, before (intra-GPU) scheduling, loop has to post-dominate allocation. In other words, loop must be the "leaf" domain before scheduling. My #3621 attempted to lift this assumption but failed (cf. #3706).

    It doesn't mean that loop and allocation have to be the same order. For example, loop can be a permutation of allocation, or a split of allocation.

    Also, it doesn't mean that loop and allocation need to have the same parallel types. Loop may be more parallelized than allocation. For example, this example for overlapping parallelizes the column dimension of operand A on Stream only in loop but not in allocation. In practice, it can be implemented as a Set from a non-stream-parallelized IterDomain to a stream-parallelized IterDomain.

    @wujingyue
    Copy link
    Collaborator Author

    @jacobhinkle I'm surprised by the number of codegen diffs. The numbers of kernels don't match -- WARNING: Number of kernels in 6a56066b (3126) does not match number of kernels in 0e095bff (3588), which is suspicious. In many cases, the code being compared doesn't seem to describe the same math.

    @wujingyue
    Copy link
    Collaborator Author

    !test --diff

    @wujingyue
    Copy link
    Collaborator Author

    !test --diff

    @wujingyue
    Copy link
    Collaborator Author

    I tried to codegen diff locally for bin/test_nvfuser. The result makes more sense. The diffs don't change functionality and in fact make code better. For example, the first diff and the second diff avoid unnecessary alloc_stride accesses. This is because of this check added.

    NVFuserTest.FusionMagicSchedulerBatchNormalization_CUDA 0 --- b8197081
    
    +++ 28aad7c4
    
    @@ -335,11 +335,11 @@
    
       Array<float, 1LL, 1> T34;
       T34[0LL]
         = T9[0LL]
         + T8[0LL];
       if (b35) {
    -    T10[(T10.alloc_stride[0LL] * ((nvfuser_index_t)blockIdx.x))]
    +    T10[((nvfuser_index_t)blockIdx.x)]
            = T34[0LL];
       }
       Array<float, 1LL, 1> T33;
       T33[0LL] = 0LL;
       T33[0LL]
    @@ -359,11 +359,11 @@
    
       Array<float, 1LL, 1> T35;
       T35[0LL]
         = T13[0LL]
         + T12[0LL];
       if (b35) {
    -    T14[(T14.alloc_stride[0LL] * ((nvfuser_index_t)blockIdx.x))]
    +    T14[((nvfuser_index_t)blockIdx.x)]
            = T35[0LL];
       }
       Array<float, 1LL, 1> T17;
       T17[0LL]
         = T6[0LL]
    NVFuserTest.FusionMagicSchedulerInstanceNormalization_CUDA 1 --- b8197081
    
    +++ 28aad7c4
    
    @@ -11,124 +11,124 @@
    
       Array<nvfuser_index_t, 2, 1> a2;
       a2 = s1.logical_size;
       nvfuser_index_t i3;
       i3 = a2[0LL];
       nvfuser_index_t i4;
    -  i4 = ceilDiv((ceilDiv(i3, ((nvfuser_index_t)blockDim.y))), 8LL);
    +  i4 = ceilDiv((ceilDiv(i3, ((nvfuser_index_t)blockDim.y))), 8);
       nvfuser_index_t i5;
       i5 = ((nvfuser_index_t)blockDim.x) * ((nvfuser_index_t)blockIdx.x);
       nvfuser_index_t i6;
       i6 = ((T3.logical_size[0LL] * ((nvfuser_index_t)threadIdx.y)) + ((nvfuser_index_t)threadIdx.x)) + i5;
       nvfuser_index_t i7;
    -  i7 = ((nvfuser_index_t)blockDim.y) * 8LL;
    +  i7 = ((nvfuser_index_t)blockDim.y) * 8;
       nvfuser_index_t i8;
       i8 = i7 * T3.logical_size[0LL];
       nvfuser_index_t i9;
       i9 = ((nvfuser_index_t)blockDim.y) * T3.logical_size[0LL];
       nvfuser_index_t i10;
       i10 = ((nvfuser_index_t)threadIdx.x) + i5;
       bool b11;
       b11 = i10 < T3.logical_size[0LL];
       nvfuser_index_t i12;
    -  i12 = (((nvfuser_index_t)blockDim.y) * 7LL) + ((nvfuser_index_t)threadIdx.y);
    +  i12 = (((nvfuser_index_t)blockDim.y) * 7) + ((nvfuser_index_t)threadIdx.y);
       nvfuser_index_t i13;
       i13 = (-i3) + ((nvfuser_index_t)threadIdx.y);
       double d14;
       d14 = (double)(i3);
       double d15;
       d15 = 1.00000000000000000e+00 * d14;
       double d16;
       d16 = reciprocal(d15);
    -  Array<float, 1LL, 1> T39;
    -  T39[0LL] = 0LL;
    +  Array<float, 1, 1> T39;
    +  T39[0] = 0;
       if (b11) {
    -    T39[0LL]
    +    T39[0]
            = T3[((T3.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((((nvfuser_index_t)blockDim.x) * T3.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))];
       }
    -  Array<float, 1LL, 1> T9;
    -  T9[0LL]
    -    = T39[0LL]
    +  Array<float, 1, 1> T9;
    +  T9[0]
    +    = T39[0]
         * (float) d0;
    -  Array<float, 1LL, 1> T10;
    -  T10[0LL]
    -     = T9[0LL];
    -  Array<float, 1LL, 1> T42;
    -  T42[0LL] = 0.000000000e+00f;
    +  Array<float, 1, 1> T10;
    +  T10[0]
    +     = T9[0];
    +  Array<float, 1, 1> T42;
    +  T42[0] = 0.000000000e+00f;
       #pragma unroll 1
    -  for(nvfuser_index_t i17 = 0LL; i17 < i4; ++i17) {
    +  for(nvfuser_index_t i17 = 0; i17 < i4; ++i17) {
         nvfuser_index_t i18;
         i18 = i6 + (i8 * i17);
         nvfuser_index_t i19;
         i19 = i7 * i17;
         nvfuser_index_t i20;
         i20 = i13 + i19;
         if ((b11 && ((i12 + i19) < i3))) {
    -      Array<float, 8LL, 1> T40;
    +      Array<float, 8, 1> T40;
           #pragma unroll
    -      for(nvfuser_index_t i21 = 0LL; i21 < 8LL; ++i21) {
    -        T40[i21] = 0LL;
    +      for(nvfuser_index_t i21 = 0; i21 < 8; ++i21) {
    +        T40[i21] = 0;
           }
           NVFUSER_UPDATE_MAGIC_ZERO;
           #pragma unroll
    -      for(nvfuser_index_t i21 = 0LL; i21 < 8LL; ++i21) {
    +      for(nvfuser_index_t i21 = 0; i21 < 8; ++i21) {
             T40[i21]
                = T8[(i18 + (i9 * (i21 + nvfuser_zero)))];
           }
           NVFUSER_UPDATE_MAGIC_ZERO;
           #pragma unroll
    -      for(nvfuser_index_t i22 = 0LL; i22 < 8LL; ++i22) {
    -        Array<float, 1LL, 1> T11;
    -        T11[0LL]
    -          = T10[0LL]
    +      for(nvfuser_index_t i22 = 0; i22 < 8; ++i22) {
    +        Array<float, 1, 1> T11;
    +        T11[0]
    +          = T10[0]
               + T40[i22];
    -        T42[0LL]
    -          = T42[0LL]
    -          + T11[0LL];
    +        T42[0]
    +          = T42[0]
    +          + T11[0];
           }
           NVFUSER_UPDATE_MAGIC_ZERO;
         } else {
    -      Array<float, 8LL, 1> T40;
    +      Array<float, 8, 1> T40;
           #pragma unroll
    -      for(nvfuser_index_t i21 = 0LL; i21 < 8LL; ++i21) {
    -        T40[i21] = 0LL;
    +      for(nvfuser_index_t i21 = 0; i21 < 8; ++i21) {
    +        T40[i21] = 0;
           }
           NVFUSER_UPDATE_MAGIC_ZERO;
           #pragma unroll
    -      for(nvfuser_index_t i21 = 0LL; i21 < 8LL; ++i21) {
    +      for(nvfuser_index_t i21 = 0; i21 < 8; ++i21) {
             nvfuser_index_t i23;
             i23 = i21 + nvfuser_zero;
             if ((b11 && (i20 < (-(((nvfuser_index_t)blockDim.y) * i23))))) {
               T40[i21]
                  = T8[(i18 + (i9 * i23))];
             }
           }
           NVFUSER_UPDATE_MAGIC_ZERO;
           #pragma unroll
    -      for(nvfuser_index_t i22 = 0LL; i22 < 8LL; ++i22) {
    -        Array<float, 1LL, 1> T11;
    -        T11[0LL]
    -          = T10[0LL]
    +      for(nvfuser_index_t i22 = 0; i22 < 8; ++i22) {
    +        Array<float, 1, 1> T11;
    +        T11[0]
    +          = T10[0]
               + T40[i22];
             if ((b11 && (i20 < (-(((nvfuser_index_t)blockDim.y) * (i22 + nvfuser_zero)))))) {
    -          T42[0LL]
    -            = T42[0LL]
    -            + T11[0LL];
    +          T42[0]
    +            = T42[0]
    +            + T11[0];
             }
           }
           NVFUSER_UPDATE_MAGIC_ZERO;
         }
       }
    -  Array<float, 1LL, 1> T12;
    -  T12[0LL] = 0.000000000e+00f;
    -  blockReduce<false, true, false, true>(T12[0LL], T42[0LL], [](float &a, float b) { a = a + b; }, static_cast<float*>(shared_mem), true, true, float(0.000000000e+00f), DefaultBlockDim());
    -  Array<float, 1LL, 1> T13;
    -  T13[0LL]
    -    = T12[0LL]
    +  Array<float, 1, 1> T12;
    +  T12[0] = 0.000000000e+00f;
    +  blockReduce<false, true, false, true>(T12[0], T42[0], [](float &a, float b) { a = a + b; }, static_cast<float*>(shared_mem), true, true, float(0.000000000e+00f), DefaultBlockDim());
    +  Array<float, 1, 1> T13;
    +  T13[0]
    +    = T12[0]
         * (float) d16;
    -  Array<float, 1LL, 1> T41;
    -  T41[0LL]
    -     = T13[0LL];
    -  if (((((nvfuser_index_t)threadIdx.y) == 0LL) && b11)) {
    +  Array<float, 1, 1> T41;
    +  T41[0]
    +     = T13[0];
    +  if (((((nvfuser_index_t)threadIdx.y) == 0) && b11)) {
         T32[i10]
    -       = T41[0LL];
    +       = T41[0];
       }
     }
    NVFuserTest.FusionMagicSchedulerInstanceNormalization_CUDA 2 --- b8197081
    
    +++ 28aad7c4
    
    @@ -23,140 +23,140 @@
    
       Array<nvfuser_index_t, 2, 1> a10;
       a10 = s9.logical_size;
       nvfuser_index_t i11;
       i11 = a10[0LL];
       nvfuser_index_t i12;
    -  i12 = ceilDiv((ceilDiv(i11, ((nvfuser_index_t)blockDim.y))), 8LL);
    +  i12 = ceilDiv((ceilDiv(i11, ((nvfuser_index_t)blockDim.y))), 8);
       nvfuser_index_t i13;
       i13 = ((nvfuser_index_t)blockDim.x) * ((nvfuser_index_t)blockIdx.x);
       nvfuser_index_t i14;
       i14 = ((T4.logical_size[0LL] * ((nvfuser_index_t)threadIdx.y)) + ((nvfuser_index_t)threadIdx.x)) + i13;
       nvfuser_index_t i15;
    -  i15 = ((nvfuser_index_t)blockDim.y) * 8LL;
    +  i15 = ((nvfuser_index_t)blockDim.y) * 8;
       nvfuser_index_t i16;
       i16 = i15 * T4.logical_size[0LL];
       nvfuser_index_t i17;
       i17 = ((nvfuser_index_t)blockDim.y) * T4.logical_size[0LL];
       nvfuser_index_t i18;
       i18 = ((nvfuser_index_t)threadIdx.x) + i13;
       bool b19;
       b19 = i18 < T4.logical_size[0LL];
       nvfuser_index_t i20;
    -  i20 = (((nvfuser_index_t)blockDim.y) * 7LL) + ((nvfuser_index_t)threadIdx.y);
    +  i20 = (((nvfuser_index_t)blockDim.y) * 7) + ((nvfuser_index_t)threadIdx.y);
       nvfuser_index_t i21;
       i21 = (-i11) + ((nvfuser_index_t)threadIdx.y);
       double d22;
       d22 = (double)(i11);
       double d23;
       d23 = 1.00000000000000000e+00 * d22;
       double d24;
       d24 = reciprocal(d23);
    -  Array<float, 1LL, 1> T39;
    -  T39[0LL] = 0LL;
    +  Array<float, 1, 1> T39;
    +  T39[0] = 0;
       if (b19) {
    -    T39[0LL]
    +    T39[0]
            = T4[((T4.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((((nvfuser_index_t)blockDim.x) * T4.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))];
       }
    -  Array<float, 1LL, 1> T16;
    -  T16[0LL]
    -    = T39[0LL]
    +  Array<float, 1, 1> T16;
    +  T16[0]
    +    = T39[0]
         * (float) d8;
    -  Array<float, 1LL, 1> T17;
    -  T17[0LL]
    -     = T16[0LL];
    -  Array<float, 1LL, 1> T42;
    -  T42[0LL] = 0.000000000e+00f;
    +  Array<float, 1, 1> T17;
    +  T17[0]
    +     = T16[0];
    +  Array<float, 1, 1> T42;
    +  T42[0] = 0.000000000e+00f;
       #pragma unroll 1
    -  for(nvfuser_index_t i25 = 0LL; i25 < i12; ++i25) {
    +  for(nvfuser_index_t i25 = 0; i25 < i12; ++i25) {
         nvfuser_index_t i26;
         i26 = i14 + (i16 * i25);
         nvfuser_index_t i27;
         i27 = i15 * i25;
         nvfuser_index_t i28;
         i28 = i21 + i27;
         if ((b19 && ((i20 + i27) < i11))) {
    -      Array<float, 8LL, 1> T40;
    +      Array<float, 8, 1> T40;
           #pragma unroll
    -      for(nvfuser_index_t i29 = 0LL; i29 < 8LL; ++i29) {
    -        T40[i29] = 0LL;
    +      for(nvfuser_index_t i29 = 0; i29 < 8; ++i29) {
    +        T40[i29] = 0;
           }
           NVFUSER_UPDATE_MAGIC_ZERO;
           #pragma unroll
    -      for(nvfuser_index_t i29 = 0LL; i29 < 8LL; ++i29) {
    +      for(nvfuser_index_t i29 = 0; i29 < 8; ++i29) {
             T40[i29]
                = T6[(i26 + (i17 * (i29 + nvfuser_zero)))];
           }
           NVFUSER_UPDATE_MAGIC_ZERO;
           #pragma unroll
    -      for(nvfuser_index_t i30 = 0LL; i30 < 8LL; ++i30) {
    -        Array<float, 1LL, 1> T14;
    -        T14[0LL]
    +      for(nvfuser_index_t i30 = 0; i30 < 8; ++i30) {
    +        Array<float, 1, 1> T14;
    +        T14[0]
               = T40[i30]
               * (float) d7;
    -        Array<float, 1LL, 1> T15;
    -        T15[0LL]
    -          = T14[0LL]
    +        Array<float, 1, 1> T15;
    +        T15[0]
    +          = T14[0]
               * (float) 1.00000001490116119e-01;
    -        Array<float, 1LL, 1> T18;
    -        T18[0LL]
    -          = T17[0LL]
    -          + T15[0LL];
    -        T42[0LL]
    -          = T42[0LL]
    -          + T18[0LL];
    +        Array<float, 1, 1> T18;
    +        T18[0]
    +          = T17[0]
    +          + T15[0];
    +        T42[0]
    +          = T42[0]
    +          + T18[0];
           }
           NVFUSER_UPDATE_MAGIC_ZERO;
         } else {
    -      Array<float, 8LL, 1> T40;
    +      Array<float, 8, 1> T40;
           #pragma unroll
    -      for(nvfuser_index_t i29 = 0LL; i29 < 8LL; ++i29) {
    -        T40[i29] = 0LL;
    +      for(nvfuser_index_t i29 = 0; i29 < 8; ++i29) {
    +        T40[i29] = 0;
           }
           NVFUSER_UPDATE_MAGIC_ZERO;
           #pragma unroll
    -      for(nvfuser_index_t i29 = 0LL; i29 < 8LL; ++i29) {
    +      for(nvfuser_index_t i29 = 0; i29 < 8; ++i29) {
             nvfuser_index_t i31;
             i31 = i29 + nvfuser_zero;
             if ((b19 && (i28 < (-(((nvfuser_index_t)blockDim.y) * i31))))) {
               T40[i29]
                  = T6[(i26 + (i17 * i31))];
             }
           }
           NVFUSER_UPDATE_MAGIC_ZERO;
           #pragma unroll
    -      for(nvfuser_index_t i30 = 0LL; i30 < 8LL; ++i30) {
    -        Array<float, 1LL, 1> T14;
    -        T14[0LL]
    +      for(nvfuser_index_t i30 = 0; i30 < 8; ++i30) {
    +        Array<float, 1, 1> T14;
    +        T14[0]
               = T40[i30]
               * (float) d7;
    -        Array<float, 1LL, 1> T15;
    -        T15[0LL]
    -          = T14[0LL]
    +        Array<float, 1, 1> T15;
    +        T15[0]
    +          = T14[0]
               * (float) 1.00000001490116119e-01;
    -        Array<float, 1LL, 1> T18;
    -        T18[0LL]
    -          = T17[0LL]
    -          + T15[0LL];
    +        Array<float, 1, 1> T18;
    +        T18[0]
    +          = T17[0]
    +          + T15[0];
             if ((b19 && (i28 < (-(((nvfuser_index_t)blockDim.y) * (i30 + nvfuser_zero)))))) {
    -          T42[0LL]
    -            = T42[0LL]
    -            + T18[0LL];
    +          T42[0]
    +            = T42[0]
    +            + T18[0];
             }
           }
           NVFUSER_UPDATE_MAGIC_ZERO;
         }
       }
    -  Array<float, 1LL, 1> T19;
    -  T19[0LL] = 0.000000000e+00f;
    -  blockReduce<false, true, false, true>(T19[0LL], T42[0LL], [](float &a, float b) { a = a + b; }, static_cast<float*>(shared_mem), true, true, float(0.000000000e+00f), DefaultBlockDim());
    -  Array<float, 1LL, 1> T20;
    -  T20[0LL]
    -    = T19[0LL]
    +  Array<float, 1, 1> T19;
    +  T19[0] = 0.000000000e+00f;
    +  blockReduce<false, true, false, true>(T19[0], T42[0], [](float &a, float b) { a = a + b; }, static_cast<float*>(shared_mem), true, true, float(0.000000000e+00f), DefaultBlockDim());
    +  Array<float, 1, 1> T20;
    +  T20[0]
    +    = T19[0]
         * (float) d24;
    -  Array<float, 1LL, 1> T41;
    -  T41[0LL]
    -     = T20[0LL];
    -  if (((((nvfuser_index_t)threadIdx.y) == 0LL) && b19)) {
    +  Array<float, 1, 1> T41;
    +  T41[0]
    +     = T20[0];
    +  if (((((nvfuser_index_t)threadIdx.y) == 0) && b19)) {
         T34[i18]
    -       = T41[0LL];
    +       = T41[0];
       }
     }
    NVFuserTest.FusionBNRepro_CUDA 0 --- b8197081
    
    +++ 28aad7c4
    
    @@ -292,11 +292,11 @@
    
       Array<float, 1LL, 1> T34;
       T34[0LL]
         = T9[0LL]
         + T8[0LL];
       if (b31) {
    -    T10[(T10.alloc_stride[0LL] * ((nvfuser_index_t)blockIdx.x))]
    +    T10[((nvfuser_index_t)blockIdx.x)]
            = T34[0LL];
       }
       Array<float, 1LL, 1> T33;
       T33[0LL] = 0LL;
       T33[0LL]
    @@ -316,11 +316,11 @@
    
       Array<float, 1LL, 1> T35;
       T35[0LL]
         = T13[0LL]
         + T12[0LL];
       if (b31) {
    -    T14[(T14.alloc_stride[0LL] * ((nvfuser_index_t)blockIdx.x))]
    +    T14[((nvfuser_index_t)blockIdx.x)]
            = T35[0LL];
       }
       Array<float, 1LL, 1> T17;
       T17[0LL]
         = T6[0LL]
    AliasTest.ReuseBuffer_AliasAcrossSegments 1 --- b8197081
    
    +++ 28aad7c4
    
    @@ -1,36 +1,36 @@
    
     __global__ void nvfuser_N(Tensor<float, 2, 2> T2, Tensor<float, 1, 1> T5, Tensor<float, 2, 2> T8, Tensor<float, 2, 2> T9) {
       nvfuser_index_t i0;
    -  i0 = ((nvfuser_index_t)threadIdx.x) + (128LL * ((nvfuser_index_t)blockIdx.x));
    +  i0 = ((nvfuser_index_t)threadIdx.x) + (128 * ((nvfuser_index_t)blockIdx.x));
       nvfuser_index_t i1;
       i1 = i0 % T2.logical_size[1LL];
       nvfuser_index_t i2;
       i2 = T2.logical_size[0LL] * T2.logical_size[1LL];
    -  if (((((nvfuser_index_t)blockIdx.x) < (ceilDiv(i2, 128LL))) && (i0 < i2))) {
    -    Array<float, 1LL, 1> T11;
    -    T11[0LL] = 0LL;
    -    T11[0LL]
    +  if (((((nvfuser_index_t)blockIdx.x) < (ceilDiv(i2, 128))) && (i0 < i2))) {
    +    Array<float, 1, 1> T11;
    +    T11[0] = 0;
    +    T11[0]
            = T2[((T2.alloc_stride[0LL] * (i0 / T2.logical_size[1LL])) + (T2.alloc_stride[1LL] * i1))];
    -    Array<float, 1LL, 1> T12;
    -    T12[0LL] = 0LL;
    -    T12[0LL]
    +    Array<float, 1, 1> T12;
    +    T12[0] = 0;
    +    T12[0]
            = T5[i1];
    -    Array<float, 1LL, 1> T6;
    -    T6[0LL]
    -       = T12[0LL];
    -    Array<float, 1LL, 1> T7;
    -    T7[0LL]
    -      = T6[0LL]
    -      + T11[0LL];
    -    Array<float, 1LL, 1> T13;
    -    T13[0LL]
    -      = T7[0LL]
    +    Array<float, 1, 1> T6;
    +    T6[0]
    +       = T12[0];
    +    Array<float, 1, 1> T7;
    +    T7[0]
    +      = T6[0]
    +      + T11[0];
    +    Array<float, 1, 1> T13;
    +    T13[0]
    +      = T7[0]
           + (float) 1.00000000000000000e+00;
         T8[i0]
    -       = T13[0LL];
    -    Array<float, 1LL, 1> T14;
    -    T14[0LL]
    -       = T7[0LL];
    +       = T13[0];
    +    Array<float, 1, 1> T14;
    +    T14[0]
    +       = T7[0];
         T9[i0]
    -       = T14[0LL];
    +       = T14[0];
       }
     }
    5 kernel differences from 4 tests found
    26 new tests found
    25 removed tests found
    

    @wujingyue wujingyue merged commit 342336b into main Mar 18, 2025
    54 of 60 checks passed
    @wujingyue wujingyue deleted the wjy/replay branch March 18, 2025 22:23
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants