Skip to content

add HirAliasSelect#4301

Merged
samnordmann merged 5 commits intomainfrom
host_ir/add_hir_alias_select
Apr 27, 2025
Merged

add HirAliasSelect#4301
samnordmann merged 5 commits intomainfrom
host_ir/add_hir_alias_select

Conversation

@samnordmann
Copy link
Collaborator

What

Add a SelectOp-like HIR to express indexing into ATen tensor.

Why

it is used in the context of stream lowering, see #4147 and
especially the discussion in #4147 (comment)

@github-actions
Copy link

github-actions bot commented Apr 23, 2025

Review updated until commit 82bc595

Description

  • Added HirAliasSelect for tensor indexing in HIR.

  • Implemented handling in HostIrEvaluator.

  • Added tests for HirAliasSelect.

  • Updated dispatch and header files.


Changes walkthrough 📝

Relevant files
Enhancement
executor.cpp
Implement HirAliasSelect handling                                               

csrc/host_ir/executor.cpp

  • Added handle method for HirAliasSelect.
+9/-0     
host_ir.cpp
Define HirAliasSelect class                                                           

csrc/host_ir/host_ir.cpp

  • Added HirAliasSelect class definition.
  • Implemented constructor and methods.
  • +45/-0   
    executor.h
    Declare HirAliasSelect handling                                                   

    csrc/host_ir/executor.h

    • Added handle method declaration for HirAliasSelect.
    +1/-0     
    host_ir.h
    Declare HirAliasSelect class                                                         

    csrc/host_ir/host_ir.h

    • Added HirAliasSelect class declaration.
    +43/-0   
    Tests
    test_host_irs.cpp
    Add HirAliasSelect test                                                                   

    tests/cpp/test_host_irs.cpp

    • Added test case for HirAliasSelect.
    +38/-0   
    Configuration changes
    dispatch.h
    Update dispatch list                                                                         

    csrc/dispatch.h

    • Added HirAliasSelect to dispatch list.
    +2/-1     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Axis Validation

    The axis validation in the constructor of HirAliasSelect checks if the axis is within the bounds of the input tensor's dimensions. However, it uses in()->getLogicalDomain().at(axis) in the error message, which might be incorrect if the axis is out of bounds. It should use axis directly in the error message for clarity.

    static_cast<int64_t>(in->getLogicalDomain().size()) > axis,
    "Select axis ",
    axis,
    " is out of bounds for tensor ",
    in->toString(),
    " with ",
    in->getLogicalDomain().size(),
    Output Handling

    The comment mentions that "out" is not added as an output because the current op doesn't "define" it, but rather sets its allocation. This might lead to confusion. It would be helpful to clarify why "out" is treated differently from other outputs.

    // but rather sets its allocation. Since "out" will be used in another
    // producing expression, this avoids unnecessary cyclic dependencies. This
    // ressembles how kir::Allocate treats its allocated TensorView.
    Test Coverage

    The test SelectingTensor covers a basic case. It would be beneficial to add more test cases to cover edge cases, such as selecting the first or last dimension, and using different data types.

    using HirAliasSelectHostIrTest = NVFuserTest;
    
    TEST_F(HirAliasSelectHostIrTest, SelectingTensor) {
      constexpr int64_t ndims = 2;
      constexpr int64_t dim = 1;
      constexpr int64_t index = 3;
      const std::vector<int64_t> input_sizes = {32, 32};
    
      ASSERT_LT(dim, ndims);
      ASSERT_EQ(input_sizes.size(), ndims);
      ASSERT_LT(index, input_sizes.at(dim));
    
      auto hic = std::make_unique<HostIrContainer>();
      FusionGuard fg(hic.get());
    
      TensorView* in = makeContigTensor(ndims);
      TensorView* out = makeContigTensor(ndims - 1);
      auto* index_val = IrBuilder::create<Val>(index, DataType::Index);
      auto* select_op = IrBuilder::create<HirAliasSelect>(in, out, dim, index_val);
    
      hic->addInput(in);
      hic->addOutput(out);
      hic->pushBackTopLevelExprs(select_op);
    
      HostIrEvaluator hie(std::move(hic));
    
      auto options = at::TensorOptions().device(at::kCUDA, 0).dtype(torch::kFloat);
      auto in_aten = at::randn(input_sizes, options);
      std::unordered_map<Val*, PolymorphicValue> concrete_input_buffers = {
          {in, in_aten}};
    
      auto out_aten = hie.runWithInput(concrete_input_buffers)[0].as<at::Tensor>();
    
      // validate
      auto ref_out = in_aten.select(dim, index);
      EXPECT_TRUE(ref_out.equal(out_aten));
    }

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann samnordmann requested a review from wujingyue April 23, 2025 22:20
    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann samnordmann merged commit bfc7ba8 into main Apr 27, 2025
    53 checks passed
    @samnordmann samnordmann deleted the host_ir/add_hir_alias_select branch April 27, 2025 14:21
    samnordmann added a commit that referenced this pull request Apr 28, 2025
    This PR belongs to a series of stacked PRs:
    1. #4144
    2. #4145
    3. #4146
    4. #4301
    5. **=> You are here:** #4147
    
    # What
    
    Implement a proper lowering for handling ParallelType::Stream. This PR
    has the following restrictions:
    - Single device fusion
    - No split/merge of Stream axis
    
    We add to Hir lowering a new pass that reads the hir container's top
    level expressions, reads the consumer's stream parallelization and
    create For Loop with stream management and sync for expressing the
    stream parallelization. Basic logic for merging For-Loop are written.
    
    Let me explain through some examples that can be found in the PR. We
    suggest to run those examples as follows:
    ```
    NVFUSER_DUMP=host_ir test_host_ir --gtest_filter=*
    ```
    
    ## Single expr and for-loop
    Look at `MultiDeviceExecutorLowerStreamTest.SingleSetOp` simple
    scenario:
    ```
      TensorView* tv0 = makeContigTensor(2);
      TensorView* tv1 = set(tv0);
      fusion->addInput(tv0);
      fusion->addOutput(tv1);
      tv1->axis(0)->parallelize(ParallelType::Stream);
    ```
    the dumped generated Host Ir program is:
    ```
    %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T1_g_float[iStreamIdx2{i0}, iS3{i2}]) :
      T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false)
      FOR StreamIdx in iStreamIdx2{i0}:
        GetCurrentStream into Stream 0
        SetCurrentStream to Stream ( StreamIdx % numberOfStreams )
        Synchronize Stream 0
        T2_l_float[iS4{i2}]
           = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx )
        T3_l_float[iS5{i2}]
           = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx )
        T3_l_float[iS5{i2}]
           = Set( T2_l_float[iS4{i2}], cache_op=Streaming )
        SetCurrentStream to Stream 0
        Synchronize Stream ( StreamIdx % numberOfStreams )
    } // %HostIrContainer
    ```
    We can see that the expr, here the "Set", gets embedded into a For Loop.
    Let us analyze further:
    - outside the for loop, we allocate the global output buffer.
    - The start of the for loop body does the new stream assignment and sync
    of that stream to the user stream
    - Then, we "Select" (aka slice) through `HirAliasSelect` into the input
    and output
    - The "Set" operation is executed on the "selected" I/O. Note that the
    output is an alias to the output's slice.
    - At the end of the for loop, we reset to the user's stream (I mean, the
    currently selected stream before entering the program) and sync the
    user's stream with the running stream.
    
    ## Merging for loops
    
    To avoid unnecessary synchronization across streams, it is important to
    be able to fuse the stream for-loop. This is exercised by the test
    `MultiDeviceExecutorLowerStreamTest.TwoSetOps`:
    ```
      TensorView* tv0 = makeContigTensor(2);
      TensorView* tv1 = set(tv0);
      TensorView* tv2 = set(tv1);
      fusion->addInput(tv0);
      fusion->addOutput(tv2);
      tv1->axis(0)->parallelize(ParallelType::Stream);
      tv2->axis(0)->parallelize(ParallelType::Stream);
    ```
    dump:
    ```
    %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T2_g_float[iStreamIdx4{i0}, iS5{i2}]) :
      T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false)
      T2_g_float[iStreamIdx4{i0}, iS5{i2}] = ALLOCATE(buffer=T2_g_float[iStreamIdx4{i0}, iS5{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false)
      FOR StreamIdx in iStreamIdx2{i0}:
        GetCurrentStream into Stream 0
        SetCurrentStream to Stream ( StreamIdx % numberOfStreams )
        Synchronize Stream 0
        T3_l_float[iS6{i2}]
           = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx )
        T4_l_float[iS7{i2}]
           = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx )
        T4_l_float[iS7{i2}]
           = Set( T3_l_float[iS6{i2}], cache_op=Streaming )
        T5_l_float[iS8{i2}]
           = HirAliasSelect( T2_g_float[iStreamIdx4{i0}, iS5{i2}], axis = iStreamIdx4{i0}, index = StreamIdx )
        T5_l_float[iS8{i2}]
           = Set( T4_l_float[iS7{i2}], cache_op=Streaming )
        SetCurrentStream to Stream 0
        Synchronize Stream ( StreamIdx % numberOfStreams )
    } // %HostIrContainer
    ```
    We observe that the For-loop are indeed merged.
    **Possible future optimization:** the allocation of the intermediate
    buffer could be only of length `numberOfStreams`
    
    ## separating for loops
    
    We also need to be able to separate and create new for loops if
    necessary, as exercised in `ThreeSetOpsWithDisjointsForLoops`, which
    considers the Fusion:
    ```
      TensorView* tv0 = makeContigTensor(2);
      TensorView* tv1 = set(tv0);
      TensorView* tv2 = set(tv1);
      TensorView* tv3 = set(tv2);
      fusion->addInput(tv0);
      fusion->addOutput(tv3);
      tv1->axis(0)->parallelize(ParallelType::Stream);
      tv3->axis(0)->parallelize(ParallelType::Stream);
    ```
    Here, tv2 is not stream-parallelized so it should be be produced in a
    for-loop. Dump:
    ```
    %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T3_g_float[iStreamIdx6{i0}, iS7{i2}]) :
      T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false)
      FOR StreamIdx in iStreamIdx2{i0}:
        GetCurrentStream into Stream 0
        SetCurrentStream to Stream ( StreamIdx % numberOfStreams )
        Synchronize Stream 0
        T4_l_float[iS8{i2}]
           = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx )
        T5_l_float[iS9{i2}]
           = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx )
        T5_l_float[iS9{i2}]
           = Set( T4_l_float[iS8{i2}], cache_op=Streaming )
        SetCurrentStream to Stream 0
        Synchronize Stream ( StreamIdx % numberOfStreams )
      T2_g_float[iS4{i0}, iS5{i2}]
         = Set( T1_g_float[iStreamIdx2{i0}, iS3{i2}], cache_op=Streaming )
      T3_g_float[iStreamIdx6{i0}, iS7{i2}] = ALLOCATE(buffer=T3_g_float[iStreamIdx6{i0}, iS7{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false)
      FOR StreamIdx in iStreamIdx6{i0}:
        GetCurrentStream into Stream 2
        SetCurrentStream to Stream ( StreamIdx % numberOfStreams )
        Synchronize Stream 2
        T6_l_float[iS10{i2}]
           = HirAliasSelect( T2_g_float[iS4{i0}, iS5{i2}], axis = iS4{i0}, index = StreamIdx )
        T7_l_float[iS11{i2}]
           = HirAliasSelect( T3_g_float[iStreamIdx6{i0}, iS7{i2}], axis = iStreamIdx6{i0}, index = StreamIdx )
        T7_l_float[iS11{i2}]
           = Set( T6_l_float[iS10{i2}], cache_op=Streaming )
        SetCurrentStream to Stream 2
        Synchronize Stream ( StreamIdx % numberOfStreams )
    } // %HostIrContainer
    ```
    
    ---------
    
    Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com>
    Co-authored-by: Jingyue Wu <wujingyue@gmail.com>
    Co-authored-by: Ryan Spring <rspring@nvidia.com>
    Co-authored-by: Liqiang Lu <116412316+liqiangxl@users.noreply.github.com>
    Co-authored-by: jjsjann123 <jiej@nvidia.com>
    Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com>
    Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>
    Co-authored-by: Priya Mishra <52657555+Priya2698@users.noreply.github.com>
    Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>
    Co-authored-by: Nick Sarkauskas <nsarkauskas@nvidia.com>
    Co-authored-by: Wang, Xiao <24860335+xwang233@users.noreply.github.com>
    Co-authored-by: root <26priya11@gmail.com>
    wujingyue pushed a commit that referenced this pull request May 22, 2025
    aliasing is not needed anymore in Host IR after
    #4301
    jacobhinkle pushed a commit that referenced this pull request May 23, 2025
    aliasing is not needed anymore in Host IR after
    #4301
    nsarka pushed a commit to nsarka/Fuser that referenced this pull request Jul 28, 2025
    aliasing is not needed anymore in Host IR after
    NVIDIA#4301
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants