add HirAliasSelect by samnordmann · Pull Request #4301 · NVIDIA/Fuser

samnordmann · 2025-04-23T22:08:05Z

What

Add a SelectOp-like HIR to express indexing into ATen tensor.

Why

it is used in the context of stream lowering, see #4147 and
especially the discussion in #4147 (comment)

github-actions · 2025-04-23T22:09:00Z

Review updated until commit 82bc595

Description

Added HirAliasSelect for tensor indexing in HIR.
Implemented handling in HostIrEvaluator.
Added tests for HirAliasSelect.
Updated dispatch and header files.

Changes walkthrough 📝

Relevant files

Enhancement

executor.cpp `Implement HirAliasSelect handling` csrc/host_ir/executor.cpp Added `handle` method for `HirAliasSelect`.	+9/-0
host_ir.cpp `Define HirAliasSelect class` csrc/host_ir/host_ir.cpp Added `HirAliasSelect` class definition. Implemented constructor and methods.	+45/-0
executor.h `Declare HirAliasSelect handling` csrc/host_ir/executor.h Added `handle` method declaration for `HirAliasSelect`.	+1/-0
host_ir.h `Declare HirAliasSelect class` csrc/host_ir/host_ir.h Added `HirAliasSelect` class declaration.	+43/-0

Tests

test_host_irs.cpp `Add HirAliasSelect test` tests/cpp/test_host_irs.cpp Added test case for `HirAliasSelect`.	+38/-0

Configuration changes

dispatch.h `Update dispatch list` csrc/dispatch.h Added `HirAliasSelect` to dispatch list.	+2/-1

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Axis Validation

The axis validation in the constructor of HirAliasSelect checks if the axis is within the bounds of the input tensor's dimensions. However, it uses in()->getLogicalDomain().at(axis) in the error message, which might be incorrect if the axis is out of bounds. It should use axis directly in the error message for clarity.

static_cast<int64_t>(in->getLogicalDomain().size()) > axis,
"Select axis ",
axis,
" is out of bounds for tensor ",
in->toString(),
" with ",
in->getLogicalDomain().size(),

Output Handling

The comment mentions that "out" is not added as an output because the current op doesn't "define" it, but rather sets its allocation. This might lead to confusion. It would be helpful to clarify why "out" is treated differently from other outputs.

// but rather sets its allocation. Since "out" will be used in another
// producing expression, this avoids unnecessary cyclic dependencies. This
// ressembles how kir::Allocate treats its allocated TensorView.

Test Coverage

The test SelectingTensor covers a basic case. It would be beneficial to add more test cases to cover edge cases, such as selecting the first or last dimension, and using different data types.

using HirAliasSelectHostIrTest = NVFuserTest;

TEST_F(HirAliasSelectHostIrTest, SelectingTensor) {
  constexpr int64_t ndims = 2;
  constexpr int64_t dim = 1;
  constexpr int64_t index = 3;
  const std::vector<int64_t> input_sizes = {32, 32};

  ASSERT_LT(dim, ndims);
  ASSERT_EQ(input_sizes.size(), ndims);
  ASSERT_LT(index, input_sizes.at(dim));

  auto hic = std::make_unique<HostIrContainer>();
  FusionGuard fg(hic.get());

  TensorView* in = makeContigTensor(ndims);
  TensorView* out = makeContigTensor(ndims - 1);
  auto* index_val = IrBuilder::create<Val>(index, DataType::Index);
  auto* select_op = IrBuilder::create<HirAliasSelect>(in, out, dim, index_val);

  hic->addInput(in);
  hic->addOutput(out);
  hic->pushBackTopLevelExprs(select_op);

  HostIrEvaluator hie(std::move(hic));

  auto options = at::TensorOptions().device(at::kCUDA, 0).dtype(torch::kFloat);
  auto in_aten = at::randn(input_sizes, options);
  std::unordered_map<Val*, PolymorphicValue> concrete_input_buffers = {
      {in, in_aten}};

  auto out_aten = hie.runWithInput(concrete_input_buffers)[0].as<at::Tensor>();

  // validate
  auto ref_out = in_aten.select(dim, index);
  EXPECT_TRUE(ref_out.equal(out_aten));
}

samnordmann · 2025-04-23T22:20:46Z

!test

…lias_select

samnordmann · 2025-04-24T15:55:30Z

!test

…lias_select

samnordmann · 2025-04-27T08:57:54Z

!test

…lias_select

samnordmann · 2025-04-27T11:25:36Z

!test

samnordmann · 2025-04-27T11:59:44Z

!test

This PR belongs to a series of stacked PRs: 1. #4144 2. #4145 3. #4146 4. #4301 5. **=> You are here:** #4147 # What Implement a proper lowering for handling ParallelType::Stream. This PR has the following restrictions: - Single device fusion - No split/merge of Stream axis We add to Hir lowering a new pass that reads the hir container's top level expressions, reads the consumer's stream parallelization and create For Loop with stream management and sync for expressing the stream parallelization. Basic logic for merging For-Loop are written. Let me explain through some examples that can be found in the PR. We suggest to run those examples as follows: ``` NVFUSER_DUMP=host_ir test_host_ir --gtest_filter=* ``` ## Single expr and for-loop Look at `MultiDeviceExecutorLowerStreamTest.SingleSetOp` simple scenario: ``` TensorView* tv0 = makeContigTensor(2); TensorView* tv1 = set(tv0); fusion->addInput(tv0); fusion->addOutput(tv1); tv1->axis(0)->parallelize(ParallelType::Stream); ``` the dumped generated Host Ir program is: ``` %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T1_g_float[iStreamIdx2{i0}, iS3{i2}]) : T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false) FOR StreamIdx in iStreamIdx2{i0}: GetCurrentStream into Stream 0 SetCurrentStream to Stream ( StreamIdx % numberOfStreams ) Synchronize Stream 0 T2_l_float[iS4{i2}] = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx ) T3_l_float[iS5{i2}] = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx ) T3_l_float[iS5{i2}] = Set( T2_l_float[iS4{i2}], cache_op=Streaming ) SetCurrentStream to Stream 0 Synchronize Stream ( StreamIdx % numberOfStreams ) } // %HostIrContainer ``` We can see that the expr, here the "Set", gets embedded into a For Loop. Let us analyze further: - outside the for loop, we allocate the global output buffer. - The start of the for loop body does the new stream assignment and sync of that stream to the user stream - Then, we "Select" (aka slice) through `HirAliasSelect` into the input and output - The "Set" operation is executed on the "selected" I/O. Note that the output is an alias to the output's slice. - At the end of the for loop, we reset to the user's stream (I mean, the currently selected stream before entering the program) and sync the user's stream with the running stream. ## Merging for loops To avoid unnecessary synchronization across streams, it is important to be able to fuse the stream for-loop. This is exercised by the test `MultiDeviceExecutorLowerStreamTest.TwoSetOps`: ``` TensorView* tv0 = makeContigTensor(2); TensorView* tv1 = set(tv0); TensorView* tv2 = set(tv1); fusion->addInput(tv0); fusion->addOutput(tv2); tv1->axis(0)->parallelize(ParallelType::Stream); tv2->axis(0)->parallelize(ParallelType::Stream); ``` dump: ``` %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T2_g_float[iStreamIdx4{i0}, iS5{i2}]) : T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false) T2_g_float[iStreamIdx4{i0}, iS5{i2}] = ALLOCATE(buffer=T2_g_float[iStreamIdx4{i0}, iS5{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false) FOR StreamIdx in iStreamIdx2{i0}: GetCurrentStream into Stream 0 SetCurrentStream to Stream ( StreamIdx % numberOfStreams ) Synchronize Stream 0 T3_l_float[iS6{i2}] = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx ) T4_l_float[iS7{i2}] = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx ) T4_l_float[iS7{i2}] = Set( T3_l_float[iS6{i2}], cache_op=Streaming ) T5_l_float[iS8{i2}] = HirAliasSelect( T2_g_float[iStreamIdx4{i0}, iS5{i2}], axis = iStreamIdx4{i0}, index = StreamIdx ) T5_l_float[iS8{i2}] = Set( T4_l_float[iS7{i2}], cache_op=Streaming ) SetCurrentStream to Stream 0 Synchronize Stream ( StreamIdx % numberOfStreams ) } // %HostIrContainer ``` We observe that the For-loop are indeed merged. **Possible future optimization:** the allocation of the intermediate buffer could be only of length `numberOfStreams` ## separating for loops We also need to be able to separate and create new for loops if necessary, as exercised in `ThreeSetOpsWithDisjointsForLoops`, which considers the Fusion: ``` TensorView* tv0 = makeContigTensor(2); TensorView* tv1 = set(tv0); TensorView* tv2 = set(tv1); TensorView* tv3 = set(tv2); fusion->addInput(tv0); fusion->addOutput(tv3); tv1->axis(0)->parallelize(ParallelType::Stream); tv3->axis(0)->parallelize(ParallelType::Stream); ``` Here, tv2 is not stream-parallelized so it should be be produced in a for-loop. Dump: ``` %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T3_g_float[iStreamIdx6{i0}, iS7{i2}]) : T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false) FOR StreamIdx in iStreamIdx2{i0}: GetCurrentStream into Stream 0 SetCurrentStream to Stream ( StreamIdx % numberOfStreams ) Synchronize Stream 0 T4_l_float[iS8{i2}] = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx ) T5_l_float[iS9{i2}] = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx ) T5_l_float[iS9{i2}] = Set( T4_l_float[iS8{i2}], cache_op=Streaming ) SetCurrentStream to Stream 0 Synchronize Stream ( StreamIdx % numberOfStreams ) T2_g_float[iS4{i0}, iS5{i2}] = Set( T1_g_float[iStreamIdx2{i0}, iS3{i2}], cache_op=Streaming ) T3_g_float[iStreamIdx6{i0}, iS7{i2}] = ALLOCATE(buffer=T3_g_float[iStreamIdx6{i0}, iS7{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false) FOR StreamIdx in iStreamIdx6{i0}: GetCurrentStream into Stream 2 SetCurrentStream to Stream ( StreamIdx % numberOfStreams ) Synchronize Stream 2 T6_l_float[iS10{i2}] = HirAliasSelect( T2_g_float[iS4{i0}, iS5{i2}], axis = iS4{i0}, index = StreamIdx ) T7_l_float[iS11{i2}] = HirAliasSelect( T3_g_float[iStreamIdx6{i0}, iS7{i2}], axis = iStreamIdx6{i0}, index = StreamIdx ) T7_l_float[iS11{i2}] = Set( T6_l_float[iS10{i2}], cache_op=Streaming ) SetCurrentStream to Stream 2 Synchronize Stream ( StreamIdx % numberOfStreams ) } // %HostIrContainer ``` --------- Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com> Co-authored-by: Jingyue Wu <wujingyue@gmail.com> Co-authored-by: Ryan Spring <rspring@nvidia.com> Co-authored-by: Liqiang Lu <116412316+liqiangxl@users.noreply.github.com> Co-authored-by: jjsjann123 <jiej@nvidia.com> Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com> Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com> Co-authored-by: Priya Mishra <52657555+Priya2698@users.noreply.github.com> Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com> Co-authored-by: Nick Sarkauskas <nsarkauskas@nvidia.com> Co-authored-by: Wang, Xiao <24860335+xwang233@users.noreply.github.com> Co-authored-by: root <26priya11@gmail.com>

aliasing is not needed anymore in Host IR after #4301

aliasing is not needed anymore in Host IR after NVIDIA#4301

add HirAliasSelect

35158ef

samnordmann mentioned this pull request Apr 23, 2025

[Host irs] Stream lowering of single device fusions #4147

Merged

samnordmann requested a review from wujingyue April 23, 2025 22:20

wujingyue approved these changes Apr 23, 2025

View reviewed changes

Merge branch 'main' of github.com:NVIDIA/Fuser into host_ir/add_hir_a…

842c736

…lias_select

Merge branch 'main' of github.com:NVIDIA/Fuser into host_ir/add_hir_a…

d23a54e

…lias_select

Merge branch 'main' of github.com:NVIDIA/Fuser into host_ir/add_hir_a…

c5b1e24

…lias_select

lint

82bc595

samnordmann merged commit bfc7ba8 into main Apr 27, 2025
53 checks passed

samnordmann deleted the host_ir/add_hir_alias_select branch April 27, 2025 14:21

samnordmann mentioned this pull request May 16, 2025

[Host IR] Remove alias #4467

Merged

wujingyue pushed a commit that referenced this pull request May 22, 2025

[Host IR] Remove alias (#4467)

697c696

aliasing is not needed anymore in Host IR after #4301

jacobhinkle pushed a commit that referenced this pull request May 23, 2025

[Host IR] Remove alias (#4467)

cb13d6e

aliasing is not needed anymore in Host IR after #4301

nsarka pushed a commit to nsarka/Fuser that referenced this pull request Jul 28, 2025

[Host IR] Remove alias (NVIDIA#4467)

7031e65

aliasing is not needed anymore in Host IR after NVIDIA#4301

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add HirAliasSelect#4301

add HirAliasSelect#4301
samnordmann merged 5 commits intomainfrom
host_ir/add_hir_alias_select

samnordmann commented Apr 23, 2025

Uh oh!

github-actions bot commented Apr 23, 2025 •

edited

Loading

Uh oh!

samnordmann commented Apr 23, 2025

Uh oh!

samnordmann commented Apr 24, 2025

Uh oh!

samnordmann commented Apr 27, 2025

Uh oh!

samnordmann commented Apr 27, 2025

Uh oh!

samnordmann commented Apr 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

samnordmann commented Apr 23, 2025

What

Why

Uh oh!

github-actions bot commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

samnordmann commented Apr 23, 2025

Uh oh!

samnordmann commented Apr 24, 2025

Uh oh!

samnordmann commented Apr 27, 2025

Uh oh!

samnordmann commented Apr 27, 2025

Uh oh!

samnordmann commented Apr 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Apr 23, 2025 •

edited

Loading