[Host irs] alias and preallocated output support by samnordmann · Pull Request #4144 · NVIDIA/Fuser

samnordmann · 2025-03-26T11:51:36Z

This PR belongs to a series of stacked PRs:

What

Support for aliases in HostIrContainer. When a Tensor tv1 is marked as being the alias of tv0, then, at runtime, tv0's concrete data/buffer will be used for the op. It is a way to reuse buffers that have been allocated elsewhere within the TensorView's SSA paradigm. Chained aliasing (tv2-->tv1-->tv0) are supported.
Fix preallocated outputs in HostIrEvaluator

Why

It is necessary for stream parallelization, where typically we allocate the full output buffer but each stream writes to a slice of this buffer.

How

The aliasing is stored in the HostIrContainer through a map.

At the HostIrEvaluator level, instead of operating directly on the ExprEvaluator to write/read concrete data, we first apply the alias indirection

github-actions · 2025-03-26T11:52:28Z

Review updated until commit 97b1743

Description

Added support for aliases in HostIrContainer for buffer reuse.
Enhanced HostIrEvaluator to handle preallocated outputs.
Improved error handling and validation in HostIrEvaluator.
Added new tests for aliasing and preallocated outputs.

Changes walkthrough 📝

Relevant files

Enhancement

container.cpp `Add alias support in HostIrContainer` csrc/host_ir/container.cpp Added alias printing in `print` method. Added `markAlias` and `alias` methods.	+6/-1
executor.cpp `Enhance HostIrEvaluator for aliases and preallocated outputs` csrc/host_ir/executor.cpp Removed redundant `getKnownTensorOrUndefined` functions. Added `getAlias`, `isKnown`, `getKnownConcreteValue`, `getKnownTensorOrUndefined`, `bind`, and `invalidate` methods. Enhanced `runWithInput` to handle preallocated outputs. Improved error handling in `handle` methods.	+110/-108
container.h `Add alias support in HostIrContainer` csrc/host_ir/container.h Added `markAlias` and `alias` methods.	+12/-0
executor.h `Enhance HostIrEvaluator for aliases and preallocated outputs` csrc/host_ir/executor.h Added `getAlias`, `isKnown`, `getKnownConcreteValue`, `getKnownTensorOrUndefined`, `bind`, and `invalidate` methods.	+32/-1

Tests

test_host_irs.cpp `Add tests for preallocated outputs and aliasing` tests/cpp/test_host_irs.cpp Added new test cases for preallocated outputs and aliasing. Updated existing tests to use `EXPECT_TRUE` instead of `GTEST_EXPECT_TRUE`.	+145/-8

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Possible Issue

The getKnownConcreteValue and getKnownTensorOrUndefined methods in HostIrEvaluator assume that getAlias will always return a Val* that is known to the expr_evaluator_. However, there is no guarantee that the aliased Val* is known, which could lead to errors.

        host_ir_container_->outputs().end(),
        communication->out()));

NVF_ERROR(
    out_idx < (int64_t)host_ir_container_->outputs().size(),
    "Output tensor not found in fusion outputs");
auto out_tensor = output_args[out_idx].as<at::Tensor>();

// Inputs are already validated in bindInputs.
validateTensors({out_tensor}, {communication->out()}, expr_eval);
c10::intrusive_ptr<c10d::Work> work = postSingleCommunication(
    communication,
    communicator_->deviceId(),
    backend,

Performance Concern

The introduction of getAlias calls in multiple methods could introduce performance overhead, especially if the alias chain is long. Consider optimizing this if performance becomes an issue.

for (Expr* e : host_ir_container_->topLevelExprs()) {
  NVF_ERROR(e->isA<Communication>());
  auto* communication = e->as<Communication>();
  c10d::Backend* backend =
      communicator_->getBackendForTeam(communication->team(), std::nullopt);
  auto in_tensor = expr_eval.evaluate(communication->in()).as<at::Tensor>();
  auto out_idx = std::distance(
      host_ir_container_->outputs().begin(),
      std::find(
          host_ir_container_->outputs().begin(),

Code Duplication

The getKnownTensorOrUndefined method is similar to the getKnownConcreteValue method but returns an at::Tensor. Consider refactoring to avoid code duplication.

validateTensors({out_tensor}, {communication->out()}, expr_eval);
c10::intrusive_ptr<c10d::Work> work = postSingleCommunication(
    communication,
    communicator_->deviceId(),
    backend,
    in_tensor,

samnordmann · 2025-03-26T12:01:06Z

!test

wujingyue · 2025-04-08T04:33:48Z

csrc/host_ir/container.h

  std::vector<Expr*> top_level_exprs_;
  std::vector<std::unique_ptr<KernelExecutor>> kernel_executors_;
  Stream* default_stream_ = nullptr;
+  std::unordered_map<const Val*, Val*> alias_;


Why not TensorView* since markAlias takes TensorView?

It is more convenient imo to keep a Val* here to have a uniform treatment of Scalars and TensorViews when we get/set, through the method getAlias.
If changing the signature to TensorView here, we need to branch into getAlias to maybe downcast the Val* to a TensorView*, get the alias, and then upcast back the obtained TensorView* to a Val*.

This is doable but imo slightly less natural.
Since, besides, we restrict aliasing to TensorView only for simplicity, but not for a structural reason, this is why I made this choice.

Let me know what you prefer.

but not for a structural reason

What does it mean for a scalar to alias another scalar?

but not for a structural reason

What does it mean for a scalar to alias another scalar?

Like for a Tensor: a different name pointing to the same data, (IOW seeing the scalar as a 0-dim tensor)
But there is no real motivation to support that for now.

FYI, scalar is not a 0-dim tensor. The difference is like int vs int*. To me, aliases for pointers make sense but aliases for scalars don't.

I'm fine with this code as is. Personally, I prefer std::unordered_map<TensorView*, TensorView*> because it gives the right impression that only TensorViews can alias. std::unordered_map<Val*, TensorView*> is OK as well if the key is too often a Val* and you want to save too much typing of ->as<TensorView>().

FYI, scalar is not a 0-dim tensor. The difference is like int vs int*. To me, aliases for pointers make sense but aliases for scalars don't.

Ok I leave it as is for now. For the sake of the discussion though and to make sure I am not missing something:

A scalar can be viewed as a 0-dim tensor, mathematically. Then, it is an implementation detail whether the scalar type owns a pointer or a value. I understand that pytorch makes the choice that at::Scalar represents the value and not the pointer, contrarily to at::Tensor. However, in our context, we are adding one level of indirection (through hir-aliasing and expression evaluator), so aliasing is always possible, even for scalars. More precisely: the symbolic object (scalar or tensor) is a Val*, mapped through (hir-alias+ExprEvaluator) to a PolymorphicValue (e.g. a at::Tensor or a at::Scalar). Two Val* can always be mapped to the same concrete value, IOW, aliasing is always possible.

csrc/host_ir/container.h

tests/cpp/test_host_irs.cpp

…pport

samnordmann · 2025-04-14T13:32:16Z

see #4147 (comment) for a discussion where aliasing is needed for now even though HIR doesn't make an SSA assumption

…pport

wujingyue

I'll finish reviewing this tomorrow. US tax day is driving me crazy unfortunately... :)

csrc/host_ir/executor.h

wujingyue · 2025-04-16T06:24:56Z

tests/cpp/test_host_irs.cpp

+  hic->markAlias(tv1, tv0);
+  hic->addInput(tv0);
+
+  EXPECT_ANY_THROW(HostIrEvaluator hie(std::move(hic)));


Try to be specific about what exceptions to expect. Eg:

Fuser/tests/cpp/test_preseg_passes.cpp

Line 57 in f44f191

EXPECT_THAT(

This helps readers understand the intention of the test and makes the test stricter.

Using the gmock templates with the move-only std::unique_ptr<HostIrContainer> hic gives a segfault:

EXPECT_THAT( [&]() { HostIrEvaluator hie(std::move(hic)); }, ::testing::ThrowsMessage<std::runtime_error>( ::testing::HasSubstr("Inputs cannot be aliased")));

After a quick search on the internet, I found this: https://google.github.io/googletest/gmock_cook_book.html#mocking-methods-that-use-move-only-types
Achieving what you suggest is doable but challenging -- at least I can't figure out how to do it easily.

…pport

samnordmann · 2025-04-16T12:18:10Z

!test

This PR belongs to a series of stacked PRs: 1. #4144 2. **=> You are here:** #4145 3. #4146 4. #4147 # What 1. We replace the bool option `SegmentCandidateFinderOptions:: only_segment_resharding_exprs` by a pointer to a predicate function. This allow the user of the segmented to (optionally) provide a custom function to be used to decide whether two given groups should be merged. This achieves better separation of responsibility: with this option, the segmented is only responsible of applying the segmentation algorithm, but does not embed the specific rule for merging group which depends on the application. The specific rule in our context is decided by the Hir lowering. Imo this refactoring should ideally go further and make the segmented a more abstract class that would be used in both Host Ir and FusionExecutorCache lowering but only changing the newly introduced function pointer. 2. In HIR lowering, we clearly separate (in a distinct for-loop, but this later will become a preseg pass) the pass that transforms resharding exprs into a Communication # Why that's a preliminary refactoring useful for more advanced Host Ir Lowering, notably ParallelType::Stream lowering

This PR belongs to a series of stacked PRs: 1. #4144 2. #4145 3. **=> You are here:** #4146 4. #4147 Add support for `LoadStoreOp`, `BinaryOp`, `ReductionOp`, including support for pre-allocated output, which is not provided by ExprEvaluator. --------- Co-authored-by: Jingyue Wu <wujingyue@gmail.com>

porting PR #4147 to here which is based on main This PR only serves as a reference since it got broken down into the following PRs to ease reviewing and merging: 1. #4144 2. #4145 3. #4146 4. #4147

This PR belongs to a series of stacked PRs: 1. #4144 2. #4145 3. #4146 4. #4301 5. **=> You are here:** #4147 # What Implement a proper lowering for handling ParallelType::Stream. This PR has the following restrictions: - Single device fusion - No split/merge of Stream axis We add to Hir lowering a new pass that reads the hir container's top level expressions, reads the consumer's stream parallelization and create For Loop with stream management and sync for expressing the stream parallelization. Basic logic for merging For-Loop are written. Let me explain through some examples that can be found in the PR. We suggest to run those examples as follows: ``` NVFUSER_DUMP=host_ir test_host_ir --gtest_filter=* ``` ## Single expr and for-loop Look at `MultiDeviceExecutorLowerStreamTest.SingleSetOp` simple scenario: ``` TensorView* tv0 = makeContigTensor(2); TensorView* tv1 = set(tv0); fusion->addInput(tv0); fusion->addOutput(tv1); tv1->axis(0)->parallelize(ParallelType::Stream); ``` the dumped generated Host Ir program is: ``` %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T1_g_float[iStreamIdx2{i0}, iS3{i2}]) : T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false) FOR StreamIdx in iStreamIdx2{i0}: GetCurrentStream into Stream 0 SetCurrentStream to Stream ( StreamIdx % numberOfStreams ) Synchronize Stream 0 T2_l_float[iS4{i2}] = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx ) T3_l_float[iS5{i2}] = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx ) T3_l_float[iS5{i2}] = Set( T2_l_float[iS4{i2}], cache_op=Streaming ) SetCurrentStream to Stream 0 Synchronize Stream ( StreamIdx % numberOfStreams ) } // %HostIrContainer ``` We can see that the expr, here the "Set", gets embedded into a For Loop. Let us analyze further: - outside the for loop, we allocate the global output buffer. - The start of the for loop body does the new stream assignment and sync of that stream to the user stream - Then, we "Select" (aka slice) through `HirAliasSelect` into the input and output - The "Set" operation is executed on the "selected" I/O. Note that the output is an alias to the output's slice. - At the end of the for loop, we reset to the user's stream (I mean, the currently selected stream before entering the program) and sync the user's stream with the running stream. ## Merging for loops To avoid unnecessary synchronization across streams, it is important to be able to fuse the stream for-loop. This is exercised by the test `MultiDeviceExecutorLowerStreamTest.TwoSetOps`: ``` TensorView* tv0 = makeContigTensor(2); TensorView* tv1 = set(tv0); TensorView* tv2 = set(tv1); fusion->addInput(tv0); fusion->addOutput(tv2); tv1->axis(0)->parallelize(ParallelType::Stream); tv2->axis(0)->parallelize(ParallelType::Stream); ``` dump: ``` %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T2_g_float[iStreamIdx4{i0}, iS5{i2}]) : T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false) T2_g_float[iStreamIdx4{i0}, iS5{i2}] = ALLOCATE(buffer=T2_g_float[iStreamIdx4{i0}, iS5{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false) FOR StreamIdx in iStreamIdx2{i0}: GetCurrentStream into Stream 0 SetCurrentStream to Stream ( StreamIdx % numberOfStreams ) Synchronize Stream 0 T3_l_float[iS6{i2}] = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx ) T4_l_float[iS7{i2}] = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx ) T4_l_float[iS7{i2}] = Set( T3_l_float[iS6{i2}], cache_op=Streaming ) T5_l_float[iS8{i2}] = HirAliasSelect( T2_g_float[iStreamIdx4{i0}, iS5{i2}], axis = iStreamIdx4{i0}, index = StreamIdx ) T5_l_float[iS8{i2}] = Set( T4_l_float[iS7{i2}], cache_op=Streaming ) SetCurrentStream to Stream 0 Synchronize Stream ( StreamIdx % numberOfStreams ) } // %HostIrContainer ``` We observe that the For-loop are indeed merged. **Possible future optimization:** the allocation of the intermediate buffer could be only of length `numberOfStreams` ## separating for loops We also need to be able to separate and create new for loops if necessary, as exercised in `ThreeSetOpsWithDisjointsForLoops`, which considers the Fusion: ``` TensorView* tv0 = makeContigTensor(2); TensorView* tv1 = set(tv0); TensorView* tv2 = set(tv1); TensorView* tv3 = set(tv2); fusion->addInput(tv0); fusion->addOutput(tv3); tv1->axis(0)->parallelize(ParallelType::Stream); tv3->axis(0)->parallelize(ParallelType::Stream); ``` Here, tv2 is not stream-parallelized so it should be be produced in a for-loop. Dump: ``` %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T3_g_float[iStreamIdx6{i0}, iS7{i2}]) : T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false) FOR StreamIdx in iStreamIdx2{i0}: GetCurrentStream into Stream 0 SetCurrentStream to Stream ( StreamIdx % numberOfStreams ) Synchronize Stream 0 T4_l_float[iS8{i2}] = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx ) T5_l_float[iS9{i2}] = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx ) T5_l_float[iS9{i2}] = Set( T4_l_float[iS8{i2}], cache_op=Streaming ) SetCurrentStream to Stream 0 Synchronize Stream ( StreamIdx % numberOfStreams ) T2_g_float[iS4{i0}, iS5{i2}] = Set( T1_g_float[iStreamIdx2{i0}, iS3{i2}], cache_op=Streaming ) T3_g_float[iStreamIdx6{i0}, iS7{i2}] = ALLOCATE(buffer=T3_g_float[iStreamIdx6{i0}, iS7{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false) FOR StreamIdx in iStreamIdx6{i0}: GetCurrentStream into Stream 2 SetCurrentStream to Stream ( StreamIdx % numberOfStreams ) Synchronize Stream 2 T6_l_float[iS10{i2}] = HirAliasSelect( T2_g_float[iS4{i0}, iS5{i2}], axis = iS4{i0}, index = StreamIdx ) T7_l_float[iS11{i2}] = HirAliasSelect( T3_g_float[iStreamIdx6{i0}, iS7{i2}], axis = iStreamIdx6{i0}, index = StreamIdx ) T7_l_float[iS11{i2}] = Set( T6_l_float[iS10{i2}], cache_op=Streaming ) SetCurrentStream to Stream 2 Synchronize Stream ( StreamIdx % numberOfStreams ) } // %HostIrContainer ``` --------- Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com> Co-authored-by: Jingyue Wu <wujingyue@gmail.com> Co-authored-by: Ryan Spring <rspring@nvidia.com> Co-authored-by: Liqiang Lu <116412316+liqiangxl@users.noreply.github.com> Co-authored-by: jjsjann123 <jiej@nvidia.com> Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com> Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com> Co-authored-by: Priya Mishra <52657555+Priya2698@users.noreply.github.com> Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com> Co-authored-by: Nick Sarkauskas <nsarkauskas@nvidia.com> Co-authored-by: Wang, Xiao <24860335+xwang233@users.noreply.github.com> Co-authored-by: root <26priya11@gmail.com>

samnordmann added 2 commits March 26, 2025 04:29

host ir alias and prealloc output support

9a0dc9e

harden and simplify allocation in for loop test

9820d5a

samnordmann requested review from nsarka and wujingyue March 26, 2025 14:36

wujingyue reviewed Apr 8, 2025

View reviewed changes

Merge branch 'main' of github.com:NVIDIA/Fuser into host_irs/alias_su…

8c49c95

…pport

samnordmann added 2 commits April 14, 2025 06:49

reviews

e1db518

Merge branch 'main' of github.com:NVIDIA/Fuser into host_irs/alias_su…

7e6cef6

…pport

samnordmann requested a review from wujingyue April 14, 2025 13:53

wujingyue reviewed Apr 15, 2025

View reviewed changes

wujingyue approved these changes Apr 16, 2025

View reviewed changes

samnordmann added 3 commits April 16, 2025 02:22

minor comment

eb46aef

lint

5f161f5

Merge branch 'main' of github.com:NVIDIA/Fuser into host_irs/alias_su…

97b1743

…pport

samnordmann merged commit a064f6e into main Apr 16, 2025
53 checks passed

samnordmann deleted the host_irs/alias_support branch April 16, 2025 14:55

samnordmann mentioned this pull request Apr 22, 2025

Deallocate HostIr Op and Test #4286

Merged

Conversation

samnordmann commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Uh oh!

github-actions bot commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

samnordmann commented Mar 26, 2025

Uh oh!

wujingyue Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

samnordmann Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

wujingyue Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

samnordmann Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wujingyue Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

samnordmann Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

samnordmann commented Apr 14, 2025

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wujingyue Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

samnordmann Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

samnordmann commented Apr 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samnordmann commented Mar 26, 2025 •

edited

Loading

github-actions bot commented Mar 26, 2025 •

edited

Loading

samnordmann Apr 15, 2025 •

edited

Loading

samnordmann Apr 16, 2025 •

edited

Loading