Skip to content

[Host irs] alias and preallocated output support#4144

Merged
samnordmann merged 8 commits intomainfrom
host_irs/alias_support
Apr 16, 2025
Merged

[Host irs] alias and preallocated output support#4144
samnordmann merged 8 commits intomainfrom
host_irs/alias_support

Conversation

@samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Mar 26, 2025

This PR belongs to a series of stacked PRs:

  1. => You are here: [Host irs] alias and preallocated output support #4144
  2. [Host Ir] refactor and cleanup lowering and segmentation #4145
  3. [Host ir] support for set reduce and binary op #4146
  4. [Host irs] Stream lowering of single device fusions #4147

What

  • Support for aliases in HostIrContainer. When a Tensor tv1 is marked as being the alias of tv0, then, at runtime, tv0's concrete data/buffer will be used for the op. It is a way to reuse buffers that have been allocated elsewhere within the TensorView's SSA paradigm. Chained aliasing (tv2-->tv1-->tv0) are supported.
  • Fix preallocated outputs in HostIrEvaluator

Why

It is necessary for stream parallelization, where typically we allocate the full output buffer but each stream writes to a slice of this buffer.

How

The aliasing is stored in the HostIrContainer through a map.

At the HostIrEvaluator level, instead of operating directly on the ExprEvaluator to write/read concrete data, we first apply the alias indirection

@github-actions
Copy link

github-actions bot commented Mar 26, 2025

Review updated until commit 97b1743

Description

  • Added support for aliases in HostIrContainer for buffer reuse.

  • Enhanced HostIrEvaluator to handle preallocated outputs.

  • Improved error handling and validation in HostIrEvaluator.

  • Added new tests for aliasing and preallocated outputs.


Changes walkthrough 📝

Relevant files
Enhancement
container.cpp
Add alias support in HostIrContainer                                         

csrc/host_ir/container.cpp

  • Added alias printing in print method.
  • Added markAlias and alias methods.
  • +6/-1     
    executor.cpp
    Enhance HostIrEvaluator for aliases and preallocated outputs

    csrc/host_ir/executor.cpp

  • Removed redundant getKnownTensorOrUndefined functions.
  • Added getAlias, isKnown, getKnownConcreteValue,
    getKnownTensorOrUndefined, bind, and invalidate methods.
  • Enhanced runWithInput to handle preallocated outputs.
  • Improved error handling in handle methods.
  • +110/-108
    container.h
    Add alias support in HostIrContainer                                         

    csrc/host_ir/container.h

    • Added markAlias and alias methods.
    +12/-0   
    executor.h
    Enhance HostIrEvaluator for aliases and preallocated outputs

    csrc/host_ir/executor.h

  • Added getAlias, isKnown, getKnownConcreteValue,
    getKnownTensorOrUndefined, bind, and invalidate methods.
  • +32/-1   
    Tests
    test_host_irs.cpp
    Add tests for preallocated outputs and aliasing                   

    tests/cpp/test_host_irs.cpp

  • Added new test cases for preallocated outputs and aliasing.
  • Updated existing tests to use EXPECT_TRUE instead of
    GTEST_EXPECT_TRUE.
  • +145/-8 

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Possible Issue

    The getKnownConcreteValue and getKnownTensorOrUndefined methods in HostIrEvaluator assume that getAlias will always return a Val* that is known to the expr_evaluator_. However, there is no guarantee that the aliased Val* is known, which could lead to errors.

            host_ir_container_->outputs().end(),
            communication->out()));
    
    NVF_ERROR(
        out_idx < (int64_t)host_ir_container_->outputs().size(),
        "Output tensor not found in fusion outputs");
    auto out_tensor = output_args[out_idx].as<at::Tensor>();
    
    // Inputs are already validated in bindInputs.
    validateTensors({out_tensor}, {communication->out()}, expr_eval);
    c10::intrusive_ptr<c10d::Work> work = postSingleCommunication(
        communication,
        communicator_->deviceId(),
        backend,
    Performance Concern

    The introduction of getAlias calls in multiple methods could introduce performance overhead, especially if the alias chain is long. Consider optimizing this if performance becomes an issue.

    for (Expr* e : host_ir_container_->topLevelExprs()) {
      NVF_ERROR(e->isA<Communication>());
      auto* communication = e->as<Communication>();
      c10d::Backend* backend =
          communicator_->getBackendForTeam(communication->team(), std::nullopt);
      auto in_tensor = expr_eval.evaluate(communication->in()).as<at::Tensor>();
      auto out_idx = std::distance(
          host_ir_container_->outputs().begin(),
          std::find(
              host_ir_container_->outputs().begin(),
    Code Duplication

    The getKnownTensorOrUndefined method is similar to the getKnownConcreteValue method but returns an at::Tensor. Consider refactoring to avoid code duplication.

    validateTensors({out_tensor}, {communication->out()}, expr_eval);
    c10::intrusive_ptr<c10d::Work> work = postSingleCommunication(
        communication,
        communicator_->deviceId(),
        backend,
        in_tensor,

    @samnordmann
    Copy link
    Collaborator Author

    !test

    std::vector<Expr*> top_level_exprs_;
    std::vector<std::unique_ptr<KernelExecutor>> kernel_executors_;
    Stream* default_stream_ = nullptr;
    std::unordered_map<const Val*, Val*> alias_;
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Why not TensorView* since markAlias takes TensorView?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    It is more convenient imo to keep a Val* here to have a uniform treatment of Scalars and TensorViews when we get/set, through the method getAlias.
    If changing the signature to TensorView here, we need to branch into getAlias to maybe downcast the Val* to a TensorView*, get the alias, and then upcast back the obtained TensorView* to a Val*.

    This is doable but imo slightly less natural.
    Since, besides, we restrict aliasing to TensorView only for simplicity, but not for a structural reason, this is why I made this choice.

    Let me know what you prefer.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    but not for a structural reason

    What does it mean for a scalar to alias another scalar?

    Copy link
    Collaborator Author

    @samnordmann samnordmann Apr 15, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    but not for a structural reason

    What does it mean for a scalar to alias another scalar?

    Like for a Tensor: a different name pointing to the same data, (IOW seeing the scalar as a 0-dim tensor)
    But there is no real motivation to support that for now.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    FYI, scalar is not a 0-dim tensor. The difference is like int vs int*. To me, aliases for pointers make sense but aliases for scalars don't.

    I'm fine with this code as is. Personally, I prefer std::unordered_map<TensorView*, TensorView*> because it gives the right impression that only TensorViews can alias. std::unordered_map<Val*, TensorView*> is OK as well if the key is too often a Val* and you want to save too much typing of ->as<TensorView>().

    Copy link
    Collaborator Author

    @samnordmann samnordmann Apr 16, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    FYI, scalar is not a 0-dim tensor. The difference is like int vs int*. To me, aliases for pointers make sense but aliases for scalars don't.

    Ok I leave it as is for now. For the sake of the discussion though and to make sure I am not missing something:

    A scalar can be viewed as a 0-dim tensor, mathematically. Then, it is an implementation detail whether the scalar type owns a pointer or a value. I understand that pytorch makes the choice that at::Scalar represents the value and not the pointer, contrarily to at::Tensor. However, in our context, we are adding one level of indirection (through hir-aliasing and expression evaluator), so aliasing is always possible, even for scalars. More precisely: the symbolic object (scalar or tensor) is a Val*, mapped through (hir-alias+ExprEvaluator) to a PolymorphicValue (e.g. a at::Tensor or a at::Scalar). Two Val* can always be mapped to the same concrete value, IOW, aliasing is always possible.

    @samnordmann
    Copy link
    Collaborator Author

    see #4147 (comment) for a discussion where aliasing is needed for now even though HIR doesn't make an SSA assumption

    @samnordmann samnordmann requested a review from wujingyue April 14, 2025 13:53
    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I'll finish reviewing this tomorrow. US tax day is driving me crazy unfortunately... :)

    hic->markAlias(tv1, tv0);
    hic->addInput(tv0);

    EXPECT_ANY_THROW(HostIrEvaluator hie(std::move(hic)));
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Try to be specific about what exceptions to expect. Eg:

    This helps readers understand the intention of the test and makes the test stricter.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Using the gmock templates with the move-only std::unique_ptr<HostIrContainer> hic gives a segfault:

      EXPECT_THAT(
          [&]() { HostIrEvaluator hie(std::move(hic)); },
          ::testing::ThrowsMessage<std::runtime_error>(
              ::testing::HasSubstr("Inputs cannot be aliased")));
    

    After a quick search on the internet, I found this: https://google.github.io/googletest/gmock_cook_book.html#mocking-methods-that-use-move-only-types
    Achieving what you suggest is doable but challenging -- at least I can't figure out how to do it easily.

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann samnordmann merged commit a064f6e into main Apr 16, 2025
    53 checks passed
    @samnordmann samnordmann deleted the host_irs/alias_support branch April 16, 2025 14:55
    samnordmann added a commit that referenced this pull request Apr 16, 2025
    This PR belongs to a series of stacked PRs:
    1. #4144
    2. **=> You are here:** #4145
    3. #4146
    4. #4147
    
    # What
    
    1. We replace the bool option `SegmentCandidateFinderOptions::
    only_segment_resharding_exprs` by a pointer to a predicate function.
    This allow the user of the segmented to (optionally) provide a custom
    function to be used to decide whether two given groups should be merged.
    This achieves better separation of responsibility: with this option, the
    segmented is only responsible of applying the segmentation algorithm,
    but does not embed the specific rule for merging group which depends on
    the application. The specific rule in our context is decided by the Hir
    lowering. Imo this refactoring should ideally go further and make the
    segmented a more abstract class that would be used in both Host Ir and
    FusionExecutorCache lowering but only changing the newly introduced
    function pointer.
    2. In HIR lowering, we clearly separate (in a distinct for-loop, but
    this later will become a preseg pass) the pass that transforms
    resharding exprs into a Communication
    
    # Why 
    that's a preliminary refactoring useful for more advanced Host Ir
    Lowering, notably ParallelType::Stream lowering
    samnordmann added a commit that referenced this pull request Apr 27, 2025
    This PR belongs to a series of stacked PRs:
    1. #4144
    2. #4145
    3. **=> You are here:** #4146
    4. #4147
    
    Add support for `LoadStoreOp`, `BinaryOp`, `ReductionOp`, including
    support for pre-allocated output, which is not provided by
    ExprEvaluator.
    
    ---------
    
    Co-authored-by: Jingyue Wu <wujingyue@gmail.com>
    samnordmann added a commit that referenced this pull request Apr 28, 2025
    porting PR #4147 to here which is based on main
    
    This PR only serves as a reference since it got broken down into the
    following PRs to ease reviewing and merging:
    1. #4144
    2. #4145
    3. #4146
    4. #4147
    samnordmann added a commit that referenced this pull request Apr 28, 2025
    This PR belongs to a series of stacked PRs:
    1. #4144
    2. #4145
    3. #4146
    4. #4301
    5. **=> You are here:** #4147
    
    # What
    
    Implement a proper lowering for handling ParallelType::Stream. This PR
    has the following restrictions:
    - Single device fusion
    - No split/merge of Stream axis
    
    We add to Hir lowering a new pass that reads the hir container's top
    level expressions, reads the consumer's stream parallelization and
    create For Loop with stream management and sync for expressing the
    stream parallelization. Basic logic for merging For-Loop are written.
    
    Let me explain through some examples that can be found in the PR. We
    suggest to run those examples as follows:
    ```
    NVFUSER_DUMP=host_ir test_host_ir --gtest_filter=*
    ```
    
    ## Single expr and for-loop
    Look at `MultiDeviceExecutorLowerStreamTest.SingleSetOp` simple
    scenario:
    ```
      TensorView* tv0 = makeContigTensor(2);
      TensorView* tv1 = set(tv0);
      fusion->addInput(tv0);
      fusion->addOutput(tv1);
      tv1->axis(0)->parallelize(ParallelType::Stream);
    ```
    the dumped generated Host Ir program is:
    ```
    %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T1_g_float[iStreamIdx2{i0}, iS3{i2}]) :
      T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false)
      FOR StreamIdx in iStreamIdx2{i0}:
        GetCurrentStream into Stream 0
        SetCurrentStream to Stream ( StreamIdx % numberOfStreams )
        Synchronize Stream 0
        T2_l_float[iS4{i2}]
           = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx )
        T3_l_float[iS5{i2}]
           = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx )
        T3_l_float[iS5{i2}]
           = Set( T2_l_float[iS4{i2}], cache_op=Streaming )
        SetCurrentStream to Stream 0
        Synchronize Stream ( StreamIdx % numberOfStreams )
    } // %HostIrContainer
    ```
    We can see that the expr, here the "Set", gets embedded into a For Loop.
    Let us analyze further:
    - outside the for loop, we allocate the global output buffer.
    - The start of the for loop body does the new stream assignment and sync
    of that stream to the user stream
    - Then, we "Select" (aka slice) through `HirAliasSelect` into the input
    and output
    - The "Set" operation is executed on the "selected" I/O. Note that the
    output is an alias to the output's slice.
    - At the end of the for loop, we reset to the user's stream (I mean, the
    currently selected stream before entering the program) and sync the
    user's stream with the running stream.
    
    ## Merging for loops
    
    To avoid unnecessary synchronization across streams, it is important to
    be able to fuse the stream for-loop. This is exercised by the test
    `MultiDeviceExecutorLowerStreamTest.TwoSetOps`:
    ```
      TensorView* tv0 = makeContigTensor(2);
      TensorView* tv1 = set(tv0);
      TensorView* tv2 = set(tv1);
      fusion->addInput(tv0);
      fusion->addOutput(tv2);
      tv1->axis(0)->parallelize(ParallelType::Stream);
      tv2->axis(0)->parallelize(ParallelType::Stream);
    ```
    dump:
    ```
    %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T2_g_float[iStreamIdx4{i0}, iS5{i2}]) :
      T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false)
      T2_g_float[iStreamIdx4{i0}, iS5{i2}] = ALLOCATE(buffer=T2_g_float[iStreamIdx4{i0}, iS5{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false)
      FOR StreamIdx in iStreamIdx2{i0}:
        GetCurrentStream into Stream 0
        SetCurrentStream to Stream ( StreamIdx % numberOfStreams )
        Synchronize Stream 0
        T3_l_float[iS6{i2}]
           = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx )
        T4_l_float[iS7{i2}]
           = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx )
        T4_l_float[iS7{i2}]
           = Set( T3_l_float[iS6{i2}], cache_op=Streaming )
        T5_l_float[iS8{i2}]
           = HirAliasSelect( T2_g_float[iStreamIdx4{i0}, iS5{i2}], axis = iStreamIdx4{i0}, index = StreamIdx )
        T5_l_float[iS8{i2}]
           = Set( T4_l_float[iS7{i2}], cache_op=Streaming )
        SetCurrentStream to Stream 0
        Synchronize Stream ( StreamIdx % numberOfStreams )
    } // %HostIrContainer
    ```
    We observe that the For-loop are indeed merged.
    **Possible future optimization:** the allocation of the intermediate
    buffer could be only of length `numberOfStreams`
    
    ## separating for loops
    
    We also need to be able to separate and create new for loops if
    necessary, as exercised in `ThreeSetOpsWithDisjointsForLoops`, which
    considers the Fusion:
    ```
      TensorView* tv0 = makeContigTensor(2);
      TensorView* tv1 = set(tv0);
      TensorView* tv2 = set(tv1);
      TensorView* tv3 = set(tv2);
      fusion->addInput(tv0);
      fusion->addOutput(tv3);
      tv1->axis(0)->parallelize(ParallelType::Stream);
      tv3->axis(0)->parallelize(ParallelType::Stream);
    ```
    Here, tv2 is not stream-parallelized so it should be be produced in a
    for-loop. Dump:
    ```
    %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T3_g_float[iStreamIdx6{i0}, iS7{i2}]) :
      T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false)
      FOR StreamIdx in iStreamIdx2{i0}:
        GetCurrentStream into Stream 0
        SetCurrentStream to Stream ( StreamIdx % numberOfStreams )
        Synchronize Stream 0
        T4_l_float[iS8{i2}]
           = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx )
        T5_l_float[iS9{i2}]
           = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx )
        T5_l_float[iS9{i2}]
           = Set( T4_l_float[iS8{i2}], cache_op=Streaming )
        SetCurrentStream to Stream 0
        Synchronize Stream ( StreamIdx % numberOfStreams )
      T2_g_float[iS4{i0}, iS5{i2}]
         = Set( T1_g_float[iStreamIdx2{i0}, iS3{i2}], cache_op=Streaming )
      T3_g_float[iStreamIdx6{i0}, iS7{i2}] = ALLOCATE(buffer=T3_g_float[iStreamIdx6{i0}, iS7{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false)
      FOR StreamIdx in iStreamIdx6{i0}:
        GetCurrentStream into Stream 2
        SetCurrentStream to Stream ( StreamIdx % numberOfStreams )
        Synchronize Stream 2
        T6_l_float[iS10{i2}]
           = HirAliasSelect( T2_g_float[iS4{i0}, iS5{i2}], axis = iS4{i0}, index = StreamIdx )
        T7_l_float[iS11{i2}]
           = HirAliasSelect( T3_g_float[iStreamIdx6{i0}, iS7{i2}], axis = iStreamIdx6{i0}, index = StreamIdx )
        T7_l_float[iS11{i2}]
           = Set( T6_l_float[iS10{i2}], cache_op=Streaming )
        SetCurrentStream to Stream 2
        Synchronize Stream ( StreamIdx % numberOfStreams )
    } // %HostIrContainer
    ```
    
    ---------
    
    Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com>
    Co-authored-by: Jingyue Wu <wujingyue@gmail.com>
    Co-authored-by: Ryan Spring <rspring@nvidia.com>
    Co-authored-by: Liqiang Lu <116412316+liqiangxl@users.noreply.github.com>
    Co-authored-by: jjsjann123 <jiej@nvidia.com>
    Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com>
    Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>
    Co-authored-by: Priya Mishra <52657555+Priya2698@users.noreply.github.com>
    Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>
    Co-authored-by: Nick Sarkauskas <nsarkauskas@nvidia.com>
    Co-authored-by: Wang, Xiao <24860335+xwang233@users.noreply.github.com>
    Co-authored-by: root <26priya11@gmail.com>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants