Skip to content

[Host Ir] refactor and cleanup lowering and segmentation#4145

Merged
samnordmann merged 16 commits intomainfrom
host_irs/refactor_lowering_and_segmentation
Apr 16, 2025
Merged

[Host Ir] refactor and cleanup lowering and segmentation#4145
samnordmann merged 16 commits intomainfrom
host_irs/refactor_lowering_and_segmentation

Conversation

@samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Mar 26, 2025

This PR belongs to a series of stacked PRs:

  1. [Host irs] alias and preallocated output support #4144
  2. => You are here: [Host Ir] refactor and cleanup lowering and segmentation #4145
  3. [Host ir] support for set reduce and binary op #4146
  4. [Host irs] Stream lowering of single device fusions #4147

What

  1. We replace the bool option SegmentCandidateFinderOptions:: only_segment_resharding_exprs by a pointer to a predicate function. This allow the user of the segmented to (optionally) provide a custom function to be used to decide whether two given groups should be merged. This achieves better separation of responsibility: with this option, the segmented is only responsible of applying the segmentation algorithm, but does not embed the specific rule for merging group which depends on the application. The specific rule in our context is decided by the Hir lowering. Imo this refactoring should ideally go further and make the segmented a more abstract class that would be used in both Host Ir and FusionExecutorCache lowering but only changing the newly introduced function pointer.
  2. In HIR lowering, we clearly separate (in a distinct for-loop, but this later will become a preseg pass) the pass that transforms resharding exprs into a Communication

Why

that's a preliminary refactoring useful for more advanced Host Ir Lowering, notably ParallelType::Stream lowering

@github-actions
Copy link

github-actions bot commented Mar 26, 2025

Review updated until commit 684118f

Description

  • Replace only_segment_resharding_exprs with custom_should_merge_groups function pointer.

  • Introduce isLowerableAsStandaloneHostOp and shouldMergeSegmentedGroups methods in HostIrLower.

  • Update SegmentCandidateFinderOptions to use the new function pointer.

  • Modify HostIrContainer to include resetTopLevelExprs method.


Changes walkthrough 📝

Relevant files
Enhancement
fusion_segmenter.cpp
Update segmenter options handling                                               

csrc/fusion_segmenter.cpp

  • Replace only_segment_resharding_exprs with custom_should_merge_groups
    checks.
  • Update codeGenSupportedMerge and deriveSchedulerType methods.
  • +8/-11   
    lower.cpp
    Add custom merge function and update lowering logic           

    csrc/host_ir/lower.cpp

  • Add isLowerableAsStandaloneHostOp and shouldMergeSegmentedGroups
    methods.
  • Update lower method to use custom_should_merge_groups.
  • Modify logic for handling resharding expressions.
  • +55/-26 
    test_resharding.cpp
    Update test options                                                                           

    tests/cpp/test_resharding.cpp

  • Update SegmentCandidateFinderOptions to use
    custom_should_merge_groups.
  • +1/-1     
    fusion_segmenter.h
    Add custom merge function pointer                                               

    csrc/fusion_segmenter.h

  • Add custom_should_merge_groups function pointer to
    SegmentCandidateFinderOptions.
  • +7/-1     
    container.h
    Add method to reset top-level expressions                               

    csrc/host_ir/container.h

    • Add resetTopLevelExprs method to HostIrContainer.
    +4/-0     
    lower.h
    Declare new methods and include header                                     

    csrc/host_ir/lower.h

  • Include fusion_segmenter.h.
  • Declare isLowerableAsStandaloneHostOp and shouldMergeSegmentedGroups
    methods.
  • +7/-0     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Code Clarity

    The comment in codeGenSupportedMerge suggests a redesign of the segmenter, but the current implementation still includes a return statement outside the if block. Ensure the logic is clear and the comment reflects the current state.

      // The segmemter should ideally be redesigned to be more flexible and
      // decoupled from the schedulers, but for now, we just return
      // `SchedulerType::None` as it is not relevant when the segmenter is
      // used with a custom should-merge function.
      if (options_.custom_should_merge_groups != nullptr) {
        return (options_.custom_should_merge_groups)(group1, group2);
      }
      return tryMerge(segmented_fusion_.get(), runtimeInfo(), group1, group2) !=
          SchedulerType::None;
    }
    Logic Verification

    The new shouldMergeSegmentedGroups function and its usage in lower need to be verified for correctness, especially the logic around isLowerableAsStandaloneHostOp. Ensure that the new logic does not introduce any regressions.

    bool HostIrLower::shouldMergeSegmentedGroups(
        SegmentedGroup* group1,
        SegmentedGroup* group2) {
      for (auto group : {group1, group2}) {
        for (Expr* expr : group->exprs()) {
          if (isLowerableAsStandaloneHostOp(expr)) {
            return false;
          }
        }
      }
      return true;
    }
    Code Duplication

    The logic for handling resharding expressions in lower seems to be duplicated. Consider refactoring to avoid duplication and improve maintainability.

    std::vector<Expr*> new_top_level_exprs;
    for (auto top_level_expr : hic->topLevelExprs()) {
      if (!isResharding(top_level_expr)) {
        new_top_level_exprs.push_back(top_level_expr);
        continue;
      }
      for (auto* expr : HostIrLower::lower(top_level_expr, my_device_index)) {
        // Allocate the recv buffers of communications
        if (expr->isA<Communication>()) {
          auto* communication = expr->as<Communication>();
          TensorView* tv = communication->out();
          if (tv->getDeviceMesh().has(my_device_index)) {
            auto* allocate =
                IrBuilder::create<kir::Allocate>(tv, MemoryType::Global);
            new_top_level_exprs.push_back(allocate);
          }
        }
        new_top_level_exprs.push_back(expr);
        if (expr->isA<Communication>()) {
          auto wait = IrBuilder::create<hir::Wait>(expr->as<Communication>());
          new_top_level_exprs.push_back(wait);
        }
      }

    @samnordmann samnordmann changed the title refactor and clean host ir lowering and segmentation [Host Ir] refactor and cleanup lowering and segmentation Mar 26, 2025
    @samnordmann
    Copy link
    Collaborator Author

    !test

    1 similar comment
    @samnordmann
    Copy link
    Collaborator Author

    !test

    Comment on lines +41 to +43
    static bool ShouldMergeSegmentedGroups(
    SegmentedGroup* group1,
    SegmentedGroup* group2);
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    This function could not been exposed if it wasn't used intests/cpp/test_resharding.cpp. We could alternatively not expose it and just duplicate its implementation (which is short and simple) in the test.

    @samnordmann samnordmann requested review from nsarka and wujingyue March 26, 2025 14:44
    @samnordmann
    Copy link
    Collaborator Author

    !test

    }
    }
    return true;
    if (options_.custom_should_merge_groups != nullptr) {
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    There's also segment_set that serves a similar purpose. But it's hard to tell why it's not sufficient without reviewing the following PRs.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    You are right that both mechanisms serve the same purpose. Btw, this was also true before this PR with the only_segment_resharding_exprs option. For the time being, besides the fact that passing a function is much closer to the existing code than moving to segmenter sets, I also find this way more lightweight and usable in this context, more precisely, it saves me

    1. a pass for adding the sets
    2. adding an option in the segmenter to only segment according to the segmenter set
    3. a pass for removing the segmenter sets

    }
    }
    }
    hic->resetTopLevelExprs(new_top_level_exprs);
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Can you clarify why we need reseting? In general, I tend to avoid APIs that reset. They make it hard to reason about life cycles of things. I might even create a new HostIrContainer in order to not having to reset.

    Copy link
    Collaborator Author

    @samnordmann samnordmann Apr 9, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    We need resetting here because we want to replace an expression with its lowered version, i.e. replace
    Expr1 Expr2 Expr3
    with
    Expr1 Expr2' Expr3

    There are other examples in forthcoming PR where we instead want to move expression inside a for loop, i.e., replace

    Expr1
    Expr2
    Expr3
    

    To

    Expr1
    For Loop { Expr2}
    Expr3
    

    resetting the top level expression seemed like a simple ok way to do it, but I'm open to suggestions.

    reason about life cycles of things

    What do you mean exactly? would it be ok if we free the deleted expressions ?

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Instead of reseting, can we add the correct top-level expressions in one go? The high-level logic seems to be:

    for each group in topo order: 
      if group leads to a kernel:
        create a PostOnStream and put it in hir's top-level
      else:
        for each expression in topo order:
          if the expression needs lowering:
            lower it to host IR and append the lowered into hir
          else:
            append the expression directly to hir
    

    Copy link
    Collaborator Author

    @samnordmann samnordmann Apr 11, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    In this PR we are separating out the "segmentation step" from "lowering a resharding expr into comms". So, with this patch there is

    1. a first pass in the topological order where we create the top_level_expression_, mapping exactly the segments of the SegmentedFusion
    2. a second lowering pass that goes through the expression and maybe lower them to (allocation + communication + wait)

    In a coming PR this second pass will be moved to a different file and become a proper preseg_pass.

    Besides being IMO cleaner and more readable this way, it also allows us to control the preseg_pass and their order of execution. This is needed for stream lowering.

    For this reason, in this PR, I just separated in two loop what used to be done in "one go" as you are suggesting here. Therefore, I cannot do as you suggest

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    In a coming PR this second pass will be moved to a different file and become a proper preseg_pass.

    Don't bother doing this. Host IR lowering is after segmentation, so I'd rather it being kept in csrc/host_ir not csrc/preseg_passes.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I understood the need to change host IR in a container. Many (if not all) host IR optimizations will be implemented as "passes".

    I don't think "reset" is the right API. Most optimizations will be "surgical", e.g., fusing two adjacent for loops and adding stream assignments for overlapping. Such optimizations would "reset" the container with a list of mostly identical host IR besides those being changed. This makes it hard to figure out what's actually changed in the code. I think we'll end up building something like kir::IrVisitor in the future, which is of course not required in this PR.

    Copy link
    Collaborator Author

    @samnordmann samnordmann Apr 15, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    In a coming PR this second pass will be moved to a different file and become a proper preseg_pass.

    Don't bother doing this. Host IR lowering is after segmentation, so I'd rather it being kept in csrc/host_ir not csrc/preseg_passes.

    Why is it a problem? It is very useful to have it in a separate pass, in order to:

    • begin able to enable/disable the pass with the Optimization Pass Guard
    • get useful debug print with the NVFUSER_DUMP=preseg_pass option
    • have the code more factored and structured, and treat the passes uniformly. For example, it took me many trial-and-error to firgure out what should be the right order of the passes.

    Host IR lowering is after segmentation

    It is debatable, it depends on which segmentation, right? In the classical sense, it is still a preseg pass; the only segmentation that happens before is the hostIr segmentation.

    If the naming or file organzation is an issue, should I create another class of Optimization pass for HIR passes happening after HIR segmentation?

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    which segmentation

    I understood MultiDeviceExecutor runs two segmentations. FusionExecutorCache is different. It runs only one and host IR lowering is after segmentation (cf. https://docs.google.com/document/d/1QrRmN27XsVjZu7QrZWJJyRENO50878LC3MvlQY1cRYA/edit?tab=t.0)

    should I create another class of Optimization pass for HIR passes

    Yes and in a different folder, e.g., csrc/host_ir. This is similar to device_lower/passes. Optimization passes can be as simple as a function, and I'd probably start with just that. When you need to add more features like guards for these passes, you can make them classes.

    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM otherwise

    }
    }
    }
    hic->resetTopLevelExprs(new_top_level_exprs);
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Instead of reseting, can we add the correct top-level expressions in one go? The high-level logic seems to be:

    for each group in topo order: 
      if group leads to a kernel:
        create a PostOnStream and put it in hir's top-level
      else:
        for each expression in topo order:
          if the expression needs lowering:
            lower it to host IR and append the lowered into hir
          else:
            append the expression directly to hir
    

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann samnordmann force-pushed the host_irs/refactor_lowering_and_segmentation branch from b3ce2b4 to 4964680 Compare April 11, 2025 12:15
    @samnordmann samnordmann requested a review from wujingyue April 14, 2025 12:35
    SegmentedGroup* group) {
    FUSER_PERF_SCOPE("SegmentCandidateFinder::deriveSchedulerType");
    if (options_.only_segment_resharding_exprs) {
    if (options_.custom_should_merge_groups != nullptr) {
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Why is this always None?

    Copy link
    Collaborator Author

    @samnordmann samnordmann Apr 15, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    It is nullptr by default, and if it is, we fallback to the traditional single device segmenter using the schedulers
    does it answer your question?

    Copy link
    Collaborator Author

    @samnordmann samnordmann Apr 15, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Sorry I misunderstood your question. I guess this one is more for @wujingyue -- here I'm only reproducing the previous behavior, but replacing the option "only_segment_resharding_exprs" with a more agnostic one.

    The idea of returning None here has something to do with how FusionExecutorCache decide to lower segments. However, this is not used in MultiDeviceExecutor, so I am not so familiar about this part

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I think this is where the extension of the custom "should merge" function feels more like a hack. The overall design of the segmenter is tightly coupled with scheduling, so it is assumed to have this scheduler type. However, what we are finding is that sometimes we also want to use this without scheduling.

    This is a good learning for when we redesign the segmenter. For now, can you please leave a note? Something like:

    The segmemter should ideally be redesigned to be more flexible and decoupled from the schedulers, but for now, we just return `SchedulerType::None` as it is not relevant when the segmenter is used with a custom should-merge function.
    

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Yes, agreed. For the record, this hack has been present for quite a long time now. Let me add the comment as you suggest

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The overall design of the segmenter is tightly coupled with scheduling, so it is assumed to have this scheduler type

    That's correct, @naoyam. FWIW, this flag is only turned on for MultiDeviceExecutor. In FusionExecutorCache, schedulers test isResharding as you suggested.

    }

    bool HostIrLower::isLowerableAsStandaloneHostOp(Expr* expr) {
    return isResharding(expr);
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Are you expecting this to change in the future to something other than return isResharding(expr)? I'm just wondering, why not just use isResharding like before?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    yes, this will change from #4146 on

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann
    Copy link
    Collaborator Author

    !test

    samnordmann added a commit that referenced this pull request Apr 16, 2025
    This PR belongs to a series of stacked PRs:
    1. **=> You are here: #4144**
    2. #4145
    3. #4146
    4. #4147
    
    # What
    
    - Support for aliases in HostIrContainer. When a Tensor tv1 is marked as
    being the alias of tv0, then, at runtime, tv0's concrete data/buffer
    will be used for the op. It is a way to reuse buffers that have been
    allocated elsewhere within the TensorView's SSA paradigm. Chained
    aliasing (tv2-->tv1-->tv0) are supported.
    - Fix preallocated outputs in HostIrEvaluator 
    
    # Why
    
    It is necessary for stream parallelization, where typically we allocate
    the full output buffer but each stream writes to a slice of this buffer.
    
    # How 
    
    The aliasing is stored in the HostIrContainer through a map.
    
    At the HostIrEvaluator level, instead of operating directly on the
    ExprEvaluator to write/read concrete data, we first apply the alias
    indirection
    Base automatically changed from host_irs/alias_support to main April 16, 2025 14:55
    @samnordmann samnordmann merged commit 729967b into main Apr 16, 2025
    53 checks passed
    @samnordmann samnordmann deleted the host_irs/refactor_lowering_and_segmentation branch April 16, 2025 14:57
    samnordmann added a commit that referenced this pull request Apr 27, 2025
    This PR belongs to a series of stacked PRs:
    1. #4144
    2. #4145
    3. **=> You are here:** #4146
    4. #4147
    
    Add support for `LoadStoreOp`, `BinaryOp`, `ReductionOp`, including
    support for pre-allocated output, which is not provided by
    ExprEvaluator.
    
    ---------
    
    Co-authored-by: Jingyue Wu <wujingyue@gmail.com>
    samnordmann added a commit that referenced this pull request Apr 28, 2025
    porting PR #4147 to here which is based on main
    
    This PR only serves as a reference since it got broken down into the
    following PRs to ease reviewing and merging:
    1. #4144
    2. #4145
    3. #4146
    4. #4147
    samnordmann added a commit that referenced this pull request Apr 28, 2025
    This PR belongs to a series of stacked PRs:
    1. #4144
    2. #4145
    3. #4146
    4. #4301
    5. **=> You are here:** #4147
    
    # What
    
    Implement a proper lowering for handling ParallelType::Stream. This PR
    has the following restrictions:
    - Single device fusion
    - No split/merge of Stream axis
    
    We add to Hir lowering a new pass that reads the hir container's top
    level expressions, reads the consumer's stream parallelization and
    create For Loop with stream management and sync for expressing the
    stream parallelization. Basic logic for merging For-Loop are written.
    
    Let me explain through some examples that can be found in the PR. We
    suggest to run those examples as follows:
    ```
    NVFUSER_DUMP=host_ir test_host_ir --gtest_filter=*
    ```
    
    ## Single expr and for-loop
    Look at `MultiDeviceExecutorLowerStreamTest.SingleSetOp` simple
    scenario:
    ```
      TensorView* tv0 = makeContigTensor(2);
      TensorView* tv1 = set(tv0);
      fusion->addInput(tv0);
      fusion->addOutput(tv1);
      tv1->axis(0)->parallelize(ParallelType::Stream);
    ```
    the dumped generated Host Ir program is:
    ```
    %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T1_g_float[iStreamIdx2{i0}, iS3{i2}]) :
      T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false)
      FOR StreamIdx in iStreamIdx2{i0}:
        GetCurrentStream into Stream 0
        SetCurrentStream to Stream ( StreamIdx % numberOfStreams )
        Synchronize Stream 0
        T2_l_float[iS4{i2}]
           = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx )
        T3_l_float[iS5{i2}]
           = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx )
        T3_l_float[iS5{i2}]
           = Set( T2_l_float[iS4{i2}], cache_op=Streaming )
        SetCurrentStream to Stream 0
        Synchronize Stream ( StreamIdx % numberOfStreams )
    } // %HostIrContainer
    ```
    We can see that the expr, here the "Set", gets embedded into a For Loop.
    Let us analyze further:
    - outside the for loop, we allocate the global output buffer.
    - The start of the for loop body does the new stream assignment and sync
    of that stream to the user stream
    - Then, we "Select" (aka slice) through `HirAliasSelect` into the input
    and output
    - The "Set" operation is executed on the "selected" I/O. Note that the
    output is an alias to the output's slice.
    - At the end of the for loop, we reset to the user's stream (I mean, the
    currently selected stream before entering the program) and sync the
    user's stream with the running stream.
    
    ## Merging for loops
    
    To avoid unnecessary synchronization across streams, it is important to
    be able to fuse the stream for-loop. This is exercised by the test
    `MultiDeviceExecutorLowerStreamTest.TwoSetOps`:
    ```
      TensorView* tv0 = makeContigTensor(2);
      TensorView* tv1 = set(tv0);
      TensorView* tv2 = set(tv1);
      fusion->addInput(tv0);
      fusion->addOutput(tv2);
      tv1->axis(0)->parallelize(ParallelType::Stream);
      tv2->axis(0)->parallelize(ParallelType::Stream);
    ```
    dump:
    ```
    %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T2_g_float[iStreamIdx4{i0}, iS5{i2}]) :
      T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false)
      T2_g_float[iStreamIdx4{i0}, iS5{i2}] = ALLOCATE(buffer=T2_g_float[iStreamIdx4{i0}, iS5{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false)
      FOR StreamIdx in iStreamIdx2{i0}:
        GetCurrentStream into Stream 0
        SetCurrentStream to Stream ( StreamIdx % numberOfStreams )
        Synchronize Stream 0
        T3_l_float[iS6{i2}]
           = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx )
        T4_l_float[iS7{i2}]
           = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx )
        T4_l_float[iS7{i2}]
           = Set( T3_l_float[iS6{i2}], cache_op=Streaming )
        T5_l_float[iS8{i2}]
           = HirAliasSelect( T2_g_float[iStreamIdx4{i0}, iS5{i2}], axis = iStreamIdx4{i0}, index = StreamIdx )
        T5_l_float[iS8{i2}]
           = Set( T4_l_float[iS7{i2}], cache_op=Streaming )
        SetCurrentStream to Stream 0
        Synchronize Stream ( StreamIdx % numberOfStreams )
    } // %HostIrContainer
    ```
    We observe that the For-loop are indeed merged.
    **Possible future optimization:** the allocation of the intermediate
    buffer could be only of length `numberOfStreams`
    
    ## separating for loops
    
    We also need to be able to separate and create new for loops if
    necessary, as exercised in `ThreeSetOpsWithDisjointsForLoops`, which
    considers the Fusion:
    ```
      TensorView* tv0 = makeContigTensor(2);
      TensorView* tv1 = set(tv0);
      TensorView* tv2 = set(tv1);
      TensorView* tv3 = set(tv2);
      fusion->addInput(tv0);
      fusion->addOutput(tv3);
      tv1->axis(0)->parallelize(ParallelType::Stream);
      tv3->axis(0)->parallelize(ParallelType::Stream);
    ```
    Here, tv2 is not stream-parallelized so it should be be produced in a
    for-loop. Dump:
    ```
    %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T3_g_float[iStreamIdx6{i0}, iS7{i2}]) :
      T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false)
      FOR StreamIdx in iStreamIdx2{i0}:
        GetCurrentStream into Stream 0
        SetCurrentStream to Stream ( StreamIdx % numberOfStreams )
        Synchronize Stream 0
        T4_l_float[iS8{i2}]
           = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx )
        T5_l_float[iS9{i2}]
           = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx )
        T5_l_float[iS9{i2}]
           = Set( T4_l_float[iS8{i2}], cache_op=Streaming )
        SetCurrentStream to Stream 0
        Synchronize Stream ( StreamIdx % numberOfStreams )
      T2_g_float[iS4{i0}, iS5{i2}]
         = Set( T1_g_float[iStreamIdx2{i0}, iS3{i2}], cache_op=Streaming )
      T3_g_float[iStreamIdx6{i0}, iS7{i2}] = ALLOCATE(buffer=T3_g_float[iStreamIdx6{i0}, iS7{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false)
      FOR StreamIdx in iStreamIdx6{i0}:
        GetCurrentStream into Stream 2
        SetCurrentStream to Stream ( StreamIdx % numberOfStreams )
        Synchronize Stream 2
        T6_l_float[iS10{i2}]
           = HirAliasSelect( T2_g_float[iS4{i0}, iS5{i2}], axis = iS4{i0}, index = StreamIdx )
        T7_l_float[iS11{i2}]
           = HirAliasSelect( T3_g_float[iStreamIdx6{i0}, iS7{i2}], axis = iStreamIdx6{i0}, index = StreamIdx )
        T7_l_float[iS11{i2}]
           = Set( T6_l_float[iS10{i2}], cache_op=Streaming )
        SetCurrentStream to Stream 2
        Synchronize Stream ( StreamIdx % numberOfStreams )
    } // %HostIrContainer
    ```
    
    ---------
    
    Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com>
    Co-authored-by: Jingyue Wu <wujingyue@gmail.com>
    Co-authored-by: Ryan Spring <rspring@nvidia.com>
    Co-authored-by: Liqiang Lu <116412316+liqiangxl@users.noreply.github.com>
    Co-authored-by: jjsjann123 <jiej@nvidia.com>
    Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com>
    Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>
    Co-authored-by: Priya Mishra <52657555+Priya2698@users.noreply.github.com>
    Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>
    Co-authored-by: Nick Sarkauskas <nsarkauskas@nvidia.com>
    Co-authored-by: Wang, Xiao <24860335+xwang233@users.noreply.github.com>
    Co-authored-by: root <26priya11@gmail.com>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    4 participants