[Host irs] alias and preallocated output support#4144
Conversation
|
Review updated until commit 97b1743 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
!test |
| std::vector<Expr*> top_level_exprs_; | ||
| std::vector<std::unique_ptr<KernelExecutor>> kernel_executors_; | ||
| Stream* default_stream_ = nullptr; | ||
| std::unordered_map<const Val*, Val*> alias_; |
There was a problem hiding this comment.
Why not TensorView* since markAlias takes TensorView?
There was a problem hiding this comment.
It is more convenient imo to keep a Val* here to have a uniform treatment of Scalars and TensorViews when we get/set, through the method getAlias.
If changing the signature to TensorView here, we need to branch into getAlias to maybe downcast the Val* to a TensorView*, get the alias, and then upcast back the obtained TensorView* to a Val*.
This is doable but imo slightly less natural.
Since, besides, we restrict aliasing to TensorView only for simplicity, but not for a structural reason, this is why I made this choice.
Let me know what you prefer.
There was a problem hiding this comment.
but not for a structural reason
What does it mean for a scalar to alias another scalar?
There was a problem hiding this comment.
but not for a structural reason
What does it mean for a scalar to alias another scalar?
Like for a Tensor: a different name pointing to the same data, (IOW seeing the scalar as a 0-dim tensor)
But there is no real motivation to support that for now.
There was a problem hiding this comment.
FYI, scalar is not a 0-dim tensor. The difference is like int vs int*. To me, aliases for pointers make sense but aliases for scalars don't.
I'm fine with this code as is. Personally, I prefer std::unordered_map<TensorView*, TensorView*> because it gives the right impression that only TensorViews can alias. std::unordered_map<Val*, TensorView*> is OK as well if the key is too often a Val* and you want to save too much typing of ->as<TensorView>().
There was a problem hiding this comment.
FYI, scalar is not a 0-dim tensor. The difference is like
intvsint*. To me, aliases for pointers make sense but aliases for scalars don't.
Ok I leave it as is for now. For the sake of the discussion though and to make sure I am not missing something:
A scalar can be viewed as a 0-dim tensor, mathematically. Then, it is an implementation detail whether the scalar type owns a pointer or a value. I understand that pytorch makes the choice that at::Scalar represents the value and not the pointer, contrarily to at::Tensor. However, in our context, we are adding one level of indirection (through hir-aliasing and expression evaluator), so aliasing is always possible, even for scalars. More precisely: the symbolic object (scalar or tensor) is a Val*, mapped through (hir-alias+ExprEvaluator) to a PolymorphicValue (e.g. a at::Tensor or a at::Scalar). Two Val* can always be mapped to the same concrete value, IOW, aliasing is always possible.
|
see #4147 (comment) for a discussion where aliasing is needed for now even though HIR doesn't make an SSA assumption |
wujingyue
left a comment
There was a problem hiding this comment.
I'll finish reviewing this tomorrow. US tax day is driving me crazy unfortunately... :)
| hic->markAlias(tv1, tv0); | ||
| hic->addInput(tv0); | ||
|
|
||
| EXPECT_ANY_THROW(HostIrEvaluator hie(std::move(hic))); |
There was a problem hiding this comment.
Try to be specific about what exceptions to expect. Eg:
Fuser/tests/cpp/test_preseg_passes.cpp
Line 57 in f44f191
This helps readers understand the intention of the test and makes the test stricter.
There was a problem hiding this comment.
Using the gmock templates with the move-only std::unique_ptr<HostIrContainer> hic gives a segfault:
EXPECT_THAT(
[&]() { HostIrEvaluator hie(std::move(hic)); },
::testing::ThrowsMessage<std::runtime_error>(
::testing::HasSubstr("Inputs cannot be aliased")));
After a quick search on the internet, I found this: https://google.github.io/googletest/gmock_cook_book.html#mocking-methods-that-use-move-only-types
Achieving what you suggest is doable but challenging -- at least I can't figure out how to do it easily.
|
!test |
This PR belongs to a series of stacked PRs: 1. #4144 2. **=> You are here:** #4145 3. #4146 4. #4147 # What 1. We replace the bool option `SegmentCandidateFinderOptions:: only_segment_resharding_exprs` by a pointer to a predicate function. This allow the user of the segmented to (optionally) provide a custom function to be used to decide whether two given groups should be merged. This achieves better separation of responsibility: with this option, the segmented is only responsible of applying the segmentation algorithm, but does not embed the specific rule for merging group which depends on the application. The specific rule in our context is decided by the Hir lowering. Imo this refactoring should ideally go further and make the segmented a more abstract class that would be used in both Host Ir and FusionExecutorCache lowering but only changing the newly introduced function pointer. 2. In HIR lowering, we clearly separate (in a distinct for-loop, but this later will become a preseg pass) the pass that transforms resharding exprs into a Communication # Why that's a preliminary refactoring useful for more advanced Host Ir Lowering, notably ParallelType::Stream lowering
This PR belongs to a series of stacked PRs: 1. #4144 2. #4145 3. **=> You are here:** #4146 4. #4147 Add support for `LoadStoreOp`, `BinaryOp`, `ReductionOp`, including support for pre-allocated output, which is not provided by ExprEvaluator. --------- Co-authored-by: Jingyue Wu <wujingyue@gmail.com>
This PR belongs to a series of stacked PRs: 1. #4144 2. #4145 3. #4146 4. #4301 5. **=> You are here:** #4147 # What Implement a proper lowering for handling ParallelType::Stream. This PR has the following restrictions: - Single device fusion - No split/merge of Stream axis We add to Hir lowering a new pass that reads the hir container's top level expressions, reads the consumer's stream parallelization and create For Loop with stream management and sync for expressing the stream parallelization. Basic logic for merging For-Loop are written. Let me explain through some examples that can be found in the PR. We suggest to run those examples as follows: ``` NVFUSER_DUMP=host_ir test_host_ir --gtest_filter=* ``` ## Single expr and for-loop Look at `MultiDeviceExecutorLowerStreamTest.SingleSetOp` simple scenario: ``` TensorView* tv0 = makeContigTensor(2); TensorView* tv1 = set(tv0); fusion->addInput(tv0); fusion->addOutput(tv1); tv1->axis(0)->parallelize(ParallelType::Stream); ``` the dumped generated Host Ir program is: ``` %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T1_g_float[iStreamIdx2{i0}, iS3{i2}]) : T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false) FOR StreamIdx in iStreamIdx2{i0}: GetCurrentStream into Stream 0 SetCurrentStream to Stream ( StreamIdx % numberOfStreams ) Synchronize Stream 0 T2_l_float[iS4{i2}] = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx ) T3_l_float[iS5{i2}] = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx ) T3_l_float[iS5{i2}] = Set( T2_l_float[iS4{i2}], cache_op=Streaming ) SetCurrentStream to Stream 0 Synchronize Stream ( StreamIdx % numberOfStreams ) } // %HostIrContainer ``` We can see that the expr, here the "Set", gets embedded into a For Loop. Let us analyze further: - outside the for loop, we allocate the global output buffer. - The start of the for loop body does the new stream assignment and sync of that stream to the user stream - Then, we "Select" (aka slice) through `HirAliasSelect` into the input and output - The "Set" operation is executed on the "selected" I/O. Note that the output is an alias to the output's slice. - At the end of the for loop, we reset to the user's stream (I mean, the currently selected stream before entering the program) and sync the user's stream with the running stream. ## Merging for loops To avoid unnecessary synchronization across streams, it is important to be able to fuse the stream for-loop. This is exercised by the test `MultiDeviceExecutorLowerStreamTest.TwoSetOps`: ``` TensorView* tv0 = makeContigTensor(2); TensorView* tv1 = set(tv0); TensorView* tv2 = set(tv1); fusion->addInput(tv0); fusion->addOutput(tv2); tv1->axis(0)->parallelize(ParallelType::Stream); tv2->axis(0)->parallelize(ParallelType::Stream); ``` dump: ``` %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T2_g_float[iStreamIdx4{i0}, iS5{i2}]) : T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false) T2_g_float[iStreamIdx4{i0}, iS5{i2}] = ALLOCATE(buffer=T2_g_float[iStreamIdx4{i0}, iS5{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false) FOR StreamIdx in iStreamIdx2{i0}: GetCurrentStream into Stream 0 SetCurrentStream to Stream ( StreamIdx % numberOfStreams ) Synchronize Stream 0 T3_l_float[iS6{i2}] = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx ) T4_l_float[iS7{i2}] = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx ) T4_l_float[iS7{i2}] = Set( T3_l_float[iS6{i2}], cache_op=Streaming ) T5_l_float[iS8{i2}] = HirAliasSelect( T2_g_float[iStreamIdx4{i0}, iS5{i2}], axis = iStreamIdx4{i0}, index = StreamIdx ) T5_l_float[iS8{i2}] = Set( T4_l_float[iS7{i2}], cache_op=Streaming ) SetCurrentStream to Stream 0 Synchronize Stream ( StreamIdx % numberOfStreams ) } // %HostIrContainer ``` We observe that the For-loop are indeed merged. **Possible future optimization:** the allocation of the intermediate buffer could be only of length `numberOfStreams` ## separating for loops We also need to be able to separate and create new for loops if necessary, as exercised in `ThreeSetOpsWithDisjointsForLoops`, which considers the Fusion: ``` TensorView* tv0 = makeContigTensor(2); TensorView* tv1 = set(tv0); TensorView* tv2 = set(tv1); TensorView* tv3 = set(tv2); fusion->addInput(tv0); fusion->addOutput(tv3); tv1->axis(0)->parallelize(ParallelType::Stream); tv3->axis(0)->parallelize(ParallelType::Stream); ``` Here, tv2 is not stream-parallelized so it should be be produced in a for-loop. Dump: ``` %HostIrContainer { (T0_g_float[iS0{i0}, iS1{i2}]) -> (T3_g_float[iStreamIdx6{i0}, iS7{i2}]) : T1_g_float[iStreamIdx2{i0}, iS3{i2}] = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false) FOR StreamIdx in iStreamIdx2{i0}: GetCurrentStream into Stream 0 SetCurrentStream to Stream ( StreamIdx % numberOfStreams ) Synchronize Stream 0 T4_l_float[iS8{i2}] = HirAliasSelect( T0_g_float[iS0{i0}, iS1{i2}], axis = iS0{i0}, index = StreamIdx ) T5_l_float[iS9{i2}] = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}], axis = iStreamIdx2{i0}, index = StreamIdx ) T5_l_float[iS9{i2}] = Set( T4_l_float[iS8{i2}], cache_op=Streaming ) SetCurrentStream to Stream 0 Synchronize Stream ( StreamIdx % numberOfStreams ) T2_g_float[iS4{i0}, iS5{i2}] = Set( T1_g_float[iStreamIdx2{i0}, iS3{i2}], cache_op=Streaming ) T3_g_float[iStreamIdx6{i0}, iS7{i2}] = ALLOCATE(buffer=T3_g_float[iStreamIdx6{i0}, iS7{i2}], mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false) FOR StreamIdx in iStreamIdx6{i0}: GetCurrentStream into Stream 2 SetCurrentStream to Stream ( StreamIdx % numberOfStreams ) Synchronize Stream 2 T6_l_float[iS10{i2}] = HirAliasSelect( T2_g_float[iS4{i0}, iS5{i2}], axis = iS4{i0}, index = StreamIdx ) T7_l_float[iS11{i2}] = HirAliasSelect( T3_g_float[iStreamIdx6{i0}, iS7{i2}], axis = iStreamIdx6{i0}, index = StreamIdx ) T7_l_float[iS11{i2}] = Set( T6_l_float[iS10{i2}], cache_op=Streaming ) SetCurrentStream to Stream 2 Synchronize Stream ( StreamIdx % numberOfStreams ) } // %HostIrContainer ``` --------- Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com> Co-authored-by: Jingyue Wu <wujingyue@gmail.com> Co-authored-by: Ryan Spring <rspring@nvidia.com> Co-authored-by: Liqiang Lu <116412316+liqiangxl@users.noreply.github.com> Co-authored-by: jjsjann123 <jiej@nvidia.com> Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com> Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com> Co-authored-by: Priya Mishra <52657555+Priya2698@users.noreply.github.com> Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com> Co-authored-by: Nick Sarkauskas <nsarkauskas@nvidia.com> Co-authored-by: Wang, Xiao <24860335+xwang233@users.noreply.github.com> Co-authored-by: root <26priya11@gmail.com>
This PR belongs to a series of stacked PRs:
What
Why
It is necessary for stream parallelization, where typically we allocate the full output buffer but each stream writes to a slice of this buffer.
How
The aliasing is stored in the HostIrContainer through a map.
At the HostIrEvaluator level, instead of operating directly on the ExprEvaluator to write/read concrete data, we first apply the alias indirection