Propagate stream in loop irrespective of device mesh#5363
Conversation
|
!test |
|
Review updated until commit f356951 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
!test |
|
!test |
|
!test |
stream in loop irrespective of device mesh
| extent_val = promoteSize(extent_val, id->extent()); | ||
| if (iter_type.has_value()) { | ||
| iter_type = promoteIterType(iter_type.value(), id->getIterType()); | ||
| } else if (id->isGatherScatter()) { |
|
!test |
Issue #5309 Unlike device parallelization, a stream parallel tensorview (in loop) may or may not have a stream-parallel allocation domain. We propagate based on the following: 1. If it is a device parallel type -> always propagate 2. If it is a fusion input or output -> id is not stream parallelized 3. If the stream ID in a tensorview is not mapped to stream ID in all of its consumers -> id is not stream parallelized For cases like: https://github.com/NVIDIA/Fuser/blob/f8e84e52296cdecd318dd2ce904139616d7bd434/tests/cpp/test_overlap.cpp#L155, we want to start with replicating Stream-parallel ID, that is the allocation is not parallelized. However, this ID will appear in the logical domain due to rfactor and with the current contract, be allocated fully regardless of parallelization. So I am not making this a condition in the pass, yet. This can be changed in future when we need. Depends on #5363 --------- Co-authored-by: Jingyue Wu <wujingyue@gmail.com>
Issue #5309 Unlike device parallelization, a stream parallel tensorview (in loop) may or may not have a stream-parallel allocation domain. We propagate based on the following: 1. If it is a device parallel type -> always propagate 2. If it is a fusion input or output -> id is not stream parallelized 3. If the stream ID in a tensorview is not mapped to stream ID in all of its consumers -> id is not stream parallelized For cases like: https://github.com/NVIDIA/Fuser/blob/f8e84e52296cdecd318dd2ce904139616d7bd434/tests/cpp/test_overlap.cpp#L155, we want to start with replicating Stream-parallel ID, that is the allocation is not parallelized. However, this ID will appear in the logical domain due to rfactor and with the current contract, be allocated fully regardless of parallelization. So I am not making this a condition in the pass, yet. This can be changed in future when we need. Depends on #5363 --------- Co-authored-by: Jingyue Wu <wujingyue@gmail.com>
When filtering the reference inputs, inputs without device mesh were removed. This caused fusions with only stream-parallel tensorviews to skip propagation.