don't propage allocation domain to cached inputs in normalization scheduler#4723
don't propage allocation domain to cached inputs in normalization scheduler#4723
Conversation
|
Review updated until commit 4250b71 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
…d broadcast domains
ab17e68 to
0a5b731
Compare
| tv->getLoopDomain().end(), | ||
| tv->getAllocationDomain().begin(), | ||
| tv->getAllocationDomain().end())) { | ||
| if (!ir_utils::isCpAsyncBulk1D(tv->definition()) && |
There was a problem hiding this comment.
So we are just excluding using allocation domain set on shared memory TV. Nitpick is to add this in the comment.
There was a problem hiding this comment.
I'm not quite following the logic flow here. to establish a base line:
Here we are dealing with the cached_input problem. T7_s_float is the cached input from loading T0_g
I think this PR is saying that the allocation domain set on T7_s isn't relevant for CpAsyncBulk1D. I'm wondering where we are inserting CpAsyncBulk1D and shouldn't that be the place where we back up the allocation domain on its output?
There was a problem hiding this comment.
We are saying: Don't use the pre-set allocation domain if the shared memory tensor is loaded with CpAsyncBulk1D. It is not safe.
For example, in this case, T7_s_float is a shared memory tensor and it is loaded with CpAsyncBulk1D. Its loop domain is (iblockIdx.y118{132}, iS119{8}, iS120{2}, iS116{2}, bS20{1}, iB22{4096}) and ca_pos = 2, its allocation domain in shared memory should be iS120{2}, iS116{2}, bS20{1}, iB22{4096}, which can't be coverted by selecting some domains from the pre-set allocation domain becuase iS120{2}, iS116{2} do not exist in the pre-set allocation domain.
T7_s_float[iblockIdx.y118{132}, iS119{8}, iS120{2}, iS116{2}, bS20{1}, iB22{4096}] ca_pos( 2 )
logical domain : (bS20{1}, iS21{4096}, iB22{4096})
allocation domain : (iS21{4096}, bS20{1}, iB22{4096})
contiguity: t n t
Split: iS21{4096} by factor 2 -> iS115{2048}, iS116{2}
Split: iS115{2048} by factor 132 -> iS117{16}, iblockIdx.y118{132}
Split: iS117{16} by factor 2 -> iS119{8}, iS120{2}
loop domain : (iblockIdx.y118{132}, iS119{8}, iS120{2}, iS116{2}, bS20{1}, iB22{4096})
There was a problem hiding this comment.
Thanks for that explanation.
IIUC, somewhere in the scheduler the allocation domain is added. So I was wondering if it makes more sense to fix that in the scheduler, rather than trying to patch it afterwards in lowering logic.
There was a problem hiding this comment.
Looks like there is a long history and many dicsusisons about allocation domain.
(1) allocation domain is passed from input to cached input, this feature was added in #2309 for matmul and also applied to all schedulers.
(2) #3621 changed to propagates allocation only for matmul schedulers to fix #3479
(3) #3621 is then reverted due to misaligned memory access from transpose kernel #3701
Looks like transpose and matmul scheduler need propagated allocation domain. For reduction & normalization we can skip that for now.
There was a problem hiding this comment.
Generally speaking, when a tensor has an allocation domain, that should truly represent the allocation of the tensor. If it doesn't have an allocation domain, we need to make some inference, but if it does, it shouldn't have an invalid domain. There are several exceptions to this general rule due to some historical reasons, but we should not add more exceptions whenever possible.
In the case of input caches, propagated allocation domains usually don't matter because they are usually Local tensors, for which allocation domains are almost always ignored as one of the existing exceptions. In this case, however, it exposed the problem because the cache tensor is Shared.
I think what we should do is to schedule the allocation domain explicitly rather than relying on propagation or inference. We schedule the tensor to be allocated on the shared memory and be loaded with TMA, and in addition to them, we should also schedule its allocation domain so that there should be no more additional inference rule.
There was a problem hiding this comment.
Thanks for the suggestion and will take a look of this approach.
There was a problem hiding this comment.
@naoyam and @jjsjann123, I drafted a PR to schedule the allocation domain, can you take a quick look? Just want to make sure I am on the right track.
There was a problem hiding this comment.
updating the allocation domain on shared TV in reduction scheduler looks right to me.
Wondering why we are doing the refactor though.
|
!test |
|
!test |
|
we decided to transform the allocation domain following loop domain transformations, #4795 |
The fusion IR of the newly added cpp test is a simple fusion uses inner outer persistent sheduler.
Its input has an allocation domain.
This allocation domain is passed to its consumer
T7_s_float:The allocation of T7 on shared memory should be derived from its compute at position and loop domain since
iS21{4096}has been transformed and its compuate at position is between the transformed domains.One remaining issue:
This specific case is not influenced but reduction and normalization scheulder may fail to derive the correct reduction type when allocation domain exists, e.g. #2202