replay loop domain transforms to allocation domain#4795
Conversation
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… allocation domain
|
Review updated until commit 1579436 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
!test |
jjsjann123
left a comment
There was a problem hiding this comment.
I wonder what's behind the refactor on the allocation lowering pass?
csrc/transform_replay.cpp
Outdated
| IterDomainMap logical_to_alloc_map; | ||
| for (auto logical_id : logical) { | ||
| auto it = std::find(alloc.begin(), alloc.end(), logical_id); | ||
| NVF_ERROR( | ||
| it != alloc.end(), | ||
| "Could not find matching allocation ID for logical ID: ", | ||
| logical_id); | ||
| logical_to_alloc_map[logical_id] = *it; | ||
| } |
There was a problem hiding this comment.
nitpick, if we can compute the permutation, we can directly use that.
There was a problem hiding this comment.
nitpick, if we can compute the permutation, we can directly use that.
This permutation is for logical domain -- allocation domain, logical domain is usually further transformed to get loop domain, then we may not direclty use the allocation domain as-is.
For example, in the test added in #4791, we have
// T2_s_float[iS6{2}, iS11{3}, iS12{4}, iB8{16}] ca_pos( 2 )
// logical domain : (iS6{2}, iS7{12}, iB8{16})
// allocation domain : (iS7{12}, iS6{2}, iB8{16})
// contiguity: t t t
// Split: iS7{12} by factor 4 -> iS11{3}, iS12{4}
// loop domain : (iS6{2}, iS11{3}, iS12{4}, iB8{16})
// T2 is computed at pos 2, we don't need to allocate domains iS6{2} and
// iS11{3} nvFuser tries to exclude these two domains from the allocation
// domain, however, iS11{3} doesn't exist in the allocation domain, so it's
// not excluded and this is considered a failed case.
There was a problem hiding this comment.
sorry, I meant just to refactor how we produce logical_to_alloc_map. i.e. if we compute the permutation order with this
Lines 661 to 679 in 1353ec7
It's a nitpick, since the logic there is identical to what you used here. 😉
There was a problem hiding this comment.
Got you! I gave computePermutation a try it returns the permutation index, which is helpful in some contexts. However, in this case, we need the actual IterDomain mapping. I believe the current implementation is a bit more straightforward, as it allows us to directly find the IterDomain without needing to compute the index with std::distance and then retrieve the IterDomain from that. Let me know if I’m missing something!
csrc/transform_replay.cpp
Outdated
| // Happens within ReplaySelf in the following steps: | ||
| // 1. Given a loop domain, find the logical domain that poduces it. | ||
| // 2. Map the logical domain to the allocation domain. | ||
| // 3. Do the same transformation on the allocation domain. |
There was a problem hiding this comment.
do we actually need a replay like this? I'm wondering if there's anything blocking us from just setting current loop domain as allocation domain as-is.
There was a problem hiding this comment.
do we actually need a replay like this? I'm wondering if there's anything blocking us from just setting current loop domain as allocation domain as-is.
Directly set loop domain as allocation domain may change the allocation domain. When it is a smem tensor, this change leads to bank conflicts. See newly added test SmemAllocationDomainChanged.
In that case, input tensor has allocation domain 32, 512 and loop domain 512,32, to achieve coalesced load from gmem to smem, we want to parallelized 512 with TIDx. So the parallelized loop domain is 512(Tidx), 32(S). Both loop and allocation domains are passed to shared memory cached input. If we set loop as allocation, then allocation becomes 512(Tidx), 32(S), which has bank conflicts.
// smem tensor has allocation domain (32, 512)
// and loop domain (512(TIDx), 32(S))
// there is no bank conflict since the index goes to allocation
// domain where 512 is the inner-most dim.
ASSERT_TRUE(fusion->bankConflictInfo().empty());
// If we reset its allocation domain to (512, 32) and still keep loop
// domain as (512(TIDx), 32(S)), then there are bank conflicts, e.g.
// all threads in a warp access bank-0, then bank-1, then bank-2, etc.
tv1->setAllocationDomain(tv1->getLoopDomain(), /*new_contiguity=*/true);
ASSERT_FALSE(fusion->bankConflictInfo().empty());
There was a problem hiding this comment.
Are you saying it's just ordering? If so, why not just reorder the loop domain?
There was a problem hiding this comment.
Are you saying it's just ordering? If so, why not just reorder the loop domain?
If a tv's loop domain is re-ordered, the inlined position will be influenced and maybe other issues.
Without reorder: T2_s_float[ithreadIdx.x4{512}, iS5{32}] ca_pos( 2 )
Inputs:
T0_g_float[ithreadIdx.x0{512}, iS1{32}]
Outputs:
T1_g_float[ithreadIdx.x2{512}, iS3{32}] ca_pos( 2 ) produce_pos( 2 )
%kernel {
T2_s_float[ithreadIdx.x4{512}, iS5{32}] ca_pos( 2 )
= Set( T0_g_float[ithreadIdx.x0{512}, iS1{32}], cache_op=Streaming )
T1_g_float[ithreadIdx.x2{512}, iS3{32}] ca_pos( 2 ) produce_pos( 2 )
= T2_s_float[ithreadIdx.x4{512}, iS5{32}] ca_pos( 2 )
+ T2_s_float[ithreadIdx.x4{512}, iS5{32}] ca_pos( 2 );
With reorder T2_s_float[iS5{32}, ithreadIdx.x4{512}]
Inputs:
T0_g_float[ithreadIdx.x0{512}, iS1{32}]
Outputs:
T1_g_float[ithreadIdx.x2{512}, iS3{32}] ca_pos( 2 )
%kernel {
T2_s_float[iS5{32}, ithreadIdx.x4{512}]
= Set( T0_g_float[ithreadIdx.x0{512}, iS1{32}], cache_op=Streaming )
T1_g_float[ithreadIdx.x2{512}, iS3{32}] ca_pos( 2 )
= T2_s_float[iS5{32}, ithreadIdx.x4{512}]
+ T2_s_float[iS5{32}, ithreadIdx.x4{512}];
There was a problem hiding this comment.
Hmm, I'm not sure what you mean by that.
Here's the example you gave for the IdModel usage:
T2_s_float[iS6{2}, iS11{3}, iS12{4}, iB8{16}] ca_pos( 2 )
logical domain : (iS6{2}, iS7{12}, iB8{16})
allocation domain : (iS15{3}, iS16{4}, iS6{2}, iB8{16}) contiguity: t t t t
Split: iS7{12} by factor 4 -> iS15{3}, iS16{4}
Split: iS7{12} by factor 4 -> iS11{3}, iS12{4}
loop domain : (iS6{2}, iS11{3}, iS12{4}, iB8{16})
What I'm suggesting is that since we want the allocation domain to have the same iter-domain expressions, we could use iS6{2}, iS11{3}, iS12{4}, iB8{16} to create the allocation domain as: iS11{3}, iS12{4}, iS6{2}, iB8{16}. Why do we need to create new iter-domains?
There was a problem hiding this comment.
Got you. So what you suggested is: we don't change the loop domain, we create the allocation domain using the loop domain IDs by reordering. Then, we won't create new IDs during the schedule of the allocation domain. For example
(1) current approach, direct split allocation domain, iS7{12}, iS6{2}, iB8{16} ---> iS15{3}, iS16{4}, iS6{2}, iB8{16}
(2) new approach, use IDs from loop domain, iS7{12}, iS6{2}, iB8{16} ---> iS11{3}, iS12{4}, iS6{2}, iB8{16}
The difference is we re-use iS11{3}, iS12{4} instead of creating new iS15{3}, iS16{4}. Is this understanding correct?
csrc/ir/iostream.cpp
Outdated
| os() << " contiguity: " << tv->domain()->getContiguityString() << "\n"; | ||
|
|
||
| for (const auto exp : tv->domain()->allExprs()) { | ||
| const auto& loop_domain = tv->getLoopDomain(); |
There was a problem hiding this comment.
Why is this change? logical_to_loop does not capture all exprs returned by allExprs(). In some cases, the loop domain may not be dependents of the logical domain either.
There was a problem hiding this comment.
I made this change to avoid printing transforms made on allocation domains. For example, there are two splits in the following tensor, one for logical domain and the other for allocation domain.
I don't think we want to print out the transforms on allocation domain since the original allocation domain was replaced. Then, this additional split expr looks confusing to me.
T2_s_float[iS6{2}, iS11{3}, iS12{4}, iB8{16}] ca_pos( 2 )
logical domain : (iS6{2}, iS7{12}, iB8{16})
allocation domain : (iS15{3}, iS16{4}, iS6{2}, iB8{16})
contiguity: t t t t
Split: iS7{12} by factor 4 -> iS15{3}, iS16{4}
Split: iS7{12} by factor 4 -> iS11{3}, iS12{4}
loop domain : (iS6{2}, iS11{3}, iS12{4}, iB8{16})
I didn't realize the loop domain may not be dependents of the logical domain, then, we should revise to still use allExprs() but exclude exprs that generate allocation domains, that is Split: iS7{12} by factor 4 -> iS15{3}, iS16{4} in this case.
There was a problem hiding this comment.
Yes, it is not the best way to show, but since these transformations are no longer just straight line transformations from root to logical, etc. Maybe we could have multiple sections for exprs like "root to logical", "logical to loop" and "logical to allocation", as those expr sequences are typically what matter most.
There was a problem hiding this comment.
Or maybe we change to
T2_s_float[iS6{2}, iS11{3}, iS12{4}, iB8{16}] ca_pos( 2 )
logical domain : (iS6{2}, iS7{12}, iB8{16})
Split: iS7{12} by factor 4 -> iS15{3}, iS16{4}
allocation domain : (iS15{3}, iS16{4}, iS6{2}, iB8{16})
contiguity: t t t t
Split: iS7{12} by factor 4 -> iS11{3}, iS12{4}
loop domain : (iS6{2}, iS11{3}, iS12{4}, iB8{16})
then we know Split: iS7{12} by factor 4 -> iS15{3}, iS16{4} was used to generate allocation domain : (iS15{3}, iS16{4}, iS6{2}, iB8{16})
There was a problem hiding this comment.
Yeah, something like that would be more helpful than just dumping all expressions. See TensorDomain::allExprs() to see how to grab each set of expressions.
There was a problem hiding this comment.
I extended allExprs() to allExprsToIds(alloc_domain)
| if (exclude_it != exclude_ca_ids.end()) { | ||
| return *exclude_it; | ||
| } | ||
| // Fallback: use IdModel to check if any excluded ID is mapped |
There was a problem hiding this comment.
For example, we have allocation domain (iS15{3}, iS16{4}, iS6{2}, iB8{16}) and loop domain (iS6{2}, iS11{3}, iS12{4}, iB8{16}) in
T2_s_float[iS6{2}, iS11{3}, iS12{4}, iB8{16}] ca_pos( 2 )
logical domain : (iS6{2}, iS7{12}, iB8{16})
allocation domain : (iS15{3}, iS16{4}, iS6{2}, iB8{16})
contiguity: t t t t
Split: iS7{12} by factor 4 -> iS15{3}, iS16{4}
Split: iS7{12} by factor 4 -> iS11{3}, iS12{4}
loop domain : (iS6{2}, iS11{3}, iS12{4}, iB8{16})
Based on loop domain and compute pos, we don't need to allocate iS6{2} and iS11{3}.
Then the corresponding allocation domains iS6{2} and iS15{3} should be excluded.
iS6{2} exists in both allocation and loop domains, it is found directly by pointer comparison.
iS15{3} only exists in allocation domain, but it is mapped with iS11{3} in loop domain. Here we need IdModel to find this pair.
There was a problem hiding this comment.
I see. I think that makes sense. Perhaps, we could simplify the code a bit:
const auto excluded_ca_groups = GpuLower::current()->idModel().idGraph(IdMappingMode::EXACT).toGroups(exclude_ca_ids);
rturn excluded_ca_groups.has(GpuLower::current()->idModel().idGraph(IdMappingMode::EXACT).toGroup(id));
Then we can also remove lines 165 to 168.
There was a problem hiding this comment.
That's great, thanks for the suggestion!
In addition to check for existence, we also need to remove the actual ID from exclude_ca_ids to ensure all intended exclusions are correctly applied. I’ve slightly extended this approach and submitted a refactor PR at #4843
csrc/scheduler/utils.h
Outdated
|
|
||
| // For shared memory tensor, replay loop domain transformations to allocation | ||
| // domain | ||
| void replayLoopToAllocation(Fusion* fusion); |
There was a problem hiding this comment.
The name and the comment of the function seem vague. Nothing is mentioned that it doesn't silently ignore any tensor that already has an allocation domain. Ideally, it should be clear from the function name only.
There was a problem hiding this comment.
renamed to replayLoopToAllocationForSharedMemoryTvs
|
!test |
|
!test |
|
!test |
|
!test |
Summary of
|
Add
selfReplayLoopToAllocationAssume allocation domain is a permutation of logical domain, then we can use
ReplaySelfto replay the loop domain transformations to allocation domain.The replay happens within
ReplaySelfin the following steps:After replay, reset allocation domain to the transformed version.
Add
scheduler_utils::replayLoopToAllocation(fusion)Two PRs are split from this PR:
(1) refactor
getAllocationDomainsAndContiguity#4792(2) Schedule allocation domain manually and use IdModel to detect mapping between scheduled allocation domain and loop domain. #4791