Conversation
|
Review updated until commit 58dcb3a Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
CC: @jjsjann123 |
|
!test --diff |
|
!test --diff |
|
!test --diff |
|
!test --diff |
|
!test --diff |
|
!test --diff |
|
!test --diff |
|
!test |
|
!test |
tests/cpp/test_host_irs.cpp
Outdated
| } | ||
|
|
||
| TEST_F(LinearHostIrTest, HostIrLinearOut) { | ||
| GTEST_SKIP(); |
There was a problem hiding this comment.
Temporarily disabled as it prevents the binary tests running on H100.
Related: #3996
|
!test |
jjsjann123
left a comment
There was a problem hiding this comment.
some quick comments.
I still need to go over graph_traversal.h and tests.
| } | ||
|
|
||
| auto resize_factors_entry = | ||
| HeuristicDataCacheEntry<HeuristicCompileTime::ResizeVectorizationFactors>( |
| std::vector<NodeType> from, | ||
| std::vector<NodeType> to, | ||
| bool require_all_to_visited = true, | ||
| Direction allowed_direction = Direction::Undefined) |
There was a problem hiding this comment.
nitpick: I know this ship has sailed. But Direction::Undefined used for Direction::Forward_and_Backward is confusing to read..
There was a problem hiding this comment.
Yeah, I was also planning to rename it to something like Direction::Unspecified.
|
|
||
| Direction edge_dir = getDirection(edge); | ||
| NVF_ERROR( | ||
| edge_dir == Direction::Forward || edge_dir == Direction::Backward); |
There was a problem hiding this comment.
nitpick: we already have NVF_THROW() in getDirection, the check here seems redundant.
| virtual std::vector<Edge> getConsumerOrProducerEdges( | ||
| const Edge& edge, | ||
| bool is_consumer, | ||
| Direction allowed_direction = Direction::Undefined) const { |
There was a problem hiding this comment.
naive question: it felt like is_consumer and allowed_direction needs to agree here in order for this to traverse anywhere.
Is the allowed_direction placed here expecting some future expansion? similar to that excludeFromTraversal as a placeholder?
NVM, I realized we are just trying to have a single function handling both direction of traversal....
| outputs.begin(), outputs.end(), [&](const ValT& output) -> bool { | ||
| return isVisited(Edge(output, from_expr)); | ||
| })) { | ||
| std::vector<Edge> prev_edges; |
There was a problem hiding this comment.
nitpick: why is it named prev_edges?
There was a problem hiding this comment.
Because those are the edges that should have been visited previously.
csrc/graph_traversal.h
Outdated
| } | ||
| } | ||
|
|
||
| return std::nullopt; |
There was a problem hiding this comment.
This should be a NVF_THROW() as well.
There was a problem hiding this comment.
No, I don't think so since not all of the dependencies may be satisfied yet.
| continue; | ||
| } | ||
|
|
||
| auto prev_edges = isReady(edge_to_visit); |
There was a problem hiding this comment.
actually we didn't use prev_edges.value(), so isReady could just return a boolean instead.
There was a problem hiding this comment.
I think I was actually using that information in some of prior versions. I ended up not using it, but I'm not sure if that's always the case, so I'll keep it as is for now.
jjsjann123
left a comment
There was a problem hiding this comment.
Thanks for addressing my nitpick and questions.
LGTM
| auto tv1 = reshape(tv0, {10}, {2, 5}); | ||
| auto tv2 = reshape(tv1, {2, 5}, {10}); | ||
| // Another cycle from tv0 to tv3 and then tv4 | ||
| auto tv3 = reshape(tv0, {10}, {5, 2}); |
There was a problem hiding this comment.
IIUC, since we do not have tv1 and tv3 merged together as an add(tv1, tv3). Even though the reshape is a static split, id_model is not mapping tv1 and tv3 into the same group.
There was a problem hiding this comment.
tv1's logical shape is {2, 5}, while tv5 is {5, 2}, so they are not mapped.
| {graph.toGroup(tv4->getLogicalDomain().at(0)->definition()), | ||
| Direction::Forward}, | ||
| {graph.toGroup(tv2->getLogicalDomain().at(0)->definition()), | ||
| Direction::Forward}}; |
There was a problem hiding this comment.
Just confirming here, if we were to map tv1 and tv3 together, we wouldn't have 4 groups here, because in that scenario, they would be merged into only 2 distinct groups.
There was a problem hiding this comment.
They are not mappable.
tests/cpp/test_bfs.cpp
Outdated
| // | ||
| // A ----> B ----> D | ||
| // ^ | | ||
| // +-- C <-+ |
There was a problem hiding this comment.
nitpick: I struggle with mapping {tvX} into (A, B, C, D)
| {{fusion.zeroVal(), tv1->getLogicalDomain().at(0)->extent()}, | ||
| {fusion.zeroVal(), IrBuilder::create<Val>(shape[1] / 2)}}); | ||
|
|
||
| auto tv3 = sin(tv0); |
There was a problem hiding this comment.
nitpick, this isn't really rope. But that doesn't matter.
another nitpick/question. Does the two sin function add anything to the test? I think we can directly slice on tv0, which would test the same pattern.
There was a problem hiding this comment.
It is not the entire rope but just the rotation part, which is what tends to matter most.
sin is just to prevent the preseg passes to take over the slice ops.
|
!build |
This is a follow-up to #3906, which added a WAR to #3640. While it's safe, it turned out it's just too conservative. For example, here's a concat pattern appearing in the backward of Litgpt Llama RoPE: ``` Inputs: T0_g___bfloat[bS0{1}, iS1{8}, iS2{4}, iS3{8192}, iS4{128}] T1_g___bfloat[bS5{1}, iS6{8}, bS7{1}, iS8{8192}, iS9{128}] T2_g___bfloat[bS10{1}, iS11{8}, bS12{1}, iS13{8192}, iS14{128}] Outputs: T8_g___bfloat[bS43{1}, iS44{8192}, iS52{6144}rf] %kernel_math { T3_l___bfloat[bS15{1}, iS16{8}, iS18{6}rf, iS19{8192}, iS20{128}] = pad( T0_g___bfloat[bS0{1}, iS1{8}, iS2{4}, iS3{8192}, iS4{128}], {0, 0, 0, 0, 0, 2, 0, 0, 0, 0} ) i31 = 0 + 4; T4_l___bfloat[bS21{1}, iS22{8}, iS24{( ( ( 0 + 4 ) + 1 ) + 1 )}rf, iS25{8192}, iS26{128}] = pad( T1_g___bfloat[bS5{1}, iS6{8}, bS7{1}, iS8{8192}, iS9{128}], {0, 0, 0, 0, i31, 1, 0, 0, 0, 0} ) i47 = i31 + 1; T5_l___bfloat[bS27{1}, iS28{8}, iS30{( ( ( 0 + 4 ) + 1 ) + 1 )}rf, iS31{8192}, iS32{128}] = pad( T2_g___bfloat[bS10{1}, iS11{8}, bS12{1}, iS13{8192}, iS14{128}], {0, 0, 0, 0, i47, 0, 0, 0, 0, 0} ) T6_l___bfloat[bS33{1}, iS34{8}, iS35{6}, iS36{8192}, iS37{128}] = cat( T3_l___bfloat[bS15{1}, iS16{8}, iS18{6}rf, iS19{8192}, iS20{128}], T4_l___bfloat[bS21{1}, iS22{8}, iS24{( ( ( 0 + 4 ) + 1 ) + 1 )}rf, iS25{8192}, iS26{128}], T5_l___bfloat[bS27{1}, iS28{8}, iS30{( ( ( 0 + 4 ) + 1 ) + 1 )}rf, iS31{8192}, iS32{128}], 2 ) T7_l___bfloat[bS38{1}, iS41{8192}, iS39{8}, iS40{6}, iS42{128}] = Set.Permute( T6_l___bfloat[bS33{1}, iS34{8}, iS35{6}, iS36{8192}, iS37{128}], cache_op=Streaming ) T8_g___bfloat[bS43{1}, iS44{8192}, iS52{6144}rf] = view( T7_l___bfloat[bS38{1}, iS41{8192}, iS39{8}, iS40{6}, iS42{128}] ) } // %kernel_math ``` This is currently taken by the pointwise scheduler, which attempts to vectorize the innermost ID of the output (i.e., `iS52{6144}`). Since the resize ops of the three pad ops are reachable from `iS52`, the WAR of #3640 simply takes them into consideration by calculating gcd with the left and right expand factors. In this case, since there's an expand factor of 1, the resulting vectorization factor is also just 1, which is clearly not what we want. Here, while the resized ID itself is not vectorizable due to the expand factor of 1, all of the resized tensors have large enough inner IDs that should allow the maximum vectorization. To make the WAR a little less conservative, this PR also checks if the constraint by a Resize expr may be missed by the vectorization analysis. In the above case, that should not happen as there's only one path through each of the resize-based tensor ops. This change is still not able to eliminate false positives completely. See one of the new tests that is currently disabled. The codediff results all seem to make sense. http://nv/eFb. Previously some of the tests did not have vectorization due to the WAR, which is relaxed in this PR and allows some vectorization.
Fixes #3640.
The current issue with resize vectorization is that because of the spanning tree based traversal, not all paths are taken into consideration when determining vectorization factors. This PR addresses the issue by finding all
Resizeops that have dependencies with vectorized IDs. To do so, added a new graph traversal class,FindAllExprs, based onBFS. Note thatBFSonly finds shortest paths, so not suitable when, for example,Resizeappears in a cycle, which is the case with RoPE.I was also thinking about overhauling the vectorization analysis to be more ID traversal basis, but we would still need to look at each tensor's allocation domain, so I don't think we can completely move away from tensor op traversals.