Visit extent scalars in SegmentCandidateFinder::resolveScalarsInGroup#840
Visit extent scalars in SegmentCandidateFinder::resolveScalarsInGroup#840jacobhinkle merged 27 commits intomainfrom
SegmentCandidateFinder::resolveScalarsInGroup#840Conversation
|
!build |
Previously, we only called mutate(TensorView*), which itself mutates IterDomains. We need to do this instead of mutating IterDomains in the usual topo order since we need to do exact map propagation before root->rfactor propagation. However, we also need to propagate mutations through scalars and scalar expressions, and for that we need to call mutate at some point. Previously, scalar expressions were never mutated, resulting in straggler scalars being left in the concretized fusion even though they had been registered for mutation. This change fixes that particular problem, meaning ReshapeToSlice now works properly.
|
!build |
| for (auto output : expr->outputs()) { | ||
| // We must be able to compute output extents for expression, so here we | ||
| // ensure the scalars involved are all available to this group | ||
| if (auto tv = dynamic_cast<TensorView*>(output)) { |
There was a problem hiding this comment.
Do we have tests to exercise this case?
There was a problem hiding this comment.
The ReshapeToPad test hits this since the missing scalar appeared only in the root domain of the pad output in the second segment.
There was a problem hiding this comment.
Hmm, then the input tensor of the pad op should have the same extent, so shouldn't it use the same scalar?
There was a problem hiding this comment.
In the unsegmented fusion the input has the same extent as you say. But when it is segmented the inputs have their roots set as the rfactor with new input scalars for the extents.
There was a problem hiding this comment.
But when it is segmented the inputs have their roots set as the rfactor with new input scalars for the extents.
Are you referring to the inputs of a fusion segment?
There was a problem hiding this comment.
Yes. In the test we have a segment with a single pad expression. In that case the expression input is a segment input so it has had its extents replaced.
There was a problem hiding this comment.
OK, so the pad input tensor becomes a segment input. It originally has the root and rfactor domains as it's the output of the reshape op, but we only keep the rfactor domain for segment inputs, so the input only has the root domain that was originally the rfactor domain. Am I right?
Still not sure why the pad output would be a problem...
There was a problem hiding this comment.
I think you're right; the pad example does not need output scalars. However, the slice example from #656 (comment) does need this. In that case the slice output contains i5 which is not found in the segment and is exact mapped to the remapped rfactor->root of T3. The only way to find it is to traverse root->rfactor in the output of the SliceOp. In fact, I think since we have the segment input scalars and since every input to a segment expr is either a segment input or the output of another expr in the segment, it should suffice to look at only attributes and outputs of all expressions in the segment...
There was a problem hiding this comment.
In #912 we see that even when we do not replace sizes with ceilDiv expressions in reshape, we can hit this problem. In that case there is no longer a problem in the second segment but the first segment is
Inputs:
T0_g[ iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4} ], float
Outputs:
T8_g[ iS40{i0}, iS47{( i2 / i5 )}rf, iS48{( (nvfuser_index_t)(i5) )}rf, iS42{i3}, iS43{i4} ], float
%kernel_math {
T8_g[ iS40{i0}, iS47{( i2 / i5 )}rf, iS48{( (nvfuser_index_t)(i5) )}rf, iS42{i3}, iS43{i4} ] = view( T0_g[ iS0{i0}, iS1{i2}, iS2{
i3}, iS3{i4} ] )
}
and since i5 is not an input to the ViewOp, it is not included as a segment input. We could of course have the new shapes included in ViewOp::inputs()instead which would also address this particular case.
There was a problem hiding this comment.
This makes sense and seems reasonable. Can you please leave a little more detailed comment why this is necessary?
|
!build |
naoyam
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the fix.
|
I converted this back to draft since the current method is including more scalars than needed. Any root domain extent does not need to be included if it is exact mapped to an extent that is already computable. We could temporarily bind inputs to an |
I have been chasing down codegen changes in #840 and #947 and have needed to dig through a lot of spurious diffs. I decided to extend the codegen diff tool to output HTML, and to also modify the diffing a bit. This PR: - Changes `tools/compare_codegen.sh` to output env information as well as add `ptxas_verbose` dump option. - Changes the diffs performed by that tool to ignore both the kernel name and the preamble. The preamble is estimated by skipping the typedef of `nvfuser_index_t`. If preambles between two runs differ, we report that with a warning and show the diff in the output. - Adds an `--html` option to `tools/diff_codegen_nvfuser_tests.py` which will write a self-contained HTML file holding all the differing kernels and diffs. To use this option you must have previously run `pip install jinja2`. - Adds a `--json` option to `tools/diff_codegen_nvfuser_tests.py` which writes a JSON file containing all the information contained in the HTML file in an easier-to-parse format. - Changes the default to not printing the diffs to STDOUT. This can be re-enabled with the `--show-diffs` argument. This lets us communicate code differences easily by sharing these files, which could be generated by our CI. An example output is attached. Github doesn't support uploading html so I have uploaded a zipped example: [codediff_f7786819_feda1e1e_binary_tests.html.zip](https://github.com/NVIDIA/Fuser/files/12793721/codediff_f7786819_feda1e1e_binary_tests.html.zip) Note that this file is probably typical for a medium sized change: it results in a zipped file size of 184KB and unzipped it is 2.1MB. Some ideas left out of this PR that might be nice in the future: - Handle not just `nvfuser_tests` output but also `nvfuser_bench` and `pytest` output. We could also fall back to arbitrary command output where we just group everything to one big "test" if we can't associate each kernel with a specific test/benchmark. - Show multiple commands in one HTML file. Especially if the first bullet is addressed, then we could have a single summary for our whole suite. - Include benchmark results. This could be done in another hidden div with a "benchmarks" button. It might be tricky especially if the number of benchmark items associated to each kernel is changed between commits, but it might also be handy to refer to benchmark regressions and have the codegen output one click away. Fixes #1007
Note this is kind of a slow test so we might want to remove it or reduce it
|
!build --diff |
http://nv/e5M/nvfuser_github_ci/codegen_diff_p11175548_j76052171_1701449937649638555_codediff_a693a66_bdd502e_custom_command_20231201_164454.html |
|
!build --diff |
|
!build --diff |
This happened in ResizePadToBroadcastDynamic_CUDA, where a seg edge was placed at the pad output, whose rfactor expressions included the original input sizes. Those sizes were added as inputs to the mul segment, but were unneeded. This change fixes that type of situation.
|
!build --diff |
|
!build --diff |
|
!build --diff |
|
!build --diff |
|
No test failures, and verified that this is finally never increasing the number of kernel args. Merging. |
During segmentation, groups are defined as lists of
Exprs. When finalizing the segmentation, scalar inputs that might be needed are sprinkled in usingresolveScalarsInGroup. The methodresolveScalarsInGroupattempts to find all scalars needed in the group and adds their defining exprs to the group; in this way scalars can be recomputed for multiple groups if needed. When theSegmentedGroupis finalized, all expression inputs in all theExprs in the group are added as inputs (deduplicated, removing constants, etc.).Currently,
resolveScalarsInGrouponly processes the inputs of currentExprs to look for input scalars usinginput->isScalar(). As seen in #656, we also need to look at output extents in some cases. This PR modifies that method to look also at intermediate IterDomain expressions and their attributes, to more aggressively search for scalars.Fixes #656.