Skip to content

Visit extent scalars in SegmentCandidateFinder::resolveScalarsInGroup#840

Merged
jacobhinkle merged 27 commits intomainfrom
scalar_seg_edges
Dec 20, 2023
Merged

Visit extent scalars in SegmentCandidateFinder::resolveScalarsInGroup#840
jacobhinkle merged 27 commits intomainfrom
scalar_seg_edges

Conversation

@jacobhinkle
Copy link
Collaborator

@jacobhinkle jacobhinkle commented Sep 5, 2023

During segmentation, groups are defined as lists of Exprs. When finalizing the segmentation, scalar inputs that might be needed are sprinkled in using resolveScalarsInGroup. The method resolveScalarsInGroup attempts to find all scalars needed in the group and adds their defining exprs to the group; in this way scalars can be recomputed for multiple groups if needed. When the SegmentedGroup is finalized, all expression inputs in all the Exprs in the group are added as inputs (deduplicated, removing constants, etc.).

Currently, resolveScalarsInGroup only processes the inputs of current Exprs to look for input scalars using input->isScalar(). As seen in #656, we also need to look at output extents in some cases. This PR modifies that method to look also at intermediate IterDomain expressions and their attributes, to more aggressively search for scalars.

Fixes #656.

@jacobhinkle jacobhinkle reopened this Sep 6, 2023
@jacobhinkle
Copy link
Collaborator Author

!build

Previously, we only called mutate(TensorView*), which itself mutates
IterDomains. We need to do this instead of mutating IterDomains in the
usual topo order since we need to do exact map propagation before
root->rfactor propagation. However, we also need to propagate mutations
through scalars and scalar expressions, and for that we need to call
mutate at some point.

Previously, scalar expressions were never mutated, resulting in
straggler scalars being left in the concretized fusion even though they
had been registered for mutation. This change fixes that particular
problem, meaning ReshapeToSlice now works properly.
@jacobhinkle
Copy link
Collaborator Author

!build

for (auto output : expr->outputs()) {
// We must be able to compute output extents for expression, so here we
// ensure the scalars involved are all available to this group
if (auto tv = dynamic_cast<TensorView*>(output)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests to exercise this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ReshapeToPad test hits this since the missing scalar appeared only in the root domain of the pad output in the second segment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, then the input tensor of the pad op should have the same extent, so shouldn't it use the same scalar?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the unsegmented fusion the input has the same extent as you say. But when it is segmented the inputs have their roots set as the rfactor with new input scalars for the extents.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But when it is segmented the inputs have their roots set as the rfactor with new input scalars for the extents.

Are you referring to the inputs of a fusion segment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In the test we have a segment with a single pad expression. In that case the expression input is a segment input so it has had its extents replaced.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so the pad input tensor becomes a segment input. It originally has the root and rfactor domains as it's the output of the reshape op, but we only keep the rfactor domain for segment inputs, so the input only has the root domain that was originally the rfactor domain. Am I right?

Still not sure why the pad output would be a problem...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right; the pad example does not need output scalars. However, the slice example from #656 (comment) does need this. In that case the slice output contains i5 which is not found in the segment and is exact mapped to the remapped rfactor->root of T3. The only way to find it is to traverse root->rfactor in the output of the SliceOp. In fact, I think since we have the segment input scalars and since every input to a segment expr is either a segment input or the output of another expr in the segment, it should suffice to look at only attributes and outputs of all expressions in the segment...

Copy link
Collaborator Author

@jacobhinkle jacobhinkle Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #912 we see that even when we do not replace sizes with ceilDiv expressions in reshape, we can hit this problem. In that case there is no longer a problem in the second segment but the first segment is

Inputs:                                                                                                                          
  T0_g[ iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4} ], float                                                                              
Outputs:                                                                                                                         
  T8_g[ iS40{i0}, iS47{( i2 / i5 )}rf, iS48{( (nvfuser_index_t)(i5) )}rf, iS42{i3}, iS43{i4} ], float                            
                                                                                                                                 
%kernel_math {                                                                                                                   
T8_g[ iS40{i0}, iS47{( i2 / i5 )}rf, iS48{( (nvfuser_index_t)(i5) )}rf, iS42{i3}, iS43{i4} ] = view( T0_g[ iS0{i0}, iS1{i2}, iS2{
i3}, iS3{i4} ] )                                                                                                                 
} 

and since i5 is not an input to the ViewOp, it is not included as a segment input. We could of course have the new shapes included in ViewOp::inputs()instead which would also address this particular case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense and seems reasonable. Can you please leave a little more detailed comment why this is necessary?

@jacobhinkle
Copy link
Collaborator Author

!build

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix.

@jacobhinkle jacobhinkle marked this pull request as draft October 2, 2023 13:42
@jacobhinkle
Copy link
Collaborator Author

I converted this back to draft since the current method is including more scalars than needed. Any root domain extent does not need to be included if it is exact mapped to an extent that is already computable. We could temporarily bind inputs to an ExpressionEvaluator and find InputsOf uncomputable scalars but that is a bit heavy handed. I will work on this then mark ready when this doesn't impact existing tests.

jacobhinkle added a commit that referenced this pull request Oct 4, 2023
I have been chasing down codegen changes in #840 and #947 and have
needed to dig through a lot of spurious diffs. I decided to extend the
codegen diff tool to output HTML, and to also modify the diffing a bit.
This PR:

- Changes `tools/compare_codegen.sh` to output env information as well
as add `ptxas_verbose` dump option.
- Changes the diffs performed by that tool to ignore both the kernel
name and the preamble. The preamble is estimated by skipping the typedef
of `nvfuser_index_t`. If preambles between two runs differ, we report
that with a warning and show the diff in the output.
- Adds an `--html` option to `tools/diff_codegen_nvfuser_tests.py` which
will write a self-contained HTML file holding all the differing kernels
and diffs. To use this option you must have previously run `pip install
jinja2`.
- Adds a `--json` option to `tools/diff_codegen_nvfuser_tests.py` which
writes a JSON file containing all the information contained in the HTML
file in an easier-to-parse format.
- Changes the default to not printing the diffs to STDOUT. This can be
re-enabled with the `--show-diffs` argument.

This lets us communicate code differences easily by sharing these files,
which could be generated by our CI. An example output is attached.

Github doesn't support uploading html so I have uploaded a zipped
example:

[codediff_f7786819_feda1e1e_binary_tests.html.zip](https://github.com/NVIDIA/Fuser/files/12793721/codediff_f7786819_feda1e1e_binary_tests.html.zip)

Note that this file is probably typical for a medium sized change: it
results in a zipped file size of 184KB and unzipped it is 2.1MB.

Some ideas left out of this PR that might be nice in the future:
- Handle not just `nvfuser_tests` output but also `nvfuser_bench` and
`pytest` output. We could also fall back to arbitrary command output
where we just group everything to one big "test" if we can't associate
each kernel with a specific test/benchmark.
- Show multiple commands in one HTML file. Especially if the first
bullet is addressed, then we could have a single summary for our whole
suite.
- Include benchmark results. This could be done in another hidden div
with a "benchmarks" button. It might be tricky especially if the number
of benchmark items associated to each kernel is changed between commits,
but it might also be handy to refer to benchmark regressions and have
the codegen output one click away.

Fixes #1007
Note this is kind of a slow test so we might want to remove it or reduce it
@jacobhinkle
Copy link
Collaborator Author

!build --diff

@jacobhinkle
Copy link
Collaborator Author

I converted this back to draft since the current method is including more scalars than needed. Any root domain extent does not need to be included if it is exact mapped to an extent that is already computable. We could temporarily bind inputs to an ExpressionEvaluator and find InputsOf uncomputable scalars but that is a bit heavy handed. I will work on this then mark ready when this doesn't impact existing tests.

http://nv/e5M/nvfuser_github_ci/codegen_diff_p11175548_j76052171_1701449937649638555_codediff_a693a66_bdd502e_custom_command_20231201_164454.html
The differing test there shows this issue. The first kernel there has inputs T0 and T4, so we should not also need to bind i0, i1, i2, i3 which are the dimensions of T0, but we do after the change in this PR. This is because we are adding purely scalar expressions in resolveScalarsInGroup which then lead to adding their inputs in resolveInputsInGroup.

@jacobhinkle
Copy link
Collaborator Author

!build --diff

@jacobhinkle
Copy link
Collaborator Author

!build --diff

This happened in ResizePadToBroadcastDynamic_CUDA, where a seg edge was
placed at the pad output, whose rfactor expressions included the
original input sizes. Those sizes were added as inputs to the mul
segment, but were unneeded. This change fixes that type of situation.
@jacobhinkle
Copy link
Collaborator Author

!build --diff

@jacobhinkle jacobhinkle marked this pull request as ready for review December 19, 2023 14:39
@jacobhinkle
Copy link
Collaborator Author

!build --diff

@jacobhinkle
Copy link
Collaborator Author

!build --diff

@jacobhinkle
Copy link
Collaborator Author

!build --diff

@jacobhinkle
Copy link
Collaborator Author

jacobhinkle commented Dec 20, 2023

No test failures, and verified that this is finally never increasing the number of kernel args. Merging.

@jacobhinkle jacobhinkle merged commit 42bf555 into main Dec 20, 2023
@jacobhinkle jacobhinkle deleted the scalar_seg_edges branch December 20, 2023 17:56
jacobhinkle added a commit that referenced this pull request Dec 20, 2023
Fixes #1277. The bug was actually fixed in #840, but the comment was
inaccurate.
jjsjann123 pushed a commit that referenced this pull request Dec 26, 2023
Fixes #1277. The bug was actually fixed in #840, but the comment was
inaccurate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

More lost scalars during segmentation for dynamic reshape

2 participants