Fix non-determinism in `mma_utils::getTensorsRoles` by jacobhinkle · Pull Request #947 · NVIDIA/Fuser

jacobhinkle · 2023-09-26T15:37:21Z

This sorts the output of mma_utils::getTensorsRoles so that the matmul scheduler is repeatable. This should fix the false positives in codegen diff CI jobs. 🤞

Fixes #799.

zasdfgbnm

Thanks for fixing!

jacobhinkle · 2023-09-26T18:46:11Z

There should be only one A, B, or D tensor; see
https://github.com/NVIDIA/Fuser/blob/main/csrc/scheduler/matmul.cpp#L725-L728. However there can be multiple C tensors (epilogue producers) and the RolesMap determines the order we cache them, hence the different numbering for cached tensors in some matmul fusions.

I have been chasing down codegen changes in #840 and #947 and have needed to dig through a lot of spurious diffs. I decided to extend the codegen diff tool to output HTML, and to also modify the diffing a bit. This PR: - Changes `tools/compare_codegen.sh` to output env information as well as add `ptxas_verbose` dump option. - Changes the diffs performed by that tool to ignore both the kernel name and the preamble. The preamble is estimated by skipping the typedef of `nvfuser_index_t`. If preambles between two runs differ, we report that with a warning and show the diff in the output. - Adds an `--html` option to `tools/diff_codegen_nvfuser_tests.py` which will write a self-contained HTML file holding all the differing kernels and diffs. To use this option you must have previously run `pip install jinja2`. - Adds a `--json` option to `tools/diff_codegen_nvfuser_tests.py` which writes a JSON file containing all the information contained in the HTML file in an easier-to-parse format. - Changes the default to not printing the diffs to STDOUT. This can be re-enabled with the `--show-diffs` argument. This lets us communicate code differences easily by sharing these files, which could be generated by our CI. An example output is attached. Github doesn't support uploading html so I have uploaded a zipped example: [codediff_f7786819_feda1e1e_binary_tests.html.zip](https://github.com/NVIDIA/Fuser/files/12793721/codediff_f7786819_feda1e1e_binary_tests.html.zip) Note that this file is probably typical for a medium sized change: it results in a zipped file size of 184KB and unzipped it is 2.1MB. Some ideas left out of this PR that might be nice in the future: - Handle not just `nvfuser_tests` output but also `nvfuser_bench` and `pytest` output. We could also fall back to arbitrary command output where we just group everything to one big "test" if we can't associate each kernel with a specific test/benchmark. - Show multiple commands in one HTML file. Especially if the first bullet is addressed, then we could have a single summary for our whole suite. - Include benchmark results. This could be done in another hidden div with a "benchmarks" button. It might be tricky especially if the number of benchmark items associated to each kernel is changed between commits, but it might also be handy to refer to benchmark regressions and have the codegen output one click away. Fixes #1007

jacobhinkle added 5 commits September 26, 2023 11:34

Sort output of getTensorsRoles

954b9c4

Minor

85f9c4e

Sort after looping instead of before

e8ad33e

Shorten diff

f2420e3

Fix constness

0e27b9e

jacobhinkle marked this pull request as ready for review September 26, 2023 16:10

zasdfgbnm approved these changes Sep 26, 2023

View reviewed changes

jacobhinkle merged commit 079b58d into main Sep 26, 2023

jacobhinkle deleted the matmul_tensorroles_determinism branch September 26, 2023 18:46

liqiangxl mentioned this pull request Sep 27, 2023

clean normalization_inner_outer #928

Merged

jacobhinkle mentioned this pull request Sep 29, 2023

Output HTML and JSON from codegen diff tool #996

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix non-determinism in `mma_utils::getTensorsRoles`#947

Fix non-determinism in `mma_utils::getTensorsRoles`#947
jacobhinkle merged 5 commits intomainfrom
matmul_tensorroles_determinism

jacobhinkle commented Sep 26, 2023 •

edited

Loading

Uh oh!

zasdfgbnm left a comment

Uh oh!

jacobhinkle commented Sep 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jacobhinkle commented Sep 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zasdfgbnm left a comment

Choose a reason for hiding this comment

Uh oh!

jacobhinkle commented Sep 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jacobhinkle commented Sep 26, 2023 •

edited

Loading