From 7ef15e0aad14844deb5c3971fd617f3255f83abb Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Mon, 20 Mar 2023 10:59:00 -0700
Subject: [PATCH 01/22] [MatMul] Prolog build out, adding automatic swizzle
 generator  for a few tile sizes (#1900) MIME-Version: 1.0 Content-Type:
 text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* use custom propagator in ampere TN

* add tile ordering utilities

* initial matmul scheduler implementation

* use matmul scheduler prototype on ampere and turing test cases

* extend to support Volta

* minor cleanup

* comment cleanup

* minor fix

* add fragment iteration and use it in matmul scheduler

* use scheduler params for tests

* fragment support in double buffer

* add register double buffering test cases

* clean up custom transform propagator

* rebase fix

* comment

* move bounded selector to common area

* Add logic to handle fake boundary tensors in selection.

* naming and comment

* remove unused parameters from mma node

* remove unnecessary parameters from mma ir node

* rename scheduling variables

* change accumulator tv interface

* Update torch/csrc/jit/codegen/cuda/scheduler/utils.h

Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>

* PR feedback

* pipe through parallel type position

* Revert "fragment support in double buffer"

This reverts commit d12a90fcce5cd02aca7c98ea5f29ea01bc85df6f.

* use cache op to handle double buffer input

* add more comment in matmul scheduler

* more comments

* comment fix

* rebase fix

* add inline pred for cpasync

* minor cleanup

* add inlining test in unit

* add option to dump ptx

* add ampere xor swizzle gen

* minor scheduler fix; add bank conflict helper

* minor update and enable single word access checker

* minor fixes and symmetric 4 warp recipe tests

* rebase fix

* fix rebase

* add cyclic shift for non-power-of-2 swizzle period

* fix swizzle handling in replay

* add a few more tile support

* minor fix

* add 6 warp test cases

* add skip swizzle option for replay matching

* cleanup

* add small repro for the replay fix

* Fix missing thread predicates

Unlikely to matter, but should be necessary

* fix merge

* fix merge

* format

* Rebase #1900 (#2009)

* hash update - bug fix for branches (#83865)

hash updates for xla were failing because the current pinned hash is a branch, so the git command for getting the date couldn't find the branch due to not having a local version of the branch.  Fixed by checking out the branch to make sure it exists locally.

example of failure: https://github.com/pytorch/pytorch/runs/7913835742?check_suite_focus=true

Test plan:
made it pull request trigger and ran, to get this:
https://github.com/pytorch/pytorch/runs/7959221184?check_suite_focus=true
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83865
Approved by: https://github.com/zengk95

* [FSDP] Remove unneeded checks (#83150)

@awgu pointed out these checks aren't really doing anything, as they just make sure we're setting training state in certain ways throughout FSDP and is sort of arbitrary. So, removing them to avoid confusion.

We still keep the checking around `_post_backward_called` because this is needed in `finalize_params` for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83150
Approved by: https://github.com/awgu

* [BE] Revert distributed change in https://github.com/pytorch/pytorch/pull/68779 (#83181)

https://github.com/pytorch/pytorch/issues/82641 points out a regression in how inputs / outputs are processed by DDP, blocking their HF use case. It was narrowed down to https://github.com/pytorch/pytorch/pull/68779 and reverting the distributed change there fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83181
Approved by: https://github.com/kumpera

* Transpose scheduler small dim sizes better support (#1910)

* Optimize transpose copy on CPU using fbgemm transpose (#83327)

Optimize transpose copy on CPU using fbgemm transpose

single socket (28cores):
```
before: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 4.819e-05 ms; bf16: 4.846e-05 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.000171 ms; bf16: 0.000129 ms

after: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10])  fp32: 2.439e-05 ms; bf16: 2.152e-05 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.000132 ms; bf16: 3.916e-05 ms
```
single core:
```
before: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 0.00109 ms;  bf16: 0.00103 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.00339 ms; bf16: 0.00295 ms

after: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 0.000566  ms; bf16: 0.000382 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.00282 ms; bf16: 0.000999 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83327
Approved by: https://github.com/frank-wei

* Grouped grid welford (#1921)

Enables grouping of grid welford ops across iterations. Same functionality as the iteration grouping for GridReduction. This ins intended to improve the outer-norm grid persistence in batchnorm-like fusions.

* [ONNX] Use `errors.SymbolicValueError` for more context (#83332)

Replace runtime errors in torch.onnx with `errors.SymbolicValueError` for more context around jit values.

- Extend `_unimplemented`, `_onnx_unsupported`, `_onnx_opset_unsupported`, `_onnx_opset_unsupported_detailed` errors to include JIT value information
- Replace plain RuntimeError with `errors.SymbolicValueError`
- Clean up: Use `_is_bool` to replace string comparison on jit types
- Clean up: Remove the todo `Remove type ignore after #81112`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83332
Approved by: https://github.com/AllenTiTaiWang, https://github.com/thiagocrepaldi, https://github.com/BowenBao

* [quant][fx] Add support for quantized matmul (#83885)

Summary:
att, probably missed the op during migration to the reference flow

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_qmatmul

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83885
Approved by: https://github.com/andrewor14

* Misc fixes/tuning for transpose scheduler (#1912)

* [nn] split rnn_utils test from test_nn.py (#83675)

Ref: https://github.com/pytorch/pytorch/issues/63085
Proposed folder structure
```
-> test
  -> nn
    -> test_conv.py
    -> test_pooling.py
    -> .....
```

This PR: Moves test related RNN utilities to a different file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83675
Approved by: https://github.com/albanD

* [optim] rprop: handle complex params as independent real params (#83858)

Ref #65711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83858
Approved by: https://github.com/albanD

* [xla hash update] update the pinned xla hash (#83899)

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83899
Approved by: https://github.com/pytorchbot

* [ROCm] More Sparse UTs enablement and more hipification mappings. (#78939)

Enables:

 test_bmm_cuda_float64
 test_bmm_deterministic_cuda_float64
 test_csr_matvec_cuda_complex128
 test_csr_matvec_cuda_complex64
 test_csr_matvec_cuda_float32
 test_csr_matvec_cuda_float64

To enable the above tests had to add some more hip mappings for the hipification process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78939
Approved by: https://github.com/pruthvistony, https://github.com/malfet

* Normalize DLPack stride to 1 where shape < 2 (#83158)

Fixes #83069. Also move all the dlpack tests to a new file., `test_dlpack.py`.

The fix involves always allocating a "strides" int array when converting to dlPack and deleting the strides when the capsule descructor is called. Then the strides are copied from the tensor, and `strides[i]` is set to `1` where `shape[i] < 2`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83158
Approved by: https://github.com/ezyang

* Remove DBR quantization from the codebase (#83642)

Summary:

DBR quantization is a no-go for now because it does not align well with
PyTorch 2.0 plans and we do not want to build yet another tracing system.

Deleting it from the codebase for now since there are no plans to develop
this in the near future. We can bring it back at a later time if necessary.

Test plan:

CI

Differential Revision: [D38839556](https://our.internmc.facebook.com/intern/diff/D38839556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83642
Approved by: https://github.com/andrewor14, https://github.com/jerryzh168

* Refactored ops on size to be dispatcher ops (#83719)

An example of how the graph looks now.
```
def forward(self, x_1):
    size = torch.ops.math.size(x_1, 0)
    size_1 = torch.ops.math.size(x_1, 1);  x_1 = None
    ones = torch.ops.aten.ones.default([1], device = device(type='cpu'), pin_memory = False)
    expand_sym_int = torch.ops.aten.expand.SymInt(ones, [size, size_1]);  ones = size = size_1 = None
    cos_default = torch.ops.aten.cos.default(expand_sym_int);  expand_sym_int = None
    return (cos_default,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83719
Approved by: https://github.com/ezyang

* Fix stride issue with faketensors (#83822)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83822
Approved by: https://github.com/ezyang, https://github.com/ngimel

* Nullary RNGOp (#1892)

* [ROCm] restore MIOpen benchmark flag default to true (#82656)

PR https://github.com/pytorch/pytorch/pull/77438 allowed MIOpen to support the benchmark flag. Previously, the benchmark flag was ignored by MIOpen such that benchmarking was always turned on. This commit restores the behavior that MIOpen benchmarking is by default turned on.

CI unit tests cover this capability.  Torchvision models demonstrate the performance delta.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82656
Approved by: https://github.com/ngimel

* Update retry action to latest version (#83911)

We're running into EPERM issues when trying to install nvidia tools, see failure example https://github.com/pytorch/pytorch/runs/7975726013?check_suite_focus=true.
```
WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1049
            throw err;
            ^

Error: kill EPERM
    at process.kill (internal/process/per_thread.js:199:13)
    at killPid (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1059:17)
    at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1036:21
    at Array.forEach (<anonymous>)
    at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1034:23
    at Array.forEach (<anonymous>)
    at killAll (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1033:27)
    at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1024:13
    at ChildProcess.onClose (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1080:17)
    at ChildProcess.emit (events.js:314:20) {
  errno: 'EPERM',
  code: 'EPERM',
  syscall: 'kill'
}

```

The root issue probably lies elsewhere but this action is not helping/the errors seem to say it's unable to kill child processes. A more recent commit in that repo uses spawn instead of exec which might make a difference.

Regardless, we should keep our actions up to date anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83911
Approved by: https://github.com/malfet

* [PyTorch] Remove unused sstream/string includes from c10/macros/Macros.h (#83353)

Nothing in the rest of the header seems to use these.

Differential Revision: [D38672680](https://our.internmc.facebook.com/intern/diff/D38672680/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83353
Approved by: https://github.com/malfet

* [functorch] add linalg cross batch rule (#83759)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83759
Approved by: https://github.com/zou3519

* Improve DistanceKernel.cu (#83811)

include device_sqrt
replace reduce_agg by BlockReduce
choose implementation by impl_fptr instead of error-prone copy-and-paste
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83811
Approved by: https://github.com/ngimel

* reinplace pass: bugfix for output node replacement (#83845)

Cleaned up some of the arg replacement logic to use tree_map, so it handles FX nodes that have nested containers.

See the added test: when you write a function that returns a list, the `output` node in the FX graph shows up as having `node.args = tuple(immutable_list(...))`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83845
Approved by: https://github.com/ezyang

* reinplace pass: special handling for view_scatter ops (#83846)

There is already special handling in the reinplacing pass for removing `{view}_scatter` ops, but there is another case that needs special handling. In this code:
```
         def f():
             a = torch.zeros(4, 4, 4)
             a[:, 2:] = torch.ones(4, 2, 4)
             return a
```

Tracing normally with `make_fx()` gives you:
```

def forward(self):
    zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False)
    ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False)
    slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
    slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807);  slice_tensor = None
    copy__default = torch.ops.aten.copy_.default(slice_tensor_1, ones);  slice_tensor_1 = ones = None
    return zeros
```
Functionalizing it gives you:

```
def forward(self):
    zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False)
    ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False)
    slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
    slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807);  slice_tensor = None
    slice_tensor_2 = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
    slice_scatter_default = torch.ops.aten.slice_scatter.default(slice_tensor_2, ones, 1, 2, 9223372036854775807);  slice_tensor_2 = ones = None
    slice_scatter_default_1 = torch.ops.aten.slice_scatter.default(zeros, slice_scatter_default, 0, 0, 9223372036854775807);  zeros = slice_scatter_default = None
    return slice_scatter_default_1
```

Notice that there are not any functional ops to directly re-inplace! What actually happened is that functionalization turned the `copy_()` into a `copy()`, but the out-of-place `copy()` operator gets optimized away because it's a no-op (when the input and output metadata are the same, `out = copy(a, b)` just returns `b`).

What we actually want is to replace this line:
```
slice_scatter_default = torch.ops.aten.slice_scatter.default(slice_tensor_2, ones, 1, 2, ...);
```
with this:
```
new_slice = torch.ops.aten.slice.Tensor(slice_tensor_2, 1, 2, ...);
_ = torch.ops.aten.copy_.default(new_slice, ones)
```

In the above, we're taking a fresh slice of the "base" tensor, and performing a `copy_()` on the slice, adding back what functionalization removed.

We actually need to create a fresh "slice" node, because we're not guaranteed that one already exists in the graph (technically there should be one, but it might have been DCE'd by the time we hit re-inplacing)

I also updated the docs for re-inplacing to more closely match the order of the logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83846
Approved by: https://github.com/ezyang

* Move ATenNVRTC.h include from `jit_utils.h` to `jit_utils.cpp` (#83886)

In general, `.h` files should only include headers that are used in the header

Fixes https://github.com/pytorch/pytorch/issues/83856
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83886
Approved by: https://github.com/ngimel

* Allow None arguments for elementwise type promotion wrapper and fix clamp with None arguments (#83586)

Fixes https://github.com/pytorch/torchdynamo/issues/759
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83586
Approved by: https://github.com/ezyang, https://github.com/ngimel

* Enable NCCL_DESYNC_DEBUG when TORCH_DISTRIBUTED_DEBUG=DETAIL (#83881)

Automatically enable `NCCL_DESYNC_DEBUG` when `TORCH_DISTRIBUTED_DEBUG` is set to `DETAIL`.
Saving user from setting two env variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83881
Approved by: https://github.com/malfet, https://github.com/rohan-varma, https://github.com/H-Huang

* Strenghten preconditions of linalg.cross (#83798)

This makes `linalg.cross` array API complaint (https://github.com/data-apis/array-api/issues/415) and fixes a few bugs.

Fixes https://github.com/pytorch/pytorch/issues/77629
Fixes https://github.com/pytorch/pytorch/issues/83756
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83798
Approved by: https://github.com/mruberry

* Fix view_func replay in no-grad mode (#83872)

Fixes https://github.com/pytorch/pytorch/issues/83828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83872
Approved by: https://github.com/albanD

* [vulkan] Add VMA as a third_party subrepo (#83906)

the [VulkanMemoryAllocator](https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator) is a popular library for GPU memory allocation using Vulkan. The Vulkan backend has a dependency on it, but since it is only a single header file we currently include it by checking it into the repo under [aten/src/ATen/native/vulkan/api/vk_mem_alloc.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h). However, it is better to check it in as a third party submodule, since it allows better version tracking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83906
Approved by: https://github.com/kimishpatel

* [torchgen] Add documentation for `autogen` keyword (#83610)

This is a follow up for #81437. This PR explains what operator can use `autogen` and what will be generated. Also talked about generated kernels and where to find them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83610
Approved by: https://github.com/albanD, https://github.com/bdhirsh

* remove assertEqualIgnoreTypes from test/distributions/test_distributions.py (#83709)

See https://github.com/pytorch/pytorch/issues/38095
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83709
Approved by: https://github.com/kit1980

* [fix] edge case in `MaxPool1d` and add ErrorInputs (#83553)

Fixes #83224

cc @kshitij12345 @albanD!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83553
Approved by: https://github.com/albanD

* [complex] conv_transpose1d (#79694)

Reference: https://github.com/pytorch/pytorch/issues/71108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79694
Approved by: https://github.com/ngimel

* Revert "Strenghten preconditions of linalg.cross (#83798)"

This reverts commit 7f0198e7390eff2f2f5fcb33ce36c99ec3b7f55e.

Reverted https://github.com/pytorch/pytorch/pull/83798 on behalf of https://github.com/janeyx99 due to Sorry, land race caused functorch issues https://hud.pytorch.org/pytorch/pytorch/commit/7f0198e7390eff2f2f5fcb33ce36c99ec3b7f55e

* Fix load_extra_only api for flatbuffers and enable flatbuffers in mobile for OSS properly (#83855)

`_load_extra_only_for_mobile` API hasn't handled flatbuffers logic yet. Update the api accordingly.

Also find out mobile build in OSS doesn't build with flatbuffers. Filed task T129996445 to track

Differential Revision: [D38890847](https://our.internmc.facebook.com/intern/diff/D38890847/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38890847/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83855
Approved by: https://github.com/qihqi

* Prefer signal from land checks over PR signals (#83715)

When a dev forks their branch from a red master build, their branch can fail CI checks for reasons unrelated to their changes, but the same checks would however pass in the land validation commit (which is rebased off of viable/strict)

Today, in the above scenario the `merge -l` command fails because mergebot sees the failing checks in the PR, which is not helpful when that same check passes in land validation.

This PR changes the behavior so that:
1. If both the PR and land validation ran a workflow, only look at the results from land validation
2. If only the PR ran a specific workflow (e.g. for CLA Check or a nightly run) then continue to look the result from the PR (which matches existing behavior)

It also includes a few extra BE fixes:
- Replaces the tuple we used to pass workflow check results around with a named tuple so that it's easier to tell what data is being used
- Reduces the number of API calls to github by ~50% during merges.  Before, we were pulling results from github every time and then filtering it down to the relevant category of checks (e.g. failed/pending/startup_failed). Now, our filters share the check results
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83715
Approved by: https://github.com/zengk95

* Don't introduce new overload for SymInt (#83628)

Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh

* Remove CoreMLMemoryObserver (#83703)

Summary: We added this observer to help us diagnose memory issues that have since resolved. It should be safe to clean this up.

Test Plan: Diff just removed logging, so just build IG and confirm no errors.

Differential Revision: D38843701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83703
Approved by: https://github.com/mcr229

* ci: Remove dead code related to android uploads (#83930)

These uploads actually never got triggeredhappened in nightlies so
removing it altogether. Someone can re-add in the future if they feel
these are important but I can't find an instance of this running since
we migrated so I have a hard time believing anyone will miss it.

https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=android

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83930
Approved by: https://github.com/atalman, https://github.com/malfet

* [fx][pass infra] Adding error catching (#83933)

Example:

```
======================================================================
ERROR: test_pass_manager_error (fx.test_pass_infra.TestPassManager)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/angelayi/Projects/pytorch/torch/fx/passes/infra/pass_manager.py", line 285, in __call__
    res = fn(module)
  File "/Users/angelayi/Projects/pytorch/test/fx/test_pass_infra.py", line 164, in pass_fail
    raise RuntimeError("bad")
RuntimeError: bad

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/angelayi/Projects/pytorch/test/fx/test_pass_infra.py", line 170, in test_pass_manager_error
    pm(traced_m)
  File "/Users/angelayi/Projects/pytorch/torch/fx/passes/infra/pass_manager.py", line 289, in __call__
    raise RuntimeError(msg) from e
RuntimeError: An error occured when running the 'pass_fail' pass after the following passes: ['replace_add_with_mul_pass', 'replace_mul_with_div_pass']
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83933
Approved by: https://github.com/SherlockNoMad

* Back out "Support regex-style matching for Any and Oneof (#82853)" (#83922)

Reviewed By: hl475

Differential Revision: D38945806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83922
Approved by: https://github.com/hl475

* Fix use-dict-literal lint (#83718)

Fix use-dict-literal pylint suggestions by changing `dict()` to `{}`. This PR should do the change for every Python file except test/jit/test_list_dict.py, where I think the intent is to test the constructor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83718
Approved by: https://github.com/albanD

* Revert "Optimize transpose copy on CPU using fbgemm transpose (#83327)"

This reverts commit 04d8da88a6a1abf0da2b11096c85244bf38d3b2a.

Reverted https://github.com/pytorch/pytorch/pull/83327 on behalf of https://github.com/weiwangmeta due to breaking internal builds/causing out-of-bounds errors/training accuracy

* Add hypothesis to requirements.txt (#83740)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83740
Approved by: https://github.com/zhxchen17, https://github.com/janeyx99, https://github.com/zou3519

* [fbia] Keep Track of full qualified name before and after remote sharding (#83889)

Summary: track qualname changes in embedding sharding & FX split, and compose target qualname in the end of FBIA transform stage, so we can use the qualname mapping in XL materialize stage

Test Plan:
CI/CD

with DISABLE_XLEBB_MATERIALIZATION = True
https://fburl.com/fblearner/a8yljbux

with DISABLE_XLEBB_MATERIALIZATION = False
https://fburl.com/fblearner/2nvi0dam

Reviewed By: lliu315gt

Differential Revision: D38772525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83889
Approved by: https://github.com/houseroad

* add merge blocking to ci: sev template (#83940)

as in title, so that by default, ci: sev will block merges

the line can be removed to not block merges
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83940
Approved by: https://github.com/huydhn, https://github.com/janeyx99, https://github.com/malfet, https://github.com/seemethere

* Move nnapi code from ATen common code to specific library (#83748)

Summary: Currently we include nnapi code in all targets using ATen even if it's not used (actually there is no usage and being deprecated). Move it to `nnapi_backend_lib` for now.

Test Plan: Sandcastle.

Differential Revision: D38761095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83748
Approved by: https://github.com/salilsdesai, https://github.com/SS-JIA

* Task: T129772171 remove assertEqualIgnoreTypes from test/test_nn.py (#83870)

See https://github.com/pytorch/pytorch/issues/38095
Replaced assertEqualIgnoreType with assertEqual
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83870
Approved by: https://github.com/kit1980

* [Nested Tensor] Make offset copy and move assignment more explicit. (#83488)

Currently the nested tensor construction for the offset_ parameter takes in references and in the chain of delegation uses value. This could lead to unnecessary copies.  Whenever a nested tensor impl is constructed it should take ownership of all its metadata. The only non-trivially copyable metadata associated with the class is `offsets_`.

The goal of this PR is to make sure that consumers of nested_tensor_impl constructors ensure that they are passing offsets as a temporary - either buy explicitly copying a reference, or by constructing the offsets vector in the scope of construction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83488
Approved by: https://github.com/albanD, https://github.com/bdhirsh

* Remove conj kernels for real dtypes (#80374)

`conj_physical_stub` is currently implemented for all dtypes despite
it just being a plain copy for real dtypes. So, instead we should
defer to the existing copy kernel in these cases.

On my build for one CUDA architecture, I see a 2.2 MB decrease in
`libtorch_cuda.so` size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80374
Approved by: https://github.com/ngimel, https://github.com/atalman

* [BE][CUDA] Use packed_accessor64 (#83949)

Not sure why we are ignoring those, but SoftMax.cu alone
generates 100+ lines of warnings:
```
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In function ‘at::Tensor at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::get_offsets(const at::Tensor&, const IntArrayRef&, int64_t)’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:261:69: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = long int; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto indices_accessor = indices.packed_accessor<int64_t, 2>();
                                                                     ^
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = double; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:607:924:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = float; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:607:1677:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = double; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:623:927:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = float; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:623:1679:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = double; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:641:977:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = float; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:641:1775:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = double; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:661:980:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = float; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:661:1777:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = double; bool requireMxRows = true; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:16:557:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = float; bool requireMxRows = true; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:18:556:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = double; bool requireMxRows = false; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:20:557:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = float; bool requireMxRows = false; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:21:556:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83949
Approved by: https://github.com/ngimel

* Support returning symbolic strides from t.stride() in Python (#83842)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83842
Approved by: https://github.com/albanD, https://github.com/Chillee, https://github.com/bdhirsh

* Support the XPU backend untyped storage (#83952)

Simple add XPU backend in untyped torch storage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83952
Approved by: https://github.com/ezyang

* Support NCCL Premul Sum (#81272)

This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum.

The major changes include
- convert enum ReduceOp to struct
- add premul sum specific paths to init.cpp and Ops.cpp.

note:
- For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed

The commit titled "add nccl premul" whose current hash is https://github.com/pytorch/pytorch/pull/81272/commits/cb99ad67447b5899ecf8c4c3d78deaafa1cc09b8 was authored by @mcarilli and @ptrblck.

cc @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272
Approved by: https://github.com/kwen2501

* Test type promotion assertignoretypes (#83867)

See #38095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83867
Approved by: https://github.com/kit1980, https://github.com/mruberry

* [Profiler] record nn.Module's parameters (#83209)

Summary:
Record nn.Module's parameters for detaild memory profiling:
- extend 'module_' in value cache  & NNModuleInfo to save parameters
- python binding and unit test case

Test Plan: buck run mode/opt //caffe2/test:profiler -- -r test_nnmodule

Differential Revision: D38379717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83209
Approved by: https://github.com/robieta

* [xla hash update] update the pinned xla hash (#83967)

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83967
Approved by: https://github.com/pytorchbot

* Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)

* Fix LTC build warnings (#83955)

Addresses `Wc++98-compat-extra-semi` warning from https://github.com/llvm/torch-mlir/issues/1264 by removing extraneous semicolon after autogen LTC native function definitions.

```
/home/runner/work/torch-mlir/torch-mlir/build/tools/torch-mlir/python/torch_mlir/csrc/base_lazy_backend/generated/LazyNativeFunctions.cpp:4241:6: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
    };
     ^
```

cc: @wconstab @desertfire @ke1337 @antoniojkim
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83955
Approved by: https://github.com/wconstab

* Strenghten preconditions of linalg.cross (#83798)

This makes `linalg.cross` array API complaint (https://github.com/data-apis/array-api/issues/415) and fixes a few bugs.

Fixes https://github.com/pytorch/pytorch/issues/77629
Fixes https://github.com/pytorch/pytorch/issues/83756
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83798
Approved by: https://github.com/mruberry

* Make linalg.inv composite of linalg.solve (#80074)

The `getri` kernel calls inside `getrs` so we can do so explicitly
ourselves and save ourselves from having to maintain an extra kernel.
This way we just need to optimise `lu_factor` and `lu_solve` and `inv`
will be as efficient as it can be, as it'll be choosing the best backend
to perform the factorisation and the best backend (not necessarily the
same) to perform the solve.

Fixes https://github.com/pytorch/pytorch/issues/77498

The benchmarks: https://github.com/pytorch/pytorch/pull/80074#issuecomment-1164309071
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80074
Approved by: https://github.com/IvanYashchuk, https://github.com/albanD, https://github.com/malfet

* Support a stable double backward on linalg.det for real inputs (#80217)

The complex case still fails. I do not know why.

Fixes https://github.com/pytorch/pytorch/issues/62327
Fixes https://github.com/pytorch/pytorch/issues/53364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80217
Approved by: https://github.com/nikitaved, https://github.com/albanD, https://github.com/malfet

* [LTC] Add custom lazy tensor save function (#83294)

We need a custom `save` function for checkpointing a lazy model, similar to what exists in PyTorch/XLA:
https://github.com/pytorch/xla/blob/3eb8a9d9eb4ebb0b064461c3704650241625654e/torch_xla/core/xla_model.py#L994
The purpose of this function is to move any lazy tensors to CPU before saving the checkpoint.

The way I implemented it was to create a general structure visitor, adapted from a function that we use quite often in Cerebras internal repositories. If there is a better tool already available in PyTorch that does the same things, I'm open to suggestions.

CC: @wconstab @Krovatkin @JackCaoG
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83294
Approved by: https://github.com/wconstab

* move pooling test from test_nn to test/nn/test_pooling (#83915)

Ref #63085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83915
Approved by: https://github.com/albanD

* [ONNX] Remove static None graph output (#82623)

Fixes #82370
* Unify the export behavior regarding static None outputs. These are
dropped for both traced graph and TorchScript graph export.
* `Optional` outputs are not affected.
Fixes #82370
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82623
Approved by: https://github.com/AllenTiTaiWang, https://github.com/abock

* [TorchTidy Fix] Don't try to collect strides for non-strided tensors (#83935)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83935
Approved by: https://github.com/robieta, https://github.com/slgong-fb

* [WIP] Validating input_col for certain datapipes (#80267)

Follow up from #79344.

Currently WIP due to multiple test failures.

Waiting for #80140 to land
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80267
Approved by: https://github.com/ejguan

* support more symintnode operations (#83877)

remove debug code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83877
Approved by: https://github.com/ezyang

* add arithmetic ops (#83878)

arithmetic ops tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83878
Approved by: https://github.com/ezyang

* logical ops (#83879)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83879
Approved by: https://github.com/ezyang

* strip SymIntNodes off in the mobile builds (#83938)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83938
Approved by: https://github.com/ezyang

* [pthreadpool] Cap max thread count to fix TSAN issues (#83950)

Summary: Cap the thread count to 64 unconditionally to solve this tsan issue which leads to harder to debug, flaky test failures.

Test Plan: CI

Reviewed By: kimishpatel

Differential Revision: D38136212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83950
Approved by: https://github.com/kimishpatel

* Skip NCCL slimming for cxx11 libtorch builds (#83959)

Fixes https://github.com/pytorch/pytorch/issues/83887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83959
Approved by: https://github.com/atalman

* add hud link to merge failure message (#83946)

as in title, related to https://github.com/pytorch/test-infra/issues/568
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83946
Approved by: https://github.com/huydhn

* Check all CUDA API calls for errors in benchmarks/cpp/nvfuser (#74920) (#81817)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74920

Test Plan: Sandcastle

Differential Revision: D35194656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81817
Approved by: https://github.com/malfet

* [frontend] Fix tensor list alias annotation  (#84005)

For issue https://github.com/pytorch/pytorch/issues/77920 and a retry of https://github.com/pytorch/pytorch/pull/8…

* cleanup

* fix merge conflict error

* cleanup clone

* fix

* remove bank conflicting util

* revert position

* more revert

* save

* fix

* no gcd

* comment

* cleanup

---------

Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>
Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
Co-authored-by: Catherine Lee <csl@fb.com>
Co-authored-by: Rohan Varma <rvarm1@fb.com>
Co-authored-by: CaoE <e.cao@intel.com>
Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com>
Co-authored-by: Justin Chu <justinchu@microsoft.com>
Co-authored-by: Jerry Zhang <jerryzh168@gmail.com>
Co-authored-by: Kshiteej K <kshitijkalambarkar@gmail.com>
Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Co-authored-by: jpvillam <Juan.Villamizar@amd.com>
Co-authored-by: mattip <matti.picus@gmail.com>
Co-authored-by: Vasiliy Kuznetsov <vasiliy@fb.com>
Co-authored-by: Horace He <chilli@fb.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Jane Xu <janeyx@fb.com>
Co-authored-by: Scott Wolchok <swolchok@fb.com>
Co-authored-by: samdow <samdow@fb.com>
Co-authored-by: chengscott <60510scott@gmail.com>
Co-authored-by: Brian Hirsh <hirsheybar@fb.com>
Co-authored-by: Nikita Shulga <nshulga@fb.com>
Co-authored-by: Ivan Yashchuk <IvanYashchuk@users.noreply.github.com>
Co-authored-by: Ke Wen <kw2501@fb.com>
Co-authored-by: lezcano <lezcano-93@hotmail.com>
Co-authored-by: soulitzer <soulitzer@gmail.com>
Co-authored-by: Stephen Jia <ssjia@fb.com>
Co-authored-by: Mengwei Liu <larryliu@fb.com>
Co-authored-by: Kaichen Liu <kaichenliu@fb.com>
Co-authored-by: Khushi Agrawal <khushiagrawal411@gmail.com>
Co-authored-by: chenlai <chenlai@devvm7453.prn0.facebook.com>
Co-authored-by: Zain Rizvi <zainr@fb.com>
Co-authored-by: Edward Z. Yang <ezyang@fb.com>
Co-authored-by: John Detloff <johndetloff@fb.com>
Co-authored-by: Eli Uriegas <eliuriegas@fb.com>
Co-authored-by: Angela Yi <angelayi@fb.com>
Co-authored-by: Shirong Wu <shirong@fb.com>
Co-authored-by: Sergii Dymchenko <sdym@fb.com>
Co-authored-by: Nan Xiao <nanx@fb.com>
Co-authored-by: Hansong Zhang <hsz@fb.com>
Co-authored-by: Ishan-Rajgarhia <ishanrajgarhia@fb.com>
Co-authored-by: Driss Guessous <drisspg@fb.com>
Co-authored-by: Peter Bell <peterbell10@live.co.uk>
Co-authored-by: Lu, Chengjun <chengjun.lu@intel.com>
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>
Co-authored-by: Souranil Sen <souranil@fb.com>
Co-authored-by: Seonglyong Gong <slgong@fb.com>
Co-authored-by: Henry Tu <henry.tu@cerebras.net>
Co-authored-by: Antonio Kim <antonio.kim@cerebras.net>
Co-authored-by: BowenBao <bowbao@microsoft.com>
Co-authored-by: John Clow <jclow@fb.com>
Co-authored-by: Robert <xiurobert@gmail.com>
Co-authored-by: Nikolay Korovaiko <korovaikon@gmail.com>
Co-authored-by: Digant Desai <digantdesai@fb.com>
Co-authored-by: Richard Barnes <rbarnes@fb.com>
Co-authored-by: Larry Liu <8188269+larryliu0820@users.noreply.github.com>
Co-authored-by: Sherlock Huang <bahuang@fb.com>
Co-authored-by: thomasw21 <24695242+thomasw21@users.noreply.github.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
Co-authored-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>
Co-authored-by: Bin Chen <bchen2020@fb.com>
Co-authored-by: Chen, Jian Ping <jian.ping.chen@intel.com>
Co-authored-by: ProGamerGov <ProGamerGov@users.noreply.github.com>
Co-authored-by: Weiwen Xia <weiwen.xia@intel.com>
Co-authored-by: atalman <atalman@fb.com>
Co-authored-by: jjsjann123 <jiej@nvidia.com>
Co-authored-by: XiaobingSuper <xiaobing.zhang@intel.com>
Co-authored-by: Andrew Gallagher <andrewjcg@fb.com>
Co-authored-by: Mandar Deshpande <mandarde@fb.com>
Co-authored-by: Alex Beloi <alexbeloi@fb.com>
Co-authored-by: Richard Zou <zou3519@gmail.com>
Co-authored-by: erjia <erjia@fb.com>
Co-authored-by: Animesh Jain <anijain@umich.edu>
Co-authored-by: Jianyu Huang <jianyuhuang@fb.com>
Co-authored-by: zaf <zaf@devvm10206.prn0.facebook.com>
Co-authored-by: Michael Voznesensky <voznesenskym@gmail.com>
Co-authored-by: migeedz <migeedz@fb.com>
Co-authored-by: Christian Jauvin <cjauvin@gmail.com>
Co-authored-by: Min Si <msi@fb.com>
Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
Co-authored-by: Shen Li <cs.shenli@gmail.com>
Co-authored-by: Taylor Robie <taylorrobie@fb.com>
Co-authored-by: Ian Graves <iangraves@fb.com>
Co-authored-by: Natalia Gimelshein <ngimel@fb.com>
Co-authored-by: Ivan Yashchuk <ivan.yashchuk@aalto.fi>
Co-authored-by: kuttire42 <64169153+kuttire42@users.noreply.github.com>
Co-authored-by: Ryan Spring <rdspring1@gmail.com>
---
---
 csrc/scheduler/matmul.cpp | 212 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 204 insertions(+), 8 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index 7d85ddd5e95..fc1d45e4801 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -12,6 +12,12 @@
 namespace nvfuser {
 
 namespace {
+
+// Returns true if given number is power of 2
+bool isPowOf2(int x) {
+  return x > 1 && (x & (x - 1)) == 0;
+}
+
 // Move the broadcast axes to the left on the specified number of inner
 // dimensions e.g.  (when number_of_inner_pos == 3):
 //      [... I0, B, I1] -> [... B, I0, I1]
@@ -44,6 +50,200 @@ void moveInnerBroadcastLeft(TensorView* tv, int number_of_inner_pos = 3) {
   tv->reorder(order_map);
 }
 
+//! Automatically generates the shared memory swizzled data layout
+//!  for matmul mainloop.
+//! The shared mem datalayout is always 2D currently, and this utility
+//!  function assumes that the innermost 2 dimensions on shared_mem_tv
+//!  are the ones begin swizzled.
+void prologSwizzle(TensorView* shared_mem_tv, const MatmulParam& params) {
+  // Check that the innermost 2 dimensions are concrete and static
+  //  sized so that the swizzle function can be defined.
+
+  // Utility to check concrete static size:
+  auto check_concrete_static_dim = [](IterDomain* id) {
+    TORCH_INTERNAL_ASSERT(
+        !id->isBroadcast() && !id->isReduction(),
+        "no support on reduction or broadcast dims, but get ",
+        id->toString());
+    TORCH_INTERNAL_ASSERT(
+        id->extent()->isConstInt(),
+        "swizzled dimensions need to be statically, but get ",
+        id->toString());
+  };
+
+  TORCH_INTERNAL_ASSERT(
+      shared_mem_tv->nDims() >= 2,
+      "At least 2D input needed for swizzling, but get ",
+      shared_mem_tv->toString());
+  check_concrete_static_dim(shared_mem_tv->axis(-2));
+  check_concrete_static_dim(shared_mem_tv->axis(-1));
+
+  auto mma_config = params.mma_builder.build();
+
+  // Extract the constant sizes of the swizzled tile
+  const auto tile_size_x = shared_mem_tv->axis(-2)->extent()->evaluateInt();
+  const auto tile_size_y = shared_mem_tv->axis(-1)->extent()->evaluateInt();
+
+  // TODO: add support for tf32(different macro) and fp32(ffma)
+  if (isTuring(mma_config.macro) || isAmpere(mma_config.macro)) {
+    // Dimension of each inner unit of swizzled indices.
+    // Turing and Ampere case, ldmatrix access assumed (see TODO above)
+    // Each ldmatrix access is 8x8
+    int row_unit = 8;
+    int col_unit = 8;
+
+    // Column size of the tile needs to be multiples of 8 for ldmatrix to work.
+    TORCH_INTERNAL_ASSERT(
+        tile_size_x >= row_unit && tile_size_x % row_unit == 0 &&
+            tile_size_y >= col_unit && tile_size_y % col_unit == 0,
+        "Prolog swizzle for ldmatrix, illegal tile size for prolog swizzle",
+        tile_size_x,
+        "x",
+        tile_size_y);
+
+    int units_per_row = tile_size_y / col_unit;
+
+    // Number of column units that can fit in a conflict free shared mem wave
+    //  with memory width = 128 Byte assumed.
+    const int units_per_memory_row =
+        128 / dataTypeSize(DataType::Half) / col_unit;
+
+    // Calculate swizzle period:
+    int residue_unit_count = units_per_row % units_per_memory_row;
+
+    // In the case where tile row is a multiple of memory row, the whole memory
+    // row
+    //  is the repeated pattern of swizzle. In the case where tile row is not
+    //  divisible, the residule part is the repeated pattern.
+    int repeated_pattern_size_in_units =
+        residue_unit_count == 0 ? units_per_memory_row : residue_unit_count;
+
+    // Calculate row multiplier, which is defined as minimum number of rows
+    //  to look down from an element until the same bank index is observed.
+    c10::optional<int> maybe_row_multiplier = c10::nullopt;
+
+    if (units_per_memory_row % repeated_pattern_size_in_units == 0) {
+      maybe_row_multiplier =
+          units_per_memory_row / repeated_pattern_size_in_units;
+    } else if (
+        units_per_memory_row > repeated_pattern_size_in_units &&
+        units_per_memory_row %
+                (units_per_memory_row - repeated_pattern_size_in_units) ==
+            0) {
+      maybe_row_multiplier = units_per_memory_row /
+          (units_per_memory_row - repeated_pattern_size_in_units);
+    }
+
+    // The case where the row multiplier cannot be an integer would be where
+    //  fractional tiling support is needed. Would gradually build out support
+    //  on this one.
+    if (!maybe_row_multiplier.has_value()) {
+      // calculate effective row_period = lcm(row_period, repeated_pattern) /
+      // repeated_pattern_size which is the same as below
+      int row_period = units_per_memory_row /
+          std::gcd(units_per_memory_row, repeated_pattern_size_in_units);
+
+      if (row_period < row_unit) {
+        TORCH_WARN_ONCE(
+            "Fractional pattern not yet implemented for swizzling memory row of size :",
+            units_per_memory_row,
+            " and tile row of size: ",
+            repeated_pattern_size_in_units);
+        // This would not lead to functional issue but just perf regression, so
+        // just do not swizzle anything yet.
+        //  TODO: add support for swizzles with different row and col periods to
+        //  enable this case.
+        return;
+      } else {
+        // This case would not need swizzling at all as the period of
+        //   memory bank index over the row is wider than the access window.
+        return;
+      }
+    } else if (maybe_row_multiplier.value() >= row_unit) {
+      // No need to swizzle in this case.
+      return;
+    }
+
+    // Calculate swizzle period, only equal row/col periods at the moment:
+    //  TODO: aperiodic swizzle could also be supported in a follow up:
+    int max_swizzle_period = repeated_pattern_size_in_units;
+
+    int swizzle_period = max_swizzle_period;
+
+    // Do not have to use the max_swizzle period if we already had
+    //  enough swizzle to permute a row_unit. This would encourage
+    //  usage of power of 2 swizzle periods.
+    if (row_unit % maybe_row_multiplier.value() == 0) {
+      swizzle_period =
+          std::min(swizzle_period, row_unit / maybe_row_multiplier.value());
+    }
+
+    int row_multiplier = maybe_row_multiplier.value();
+
+    TORCH_INTERNAL_ASSERT(
+        tile_size_x % (swizzle_period * row_multiplier) == 0 &&
+            tile_size_y % (swizzle_period * col_unit) == 0,
+        "need aperiodic swizzle config for tile size ",
+        tile_size_x,
+        "x",
+        tile_size_y,
+        "with units ",
+        row_unit,
+        "x",
+        col_unit);
+
+    // add the swizzling op:
+    shared_mem_tv->split(-2, row_multiplier * swizzle_period);
+    shared_mem_tv->split(-2, row_multiplier);
+
+    shared_mem_tv->split(-1, col_unit * swizzle_period);
+    shared_mem_tv->split(-1, col_unit);
+
+    //        -6        -5           -4              -3        -2       -1
+    // [..., Irow_o, Irow_period, Irow_multiplier, Icol_o, Icol_period,
+    // Icol_unit]
+    if (isPowOf2(swizzle_period)) {
+      shared_mem_tv->swizzle(Swizzle2DType::XOR, -5, -2);
+    } else {
+      shared_mem_tv->swizzle(Swizzle2DType::CyclicShift, -5, -2);
+    }
+
+    // Merge back the tile for subsequent vectorization scheduling
+    //  TODO: could potentially simplify away the merges
+    shared_mem_tv->merge(-6);
+    shared_mem_tv->merge(-5);
+    shared_mem_tv->merge(-3);
+    shared_mem_tv->merge(-2);
+  } else if (isVolta(mma_config.macro)) {
+    // TODO: Volta is slightly more complex, and a fixed recipe would
+    //  not scale. In a follow up this would be inferred from the mma
+    //  macro layout themselves as we already have them registered in
+    //  the utils.
+    return;
+  } else {
+    TORCH_INTERNAL_ASSERT(false, "Prolog swizzle: unsupported mma macro");
+  }
+}
+
+//! Generates the prolog schedule on the shared memory buffer
+//!  tensor. The scheduling performs two steps:
+//!
+//! 1. Swizzled the shared mem data layout.
+//! 2. Coalesce and vectorize the read write schedule.
+void scheduleProlog(TensorView* shared_mem_tv, const MatmulParam& params) {
+  // Swizzle the shared memory data layout
+  prologSwizzle(shared_mem_tv, params);
+
+  // Assuming we are always vectorizing smem write by 128b at the moment:
+  //   TODO: would need a data-type and alignment dependent interface
+  //    to support non-vectorizable shapes.
+  //   The vectorizable width logic would be in a separate PR as the
+  //    current effort tries to focus on generating swizzles.
+  shared_mem_tv->merge(-2);
+  scheduler_utils::matmul_utils::scheduleContiguousVectorLoad(
+      shared_mem_tv, params.tile_sizes, 8, false);
+}
+
 } // namespace
 
 void scheduleMatmul(
@@ -198,15 +398,11 @@ void scheduleMatmul(
   // ------------------------------------------------------------------
   scheduler_utils::matmul_utils::orderTiledConcreteIdAsRoot(acw_smem);
   // [... M, K]
-  acw_smem->merge(-2);
-  scheduler_utils::matmul_utils::scheduleContiguousVectorLoad(
-      acw_smem, gemm_tile, 8, false);
+  scheduleProlog(acw_smem, params);
 
-  // [... N, K]
   scheduler_utils::matmul_utils::orderTiledConcreteIdAsRoot(bcw_smem);
-  bcw_smem->merge(-2);
-  scheduler_utils::matmul_utils::scheduleContiguousVectorLoad(
-      bcw_smem, gemm_tile, 8, false);
+  // [... N, K]
+  scheduleProlog(bcw_smem, params);
 
   // Propagate prolog tensors
   //  propagate up the DAG, and propagate parallel type.
@@ -230,7 +426,7 @@ void scheduleMatmul(
   // CTA tile:
 
   // Swizzle block tiles:
-  c->swizzle(Swizzle2DType::ZShape, 0, 1, SwizzleMode::Loop);
+  // c->swizzle(Swizzle2DType::ZShape, 0, 1, SwizzleMode::Loop);
 
   a->computeAt(c, 2);
   b->computeAt(c, 2);

From 7b8e75a328cc694e49eef0dab1449e72d5f8a81b Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Mon, 3 Apr 2023 15:59:07 -0700
Subject: [PATCH 02/22] fix

---
 csrc/scheduler/matmul.cpp | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index 21f701d8bcb..517bb6e456c 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -62,7 +62,7 @@ void moveInnerBroadcastLeft(TensorView* tv, int number_of_inner_pos = 3) {
 //! The shared mem datalayout is always 2D currently, and this utility
 //!  function assumes that the innermost 2 dimensions on shared_mem_tv
 //!  are the ones begin swizzled.
-void prologSwizzle(TensorView* shared_mem_tv, const MatmulParam& params) {
+void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
   // Check that the innermost 2 dimensions are concrete and static
   //  sized so that the swizzle function can be defined.
 
@@ -85,14 +85,12 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParam& params) {
   check_concrete_static_dim(shared_mem_tv->axis(-2));
   check_concrete_static_dim(shared_mem_tv->axis(-1));
 
-  auto mma_config = params.mma_builder.build();
-
   // Extract the constant sizes of the swizzled tile
   const auto tile_size_x = shared_mem_tv->axis(-2)->extent()->evaluateInt();
   const auto tile_size_y = shared_mem_tv->axis(-1)->extent()->evaluateInt();
 
   // TODO: add support for tf32(different macro) and fp32(ffma)
-  if (isTuring(mma_config.macro) || isAmpere(mma_config.macro)) {
+  if (isTuring(params.mma_op) || isAmpere(params.mma_op)) {
     // Dimension of each inner unit of swizzled indices.
     // Turing and Ampere case, ldmatrix access assumed (see TODO above)
     // Each ldmatrix access is 8x8
@@ -221,7 +219,7 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParam& params) {
     shared_mem_tv->merge(-5);
     shared_mem_tv->merge(-3);
     shared_mem_tv->merge(-2);
-  } else if (isVolta(mma_config.macro)) {
+  } else if (isVolta(params.mma_op)) {
     // TODO: Volta is slightly more complex, and a fixed recipe would
     //  not scale. In a follow up this would be inferred from the mma
     //  macro layout themselves as we already have them registered in
@@ -237,7 +235,7 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParam& params) {
 //!
 //! 1. Swizzled the shared mem data layout.
 //! 2. Coalesce and vectorize the read write schedule.
-void scheduleProlog(TensorView* shared_mem_tv, const MatmulParam& params) {
+void scheduleProlog(TensorView* shared_mem_tv, const MatmulParams& params) {
   // Swizzle the shared memory data layout
   prologSwizzle(shared_mem_tv, params);
 

From 9ac4f51cc336f57765cc78e2dd857b0fc0829868 Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Mon, 10 Apr 2023 08:27:42 -0700
Subject: [PATCH 03/22] test bank conflict

---
 csrc/scheduler/matmul.cpp    | 10 +++++++---
 test/test_gpu_tensorcore.cpp | 28 ++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index 2d7d5be1899..f5e9ed1c2c4 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -89,10 +89,14 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
   const auto tile_size_x = shared_mem_tv->axis(-2)->extent()->evaluateInt();
   const auto tile_size_y = shared_mem_tv->axis(-1)->extent()->evaluateInt();
 
-  // TODO: add support for tf32(different macro) and fp32(ffma)
   if (isTuring(params.mma_op) || isAmpere(params.mma_op)) {
-    // Dimension of each inner unit of swizzled indices.
-    // Turing and Ampere case, ldmatrix access assumed (see TODO above)
+    // TODO: right now, we are assuming ldmatrix access, which only supports
+    // 16bit load according to offical doc:
+    // https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-load-instruction-ldmatrix
+    // In the future, when we start adding support for tf32(different macro),
+    // fp32(ffma), double, int8, fp8, etc. we need to update this function.
+    TORCH_INTERNAL_ASSERT(dataTypeSize(*shared_mem_tv->getDataType()) == 2);
+
     // Each ldmatrix access is 8x8
     int row_unit = 8;
     int col_unit = 8;
diff --git a/test/test_gpu_tensorcore.cpp b/test/test_gpu_tensorcore.cpp
index cd61bce84a1..999decb2888 100644
--- a/test/test_gpu_tensorcore.cpp
+++ b/test/test_gpu_tensorcore.cpp
@@ -317,6 +317,9 @@ TEST_F(NVFuserTest, FusionVoltaMatmul_CUDA) {
     params.tile_sizes = gemm_tile;
     scheduleMatmul(&fusion, params);
 
+    // prologSwizzle on Volta is not supported yet
+    // ASSERT_TRUE(fusion.bankConflictInfo().empty());
+
     at::manual_seed(0);
     auto inputs = fp16MatmulAtInput(M, N, K, layout);
 
@@ -366,6 +369,9 @@ TEST_F(NVFuserTest, FusionVoltaMatmulRegDoubleBuffer_CUDA) {
     params.double_buffer_options.double_buffer_smem_read = true;
     scheduleMatmul(&fusion, params);
 
+    // prologSwizzle on Volta is not supported yet
+    // ASSERT_TRUE(fusion.bankConflictInfo().empty());
+
     at::manual_seed(0);
     auto inputs = fp16MatmulAtInput(M, N, K, layout);
 
@@ -650,6 +656,8 @@ TEST_F(NVFuserTest, FusionAmpereMatmul_CUDA) {
     params.double_buffer_options.smem_double_buffer_stage = 4;
     scheduleMatmul(&fusion, params);
 
+    ASSERT_TRUE(fusion.bankConflictInfo().empty());
+
     at::manual_seed(0);
     auto inputs = fp16MatmulAtInput(M, N, K, layout);
 
@@ -705,6 +713,8 @@ TEST_F(NVFuserTest, FusionAmpereMatmulPipelineGmem_CUDA) {
       params.double_buffer_options.smem_double_buffer_stage = stage;
       scheduleMatmul(&fusion, params);
 
+      ASSERT_TRUE(fusion.bankConflictInfo().empty());
+
       at::manual_seed(0);
       auto inputs = fp16MatmulAtInput(M, N, K, layout);
 
@@ -771,6 +781,8 @@ TEST_F(NVFuserTest, FusionAmpereSwizzle_CUDA) {
 
     scheduleMatmul(&fusion, params);
 
+    ASSERT_TRUE(fusion.bankConflictInfo().empty());
+
     at::manual_seed(0);
     auto inputs = fp16MatmulAtInput(M, N, K, layout);
 
@@ -878,6 +890,8 @@ TEST_F(NVFuserTest, FusionAmpereMatmulRegDoubleBuffer_CUDA) {
       params.double_buffer_options.double_buffer_smem_read = true;
       scheduleMatmul(&fusion, params);
 
+      ASSERT_TRUE(fusion.bankConflictInfo().empty());
+
       at::manual_seed(0);
       auto inputs = fp16MatmulAtInput(M, N, K, layout);
 
@@ -1819,6 +1833,8 @@ TEST_F(NVFuserTest, FusionTuringMatmul_CUDA) {
     params.tile_sizes = gemm_tile;
     scheduleMatmul(&fusion, params);
 
+    ASSERT_TRUE(fusion.bankConflictInfo().empty());
+
     at::manual_seed(0);
     auto inputs = fp16MatmulAtInput(M, N, K, layout);
 
@@ -2859,6 +2875,8 @@ TEST_F(NVFuserTest, FusionAmpereMatmulLargeLoad_CUDA) {
     params.double_buffer_options.smem_double_buffer_stage = 3;
     scheduleMatmul(&fusion, params);
 
+    ASSERT_TRUE(fusion.bankConflictInfo().empty());
+
     at::manual_seed(0);
     auto inputs = fp16MatmulAtInput(M, N, K, layout);
 
@@ -2907,6 +2925,8 @@ TEST_F(NVFuserTest, FusionTuringMatmulLargeLoad_CUDA) {
     params.tile_sizes = gemm_tile;
     scheduleMatmul(&fusion, params);
 
+    ASSERT_TRUE(fusion.bankConflictInfo().empty());
+
     at::manual_seed(0);
     auto inputs = fp16MatmulAtInput(M, N, K, layout);
 
@@ -2962,6 +2982,8 @@ TEST_F(NVFuserTest, FusionAmpereMatmulTileCheck4warp_CUDA) {
         params.double_buffer_options.double_buffer_smem_write = true;
         scheduleMatmul(&fusion, params);
 
+        ASSERT_TRUE(fusion.bankConflictInfo().empty());
+
         at::manual_seed(0);
         auto inputs = fp16MatmulAtInput(M, N, K, layout);
 
@@ -3027,6 +3049,8 @@ TEST_F(NVFuserTest, FusionAmpereMatmulTileCheck8warp_CUDA) {
 
           scheduleMatmul(&fusion, params);
 
+          ASSERT_TRUE(fusion.bankConflictInfo().empty());
+
           at::manual_seed(0);
           auto inputs = fp16MatmulAtInput(M, N, K, layout);
 
@@ -3086,6 +3110,8 @@ TEST_F(NVFuserTest, FusionAmpereMatmulTileCheck6warp_CUDA) {
 
       scheduleMatmul(&fusion, params);
 
+      ASSERT_TRUE(fusion.bankConflictInfo().empty());
+
       at::manual_seed(0);
       auto inputs = fp16MatmulAtInput(M, N, K, layout);
 
@@ -3138,6 +3164,8 @@ TEST_F(NVFuserTest, FusionAmpereMatmulLargeLoadLargeK_CUDA) {
     params.double_buffer_options.smem_double_buffer_stage = 3;
     scheduleMatmul(&fusion, params);
 
+    ASSERT_TRUE(fusion.bankConflictInfo().empty());
+
     at::manual_seed(0);
     auto inputs = fp16MatmulAtInput(M, N, K, layout);
 

From d1c15f159b58e9ec49d9eace3ed446df8b8f2690 Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Mon, 10 Apr 2023 09:30:11 -0700
Subject: [PATCH 04/22] cleanup

---
 csrc/scheduler/matmul.cpp | 55 ++++++++++++++-------------------------
 csrc/scheduler/utils.cpp  |  8 +++---
 2 files changed, 23 insertions(+), 40 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index f5e9ed1c2c4..5049e99e8a8 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -121,9 +121,8 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
     int residue_unit_count = units_per_row % units_per_memory_row;
 
     // In the case where tile row is a multiple of memory row, the whole memory
-    // row
-    //  is the repeated pattern of swizzle. In the case where tile row is not
-    //  divisible, the residule part is the repeated pattern.
+    //  row is the repeated pattern of swizzle. In the case where tile row is
+    //  not divisible, the residule part is the repeated pattern.
     int repeated_pattern_size_in_units =
         residue_unit_count == 0 ? units_per_memory_row : residue_unit_count;
 
@@ -208,9 +207,8 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
     shared_mem_tv->split(-1, col_unit * swizzle_period);
     shared_mem_tv->split(-1, col_unit);
 
-    //        -6        -5           -4              -3        -2       -1
-    // [..., Irow_o, Irow_period, Irow_multiplier, Icol_o, Icol_period,
-    // Icol_unit]
+    //        -6        -5           -4           -3        -2       -1
+    // [..., row_o, row_period, row_multiplier, col_o, col_period, col_unit]
     if (isPowOf2(swizzle_period)) {
       shared_mem_tv->swizzle(Swizzle2DType::XOR, -5, -2);
     } else {
@@ -240,6 +238,10 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
 //! 1. Swizzled the shared mem data layout.
 //! 2. Coalesce and vectorize the read write schedule.
 void scheduleProlog(TensorView* shared_mem_tv, const MatmulParams& params) {
+  shared_mem_tv->setMemoryType(MemoryType::Shared);
+
+  scheduler_utils::matmul_utils::orderTiledConcreteIdAsRoot(shared_mem_tv);
+
   // Swizzle the shared memory data layout
   prologSwizzle(shared_mem_tv, params);
 
@@ -250,7 +252,16 @@ void scheduleProlog(TensorView* shared_mem_tv, const MatmulParams& params) {
   //    current effort tries to focus on generating swizzles.
   shared_mem_tv->merge(-2);
   scheduler_utils::matmul_utils::scheduleContiguousVectorLoad(
-      shared_mem_tv, params.tile_sizes, 8, false);
+      shared_mem_tv, params.tile_sizes, 8, true);
+
+  // Propagate prolog tensors
+  //  propagate up the DAG, and propagate parallel type.
+  scheduler_utils::BoundedDirectionalTransformPropagator::backward(
+      shared_mem_tv,
+      -1,
+      {},
+      scheduler_utils::BoundedDirectionalTransformPropagator::Options()
+          .propagateParallelType());
 }
 
 } // namespace
@@ -439,32 +450,11 @@ void scheduleMatmul(Fusion* fusion, const MatmulParams& params) {
       cc, -1, {acw_smem, bcw_smem}, {c});
 
   // Schedule prolog:
-  //   TODO: this section goes to a separate matmul util,
-  //   and needs more configurability.
+  //   TODO: this section needs more configurability.
   // ------------------------------------------------------------------
-  scheduler_utils::matmul_utils::orderTiledConcreteIdAsRoot(acw_smem);
-  // [... M, K]
   scheduleProlog(acw_smem, params);
-
-  scheduler_utils::matmul_utils::orderTiledConcreteIdAsRoot(bcw_smem);
-  // [... N, K]
   scheduleProlog(bcw_smem, params);
 
-  // Propagate prolog tensors
-  //  propagate up the DAG, and propagate parallel type.
-  scheduler_utils::BoundedDirectionalTransformPropagator::backward(
-      acw_smem,
-      -1,
-      {a},
-      scheduler_utils::BoundedDirectionalTransformPropagator::Options()
-          .propagateParallelType());
-  scheduler_utils::BoundedDirectionalTransformPropagator::backward(
-      bcw_smem,
-      -1,
-      {b},
-      scheduler_utils::BoundedDirectionalTransformPropagator::Options()
-          .propagateParallelType());
-
   // Set computeAt, setup the loop nesting structure on the kernel.
   //   TODO: this section goes to a separate matmul util,
   //   and needs more configurability.
@@ -512,19 +502,12 @@ void scheduleMatmul(Fusion* fusion, const MatmulParams& params) {
   cc->applyMmaSwizzle(
       mma_builder.operand(MmaOptions::Operand::Accumulator).build());
 
-  // Set memory type:
-  acw_smem->setMemoryType(MemoryType::Shared);
-  bcw_smem->setMemoryType(MemoryType::Shared);
-
   // Set parallelization:
   //   TODO: this section goes to a separate matmul util,
   //   and needs more configurability.
   // ------------------------------------------------------------------
 
   // Vectorize smem stores/loads:
-  acw_smem->axis(-1)->parallelize(ParallelType::Vectorize);
-  bcw_smem->axis(-1)->parallelize(ParallelType::Vectorize);
-
   acr->axis(-1)->parallelize(ParallelType::Vectorize);
   bcr->axis(-1)->parallelize(ParallelType::Vectorize);
 
diff --git a/csrc/scheduler/utils.cpp b/csrc/scheduler/utils.cpp
index 1697ef988b5..35dd51b96ab 100644
--- a/csrc/scheduler/utils.cpp
+++ b/csrc/scheduler/utils.cpp
@@ -1949,7 +1949,7 @@ bool isFakeBoundaryTensorview(
 //!  transform to by BoundedDirectionalTransformPropagator.
 std::unordered_set<TensorView*> getDirectionalPropagatePathSet(
     TensorView* from_tv,
-    std::vector<TensorView*> boundary_tvs,
+    const std::vector<TensorView*>& boundary_tvs,
     BoundedDirectionalTransformPropagator::Options options,
     PropagateDirection direction) {
   // Prepare to collect all candidate tensorviews
@@ -2061,9 +2061,9 @@ void BoundedDirectionalTransformPropagator::backward(
   if (!options.has_value()) {
     options = Options();
   }
-  TORCH_INTERNAL_ASSERT(
-      !to.empty(),
-      "Propagation needs to be bounded, so no support for empty boundary.");
+  if (to.empty()) {
+    to = ir_utils::inputTvsOf(from);
+  }
 
   // Collect all tvs to included on the backward path as specified
   //  by boundary and options.

From e084d4b8cd337471aff743624ef8e9995fd3400b Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Mon, 10 Apr 2023 15:18:40 -0700
Subject: [PATCH 05/22] Matmul prolog swizzle new algo

---
 benchmark/matmul.cpp      |  7 +++++-
 csrc/scheduler/matmul.cpp | 50 +++++----------------------------------
 2 files changed, 12 insertions(+), 45 deletions(-)

diff --git a/benchmark/matmul.cpp b/benchmark/matmul.cpp
index 10c32fecdaf..53b378acda3 100644
--- a/benchmark/matmul.cpp
+++ b/benchmark/matmul.cpp
@@ -9,6 +9,7 @@
 #include <fusion.h>
 #include <ir_all_nodes.h>
 #include <ir_utils.h>
+#include <lower_bank_conflict.h>
 #include <ops/all_ops.h>
 #include <scheduler/all_schedulers.h>
 #include <scheduler/matmul.h>
@@ -231,8 +232,12 @@ static void SingleMatmulBase(
   cparams.enable_magic_zero = false;
 
   // Compile kernel
+  auto launch_constraints = LaunchParams();
   FusionExecutor fe;
-  fe.compileFusion(fusion, args, LaunchParams(), cparams);
+  fe.compileFusion(fusion, args, launch_constraints, cparams);
+  TORCH_CHECK(
+      getBankConflictInfo(fe.kernel(), launch_constraints).empty(),
+      "Shared memory bank conflict not removed.");
 
   // Warm up run
   auto outputs = fe.runFusion({inputs.first, inputs.second});
diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index 5049e99e8a8..c0db01f5787 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -124,50 +124,13 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
     //  row is the repeated pattern of swizzle. In the case where tile row is
     //  not divisible, the residule part is the repeated pattern.
     int repeated_pattern_size_in_units =
-        residue_unit_count == 0 ? units_per_memory_row : residue_unit_count;
+        std::gcd(units_per_memory_row, residue_unit_count);
 
     // Calculate row multiplier, which is defined as minimum number of rows
     //  to look down from an element until the same bank index is observed.
-    c10::optional<int> maybe_row_multiplier = c10::nullopt;
+    int multiplier = units_per_memory_row / repeated_pattern_size_in_units;
 
-    if (units_per_memory_row % repeated_pattern_size_in_units == 0) {
-      maybe_row_multiplier =
-          units_per_memory_row / repeated_pattern_size_in_units;
-    } else if (
-        units_per_memory_row > repeated_pattern_size_in_units &&
-        units_per_memory_row %
-                (units_per_memory_row - repeated_pattern_size_in_units) ==
-            0) {
-      maybe_row_multiplier = units_per_memory_row /
-          (units_per_memory_row - repeated_pattern_size_in_units);
-    }
-
-    // The case where the row multiplier cannot be an integer would be where
-    //  fractional tiling support is needed. Would gradually build out support
-    //  on this one.
-    if (!maybe_row_multiplier.has_value()) {
-      // calculate effective row_period = lcm(row_period, repeated_pattern) /
-      // repeated_pattern_size which is the same as below
-      int row_period = units_per_memory_row /
-          std::gcd(units_per_memory_row, repeated_pattern_size_in_units);
-
-      if (row_period < row_unit) {
-        TORCH_WARN_ONCE(
-            "Fractional pattern not yet implemented for swizzling memory row of size :",
-            units_per_memory_row,
-            " and tile row of size: ",
-            repeated_pattern_size_in_units);
-        // This would not lead to functional issue but just perf regression, so
-        // just do not swizzle anything yet.
-        //  TODO: add support for swizzles with different row and col periods to
-        //  enable this case.
-        return;
-      } else {
-        // This case would not need swizzling at all as the period of
-        //   memory bank index over the row is wider than the access window.
-        return;
-      }
-    } else if (maybe_row_multiplier.value() >= row_unit) {
+    if (multiplier >= row_unit) {
       // No need to swizzle in this case.
       return;
     }
@@ -181,12 +144,11 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
     // Do not have to use the max_swizzle period if we already had
     //  enough swizzle to permute a row_unit. This would encourage
     //  usage of power of 2 swizzle periods.
-    if (row_unit % maybe_row_multiplier.value() == 0) {
-      swizzle_period =
-          std::min(swizzle_period, row_unit / maybe_row_multiplier.value());
+    if (row_unit % multiplier == 0) {
+      swizzle_period = std::min(swizzle_period, row_unit / multiplier);
     }
 
-    int row_multiplier = maybe_row_multiplier.value();
+    int row_multiplier = multiplier;
 
     TORCH_INTERNAL_ASSERT(
         tile_size_x % (swizzle_period * row_multiplier) == 0 &&

From 566d26eb1a8f033002aa67272162fb25bc1735ec Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Tue, 11 Apr 2023 09:49:04 -0700
Subject: [PATCH 06/22] save

---
 csrc/scheduler/matmul.cpp | 169 ++++++++++++++++++++++++++++++++------
 1 file changed, 142 insertions(+), 27 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index c0db01f5787..f35a5bdecbe 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -91,46 +91,160 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
 
   if (isTuring(params.mma_op) || isAmpere(params.mma_op)) {
     // TODO: right now, we are assuming ldmatrix access, which only supports
-    // 16bit load according to offical doc:
+    // sizeof(T) == 16bit (i.e. half/bfloat16) load according to offical doc:
     // https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-load-instruction-ldmatrix
     // In the future, when we start adding support for tf32(different macro),
     // fp32(ffma), double, int8, fp8, etc. we need to update this function.
     TORCH_INTERNAL_ASSERT(dataTypeSize(*shared_mem_tv->getDataType()) == 2);
 
-    // Each ldmatrix access is 8x8
-    int row_unit = 8;
-    int col_unit = 8;
+    // ldmatrix loads a ldmatrix_rows x ldmatrix_cols = 8 x 8 matrix each time,
+    constexpr int ldmatrix_rows = 8;
+    constexpr int ldmatrix_cols = 8;
 
     // Column size of the tile needs to be multiples of 8 for ldmatrix to work.
     TORCH_INTERNAL_ASSERT(
-        tile_size_x >= row_unit && tile_size_x % row_unit == 0 &&
-            tile_size_y >= col_unit && tile_size_y % col_unit == 0,
+        tile_size_x >= ldmatrix_rows && tile_size_x % ldmatrix_rows == 0 &&
+            tile_size_y >= ldmatrix_cols && tile_size_y % ldmatrix_cols == 0,
         "Prolog swizzle for ldmatrix, illegal tile size for prolog swizzle",
         tile_size_x,
         "x",
         tile_size_y);
 
-    int units_per_row = tile_size_y / col_unit;
-
-    // Number of column units that can fit in a conflict free shared mem wave
-    //  with memory width = 128 Byte assumed.
-    const int units_per_memory_row =
-        128 / dataTypeSize(DataType::Half) / col_unit;
-
-    // Calculate swizzle period:
-    int residue_unit_count = units_per_row % units_per_memory_row;
+    /* Note [How to remove bank conflict for ldmatrix?]
+     *
+     * **This note is interleaved with code, I suggest reading this note like
+     *   reading a jupyter notebook**
+     *
+     * Our task is to make sure different rows does not fall into the same
+     * bank of shared memory.
+     *
+     * Introduction to bank conflict can be found at page 54-72 of:
+     * https://on-demand.gputechconf.com/gtc/2018/presentation/s81006-volta-architecture-and-performance-optimization.pdf
+     *
+     * When we talk about bank conflict removal, we are talking about the
+     * following task:
+     *   "there are 32 banks, and each bank contains one 4-byte word, we want to
+     *    make sure different lanes in a warp does not access different words in
+     *    the same bank"
+     */
+
+    constexpr int smem_bytes_per_word = 4;
+    constexpr int smem_banks = 32;
+
+    /* but here, for our convenience, because ldmatrix always use vectorized
+     * access of 8 items = 16 bytes = 4 words, we further group words into
+     * units: we consider each 4 words as a "unit", and each 4 banks as a
+     * "megabank". So we can rephrase our task as:
+     *   "there are 8 megabanks, and each megabanks contains one 4-word unit, we
+     *    want to make sure different lanes in a warp does not access different
+     *    units in the same megabank"
+     * In this terminology, each matrix has 8 rows, and each row has exactly one
+     * unit.
+     */
+
+    constexpr int items_per_unit = ldmatrix_cols;
+    const int bytes_per_unit = items_per_unit * dataTypeSize(DataType::Half);
+    const int words_per_unit = bytes_per_unit / smem_bytes_per_word;
+    const int num_megabanks = smem_banks / words_per_unit;
+
+    /* In the following example, each CTA tile contains 2 rows and 3 colums of
+     * matrices, each 8x8 size:
+     *   +----------+----------+----------+
+     *   | matrix 0 | matrix 1 | matrix 2 |
+     *   +----------+----------+----------+
+     *   | matrix 3 | matrix 4 | matrix 5 |
+     *   +----------+----------+----------+
+     * The addresses of different rows in the same matrix are offseted by 3
+     * units. In this perspective, loading a matrix is a strided memory access
+     * with the following stride (in units):
+     */
+
+    const int row_stride = tile_size_y / items_per_unit;
+
+    /* So the bank conflicting problem is now converted to the following game:
+     *   I have a clock that has one pointer and `num_megabanks` ticks. I start
+     *   my game by making my pointer pointing to somewhere, and turn forward
+     *   the pointer `ldmatrix_rows` times, each time by `row_stride` ticks.
+     * This problem can be well modeled by modular arithmetic in number theory
+     * using the concept "integers modulo n" a.k.a. "Z/nZ"[1].
+     * Take n = 6 as an example, Z/6Z only has 6 elements: 0, 1, 2, 3, 4, 5.
+     * Additions and multiplications are defined in a cyclic manner:
+     *   5 + 1 = 0, 5 + 2 = 1, 5 + 3 = 2, 5 + 4 = 3, ...
+     *   2 * 1 = 2, 2 * 2 = 4, 2 * 3 = 0, 2 * 4 = 2, ...
+     * With this definition, Z is mapped to Z/nZ naturally by i -> i % n [2]
+     *
+     * It worth mention that Z/nZ is a "commutative ring", that is, we can use
+     * addition and multiplication rules just like using normal integers:
+     *   a + b = b + a, a * (b + c) = a * b + a * c, ...
+     * In short, we can reason about Z/nZ just like we are reasoning about
+     * integers, except that every number is automatically "% n".
+     *
+     * Reference:
+     * [1] https://en.wikipedia.org/wiki/Modular_arithmetic#Integers_modulo_n
+     * [2] The % is under Euclidean definition, that is -1 % 6 is 5 instead of
+     *     -1, see [The Mathematics of Integer Arithmetic] for more detail. But
+     *     we are only interested in non-negative numbers here, so there is no
+     *     need to worry about this problem
+     */
+
+    // row_stride in Z/nZ, where n is num_megabanks:
+    // assert(row_stride >= 0);
+    // assert(num_megabanks >= 0);
+    int row_stride_znz = row_stride % num_megabanks;
+
+    /* Consider the following function in Z/nZ:
+     *   f(i) = init + i * stride
+     * where init is the initial position of the pointer in the clock when we
+     * start the game, and stride is the number of ticks we move forward each
+     * time, and i is the number of times we move forward.
+     *
+     * In our problem, f(i) is the megabank of the `i`th row of the matrix, and
+     * `init` is the megabank of the 0th row of the matrix.
+     *
+     * One very important property of f(i) is:
+     * - if f(i1) == f(i2), then for every j, f(i1 + j) = f(i2 + j)
+     * This property is true because:
+     *   f(i1 + j) = f(i1) + j * stride = f(i2) + j * stride = f(i2 + j)
+     *
+     * The above property tells us, as we turn the clock forward:
+     * - initially, we will go to a never-visited tick in each turn, but,
+     * - at some point, we will return back to our original position, and,
+     * - after we return, we start repeat the pervious pattern again and again.
+     *
+     * As an example, consider f(i) where init = 0, stride = 6, under Z/8Z:
+     *     i  0 1 2 3 4 5 6 7
+     *   f(i) 0 6 4 2 0 6 4 2
+     * We can see that f(i) is repeating a pattern of four unique numbers
+     * "0 6 4 2" twice. In our bank conflict problem, this means we are using 4
+     * different megabanks, and we have a 2-way conflict.
+     *
+     * The question of interest is, does the above observation generalize? That
+     * is, does f(i) always repeat a pattern of p unique numbers q times? Note
+     * that p and q must satisfy p * q = n.
+     *
+     * The answer to the above question is: yes! Consider the following
+     * equation:
+     *    f(i1 + j) == f(i1)
+     * We want to know what is the smallest j that makes the above equation
+     * true. Because this tells us in how many steps we will see repeat. This
+     * equation can be simplified as:
+     *   f(i1 + j) == f(i1) + j * stride == f(i1)
+     *   ==> j * stride == 0
+     * That is, we are interested in finding the minimum j that makes
+     *   j * stride == 0
+     */
 
     // In the case where tile row is a multiple of memory row, the whole memory
     //  row is the repeated pattern of swizzle. In the case where tile row is
     //  not divisible, the residule part is the repeated pattern.
     int repeated_pattern_size_in_units =
-        std::gcd(units_per_memory_row, residue_unit_count);
+        std::gcd(num_megabanks, row_stride_znz);
 
     // Calculate row multiplier, which is defined as minimum number of rows
     //  to look down from an element until the same bank index is observed.
-    int multiplier = units_per_memory_row / repeated_pattern_size_in_units;
+    int multiplier = num_megabanks / repeated_pattern_size_in_units;
 
-    if (multiplier >= row_unit) {
+    if (multiplier >= ldmatrix_rows) {
       // No need to swizzle in this case.
       return;
     }
@@ -142,35 +256,36 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
     int swizzle_period = max_swizzle_period;
 
     // Do not have to use the max_swizzle period if we already had
-    //  enough swizzle to permute a row_unit. This would encourage
+    //  enough swizzle to permute a ldmatrix_rows. This would encourage
     //  usage of power of 2 swizzle periods.
-    if (row_unit % multiplier == 0) {
-      swizzle_period = std::min(swizzle_period, row_unit / multiplier);
+    if (ldmatrix_rows % multiplier == 0) {
+      swizzle_period = std::min(swizzle_period, ldmatrix_rows / multiplier);
     }
 
     int row_multiplier = multiplier;
 
     TORCH_INTERNAL_ASSERT(
         tile_size_x % (swizzle_period * row_multiplier) == 0 &&
-            tile_size_y % (swizzle_period * col_unit) == 0,
+            tile_size_y % (swizzle_period * ldmatrix_cols) == 0,
         "need aperiodic swizzle config for tile size ",
         tile_size_x,
         "x",
         tile_size_y,
         "with units ",
-        row_unit,
+        ldmatrix_rows,
         "x",
-        col_unit);
+        ldmatrix_cols);
 
     // add the swizzling op:
     shared_mem_tv->split(-2, row_multiplier * swizzle_period);
     shared_mem_tv->split(-2, row_multiplier);
 
-    shared_mem_tv->split(-1, col_unit * swizzle_period);
-    shared_mem_tv->split(-1, col_unit);
+    shared_mem_tv->split(-1, ldmatrix_cols * swizzle_period);
+    shared_mem_tv->split(-1, ldmatrix_cols);
 
     //        -6        -5           -4           -3        -2       -1
-    // [..., row_o, row_period, row_multiplier, col_o, col_period, col_unit]
+    // [..., row_o, row_period, row_multiplier, col_o, col_period,
+    // ldmatrix_cols]
     if (isPowOf2(swizzle_period)) {
       shared_mem_tv->swizzle(Swizzle2DType::XOR, -5, -2);
     } else {

From 915231921e989b7f38ea0a5d043215561ef4846a Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Tue, 11 Apr 2023 11:25:48 -0700
Subject: [PATCH 07/22] save

---
 csrc/scheduler/matmul.cpp | 15 +++++++++++----
 csrc/type.cpp             | 29 +----------------------------
 csrc/type.h               | 30 ++++++++++++++++++++++++++++++
 3 files changed, 42 insertions(+), 32 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index f35a5bdecbe..db3dbdd1d5c 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -143,9 +143,9 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
      */
 
     constexpr int items_per_unit = ldmatrix_cols;
-    const int bytes_per_unit = items_per_unit * dataTypeSize(DataType::Half);
-    const int words_per_unit = bytes_per_unit / smem_bytes_per_word;
-    const int num_megabanks = smem_banks / words_per_unit;
+    constexpr int bytes_per_unit = items_per_unit * primDataTypeSize(DataType::Half);
+    constexpr int words_per_unit = bytes_per_unit / smem_bytes_per_word;
+    constexpr int num_megabanks = smem_banks / words_per_unit;
 
     /* In the following example, each CTA tile contains 2 rows and 3 colums of
      * matrices, each 8x8 size:
@@ -159,7 +159,7 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
      * with the following stride (in units):
      */
 
-    const int row_stride = tile_size_y / items_per_unit;
+    int row_stride = tile_size_y / items_per_unit;
 
     /* So the bank conflicting problem is now converted to the following game:
      *   I have a clock that has one pointer and `num_megabanks` ticks. I start
@@ -232,6 +232,13 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
      *   ==> j * stride == 0
      * That is, we are interested in finding the minimum j that makes
      *   j * stride == 0
+     *
+     * Further study of equation j * stride == 0 needs knowledge on
+     * "multiplicative group of integers modulo n", which can be found at:
+     * https://en.wikipedia.org/wiki/Multiplicative_group_of_integers_modulo_n
+     * Readers not interested in this detail can skip this paragraph and jump to
+     * conclusions in the next paragraph.
+     * 
      */
 
     // In the case where tile row is a multiple of memory row, the whole memory
diff --git a/csrc/type.cpp b/csrc/type.cpp
index f512a00487f..a3157e2c0a0 100644
--- a/csrc/type.cpp
+++ b/csrc/type.cpp
@@ -1097,34 +1097,7 @@ size_t dataTypeSize(DataType type) {
       [](auto&& dtype) -> size_t {
         using T = std::decay_t<decltype(dtype)>;
         if constexpr (std::is_same_v<T, PrimDataType>) {
-          switch (dtype) {
-            case DataType::Bool:
-              return sizeof(bool);
-            case DataType::ComplexDouble:
-              return sizeof(std::complex<double>);
-            case DataType::ComplexFloat:
-              return sizeof(std::complex<float>);
-            case DataType::Double:
-              return sizeof(double);
-            case DataType::Float:
-              return sizeof(float);
-            case DataType::Half:
-              return sizeof(at::Half);
-            case DataType::BFloat16:
-              return sizeof(at::BFloat16);
-            case DataType::Index:
-              TORCH_INTERNAL_ASSERT(
-                  false,
-                  "The actual type of Index is only known at compile time.");
-            case DataType::Int:
-              return sizeof(uint64_t);
-            case DataType::Int32:
-              return sizeof(uint32_t);
-            case DataType::SMemAddress:
-              return sizeof(unsigned);
-            default:
-              TORCH_INTERNAL_ASSERT(false, "Size undefined for data type.");
-          }
+          return primDataTypeSize(dtype);
         } else if constexpr (std::is_same_v<T, PointerOf>) {
           return sizeof(void*);
         } else if constexpr (std::is_same_v<T, ArrayOf>) {
diff --git a/csrc/type.h b/csrc/type.h
index 8d29bf40dc5..c6f3cb44486 100644
--- a/csrc/type.h
+++ b/csrc/type.h
@@ -638,6 +638,36 @@ TORCH_CUDA_CU_API const char* load_store_type2string(LoadStoreOpType t);
 TORCH_CUDA_CU_API c10::optional<std::string> cast_func_str(
     const std::pair<DataType, DataType>&);
 
+constexpr inline size_t primDataTypeSize(PrimDataType type) {
+  switch (type) {
+    case DataType::Bool:
+      return sizeof(bool);
+    case DataType::ComplexDouble:
+      return sizeof(std::complex<double>);
+    case DataType::ComplexFloat:
+      return sizeof(std::complex<float>);
+    case DataType::Double:
+      return sizeof(double);
+    case DataType::Float:
+      return sizeof(float);
+    case DataType::Half:
+      return sizeof(at::Half);
+    case DataType::BFloat16:
+      return sizeof(at::BFloat16);
+    case DataType::Index:
+      TORCH_INTERNAL_ASSERT(
+          false, "The actual type of Index is only known at compile time.");
+    case DataType::Int:
+      return sizeof(uint64_t);
+    case DataType::Int32:
+      return sizeof(uint32_t);
+    case DataType::SMemAddress:
+      return sizeof(unsigned);
+    default:
+      TORCH_INTERNAL_ASSERT(false, "Size undefined for data type.");
+  }
+}
+
 TORCH_CUDA_CU_API size_t dataTypeSize(DataType type);
 
 // If the index type is known it will be automatically used here

From a6e0878f885552dc4d853be65721e8a0b05ec191 Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Tue, 11 Apr 2023 12:21:50 -0700
Subject: [PATCH 08/22] save

---
 csrc/scheduler/matmul.cpp | 63 +++++++++++++++++++++++----------------
 1 file changed, 37 insertions(+), 26 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index db3dbdd1d5c..14b2320f84b 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -143,7 +143,8 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
      */
 
     constexpr int items_per_unit = ldmatrix_cols;
-    constexpr int bytes_per_unit = items_per_unit * primDataTypeSize(DataType::Half);
+    constexpr int bytes_per_unit =
+        items_per_unit * primDataTypeSize(DataType::Half);
     constexpr int words_per_unit = bytes_per_unit / smem_bytes_per_word;
     constexpr int num_megabanks = smem_banks / words_per_unit;
 
@@ -225,37 +226,46 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
      * The answer to the above question is: yes! Consider the following
      * equation:
      *    f(i1 + j) == f(i1)
-     * We want to know what is the smallest j that makes the above equation
-     * true. Because this tells us in how many steps we will see repeat. This
-     * equation can be simplified as:
+     * We want to know what is the smallest positive number j that makes the
+     * above equation true. Because this tells us in how many steps we will see
+     * repeat. This equation can be simplified as:
      *   f(i1 + j) == f(i1) + j * stride == f(i1)
      *   ==> j * stride == 0
-     * That is, we are interested in finding the minimum j that makes
-     *   j * stride == 0
      *
-     * Further study of equation j * stride == 0 needs knowledge on
-     * "multiplicative group of integers modulo n", which can be found at:
-     * https://en.wikipedia.org/wiki/Multiplicative_group_of_integers_modulo_n
-     * Readers not interested in this detail can skip this paragraph and jump to
-     * conclusions in the next paragraph.
-     * 
+     * An important tool to study this equation is multiplicative inverse:
+     * https://en.wikipedia.org/wiki/Modular_multiplicative_inverse
+     * stride has an multiplicative inverse if and only if stride coprime with
+     * n, that is, gcd(stride, n) == 1. In such case, the solution to our
+     * equation j * stride == 0 is j = stride^(-1) * 0 = 0, that is: f(i) does
+     * not repeat, that is: there is no bank conflict.
      */
 
-    // In the case where tile row is a multiple of memory row, the whole memory
-    //  row is the repeated pattern of swizzle. In the case where tile row is
-    //  not divisible, the residule part is the repeated pattern.
-    int repeated_pattern_size_in_units =
-        std::gcd(num_megabanks, row_stride_znz);
+    int g = std::gcd(num_megabanks, row_stride_znz);
+    if (g == 1) {
+      return; // No need to swizzle in this case.
+    }
+
+    /* If stride does not coprime with n, then we can write stride as:
+     *   stride = s * gcd(stride, n)
+     * where s coprime with n. Then the equation j * stride == 0 becomes:
+     *   j * s * gcd(stride, n) == 0
+     * which can be simplified as
+     *   j * gcd(stride, n) == s^(-1) * 0
+     *   ==> j * gcd(stride, n) == 0
+     * It is easy to see that j is n / gcd(stride, n).
+     * That is: f(i) always repeat a pattern of n / gcd(stride, n) unique
+     * numbers gcd(stride, n) times
+     */
 
-    // Calculate row multiplier, which is defined as minimum number of rows
-    //  to look down from an element until the same bank index is observed.
-    int multiplier = num_megabanks / repeated_pattern_size_in_units;
+    int repeated_pattern_size = num_megabanks / g;
 
-    if (multiplier >= ldmatrix_rows) {
-      // No need to swizzle in this case.
-      return;
+    if (repeated_pattern_size >= ldmatrix_rows) {
+      return; // No need to swizzle in this case.
     }
 
+    /* Now we are ready to implement our swizzle
+     */
+
     // Calculate swizzle period, only equal row/col periods at the moment:
     //  TODO: aperiodic swizzle could also be supported in a follow up:
     int max_swizzle_period = repeated_pattern_size_in_units;
@@ -265,11 +275,12 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
     // Do not have to use the max_swizzle period if we already had
     //  enough swizzle to permute a ldmatrix_rows. This would encourage
     //  usage of power of 2 swizzle periods.
-    if (ldmatrix_rows % multiplier == 0) {
-      swizzle_period = std::min(swizzle_period, ldmatrix_rows / multiplier);
+    if (ldmatrix_rows % repeated_pattern_size == 0) {
+      swizzle_period =
+          std::min(swizzle_period, ldmatrix_rows / repeated_pattern_size);
     }
 
-    int row_multiplier = multiplier;
+    int row_multiplier = repeated_pattern_size;
 
     TORCH_INTERNAL_ASSERT(
         tile_size_x % (swizzle_period * row_multiplier) == 0 &&

From 278261efb688bdc372b992a5960a03249fa3a178 Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Tue, 11 Apr 2023 13:44:02 -0700
Subject: [PATCH 09/22] save

---
 csrc/scheduler/matmul.cpp | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index 14b2320f84b..282ecf6b981 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -263,12 +263,41 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
       return; // No need to swizzle in this case.
     }
 
-    /* Now we are ready to implement our swizzle
+    /* Now we have the idea about how to remove bank conflict: We can do an
+     * inner split of our row dimension by `repeated_pattern_size`, then
+     * different indices of the outer row dimension will be using the same
+     * megabank, and different indices of the inner row dimension will be using
+     * different megabank. We don't need to touch the inner dimension, but we
+     * need to play with the outer dimension to interleave it with matrice ids
+     * so that each matrix is distributed across different banks.
+     *
+     * For example, if we have repeated_pattern_size = 4, we would want to do
+     * something like below:
+     *    +----------+----------+
+     *   0|          |          |
+     *   1| matrix 0 | matrix 1 |
+     *   2|          |          |
+     *   3|          |          |
+     *    +----------+----------+
+     *   4|          |          |
+     *   5| matrix 1 | matrix 0 |
+     *   6|          |          |
+     *   7|          |          |
+     *    +----------+----------+
      */
 
+    //   -2   -1
+    // [row, col]
+    TORCH_INTERNAL_ASSERT(tile_size_x % repeated_pattern_size);
+    shared_mem_tv->split(-2, repeated_pattern_size);
+    TORCH_INTERNAL_ASSERT(tile_size_y % ldmatrix_cols);
+    shared_mem_tv->split(-1, ldmatrix_cols);
+    //      -4         -3        -2       -1
+    // [pattern id, pattern, matrix id, matrix]
+
     // Calculate swizzle period, only equal row/col periods at the moment:
     //  TODO: aperiodic swizzle could also be supported in a follow up:
-    int max_swizzle_period = repeated_pattern_size_in_units;
+    int max_swizzle_period = g;
 
     int swizzle_period = max_swizzle_period;
 

From 0fb2d83b3e56177773da7467b04939df823c5031 Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Tue, 11 Apr 2023 15:13:50 -0700
Subject: [PATCH 10/22] save

---
 csrc/scheduler/matmul.cpp | 105 +++++++++++++++++++++++---------------
 1 file changed, 63 insertions(+), 42 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index 282ecf6b981..59ad033e883 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -263,13 +263,50 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
       return; // No need to swizzle in this case.
     }
 
-    /* Now we have the idea about how to remove bank conflict: We can do an
-     * inner split of our row dimension by `repeated_pattern_size`, then
-     * different indices of the outer row dimension will be using the same
-     * megabank, and different indices of the inner row dimension will be using
-     * different megabank. We don't need to touch the inner dimension, but we
-     * need to play with the outer dimension to interleave it with matrice ids
-     * so that each matrix is distributed across different banks.
+    /* We've just studied the behavior of f(i) w.r.t. different `i`s, and f(i)
+     * repeat with a period of n / gcd(stride, n). With fixed stride, for each
+     * given `init`, the values f(i) at different `i` form a "pattern". Now we
+     * would study the behavior of f(i) for different `init` values. In other
+     * word, we just studied the megabank usage behavior of different rows of
+     * the same matrix, now we study the megabank usage behavior of the same row
+     * of different matrices.
+     *
+     * Let's slightly change our notation f(i) as f(i;init) for convenience.
+     * Because Z/nZ has n items, each pattern has n / gcd(stride, n) different
+     * items, so we have in total gcd(stride, n) different patterns. in Z/nZ,
+     * `init` has n possible values, we want to know when different `init`
+     * correspond to different patterns and when they correspond to the same
+     * pattern.
+     *
+     * Consider the equation
+     *   f(i1; init1) == f(i2; init2)
+     * which simplifies to
+     *   init1 + i1 * stride == init2 + i2 * stride
+     *   ==> init1 - init2 = (i2 - i1) * stride
+     *   ==> init1 - init2 = (i2 - i1) * s * gcd(stride, n)
+     * Let si = (i2 - i1) * s, because s coprime with n, we know that for an
+     * arbitrary value in Z/nZ, there exist an i1 and i2 to make si take that
+     * value. That said, for init values that are off by a multiple of
+     * gcd(stride, n) they correspond to the same pattern, otherwise they
+     * belongs to different patterns. So, we can use
+     *   init = 0, 1, ..., gcd(stride, n) - 1
+     * to canonically represent gcd(stride, n) patterns. Let's call the above
+     * `init` values "pattern id".
+     *
+     * For the example of stride = 6 under Z/8Z, we have different patterns
+     *        f(i): 01234567
+     *   pattern 0: x_x_x_x_
+     *   pattern 1: _x_x_x_x
+     *   (x => occupied, _ => unoccupied)
+     *
+     * Now we have the idea about how to remove bank conflict: We can do an
+     * inner split of our row dimension by `repeated_pattern_size` to get
+     * (repeat, pattern), then different indices of the "repeat" dimension will
+     * be using the same megabank, and different indices of the "pattern"
+     * dimension will be using different megabank. We don't need to touch the
+     * "pattern" dimension, but we need to play with the "repeat" dimension to
+     * interleave it with matrice ids so that each matrix is distributed across
+     * different banks.
      *
      * For example, if we have repeated_pattern_size = 4, we would want to do
      * something like below:
@@ -288,32 +325,23 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
 
     //   -2   -1
     // [row, col]
-    TORCH_INTERNAL_ASSERT(tile_size_x % repeated_pattern_size);
-    shared_mem_tv->split(-2, repeated_pattern_size);
-    TORCH_INTERNAL_ASSERT(tile_size_y % ldmatrix_cols);
+    TORCH_INTERNAL_ASSERT(
+        tile_size_x % ldmatrix_rows == 0, "Partial matrices not supported");
+    shared_mem_tv->split(-2, ldmatrix_rows);
+    TORCH_INTERNAL_ASSERT(
+        tile_size_y % ldmatrix_cols == 0, "Partial matrices not supported");
     shared_mem_tv->split(-1, ldmatrix_cols);
-    //      -4         -3        -2       -1
-    // [pattern id, pattern, matrix id, matrix]
-
-    // Calculate swizzle period, only equal row/col periods at the moment:
-    //  TODO: aperiodic swizzle could also be supported in a follow up:
-    int max_swizzle_period = g;
-
-    int swizzle_period = max_swizzle_period;
-
-    // Do not have to use the max_swizzle period if we already had
-    //  enough swizzle to permute a ldmatrix_rows. This would encourage
-    //  usage of power of 2 swizzle periods.
-    if (ldmatrix_rows % repeated_pattern_size == 0) {
-      swizzle_period =
-          std::min(swizzle_period, ldmatrix_rows / repeated_pattern_size);
-    }
-
-    int row_multiplier = repeated_pattern_size;
-
+    //     -4        -3      -2         -1
+    // [matrix id, matrix, matrix id, matrix]
+    TORCH_INTERNAL_ASSERT(
+        ldmatrix_rows % repeated_pattern_size == 0,
+        "ldmatrix_rows is assumed to be a multiple of repeated_pattern_size");
+    shared_mem_tv->split(-3, repeated_pattern_size);
+    //     -5        -4      -3       -2         -1
+    // [matrix id, repeat, pattern, matrix id, matrix]
+    int swizzle_period = ldmatrix_rows / repeated_pattern_size;
     TORCH_INTERNAL_ASSERT(
-        tile_size_x % (swizzle_period * row_multiplier) == 0 &&
-            tile_size_y % (swizzle_period * ldmatrix_cols) == 0,
+        tile_size_y % (swizzle_period * ldmatrix_cols) == 0,
         "need aperiodic swizzle config for tile size ",
         tile_size_x,
         "x",
@@ -322,17 +350,10 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
         ldmatrix_rows,
         "x",
         ldmatrix_cols);
-
-    // add the swizzling op:
-    shared_mem_tv->split(-2, row_multiplier * swizzle_period);
-    shared_mem_tv->split(-2, row_multiplier);
-
-    shared_mem_tv->split(-1, ldmatrix_cols * swizzle_period);
-    shared_mem_tv->split(-1, ldmatrix_cols);
-
-    //        -6        -5           -4           -3        -2       -1
-    // [..., row_o, row_period, row_multiplier, col_o, col_period,
-    // ldmatrix_cols]
+    shared_mem_tv->split(-2, swizzle_period);
+    //     -6        -5      -4            -3           -2         -1
+    // [matrix id, repeat, pattern, matrix id outer, pattern id, matrix]
+    // swizzle repeat with pattern id to make repeat no longer repeat
     if (isPowOf2(swizzle_period)) {
       shared_mem_tv->swizzle(Swizzle2DType::XOR, -5, -2);
     } else {

From 1266943e4debf27d315800476830aa159c4c6c8d Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Wed, 12 Apr 2023 16:30:43 -0700
Subject: [PATCH 11/22] save

---
 csrc/scheduler/matmul.cpp | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index a9934b03e15..96358338d03 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -124,8 +124,14 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
      * When we talk about bank conflict removal, we are talking about the
      * following task:
      *   "there are 32 banks, and each bank contains one 4-byte word, we want to
-     *    make sure different lanes in a warp does not access different words in
-     *    the same bank"
+     *    make sure different lanes in a warp does not access different word
+     *    addresses in the same bank"
+     * For example, if thread 0 is accessing word address 1, and thread 1 is
+     * accessing word address 33, then these two threads will have a bank
+     * conflict because they are accessing different word addresses in the same
+     * bank. However, if thread 0 is accessing byte address 4 and thread 1 is
+     * accessing byte address 6 then there will be no bank conflict because 4
+     * and 6 both belong to word 1.
      */
 
     constexpr int smem_bytes_per_word = 4;
@@ -137,7 +143,7 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
      * "megabank". So we can rephrase our task as:
      *   "there are 8 megabanks, and each megabanks contains one 4-word unit, we
      *    want to make sure different lanes in a warp does not access different
-     *    units in the same megabank"
+     *    unit addresses in the same megabank"
      * In this terminology, each matrix has 8 rows, and each row has exactly one
      * unit.
      */

From b39a04f3c22f02414767074613036ac4b18aa655 Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Wed, 12 Apr 2023 22:35:13 -0700
Subject: [PATCH 12/22] update

---
 csrc/scheduler/matmul.cpp | 67 ++++++++++++++++++++++-----------------
 1 file changed, 38 insertions(+), 29 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index 96358338d03..1ea1a534504 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -21,7 +21,7 @@ namespace nvfuser {
 namespace {
 
 // Returns true if given number is power of 2
-bool isPowOf2(int x) {
+constexpr bool isPowOf2(int x) {
   return x > 1 && (x & (x - 1)) == 0;
 }
 
@@ -241,7 +241,7 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
      * An important tool to study this equation is multiplicative inverse:
      * https://en.wikipedia.org/wiki/Modular_multiplicative_inverse
      * stride has an multiplicative inverse if and only if stride coprime with
-     * n, that is, gcd(stride, n) == 1. In such case, the solution to our
+     * n, that is, g := gcd(stride, n) == 1. In such case, the solution to our
      * equation j * stride == 0 is j = stride^(-1) * 0 = 0, that is: f(i) does
      * not repeat, that is: there is no bank conflict.
      */
@@ -251,16 +251,26 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
       return; // No need to swizzle in this case.
     }
 
-    /* If stride does not coprime with n, then we can write stride as:
-     *   stride = s * gcd(stride, n)
+    /* For the case where stride does not coprime with n, if n is a power of a
+     * prime, then we can write stride as:
+     *   stride = s * g
      * where s coprime with n. Then the equation j * stride == 0 becomes:
-     *   j * s * gcd(stride, n) == 0
+     *   j * s * g == 0
      * which can be simplified as
-     *   j * gcd(stride, n) == s^(-1) * 0
-     *   ==> j * gcd(stride, n) == 0
-     * It is easy to see that j is n / gcd(stride, n).
-     * That is: f(i) always repeat a pattern of n / gcd(stride, n) unique
-     * numbers gcd(stride, n) times
+     *   j * g == s^(-1) * 0
+     *   ==> j * g == 0
+     *
+     * It is easy to see that j is n / g.
+     * That is: f(i) always repeat a pattern of n / g unique numbers g times
+     *
+     * For our application here, n is always 8, which is a power of 2. So this
+     * conclusion holds.
+     * (Actually, if n is not a power of a prime, we can decompose the ring Z/nZ
+     *  into the direct product of Z/p1^k1Z x Z/p2^k2Z x ... according to the
+     *  Chinese remainder theorem, and repeat the above proof in each ring, we
+     *  will get the same conclusion here. I will not go deep into the details
+     *  about this here as it is unrelated, but on the other hand I will not add
+     *  static_assert(isPowOf2(num_megabanks)); here either)
      */
 
     int repeated_pattern_size = num_megabanks / g;
@@ -270,33 +280,32 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
     }
 
     /* We've just studied the behavior of f(i) w.r.t. different `i`s, and f(i)
-     * repeat with a period of n / gcd(stride, n). With fixed stride, for each
-     * given `init`, the values f(i) at different `i` form a "pattern". Now we
-     * would study the behavior of f(i) for different `init` values. In other
-     * word, we just studied the megabank usage behavior of different rows of
-     * the same matrix, now we study the megabank usage behavior of the same row
-     * of different matrices.
+     * repeat with a period of n / g. With fixed stride, for each given `init`,
+     * the values f(i) at different `i` form a "pattern". Now we would study the
+     * behavior of f(i) for different `init` values. In other word, we just
+     * studied the megabank usage behavior of different rows of the same matrix,
+     * now we study the megabank usage behavior of the same row of different
+     * matrices.
      *
      * Let's slightly change our notation f(i) as f(i;init) for convenience.
-     * Because Z/nZ has n items, each pattern has n / gcd(stride, n) different
-     * items, so we have in total gcd(stride, n) different patterns. in Z/nZ,
-     * `init` has n possible values, we want to know when different `init`
-     * correspond to different patterns and when they correspond to the same
-     * pattern.
+     * Because Z/nZ has n items, each pattern has n / g different items, so we
+     * have in total g different patterns. in Z/nZ, `init` has n possible
+     * values, we want to know when different `init` correspond to different
+     * patterns and when they correspond to the same pattern.
      *
      * Consider the equation
      *   f(i1; init1) == f(i2; init2)
      * which simplifies to
      *   init1 + i1 * stride == init2 + i2 * stride
      *   ==> init1 - init2 = (i2 - i1) * stride
-     *   ==> init1 - init2 = (i2 - i1) * s * gcd(stride, n)
-     * Let si = (i2 - i1) * s, because s coprime with n, we know that for an
-     * arbitrary value in Z/nZ, there exist an i1 and i2 to make si take that
-     * value. That said, for init values that are off by a multiple of
-     * gcd(stride, n) they correspond to the same pattern, otherwise they
-     * belongs to different patterns. So, we can use
-     *   init = 0, 1, ..., gcd(stride, n) - 1
-     * to canonically represent gcd(stride, n) patterns. Let's call the above
+     *   ==> init1 - init2 = (i2 - i1) * s * g
+     * Let si = (i2 - i1) * s, because s coprime with (m = n / g), we know that
+     * for an arbitrary value in Z/mZ, there exist an i1 and i2 to make si take
+     * that value. That said, for init values that are off by a multiple of g
+     * they correspond to the same pattern, otherwise they belongs to different
+     * patterns. So, we can use
+     *   init = 0, 1, ..., g - 1
+     * to canonically represent g patterns. Let's call the above
      * `init` values "pattern id".
      *
      * For the example of stride = 6 under Z/8Z, we have different patterns

From de1f0583ffbddbfc471eaac40eb208c74558314c Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Wed, 12 Apr 2023 23:05:43 -0700
Subject: [PATCH 13/22] another update

---
 csrc/scheduler/matmul.cpp | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index 1ea1a534504..5f7ffe7878a 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -299,11 +299,14 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
      *   init1 + i1 * stride == init2 + i2 * stride
      *   ==> init1 - init2 = (i2 - i1) * stride
      *   ==> init1 - init2 = (i2 - i1) * s * g
-     * Let si = (i2 - i1) * s, because s coprime with (m = n / g), we know that
-     * for an arbitrary value in Z/mZ, there exist an i1 and i2 to make si take
-     * that value. That said, for init values that are off by a multiple of g
-     * they correspond to the same pattern, otherwise they belongs to different
+     * Let m = n / g, according to Theorem 4.13 in [The Mathematics of
+     * Integer Arithmetic], ((i2 - i1) * stride) % n = ((i2 - i1) * s) % m * g.
+     * Let si = (i2 - i1) * s, because s coprime with m, we know that for an
+     * arbitrary value in Z/mZ, there exist an i1 and i2 to make si take that
+     * value. That said, for init values that are off by a multiple of g they
+     * correspond to the same pattern, otherwise they belongs to different
      * patterns. So, we can use
+     *
      *   init = 0, 1, ..., g - 1
      * to canonically represent g patterns. Let's call the above
      * `init` values "pattern id".

From f4ac9c8724cce4d6ec5f2027d1024014eec9768b Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Thu, 13 Apr 2023 11:20:36 -0700
Subject: [PATCH 14/22] guard assert

---
 benchmark/matmul.cpp | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/benchmark/matmul.cpp b/benchmark/matmul.cpp
index 53b378acda3..1a3e2c6343a 100644
--- a/benchmark/matmul.cpp
+++ b/benchmark/matmul.cpp
@@ -235,9 +235,13 @@ static void SingleMatmulBase(
   auto launch_constraints = LaunchParams();
   FusionExecutor fe;
   fe.compileFusion(fusion, args, launch_constraints, cparams);
-  TORCH_CHECK(
-      getBankConflictInfo(fe.kernel(), launch_constraints).empty(),
-      "Shared memory bank conflict not removed.");
+  auto properties = at::cuda::getDeviceProperties(inputs.first.get_device());
+  if (properties->major >= 8 ||
+      (properties->major == 7 && properties->minor >= 5)) {
+    TORCH_CHECK(
+        getBankConflictInfo(fe.kernel(), launch_constraints).empty(),
+        "Shared memory bank conflict not removed.");
+  }
 
   // Warm up run
   auto outputs = fe.runFusion({inputs.first, inputs.second});

From f5848b623fb271d264266ba7fb498efc4cb49928 Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Thu, 13 Apr 2023 11:30:30 -0700
Subject: [PATCH 15/22] save

---
 csrc/scheduler/matmul.cpp | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index 5f7ffe7878a..ff2bada8554 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -144,8 +144,8 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
      *   "there are 8 megabanks, and each megabanks contains one 4-word unit, we
      *    want to make sure different lanes in a warp does not access different
      *    unit addresses in the same megabank"
-     * In this terminology, each matrix has 8 rows, and each row has exactly one
-     * unit.
+     * In this terminology, matrices are in the row major format, each matrix
+     * has 8 rows, and each row has exactly one unit.
      */
 
     constexpr int items_per_unit = ldmatrix_cols;
@@ -161,11 +161,12 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
      *   +----------+----------+----------+
      *   | matrix 3 | matrix 4 | matrix 5 |
      *   +----------+----------+----------+
-     * The addresses of different rows in the same matrix are offseted by 3
-     * units. In this perspective, loading a matrix is a strided memory access
-     * with the following stride (in units):
+     * The addresses of different rows in the same matrix are offset by 3 units.
+     * In this perspective, loading a matrix is a strided memory access with the
+     * following stride (in units):
      */
 
+    // number of units per row
     int row_stride = tile_size_y / items_per_unit;
 
     /* So the bank conflicting problem is now converted to the following game:

From d1256b1fee149b2ba8daf0d73ce02f016fb59fb3 Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Thu, 13 Apr 2023 12:31:25 -0700
Subject: [PATCH 16/22] update

---
 csrc/scheduler/matmul.cpp | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index ff2bada8554..e8dd7b94b4a 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -262,7 +262,9 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
      *   ==> j * g == 0
      *
      * It is easy to see that j is n / g.
-     * That is: f(i) always repeat a pattern of n / g unique numbers g times
+     * That is: f(i) always repeat a pattern of n / g unique numbers g times.
+     * In other word: we are using n/g megabanks, and we have a g-way bank
+     * conflict.
      *
      * For our application here, n is always 8, which is a power of 2. So this
      * conclusion holds.
@@ -307,7 +309,6 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
      * value. That said, for init values that are off by a multiple of g they
      * correspond to the same pattern, otherwise they belongs to different
      * patterns. So, we can use
-     *
      *   init = 0, 1, ..., g - 1
      * to canonically represent g patterns. Let's call the above
      * `init` values "pattern id".

From 5610465940b438e1ea7af46abf17f0817a302c5c Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Thu, 13 Apr 2023 22:25:02 -0700
Subject: [PATCH 17/22] simplify the case with g != 1

---
 csrc/scheduler/matmul.cpp | 34 +++++++++++++++-------------------
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index e8dd7b94b4a..f818ec02614 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -252,28 +252,24 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
       return; // No need to swizzle in this case.
     }
 
-    /* For the case where stride does not coprime with n, if n is a power of a
-     * prime, then we can write stride as:
-     *   stride = s * g
-     * where s coprime with n. Then the equation j * stride == 0 becomes:
-     *   j * s * g == 0
-     * which can be simplified as
-     *   j * g == s^(-1) * 0
-     *   ==> j * g == 0
+    /* For the case where stride does not coprime with n, we note that
+     * j * stride == 0 in Z/nZ is equivalent to (j * stride) % n = 0 in Z. We
+     * can write stride and n as:
+     *   stride = s * g, n = k * g
+     * According to Theorem 4.13 in [The Mathematics of Integer Arithmetic], we
+     * have:
+     *   (j * stride) % n = 0
+     *   ==> (j * s) % k * g = 0
+     *   ==> (j * s) % k = 0
+     * which is equivalent to j * s == 0 in Z/kZ. Because s coprime with k, we
+     * further get:
+     *   j == 0 (in Z/kZ)
+     * That is, j is a multiple of k in Z. So the smallest positive j that make
+     * the equation hold is n / g.
      *
-     * It is easy to see that j is n / g.
-     * That is: f(i) always repeat a pattern of n / g unique numbers g times.
+     * That is: f(i) always repeat a pattern of n/g unique numbers g times.
      * In other word: we are using n/g megabanks, and we have a g-way bank
      * conflict.
-     *
-     * For our application here, n is always 8, which is a power of 2. So this
-     * conclusion holds.
-     * (Actually, if n is not a power of a prime, we can decompose the ring Z/nZ
-     *  into the direct product of Z/p1^k1Z x Z/p2^k2Z x ... according to the
-     *  Chinese remainder theorem, and repeat the above proof in each ring, we
-     *  will get the same conclusion here. I will not go deep into the details
-     *  about this here as it is unrelated, but on the other hand I will not add
-     *  static_assert(isPowOf2(num_megabanks)); here either)
      */
 
     int repeated_pattern_size = num_megabanks / g;

From 77cef3523068dfdd76a7977366170a8a90fc8152 Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Thu, 13 Apr 2023 22:27:11 -0700
Subject: [PATCH 18/22] k->m

---
 csrc/scheduler/matmul.cpp | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index b86115682c0..9a4d4d44070 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -253,16 +253,16 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
     /* For the case where stride does not coprime with n, we note that
      * j * stride == 0 in Z/nZ is equivalent to (j * stride) % n = 0 in Z. We
      * can write stride and n as:
-     *   stride = s * g, n = k * g
+     *   stride = s * g, n = m * g
      * According to Theorem 4.13 in [The Mathematics of Integer Arithmetic], we
      * have:
      *   (j * stride) % n = 0
-     *   ==> (j * s) % k * g = 0
-     *   ==> (j * s) % k = 0
-     * which is equivalent to j * s == 0 in Z/kZ. Because s coprime with k, we
+     *   ==> (j * s) % m * g = 0
+     *   ==> (j * s) % m = 0
+     * which is equivalent to j * s == 0 in Z/mZ. Because s coprime with m, we
      * further get:
-     *   j == 0 (in Z/kZ)
-     * That is, j is a multiple of k in Z. So the smallest positive j that make
+     *   j == 0 (in Z/mZ)
+     * That is, j is a multiple of m in Z. So the smallest positive j that make
      * the equation hold is n / g.
      *
      * That is: f(i) always repeat a pattern of n/g unique numbers g times.

From d54a76febc84cc8024ea0a58b73c31a0fe1aa256 Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Fri, 14 Apr 2023 13:31:06 -0700
Subject: [PATCH 19/22] fix indexing error

---
 csrc/index_compute.cpp | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/csrc/index_compute.cpp b/csrc/index_compute.cpp
index ae126efb40d..b9964df33af 100644
--- a/csrc/index_compute.cpp
+++ b/csrc/index_compute.cpp
@@ -559,21 +559,14 @@ void IndexCompute::handle(Swizzle2D* swizzle_2d) {
     // Handle inactive swizzles by just passing through index
     //  and extend information.
 
-    TORCH_INTERNAL_ASSERT(
-        index_map_.count(in_x_id) == index_map_.count(in_y_id),
-        "input index should be either both defined or both undefined");
-    if (index_map_.count(in_x_id)) {
-      // Only propagate original index through if
-      //  the input index hasn't been computed.
-      // TODO:
-      //  This part should be cleaner once we remove the
-      // second index traversal pass.
-      return;
+    if (!index_map_.count(in_x_id)) {
+      index_map_[in_x_id] = out_x_ind;
+      extent_map_[in_x_id] = getExtent(out_x_id);
+    }
+    if (!index_map_.count(in_y_id)) {
+      index_map_[in_y_id] = out_y_ind;
+      extent_map_[in_y_id] = getExtent(out_y_id);
     }
-    index_map_[in_x_id] = out_x_ind;
-    index_map_[in_y_id] = out_y_ind;
-    extent_map_[in_y_id] = getExtent(out_y_id);
-    extent_map_[in_x_id] = getExtent(out_x_id);
   } else {
     // Generate integer swizzle math if the
     //  swizzle is activated. See also

From 35be76849e5e3950cb8d1e170c2c1b4456b6b847 Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Fri, 14 Apr 2023 14:32:37 -0700
Subject: [PATCH 20/22] update doc

---
 csrc/scheduler/matmul.cpp | 66 +++++++++++++++++++++------------------
 1 file changed, 36 insertions(+), 30 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index 9a4d4d44070..aa5530e1756 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -199,10 +199,11 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
     int row_stride_znz = row_stride % num_megabanks;
 
     /* Consider the following function in Z/nZ:
-     *   f(i) = init + i * stride
+     *   f(i; init) = init + i * stride
      * where init is the initial position of the pointer in the clock when we
      * start the game, and stride is the number of ticks we move forward each
-     * time, and i is the number of times we move forward.
+     * time, and i is the number of times we move forward. For a fixed init, we
+     * abbrivate f(i; init) as f(i).
      *
      * In our problem, f(i) is the megabank of the `i`th row of the matrix, and
      * `init` is the megabank of the 0th row of the matrix.
@@ -268,6 +269,15 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
      * That is: f(i) always repeat a pattern of n/g unique numbers g times.
      * In other word: we are using n/g megabanks, and we have a g-way bank
      * conflict.
+     *
+     * Let's use the word "pattern" to refer to the set of values of `f` at
+     * different `i`, that is:
+     *   pattern k = { f(i; init=k) | i in Z/nZ }
+     * For the example of stride = 6 under Z/8Z, we have the following patterns
+     *        f(i): 01234567
+     *   pattern 0: x_x_x_x_
+     *   pattern 1: _x_x_x_x
+     *   (x => occupied, _ => unoccupied)
      */
 
     int repeated_pattern_size = num_megabanks / g;
@@ -276,43 +286,39 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
       return; // No need to swizzle in this case.
     }
 
-    /* We've just studied the behavior of f(i) w.r.t. different `i`s, and f(i)
-     * repeat with a period of n / g. With fixed stride, for each given `init`,
-     * the values f(i) at different `i` form a "pattern". Now we would study the
-     * behavior of f(i) for different `init` values. In other word, we just
-     * studied the megabank usage behavior of different rows of the same matrix,
-     * now we study the megabank usage behavior of the same row of different
-     * matrices.
+    /* Now we know that we have a g-way bank conflict. How do we remove this
+     * bank conflict? The answer is to mix the storage of different matrices.
+     * We first split the matrices along the row axis into g pieces, each piece
+     * has n/g rows. With this split, each piece occupies exactly one pattern.
+     * We want to use some non-traditional storage to let different pieces of
+     * the same matrix to occupy different patterns.
      *
-     * Let's slightly change our notation f(i) as f(i;init) for convenience.
-     * Because Z/nZ has n items, each pattern has n / g different items, so we
-     * have in total g different patterns. in Z/nZ, `init` has n possible
-     * values, we want to know when different `init` correspond to different
-     * patterns and when they correspond to the same pattern.
+     * Because Z/nZ has n items, each pattern has n/g different items, so we
+     * have in total g different patterns. We want to find the corresponding
+     * `init` values of these g different patterns.
      *
-     * Consider the equation
-     *   f(i1; init1) == f(i2; init2)
+     * Consider two different init values `init1` and `init2`. When do they
+     * represent the same pattern? They represent the same pattern if and only
+     * if `f(0; init2)` falls on the pattern of `init1`, that is, there exist an
+     * i such that
+     *   f(i; init1) == f(0; init2)
      * which simplifies to
-     *   init1 + i1 * stride == init2 + i2 * stride
-     *   ==> init1 - init2 = (i2 - i1) * stride
-     *   ==> init1 - init2 = (i2 - i1) * s * g
-     * Let m = n / g, according to Theorem 4.13 in [The Mathematics of
-     * Integer Arithmetic], ((i2 - i1) * stride) % n = ((i2 - i1) * s) % m * g.
-     * Let si = (i2 - i1) * s, because s coprime with m, we know that for an
-     * arbitrary value in Z/mZ, there exist an i1 and i2 to make si take that
-     * value. That said, for init values that are off by a multiple of g they
+     *   init1 + i * stride == init2
+     *   ==> init2 - init1 == i * stride
+     * What values can `i * stride` be? It can be an arbitrary multiple of g:
+     * i * stride in Z/nZ is (i * stride) % n in Z. Let m = n/g, according to
+     * Theorem 4.13 in [The Mathematics of Integer Arithmetic]
+     *   (i * stride) % n = (i * s) % m * g
+     * Because s coprime with m, we know that for an arbitrary value `j` in
+     * Z/mZ, we can take `i = s^(-1) * j` to make `i * s == j`.
+     *
+     * That said, for init values that are off by a multiple of g they
      * correspond to the same pattern, otherwise they belongs to different
      * patterns. So, we can use
      *   init = 0, 1, ..., g - 1
      * to canonically represent g patterns. Let's call the above
      * `init` values "pattern id".
      *
-     * For the example of stride = 6 under Z/8Z, we have different patterns
-     *        f(i): 01234567
-     *   pattern 0: x_x_x_x_
-     *   pattern 1: _x_x_x_x
-     *   (x => occupied, _ => unoccupied)
-     *
      * Now we have the idea about how to remove bank conflict: We can do an
      * inner split of our row dimension by `repeated_pattern_size` to get
      * (repeat, pattern), then different indices of the "repeat" dimension will

From 9d8db7f7e149cacce34e5b3d1b7de5c51eb274c1 Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Fri, 14 Apr 2023 14:59:05 -0700
Subject: [PATCH 21/22] minv

---
 csrc/scheduler/matmul.cpp | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/csrc/scheduler/matmul.cpp b/csrc/scheduler/matmul.cpp
index aa5530e1756..e86eb3df484 100644
--- a/csrc/scheduler/matmul.cpp
+++ b/csrc/scheduler/matmul.cpp
@@ -240,9 +240,15 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
      *
      * An important tool to study this equation is multiplicative inverse:
      * https://en.wikipedia.org/wiki/Modular_multiplicative_inverse
+     * A number i has multiplicative inverse `minv(i)` in Z/nZ if and only if it
+     * coprime with n. `minv(i)` is the number that `i * minv(i) == 1`. So in
+     * Z/nZ, the equation `ax = b` has solution `x = minv(a)*b` if a has
+     * multiplicative inverse. For example, in Z/15Z, `minv(2) = 8` because
+     *   (2 * 8) % 15 = 1
+     *
      * stride has an multiplicative inverse if and only if stride coprime with
      * n, that is, g := gcd(stride, n) == 1. In such case, the solution to our
-     * equation j * stride == 0 is j = stride^(-1) * 0 = 0, that is: f(i) does
+     * equation j * stride == 0 is j = minv(stride) * 0 = 0, that is: f(i) does
      * not repeat, that is: there is no bank conflict.
      */
 
@@ -310,7 +316,7 @@ void prologSwizzle(TensorView* shared_mem_tv, const MatmulParams& params) {
      * Theorem 4.13 in [The Mathematics of Integer Arithmetic]
      *   (i * stride) % n = (i * s) % m * g
      * Because s coprime with m, we know that for an arbitrary value `j` in
-     * Z/mZ, we can take `i = s^(-1) * j` to make `i * s == j`.
+     * Z/mZ, we can take `i = minv(s) * j` to make `i * s == j`.
      *
      * That said, for init values that are off by a multiple of g they
      * correspond to the same pattern, otherwise they belongs to different

From 1da2441d09f89a8e2c3c7d03f4ea76dc7b72f4c4 Mon Sep 17 00:00:00 2001
From: Xiang Gao <qasdfgtyuiop@gmail.com>
Date: Fri, 14 Apr 2023 16:27:10 -0700
Subject: [PATCH 22/22] disable FusionAmpereMatmulSASSRegisterUsageLDSM_CUDA

---
 test/test_gpu_matmul_sass.cpp | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/test/test_gpu_matmul_sass.cpp b/test/test_gpu_matmul_sass.cpp
index ff5370a3831..3eaed52f61e 100644
--- a/test/test_gpu_matmul_sass.cpp
+++ b/test/test_gpu_matmul_sass.cpp
@@ -69,6 +69,8 @@ sass::Container getSASSFor(
   params.double_buffer_options.smem_double_buffer_stage = 4;
   scheduleMatmul(&fusion, params);
 
+  fusion.printTransforms();
+
   at::manual_seed(0);
   auto inputs = fp16MatmulAtInput(M, N, K, layout);
 
@@ -245,6 +247,20 @@ TEST_F(NVFuserTest, FusionAmpereMatmulSASSModifiersCheck_CUDA) {
   }
 }
 
+#if 0
+
+TODO: With swizzle, the cuda code looks like:
+
+#pragma unroll
+for(nvfuser_index_t i507 = 0; i507 < 8; ++i507) {
+  int i18439;
+  i18439 = i18438 + i507;
+  Turing::ldMatrixT (*reinterpret_cast<Array<__half,4,4>*>(&T9[(4 * i507)]),((i18437 + (128 * (i18439 / 8))) + (16 * (i6455 ^ (i18439 % 8)))));
+}
+
+where i6455 = (((nvfuser_index_t)threadIdx.x) % 16) % 8 so it no longer make sense to require the memory access pattern below.
+We need to reinvestigate the test below to determine whether to change it or delete it.
+
 // Check that all LDSM instructions has the following pattern:
 //   LDSM.16.M88.2 R2,   [R213] ;
 //   LDSM.16.M88.2 R136, [R213+0x200] ;
@@ -317,5 +333,6 @@ TEST_F(NVFuserTest, FusionAmpereMatmulSASSRegisterUsageLDSM_CUDA) {
     }
   }
 }
+#endif
 
 } // namespace nvfuser