-
Notifications
You must be signed in to change notification settings - Fork 79
Closed
Description
codegen gives wrong result with transpose scheduler. This is a relatively complicated case where we have all tricks in small transpose dimensions in action. (split_before_tiling and two merges).
Here's the cpp repro. I'm almost certain there's an indexing issue here, since this program gives me misaligned access when I enable vectorizatio. 😛
// small transpose dimension with merge and split
TEST_F(NVFuserTest, TransposeIndexingIssue_CUDA) {
auto fusion = std::make_unique<Fusion>();
FusionGuard fg(fusion.get());
auto tv0 = makeContigTensor(5);
fusion->addInput(tv0);
auto tv1 = transpose(tv0, 1, 4);
auto tv2 = transpose(tv1, 0, 3);
fusion->addOutput(tv2);
std::vector<int64_t> shape({2, 7, 102400, 4, 5});
auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
auto t0 = at::randn(shape, options);
std::vector<c10::IValue> aten_inputs({t0});
FusionExecutorCache executor_cache(std::move(fusion));
auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);
auto runtime = executor_cache.getMostRecentKernelRuntime();
TORCH_CHECK(!runtime->isSegmented(), "Segmentation not expected");
auto ref = t0.transpose(1, 4).transpose(0, 3);
TORCH_CHECK(ref.equal(cg_outputs.at(0)));
}
===== Transpose Stats ========
inputs: T0_g[ iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}, iS4{i5} ]
outputs: T2_g[ iS13{i4}, iS11{i5}, iS12{i3}, iS10{i0}, iS14{i2} ]
shape: 4 5 102400 2 7
num_elems: 28672000
n_input_tensors: 1
max_input_dtype_size: 4
group 1: T2_g[ iS13{i4}, iS11{i5}, iS12{i3}, iS10{i0}, iS14{i2} ]
reference1: T2_g[ iS13{i4}, iS11{i5}, iS12{i3}, iS10{i0}, iS14{i2} ]
inner_most_id1 position: 4 (in reference 1)
group 2: T0_g[ iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}, iS4{i5} ]
reference2: T0_g[ iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}, iS4{i5} ]
inner_most_id2 position: 1 (in reference 1)
small transposed dim, needs virtual inner-most dim
===== Transpose Parameters ========
Tag: Transpose heuristics Transpose Characteristics:
BlckX: 128
input tile size: 32
output tile size: 32
elements per tile: 1024
elements per thread: 8
Unroll group 1, Factor: 8
Unroll group 2, Factor: 8
Virtual inner-most dim:
split(2, 3)
merge 4, 3 with innermost1
merge 0, 2 with innermost2
====================================
C++ exception with description "Expected ref.equal(cg_outputs.at(0)) to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
Exception raised from TestBody at /opt/pytorch/nvfuser/test/test_gpu_view.cpp:2537 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xae (0x7feb6df2e50e in /opt/pytorch/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x7d (0x7feb6dee4aeb in /opt/pytorch/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0xc6111d (0x55e3084b211d in ./build/nvfuser_tests)
frame #3: <unknown function> + 0xe0288b (0x55e30865388b in ./build/nvfuser_tests)
frame #4: <unknown function> + 0xdfb831 (0x55e30864c831 in ./build/nvfuser_tests)
frame #5: <unknown function> + 0xdd0648 (0x55e308621648 in ./build/nvfuser_tests)
frame #6: <unknown function> + 0xdd10d6 (0x55e3086220d6 in ./build/nvfuser_tests)
frame #7: <unknown function> + 0xdd19dd (0x55e3086229dd in ./build/nvfuser_tests)
frame #8: <unknown function> + 0xde18ef (0x55e3086328ef in ./build/nvfuser_tests)
frame #9: <unknown function> + 0xe0376a (0x55e30865476a in ./build/nvfuser_tests)
frame #10: <unknown function> + 0xdfc7f5 (0x55e30864d7f5 in ./build/nvfuser_tests)
frame #11: <unknown function> + 0xde0057 (0x55e308631057 in ./build/nvfuser_tests)
frame #12: <unknown function> + 0x409b0d (0x55e307c5ab0d in ./build/nvfuser_tests)
frame #13: <unknown function> + 0x409082 (0x55e307c5a082 in ./build/nvfuser_tests)
frame #14: <unknown function> + 0x29d90 (0x7feb417d0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #15: __libc_start_main + 0x80 (0x7feb417d0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #16: <unknown function> + 0x3b8545 (0x55e307c09545 in ./build/nvfuser_tests)
" thrown in the test body.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels