Skip to content

codegen wrong result from transpose kernel #667

@jjsjann123

Description

@jjsjann123

codegen gives wrong result with transpose scheduler. This is a relatively complicated case where we have all tricks in small transpose dimensions in action. (split_before_tiling and two merges).

Here's the cpp repro. I'm almost certain there's an indexing issue here, since this program gives me misaligned access when I enable vectorizatio. 😛

// small transpose dimension with merge and split
TEST_F(NVFuserTest, TransposeIndexingIssue_CUDA) {
  auto fusion = std::make_unique<Fusion>();
  FusionGuard fg(fusion.get());

  auto tv0 = makeContigTensor(5);
  fusion->addInput(tv0);

  auto tv1 = transpose(tv0, 1, 4);
  auto tv2 = transpose(tv1, 0, 3);
  fusion->addOutput(tv2);

  std::vector<int64_t> shape({2, 7, 102400, 4, 5});

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);

  auto t0 = at::randn(shape, options);
  std::vector<c10::IValue> aten_inputs({t0});

  FusionExecutorCache executor_cache(std::move(fusion));
  auto cg_outputs = executor_cache.runFusionWithInputs(aten_inputs);

  auto runtime = executor_cache.getMostRecentKernelRuntime();
  TORCH_CHECK(!runtime->isSegmented(), "Segmentation not expected");

  auto ref = t0.transpose(1, 4).transpose(0, 3);

  TORCH_CHECK(ref.equal(cg_outputs.at(0)));
}
===== Transpose Stats ========
inputs: T0_g[ iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}, iS4{i5} ]
outputs: T2_g[ iS13{i4}, iS11{i5}, iS12{i3}, iS10{i0}, iS14{i2} ]
shape: 4 5 102400 2 7
num_elems: 28672000
n_input_tensors: 1
max_input_dtype_size: 4
group 1: T2_g[ iS13{i4}, iS11{i5}, iS12{i3}, iS10{i0}, iS14{i2} ]
reference1: T2_g[ iS13{i4}, iS11{i5}, iS12{i3}, iS10{i0}, iS14{i2} ]
inner_most_id1 position: 4 (in reference 1)
group 2: T0_g[ iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}, iS4{i5} ]
reference2: T0_g[ iS0{i0}, iS1{i2}, iS2{i3}, iS3{i4}, iS4{i5} ]
inner_most_id2 position: 1 (in reference 1)
small transposed dim, needs virtual inner-most dim


===== Transpose Parameters ========
Tag: Transpose heuristics Transpose Characteristics:
 BlckX: 128
 input tile size: 32
 output tile size: 32
 elements per tile: 1024
 elements per thread: 8
Unroll group 1, Factor: 8
Unroll group 2, Factor: 8
Virtual inner-most dim:
  split(2, 3)
  merge 4, 3 with innermost1
  merge 0, 2 with innermost2
====================================

C++ exception with description "Expected ref.equal(cg_outputs.at(0)) to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
Exception raised from TestBody at /opt/pytorch/nvfuser/test/test_gpu_view.cpp:2537 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xae (0x7feb6df2e50e in /opt/pytorch/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x7d (0x7feb6dee4aeb in /opt/pytorch/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0xc6111d (0x55e3084b211d in ./build/nvfuser_tests)
frame #3: <unknown function> + 0xe0288b (0x55e30865388b in ./build/nvfuser_tests)
frame #4: <unknown function> + 0xdfb831 (0x55e30864c831 in ./build/nvfuser_tests)
frame #5: <unknown function> + 0xdd0648 (0x55e308621648 in ./build/nvfuser_tests)
frame #6: <unknown function> + 0xdd10d6 (0x55e3086220d6 in ./build/nvfuser_tests)
frame #7: <unknown function> + 0xdd19dd (0x55e3086229dd in ./build/nvfuser_tests)
frame #8: <unknown function> + 0xde18ef (0x55e3086328ef in ./build/nvfuser_tests)
frame #9: <unknown function> + 0xe0376a (0x55e30865476a in ./build/nvfuser_tests)
frame #10: <unknown function> + 0xdfc7f5 (0x55e30864d7f5 in ./build/nvfuser_tests)
frame #11: <unknown function> + 0xde0057 (0x55e308631057 in ./build/nvfuser_tests)
frame #12: <unknown function> + 0x409b0d (0x55e307c5ab0d in ./build/nvfuser_tests)
frame #13: <unknown function> + 0x409082 (0x55e307c5a082 in ./build/nvfuser_tests)
frame #14: <unknown function> + 0x29d90 (0x7feb417d0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #15: __libc_start_main + 0x80 (0x7feb417d0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #16: <unknown function> + 0x3b8545 (0x55e307c09545 in ./build/nvfuser_tests)
" thrown in the test body.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions