`FusionMatmulSchedulerEpilogueBias_CUDA` is failing with misaligned address on RTX4090

```C++
❯ ./build/nvfuser_tests --gtest_filter=*./build/nvfuser_tests --gtest_filter=*NVFuserTest.FusionMatmulSchedulerEpilogueBias_CUDA*
Note: Google Test filter = *NVFuserTest.FusionMatmulSchedulerEpilogueBias_CUDA*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NVFuserTest
[ RUN      ] NVFuserTest.FusionMatmulSchedulerEpilogueBias_CUDA
unknown file: Failure
C++ exception with description "CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f6f3a8477 in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f2f6f36489b in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f2f6d538298 in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: void at::native::gpu_kernel_impl<at::native::BinaryFunctor<float, float, bool, at::native::(anonymous namespace)::CompareEqFunctor<float> > >(at::TensorIteratorBase&, at::native::BinaryFunctor<float, float, bool, at::native::(anonymous namespace)::CompareEqFunctor<float> > const&) + 0xc9f (0x7f2f0bc3187f in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: void at::native::gpu_kernel<at::native::BinaryFunctor<float, float, bool, at::native::(anonymous namespace)::CompareEqFunctor<float> > >(at::TensorIteratorBase&, at::native::BinaryFunctor<float, float, bool, at::native::(anonymous namespace)::CompareEqFunctor<float> > const&) + 0x32b (0x7f2f0bc3209b in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: void at::native::opmath_symmetric_gpu_kernel_with_scalars<float, bool, at::native::(anonymous namespace)::CompareEqFunctor<float> >(at::TensorIteratorBase&, at::native::(anonymous namespace)::CompareEqFunctor<float> const&) + 0x105 (0x7f2f0bc479b5 in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: at::native::compare_eq_ne_kernel(at::TensorIteratorBase&, at::native::(anonymous namespace)::EqOpType) + 0x1a9 (0x7f2f0bc24979 in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x303ee53 (0x7f2f0d43ee53 in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x303eef0 (0x7f2f0d43eef0 in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: at::_ops::eq_Tensor::call(at::Tensor const&, at::Tensor const&) + 0x161 (0x7f2f51bd5951 in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #10: at::native::isclose(at::Tensor const&, at::Tensor const&, double, double, bool) + 0xa5 (0x7f2f516a5da5 in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x2fe38c4 (0x7f2f525e38c4 in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: at::_ops::isclose::call(at::Tensor const&, at::Tensor const&, double, double, bool) + 0x18b (0x7f2f521b47fb in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::native::allclose(at::Tensor const&, at::Tensor const&, double, double, bool) + 0x21 (0x7f2f516a3f21 in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::_ops::allclose::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, double, double, bool) + 0x91 (0x7f2f51b8ac11 in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x41e0136 (0x7f2f537e0136 in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: at::_ops::allclose::call(at::Tensor const&, at::Tensor const&, double, double, bool) + 0x17f (0x7f2f51bce45f in /home/gaoxiang/.local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x58e13e (0x55fdc218413e in ./build/nvfuser_tests)
frame #18: <unknown function> + 0x687f17 (0x55fdc227df17 in ./build/nvfuser_tests)
frame #19: <unknown function> + 0x67a7cd (0x55fdc22707cd in ./build/nvfuser_tests)
frame #20: <unknown function> + 0x67a9a5 (0x55fdc22709a5 in ./build/nvfuser_tests)
frame #21: <unknown function> + 0x67ab80 (0x55fdc2270b80 in ./build/nvfuser_tests)
frame #22: <unknown function> + 0x67e35d (0x55fdc227435d in ./build/nvfuser_tests)
frame #23: <unknown function> + 0x6884a7 (0x55fdc227e4a7 in ./build/nvfuser_tests)
frame #24: <unknown function> + 0x67adb2 (0x55fdc2270db2 in ./build/nvfuser_tests)
frame #25: <unknown function> + 0x18ff89 (0x55fdc1d85f89 in ./build/nvfuser_tests)
frame #26: <unknown function> + 0x23850 (0x7f2f09e39850 in /usr/lib/libc.so.6)
frame #27: __libc_start_main + 0x8a (0x7f2f09e3990a in /usr/lib/libc.so.6)
frame #28: <unknown function> + 0x1c7ce5 (0x55fdc1dbdce5 in ./build/nvfuser_tests)
" thrown in the test body.
[  FAILED  ] NVFuserTest.FusionMatmulSchedulerEpilogueBias_CUDA (993 ms)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`FusionMatmulSchedulerEpilogueBias_CUDA` is failing with misaligned address on RTX4090 #682

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FusionMatmulSchedulerEpilogueBias_CUDA is failing with misaligned address on RTX4090 #682

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`FusionMatmulSchedulerEpilogueBias_CUDA` is failing with misaligned address on RTX4090 #682