Skip to content

Torch.inverse should be replaced to achieve stable inverse result. #5983

@mingxin-zheng

Description

@mingxin-zheng

Describe the bug

Spacing transform relies on the torch.inverse() to compute the affine matrix. However, the operation has errors compared to numpy. As a result, the integration test test_integration_segmentation3d is affected by the changes between PyTorch versions.

As an example, torch.inverse() provides different results between PyTorch 22.09 and 22.11. Steps to reproduce are listed below. The error seems to be at the range of 1e-7 on ampere GPUs.
Furthermore, we use numpy function np.linalg.inv() as the gold standard.
For the particular input, we see the error in 22.11/22.12/23.01 is 3 times higher than 22.09.
We tried to turn off TF32, set the matmul precision to highest (FP32) in PyTorch containers (22.11-23.01), and the result is the same.

Result in 22.09:

Input tensor:
tensor([[[1.8692e-02, 0.0000e+00, 0.0000e+00, -9.9065e-01],
[0.0000e+00, 1.2500e-02, 0.0000e+00, -9.9375e-01],
[0.0000e+00, 0.0000e+00, 1.0989e-02, -9.9451e-01],
[0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00]]])
Output of torch.inverse in PyTorch 22.09:
tensor([[[5.3500e+01, 0.0000e+00, -0.0000e+00, 5.3000e+01],
[0.0000e+00, 8.0000e+01, -0.0000e+00, 7.9500e+01],
[0.0000e+00, 0.0000e+00, 9.1000e+01, 9.0500e+01],
[0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00]]])
Error compared with numpy
tensor(1.1921e-07)

Result in 22.11

Input tensor:
tensor([[[1.8692e-02, 0.0000e+00, 0.0000e+00, -9.9065e-01],
[0.0000e+00, 1.2500e-02, 0.0000e+00, -9.9375e-01],
[0.0000e+00, 0.0000e+00, 1.0989e-02, -9.9451e-01],
[0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00]]])
Output of torch.inverse in PyTorch 22.11:
tensor([[[ 5.3500e+01, -2.2252e-06, -1.3113e-06, 5.3000e+01],
[ 6.3144e-06, 8.0000e+01, -1.9670e-06, 7.9500e+01],
[-6.6143e-06, -9.4399e-06, 9.1000e+01, 9.0500e+01],
[-7.4506e-09, -4.1986e-08, -2.4742e-08, 1.0000e+00]]])
Error compared with numpy
tensor(3.4737e-07)

To Reproduce

  1. Run the snippet in shell script to get result in PyTorch 22.09 container.
docker run -it --rm --gpus=all -e NVIDIA_TF32_OVERRIDE=0 nvcr.io/nvidia/pytorch:22.09-py3 bash -c "python << EOF
import torch
import numpy as np
torch.set_printoptions(sci_mode=True)
torch.set_float32_matmul_precision(\"highest\")
C = torch.tensor([[[ 0.01869158819317817687988281250000,
          0.00000000000000000000000000000000,
          0.00000000000000000000000000000000,
         -0.99065423011779785156250000000000],
        [ 0.00000000000000000000000000000000,
          0.01250000018626451492309570312500,
          0.00000000000000000000000000000000,
         -0.99374997615814208984375000000000],
        [ 0.00000000000000000000000000000000,
          0.00000000000000000000000000000000,
          0.01098901126533746719360351562500,
         -0.99450546503067016601562500000000],
        [ 0.00000000000000000000000000000000,
          0.00000000000000000000000000000000,
          0.00000000000000000000000000000000,
          1.00000000000000000000000000000000]]])

print(\"Input tensor: \")
print(C)
print(\"Output of torch.inverse in PyTorch 22.09: \")
print(torch.inverse(C))
print(\"Error compared with numpy\")
print((torch.tensor(np.array(C) @ np.linalg.inv(np.array(C))) - C @ torch.inverse(C)).abs().sum())
EOF
"
  1. Run the same snippet in a different container later than 22.11. Here we use 22.11 for example:
docker run -it --rm --gpus=all -e NVIDIA_TF32_OVERRIDE=0 nvcr.io/nvidia/pytorch:22.11-py3 bash -c "python << EOF
import torch
import numpy as np
torch.set_printoptions(sci_mode=True)
torch.set_float32_matmul_precision(\"highest\")
C = torch.tensor([[[ 0.01869158819317817687988281250000,
           0.00000000000000000000000000000000,
           0.00000000000000000000000000000000,
          -0.99065423011779785156250000000000],
         [ 0.00000000000000000000000000000000,
           0.01250000018626451492309570312500,
           0.00000000000000000000000000000000,
          -0.99374997615814208984375000000000],
         [ 0.00000000000000000000000000000000,
           0.00000000000000000000000000000000,
           0.01098901126533746719360351562500,
          -0.99450546503067016601562500000000],
         [ 0.00000000000000000000000000000000,
           0.00000000000000000000000000000000,
           0.00000000000000000000000000000000,
           1.00000000000000000000000000000000]]])

print(\"Input tensor: \")
print(C)
print(\"Output of torch.inverse in PyTorch 22.11: \")
print(torch.inverse(C))
print(\"Error compared with numpy\")
print((torch.tensor(np.array(C) @ np.linalg.inv(np.array(C))) - C @ torch.inverse(C)).abs().sum())
EOF
"

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment

Ensuring you use the relevant python executable, please paste the output of:

python -c 'import monai; monai.config.print_debug_info()'

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions