Re-enable skipped test_normalization.py::test_instance_norm_multigpu#4833
Re-enable skipped test_normalization.py::test_instance_norm_multigpu#4833jacobhinkle wants to merge 1 commit intomainfrom
Conversation
Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
!test |
|
Reading #1728, I wonder what fixed the issue. Is the fusion scheduled differently? |
jjsjann123
left a comment
There was a problem hiding this comment.
More than happy to enable the test again.
But similar question as what @naoyam has, do we know that we can close the original issue now?
The test was fixed by #4748. Repro of failing segment
# CUDA devices:
# 0: NVIDIA H100 NVL
# 1: NVIDIA H100 NVL
# torch version: 2.8.0a0+34c6371d24.nvInternal
# cuda version: 13.0
# nvfuser version: 0.2.27+git8830aff
import torch
from nvfuser import FusionDefinition, DataType
def nvfuser_fusion_id2(fd : FusionDefinition) -> None :
T0 = fd.define_tensor(shape=[-1, -1, -1, -1, -1], contiguity=[True, None, None, None, None], dtype=DataType.Float, is_cpu=False)
T1 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.Float, is_cpu=False)
T2 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.Float, is_cpu=False)
T3 = fd.define_tensor(shape=[-1, -1, -1, -1, -1], contiguity=[True, True, True, True, True], dtype=DataType.Float, is_cpu=False)
S4 = fd.define_scalar(1, dtype=DataType.Int)
S5 = fd.ops.size(T3, dim=2)
S6 = fd.ops.mul(S4, S5)
S7 = fd.ops.size(T3, dim=3)
S8 = fd.ops.mul(S6, S7)
S9 = fd.ops.size(T3, dim=4)
S10 = fd.ops.mul(S8, S9)
S11 = fd.ops.cast(S10, dtype=DataType.Float)
S12 = fd.ops.reciprocal(S11)
T13 = fd.ops.mul(T2, S12)
T14 = fd.ops.mul(T1, T1)
T15 = fd.ops.mul(T13, T14)
T16 = fd.ops.broadcast(T15, is_broadcast_dim=[False, False, True, True, True])
T17 = fd.ops.mul(T3, T16)
T18 = fd.ops.sub(T0, T17)
T19 = fd.ops.squeeze(T0, dims=[2, 3, 4], squeeze_expanded=True)
S20 = fd.ops.size(T0, dim=2)
S21 = fd.ops.size(T0, dim=3)
S22 = fd.ops.mul(S20, S21)
S23 = fd.ops.size(T0, dim=4)
S24 = fd.ops.mul(S22, S23)
S25 = fd.ops.cast(S24, dtype=DataType.Float)
T26 = fd.ops.mul(T19, S25)
T27 = fd.ops.mul(T26, S12)
T28 = fd.ops.broadcast(T27, is_broadcast_dim=[False, False, True, True, True])
T29 = fd.ops.sub(T18, T28)
T30 = fd.ops.broadcast(T1, is_broadcast_dim=[False, False, True, True, True])
T31 = fd.ops.mul(T29, T30)
fd.add_output(T31)
with FusionDefinition() as fd:
nvfuser_fusion_id2(fd)
inputs = [
torch.randn(2, dtype=torch.float32, device='cuda:1').as_strided((2, 4, 128, 128, 128), (1, 0, 0, 0, 0)),
torch.testing.make_tensor((2, 4), dtype=torch.float32, device='cuda:1'),
torch.testing.make_tensor((2, 4), dtype=torch.float32, device='cuda:1'),
torch.testing.make_tensor((2, 4, 128, 128, 128), dtype=torch.float32, device='cuda:1'),
]
fd.execute(inputs)Params before #4748: Params after #4748: We are no longer using a 2D grid, but I'm not sure yet what part of that PR caused the change in heuristic. cc @zasdfgbnm |
It seems just a luck. That PR discovered an unfixed bug in the breakpoint logic (#4773). The bug is still there since the PR just changed the use of bytes to bits, however it did affect the breakpoint decision, so I believe that's why this segment no longer uses the 2D scheduling. Since the bug remains to be fixed, this segment may again use the 2D scheduling once the bug is fixed. I wonder what the root cause of the failure of this test case is. If there's a certain case where the 2D scheduling should not be used, we should include that condition into the algorithm of determining the breakpoint. |
This test no longer fails, but it was skipped in #1730.
Fixes #1728