Conversation
|
!test |
|
Review updated until commit 045202a Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
5c424bd to
99954c6
Compare
|
!test |
99954c6 to
3662100
Compare
|
!test |
csrc/host_ir/executor.cpp
Outdated
| FusionGuard fg(container_.get()); | ||
| expr_evaluator_.bind( | ||
| NamedScalar::getParallelIndex(ParallelType::DIDx), | ||
| communicator_->deviceId()); |
There was a problem hiding this comment.
Is this the right thing to do in the foreseeable future? Isn't DIDx decided also by the mesh?
There was a problem hiding this comment.
Is this the right thing to do in the foreseeable future? Isn't DIDx decided also by the mesh?
Thanks for the question. IIUC, you're asking if deviceIdx.x should be:
- An absolute device ID (e.g., always
1for device1), or - A mesh-relative index (e.g.,
0for device1if the mesh is{1}).
The current PR implements option 1. However, you question made me think, and I am seeing now that option 2 makes more sense, especially when we move to 2D (with the caveat that the mesh is per-Tensor and can change during a fusion).
I decided to change the name of that NamedScalar to "myDeviceId" for now. Let me know if this sounds good to you.
There was a problem hiding this comment.
with the caveat that the mesh is per-Tensor and can change during a fusion
Yes. Therefore, I was also unsure about deviceIdx.x being a "global" variable as in the previous version. Let me read your new changes...
|
!test |
|
!test |
|
!test |
on top of - NVIDIA#4387 # What Add Stream lowering to Allgather p2p linear, with NCCL backend For example: `MultiDeviceStreamParallelTypeTest.AllgatherP2p` from `tests/cpp/test_multidevice_stream_parallel_type.cpp`: ``` TensorView* tv0 = makeContigTensor(2); TensorView* tv1 = set(tv0); fusion->addInput(tv0); fusion->addOutput(tv1); const DeviceMesh mesh = DeviceMesh::createForNumDevices(communicator_->size()); tv0->setDeviceMesh(mesh); tv1->setDeviceMesh(mesh); tv0->axis(0)->parallelize(ParallelType::DIDx); tv1->axis(0)->parallelize(ParallelType::Stream); ``` is lowered to: ``` %HostIrContainer { (T0_g_float[ideviceIdx.x0{i0}, iS1{i2}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T1_g_float[iStreamIdx2{i0}, iS3{i2}] (DeviceMesh{0 1 2 3 4 5 6 7})) : T1_g_float[iStreamIdx2{i0}, iS3{i2}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T1_g_float[iStreamIdx2{i0}, iS3{i2}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=( i0 * i2 ), zero_init=false, resets_to_zero=false) GetCurrentStream into Stream 0 FOR StreamIdx in iStreamIdx2{i0}: SetCurrentStream to Stream ( StreamIdx % numberOfStreams ) Synchronize Stream 0 FOR StreamIdx in iStreamIdx2{i0}: SetCurrentStream to Stream ( StreamIdx % numberOfStreams ) T3_l_float[iS5{i2}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T1_g_float[iStreamIdx2{i0}, iS3{i2}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iStreamIdx2{i0}, index = StreamIdx ) IF Manual ( StreamIdx == deviceIdx.x ): T2_l_float[iS4{i2}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T0_g_float[ideviceIdx.x0{i0}, iS1{i2}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = ideviceIdx.x0{i0}, index = 0 ) T3_l_float[iS5{i2}] (DeviceMesh{0 1 2 3 4 5 6 7}) = Set( T2_l_float[iS4{i2}] (DeviceMesh{0 1 2 3 4 5 6 7}), cache_op=Streaming ) ELSE: StartCoalescing P2PCommunication 30 (type=recv, buffer=T3_l_float[iS5{i2}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=StreamIdx, backend=NCCL) P2PCommunication 31 (type=send, buffer=T0_g_float[ideviceIdx.x0{i0}, iS1{i2}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=StreamIdx, backend=NCCL) EndCoalescing 32 Wait Communication 32 SetCurrentStream to Stream 0 Synchronize Stream ( StreamIdx % numberOfStreams ) } // %HostIrContainer ``` An test with an overlapped matmul is also proposed in `AG_matmul_P2p`, which generates the following host program: ``` %HostIrContainer { (T0_g_float[ideviceIdx.x0{i0}, iS1{i2}, iS2{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g_float[iS3{i4}, iS4{i5}] (DeviceMesh{0 1 2 3 4 5 6 7})) -> (T2_g_float[iStreamIdx5{i0}, iS6{i2}, iS7{i5}, rS8{i3}] (DeviceMesh{0 1 2 3 4 5 6 7})) : T3_g_float[iStreamIdx9{i0}, iS10{i2}, iS11{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T3_g_float[iStreamIdx9{i0}, iS10{i2}, iS11{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=( ( i0 * i2 ) * i3 ), zero_init=false, resets_to_zero=false) T2_g_float[iStreamIdx5{i0}, iS6{i2}, iS7{i5}, rS8{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}) = ALLOCATE(buffer=T2_g_float[iStreamIdx5{i0}, iS6{i2}, iS7{i5}, rS8{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}), mem_type=global, size=( ( i0 * i2 ) * i5 ), zero_init=false, resets_to_zero=false ) GetCurrentStream into Stream 0 FOR StreamIdx in iStreamIdx9{i0}: SetCurrentStream to Stream ( StreamIdx % numberOfStreams ) Synchronize Stream 0 FOR StreamIdx in iStreamIdx9{i0}: SetCurrentStream to Stream ( StreamIdx % numberOfStreams ) T5_l_float[iS14{i2}, iS15{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T3_g_float[iStreamIdx9{i0}, iS10{i2}, iS11{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iStreamIdx9{i0}, index = StreamIdx ) IF Manual ( StreamIdx == deviceIdx.x ): T4_l_float[iS12{i2}, iS13{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T0_g_float[ideviceIdx.x0{i0}, iS1{i2}, iS2{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = ideviceIdx.x0{i0}, index = 0 ) T5_l_float[iS14{i2}, iS15{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}) = Set( T4_l_float[iS12{i2}, iS13{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}), cache_op=Streaming ) ELSE: StartCoalescing P2PCommunication 41 (type=recv, buffer=T5_l_float[iS14{i2}, iS15{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=StreamIdx, backend=NCCL) P2PCommunication 42 (type=send, buffer=T0_g_float[ideviceIdx.x0{i0}, iS1{i2}, iS2{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}), peer=StreamIdx, backend=NCCL) EndCoalescing 43 Wait Communication 43 T6_l_float[iS16{i2}, iS17{i5}, rS18{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}) = HirAliasSelect( T2_g_float[iStreamIdx5{i0}, iS6{i2}, iS7{i5}, rS8{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}), axis = iStreamIdx5{i0}, index = StreamIdx ) T6_l_float[iS16{i2}, iS17{i5}, rS18{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}) = matmul(T5_l_float[iS14{i2}, iS15{i3}] (DeviceMesh{0 1 2 3 4 5 6 7}), T1_g_float[iS3{i4}, iS4{i5}] (DeviceMesh{0 1 2 3 4 5 6 7})) SetCurrentStream to Stream 0 Synchronize Stream ( StreamIdx % numberOfStreams ) } // %HostIrContainer ```
on top of
What
Add Stream lowering to Allgather p2p linear, with NCCL backend
For example:
MultiDeviceStreamParallelTypeTest.AllgatherP2pfromtests/cpp/test_multidevice_stream_parallel_type.cpp:is lowered to:
An test with an overlapped matmul is also proposed in
AG_matmul_P2p, which generates the following host program: