Conversation
|
|
||
| FusionExecutor fe; | ||
|
|
||
| fe.registerPostLoweringHook([](kir::Kernel* kernel) { |
There was a problem hiding this comment.
Kernel after modification:
__global__ void kernel1(Tensor<float, 2, 2> T0, Tensor<float, 2, 2> T2) {
alignas(16) extern __shared__ char array[];
const unsigned smem_offset = 0;
nvfuser_index_t i0;
i0 = ((nvfuser_index_t)threadIdx.y) + (32 * ((nvfuser_index_t)threadIdx.x));
nvfuser_index_t i1;
i1 = ((nvfuser_index_t)threadIdx.x) + (32 * ((nvfuser_index_t)threadIdx.y));
float* T1 = reinterpret_cast<float*>(array + smem_offset + 0);
uint64_t* T3 = reinterpret_cast<uint64_t*>(array + smem_offset + 4096);
mbarrier::init(toSmem(T3), 1024);
T1[i0]
= T0[i0];
uint64_t i2;
i2 = mbarrier::arrive(toSmem(T3));
mbarrier::wait(toSmem(T3), i2);
T2[i1]
= T1[i1];
mbarrier::inval(toSmem(T3));
}There was a problem hiding this comment.
Thanks for this. This is really a nice way of testing initial build-out features
|
!build |
mbarrier: arrive wait barrier on smem
runtime/mbarrier.cu
Outdated
| "{\n" | ||
| ".reg .pred P1;\n" | ||
| "LAB_WAIT:\n" | ||
| "mbarrier.try_wait.shared.b64 P1, [%0], %1;\n" |
There was a problem hiding this comment.
Try wait is only available on SM90
jacobhinkle
left a comment
There was a problem hiding this comment.
This is a great step! I will have a look at updating the smem allocator to recognize/use these. One question: I think we support sm75 so will these new kernel nodes work in that case and they fall back to a synchronous barrier?
| struct DataTypeToNativeType<data_type> { \ | ||
| using type = native_type; \ | ||
| }; \ |
There was a problem hiding this comment.
Was this unused? Could we use this in the switch statement below in getPrimDataTypeSize?
There was a problem hiding this comment.
They were used. They were just a copy-paste of DEFINE_DATATYPE_TO_NATIVE_TYPE, so I replaced the copy-pasted code with DEFINE_DATATYPE_TO_NATIVE_TYPE.
There was a problem hiding this comment.
This can not be used in primDataTypeSize either, because this requires the data type to be compile-time constant, which is not the case for primDataTypeSize.
On sm < 80, we should not lower into code that uses mbarrier. It must use sync threads. |
liqiangxl
left a comment
There was a problem hiding this comment.
LGTM. Just 2 minor comments.
| } | ||
|
|
||
| __device__ inline void wait(uint32_t smem_barrier_ptr, uint64_t state) { | ||
| #if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900)) |
There was a problem hiding this comment.
Yeah, agree. Changed that.
| for (auto expr : fe.kernel()->topLevelExprs()) { | ||
| remaining_mbarrier_exprs.erase(&typeid(*expr)); | ||
| } | ||
| EXPECT_TRUE(remaining_mbarrier_exprs.empty()); |
There was a problem hiding this comment.
what's the purpose of this part? Does it ensure that all MBarrier expressions are correctly integrated into the kir? I saw other test cases are directly checking kernel string, e.g. FusionCodegenAllocatedScalars_CUDA
There was a problem hiding this comment.
Yes, that's it. And directly checking kernel string is also a way.
|
@zasdfgbnm I merged #996 so you might want to retry a !build and check the code diff output. |
|
!build |
|
nvfuser-ci/job-70932017: codegen_diff_4/9 Seems like the codegen diff script created too many outputs to stdout that exceeded CI log size limit. I've fixed this in the CI. If it's a concern to you, feel free to restart a new build. |
|
!build |
3 similar comments
|
!build |
|
!build |
|
!build |
|
|
||
| FusionExecutor fe; | ||
|
|
||
| fe.registerPostLoweringHook([](kir::Kernel* kernel) { |
There was a problem hiding this comment.
Thanks for this. This is really a nice way of testing initial build-out features
|
LGTM |
Fixes: #992 Required by: #993
This PR introduces
mbarrier, an arrive-wait barrier on shared memory. The code formbarrieritself is ready-to-use, however, there is no passes in our lowering currently using this barrier. In future PR, I will explore changing our block syncs withmbarrierwhen makes sense.In this PR, a new test
MBarrierTest.Simpleis added. This test is a simplegmem->smem->gmemcopy kernel. The fusion is scheduled in a way that block sync is needed. And in the test, the lowered kernel is modified to replace the block sync withmbarrier. Because there is no lowering pass usingmbarrier, this test is written in a hacky way that it lowers to a kernel first and then modifies the lowered kernel.