Skip to content

[initial build up] mbarrier: arrive wait barrier on smem#995

Merged
zasdfgbnm merged 19 commits intomainfrom
mbarrier
Oct 10, 2023
Merged

[initial build up] mbarrier: arrive wait barrier on smem#995
zasdfgbnm merged 19 commits intomainfrom
mbarrier

Conversation

@zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Sep 29, 2023

Fixes: #992 Required by: #993

This PR introduces mbarrier, an arrive-wait barrier on shared memory. The code for mbarrier itself is ready-to-use, however, there is no passes in our lowering currently using this barrier. In future PR, I will explore changing our block syncs with mbarrier when makes sense.

In this PR, a new test MBarrierTest.Simple is added. This test is a simple gmem->smem->gmem copy kernel. The fusion is scheduled in a way that block sync is needed. And in the test, the lowered kernel is modified to replace the block sync with mbarrier. Because there is no lowering pass using mbarrier, this test is written in a hacky way that it lowers to a kernel first and then modifies the lowered kernel.


FusionExecutor fe;

fe.registerPostLoweringHook([](kir::Kernel* kernel) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kernel after modification:

__global__ void kernel1(Tensor<float, 2, 2> T0, Tensor<float, 2, 2> T2) {
  alignas(16) extern __shared__ char array[];
  const unsigned smem_offset = 0;
  nvfuser_index_t i0;
  i0 = ((nvfuser_index_t)threadIdx.y) + (32 * ((nvfuser_index_t)threadIdx.x));
  nvfuser_index_t i1;
  i1 = ((nvfuser_index_t)threadIdx.x) + (32 * ((nvfuser_index_t)threadIdx.y));
  float* T1 = reinterpret_cast<float*>(array + smem_offset + 0);
  uint64_t* T3 = reinterpret_cast<uint64_t*>(array + smem_offset + 4096);
  mbarrier::init(toSmem(T3), 1024);
  T1[i0]
     = T0[i0];
  uint64_t i2;
  i2 = mbarrier::arrive(toSmem(T3));
  mbarrier::wait(toSmem(T3), i2);
  T2[i1]
     = T1[i1];
  mbarrier::inval(toSmem(T3));
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. This is really a nice way of testing initial build-out features

@zasdfgbnm zasdfgbnm changed the title [Not ready] mbarrier: arrive wait barrier on smem [initial build up] mbarrier: arrive wait barrier on smem Sep 30, 2023
@zasdfgbnm
Copy link
Collaborator Author

!build

@zasdfgbnm zasdfgbnm marked this pull request as ready for review September 30, 2023 06:23
@zasdfgbnm zasdfgbnm changed the title [initial build up] mbarrier: arrive wait barrier on smem [initial build up] mbarrier: arrive wait barrier on smem Sep 30, 2023
"{\n"
".reg .pred P1;\n"
"LAB_WAIT:\n"
"mbarrier.try_wait.shared.b64 P1, [%0], %1;\n"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try wait is only available on SM90

Copy link
Collaborator

@jacobhinkle jacobhinkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great step! I will have a look at updating the smem allocator to recognize/use these. One question: I think we support sm75 so will these new kernel nodes work in that case and they fall back to a synchronous barrier?

Comment on lines -342 to -344
struct DataTypeToNativeType<data_type> { \
using type = native_type; \
}; \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this unused? Could we use this in the switch statement below in getPrimDataTypeSize?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were used. They were just a copy-paste of DEFINE_DATATYPE_TO_NATIVE_TYPE, so I replaced the copy-pasted code with DEFINE_DATATYPE_TO_NATIVE_TYPE.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can not be used in primDataTypeSize either, because this requires the data type to be compile-time constant, which is not the case for primDataTypeSize.

@zasdfgbnm
Copy link
Collaborator Author

so will these new kernel nodes work in that case and they fall back to a synchronous barrier?

On sm < 80, we should not lower into code that uses mbarrier. It must use sync threads.

Copy link
Collaborator

@liqiangxl liqiangxl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just 2 minor comments.

}

__device__ inline void wait(uint32_t smem_barrier_ptr, uint64_t state) {
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sample code from the doc seems simpler:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agree. Changed that.

for (auto expr : fe.kernel()->topLevelExprs()) {
remaining_mbarrier_exprs.erase(&typeid(*expr));
}
EXPECT_TRUE(remaining_mbarrier_exprs.empty());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of this part? Does it ensure that all MBarrier expressions are correctly integrated into the kir? I saw other test cases are directly checking kernel string, e.g. FusionCodegenAllocatedScalars_CUDA

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's it. And directly checking kernel string is also a way.

@jacobhinkle
Copy link
Collaborator

@zasdfgbnm I merged #996 so you might want to retry a !build and check the code diff output.

@xwang233
Copy link
Collaborator

xwang233 commented Oct 5, 2023

!build

@xwang233
Copy link
Collaborator

xwang233 commented Oct 5, 2023

nvfuser-ci/job-70932017: codegen_diff_4/9

http://nv/e5M/nvfuser_github_ci/codegen_diff_p10125137_j70932017_1696543064832465742_codediff_48bda6c_5c3d61f_custom_command_20231005_140551.html

Seems like the codegen diff script created too many outputs to stdout that exceeded CI log size limit. I've fixed this in the CI. If it's a concern to you, feel free to restart a new build.

@zasdfgbnm
Copy link
Collaborator Author

!build

3 similar comments
@xwang233
Copy link
Collaborator

xwang233 commented Oct 6, 2023

!build

@xwang233
Copy link
Collaborator

xwang233 commented Oct 6, 2023

!build

@xwang233
Copy link
Collaborator

xwang233 commented Oct 6, 2023

!build

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


FusionExecutor fe;

fe.registerPostLoweringHook([](kir::Kernel* kernel) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. This is really a nice way of testing initial build-out features

@drzejan2
Copy link
Contributor

LGTM

@zasdfgbnm zasdfgbnm merged commit 877edeb into main Oct 10, 2023
@zasdfgbnm zasdfgbnm deleted the mbarrier branch October 10, 2023 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mbarrier

6 participants