Skip to content

Add unit tests for subset of kernels#1384

Merged
lockshaw merged 28 commits intorepo-refactorfrom
kernel-tests
Jul 10, 2024
Merged

Add unit tests for subset of kernels#1384
lockshaw merged 28 commits intorepo-refactorfrom
kernel-tests

Conversation

@dylanllim
Copy link
Contributor

@dylanllim dylanllim commented May 6, 2024

Add unit tests for the kernel functions of the following operators:

  • MultiHeadAttention
  • Cast
  • Combine
  • Concat
  • Flat
  • Partition
  • Replicate
  • Reshape
  • Softmax
  • Transpose
  • Reduce
  • Dropout
  • Transpose
  • Cast
  • Reshape
  • Attention Kernel
  • Combine
  • Concat
  • Partition Kernels
  • Replicate
  • Softmax
  • Flat Kernels
  • Split
  • Reverse
  • Reduction
  • Batch Matmul
  • Batch Norm
  • Gather
  • Pool 2D
  • Layer Norm

The changes made to initializer_kernels.h, array_shape.h, and utils/fmt.h were needed in order for tests to compile.


This change is Reviewable

@dylanllim dylanllim requested a review from lockshaw May 6, 2024 22:28
@dylanllim dylanllim marked this pull request as draft May 6, 2024 23:29
@dylanllim dylanllim added bug Something isn't working repo-refactor kernels Kernels library labels May 6, 2024
@lockshaw lockshaw changed the title pr for debugging kernel driver issues Add unit tests for lib/kernels May 30, 2024
@lockshaw lockshaw requested review from reyna-abhyankar and removed request for lockshaw May 30, 2024 21:45
@dylanllim dylanllim removed the bug Something isn't working label May 31, 2024
@lockshaw lockshaw self-requested a review May 31, 2024 05:07
Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the PR description and title? (nevermind, took care of this already)

Overall the tests look okay, but they need two things: (1) use some of the higher level interfaces (I think Allocator already has a way of producing a GenericTensorAccessor for a given ArrayShape, though if not it should be added), and (2) pull out duplicate testing code--currently it's nearly impossible to read what the test functions are doing, because 90% of the code is the same "generate a random tensor" code over and over again, which should be pulled out into separate functions (kernels-tests can have helper functions and header files in it)

Reviewed 8 of 13 files at r1, 23 of 31 files at r2, all commit messages.
Reviewable status: 31 of 39 files reviewed, 28 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)


.proj.toml line 16 at r2 (raw file):

test_targets = [
  "kernel-tests",

Make sure these changes don't get merged--the only change here should be adding kernel-tests to the list of tests to run, not removing anything


lib/CMakeLists.txt line 9 at r2 (raw file):

add_subdirectory(ffi)
add_subdirectory(substitutions)
# add_subdirectory(local-execution)

Why?


lib/kernels/CMakeLists.txt line 11 at r2 (raw file):

     src/*.cc
     src/cuda/cuda_helper.cu
     src/cuda/ops/cast_kernels.cu

I assume eventually this will get changed back to src/cuda/ops/*.cu once everything builds successfully?


lib/kernels/include/kernels/concat_kernels.h line 6 at r2 (raw file):

#include "device.h"
#include "kernels/accessor.h"
#include "kernels/concat_kernels.h"

Why does this file include itself?


lib/kernels/include/kernels/cuda_helper.h line 21 at r2 (raw file):

template <typename T>
__global__ void apply_add_with_scale(T *data_ptr, T const *grad_ptr, size_t size, T scale);

This shouldn't be exposed by kernels, only the wrapped kernel functions for each operator should be exposed through the public headers


lib/kernels/include/kernels/device.h line 31 at r2 (raw file):

#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
typedef cudaStream_t ffStream_t;
// cudaError_t get_legion_stream(cudaStream_t *stream);

Why this change?


lib/kernels/include/kernels/softmax_kernels.h line 22 at r2 (raw file):

SoftmaxPerDeviceState init_kernel(PerDeviceFFHandle const &, int, 
                                  int, int, int, int);

Add argument names


lib/kernels/src/accessor.cc line 6 at r2 (raw file):

int32_t *GenericTensorAccessorW::get_int32_ptr() const {
  return get<DataType::INT32>();

Generally we prefer explicitly using this as it makes it clearer what is a method and what is a function

Suggestion:

  return this->get<DataType::INT32>();

lib/kernels/src/device.h line 96 at r2 (raw file):

// template <typename DT>
// __global__ void apply_add_with_scale(DT *data_ptr,

Why commented out?


lib/kernels/src/cuda/cuda_helper.cu line 274 at r2 (raw file):

}

template __global__ void

Can/should we move these manual instantiations over to use the stuff in datatype_dispatch.h?


lib/kernels/src/cuda/cuda_helper.cu line 294 at r2 (raw file):

    add_kernel<int64_t>(int64_t *dst, int64_t const *src, size_t size);
template __global__ void
    add_kernel<bool>(bool *dst, bool const *src, unsigned long size);

Why do we need to add bools? Is it just because DatatypeDispatch forces all of the instantiations?


lib/kernels/src/cuda/ops/attention_kernels.cu line 47 at r2 (raw file):

  ffSeqDataDescriptor_t oDesc;
  void *reserveSpace;
  // void *dropoutStates; // NOT USED

Not even on inference/master? In that case they should be removed


lib/kernels/src/cuda/ops/attention_kernels.cu line 307 at r2 (raw file):

void cleanup_kernel(Allocator &allocator,
                    MHAPerDeviceState const &device_state) {
  /* Noticed that loWinIdx and hiWinIdx are not allocated on GPU? Should 

Where are they used?


lib/kernels/src/cuda/ops/cast_kernels.cu line 61 at r2 (raw file):

};

// void forward_kernel(PerDeviceFFHandle handle,

If these shouldn't be here, delete don't just comment out


lib/kernels/test/CMakeLists.txt line 3 at r2 (raw file):

ff_add_test_executable(
  NAME
    kernel-tests

Suggestion:

    kernels-tests

lib/kernels/test/CMakeLists.txt line 21 at r2 (raw file):

    cudart
    cublas
)

Definitely don't need compiler or pcg, I also don't think you need nccl

Suggestion:

    utils
    doctest
    utils-test-common
    kernels
    op-attrs
    cuda
    cudnn
    cudart
    cublas
)

lib/kernels/test/CMakeLists.txt line 24 at r2 (raw file):

# set(project_target kernel-tests)

Fix up commented-out stuff before merging


lib/kernels/test/src/test_attention_kernel.cc line 24 at r2 (raw file):

    handle.workSpaceSize = 1024 * 1024;
    cudaMalloc(&handle.workSpace, handle.workSpaceSize);
    handle.allowTensorOpMathConversion = true;

Probably best to pull this out into a helper function

Code quote:

    PerDeviceFFHandle handle;
    cudnnCreate(&handle.dnn);
    cublasCreate(&handle.blas);
    handle.workSpaceSize = 1024 * 1024;
    cudaMalloc(&handle.workSpace, handle.workSpaceSize);
    handle.allowTensorOpMathConversion = true;

lib/kernels/test/src/test_attention_kernel.cc line 36 at r2 (raw file):

        allocator.allocate(num_samples * kvSeqLength * kSize * sizeof(float));
    void *value_ptr =
        allocator.allocate(num_samples * kvSeqLength * vSize * sizeof(float));

@reyna-abhyankar Isn't there a way to directly allocate a GenericTensorAccessor using an Allocator?

Code quote:

allocator.allocate(

lib/kernels/test/src/test_attention_kernel.cc line 51 at r2 (raw file):

    for (auto &val : host_query)
      val = dist(gen);

Pull generating a random GenericTensorAccessor out into a separate function

Code quote:

    for (auto &val : host_query)
      val = dist(gen);

lib/kernels/test/src/test_attention_kernel.cc line 86 at r2 (raw file):

    // TODO: PROBABLY NEED DIFFERENT CHECK?!!??!
    REQUIRE(std::any_of(host_output.begin(), host_output.end(),

No check needed, as long as it doesn't crash I'm happy


lib/kernels/test/src/test_attention_kernel.cc line 205 at r2 (raw file):

    REQUIRE(std::any_of(output_grad.begin(), output_grad.end(),
                        [](float v) { return v != 0; }));

Prefer functions from containers.h which don't require passing iterators

Code quote:

    REQUIRE(std::any_of(output_grad.begin(), output_grad.end(),
                        [](float v) { return v != 0; }));

lib/kernels/test/src/test_cast_kernel.cc line 8 at r2 (raw file):

namespace FlexFlow {
TEST_SUITE(FF_TEST_SUITE) {

Suggestion:

using namespace ::FlexFlow;

TEST_SUITE(FF_TEST_SUITE) {

lib/kernels/test/src/test_cast_kernel.cc line 12 at r2 (raw file):

    std::size_t dims[] = {100, 100};
    std::size_t num_dims = 2;
    FlexFlow::ArrayShape shape(dims, num_dims);

Suggestion:

    ArrayShape shape = ArrayShape{
      std::vector<size_t>{100, 100},
    };

lib/kernels/test/src/test_cast_kernel.cc line 83 at r2 (raw file):

    std::vector<int> host_data(100 * 100);
    for (auto &val : host_data) {

Generally we avoid using auto unless the type name is really long

Suggestion:

   for (int &val : host_data) {

lib/kernels/test/src/test_cast_kernel.cc line 108 at r2 (raw file):

    for (size_t i = 0; i < host_int_data.size(); ++i) {
      REQUIRE(typeid(host_float_data[i]) == typeid(float));

Just remove the check, not necessary here


lib/kernels/test/src/test_transpose_kernel.cc line 68 at r2 (raw file):

    std::vector<int> in_strides(num_dims, 1);
    std::vector<int> out_strides(num_dims, 1);
    for (int i = 1; i < num_dims; i++) {

It seems like this is actually computing the transpose? In that case I'd rather this get moved into kernels itself as a CPU kernel, and then the tests can just compare that the CUDA and the CPU kernels do the same thing. Mainly this is because (1) super complicated tests are not great, but primarily (2) we have uses for CPU-only versions of the kernels in testing other parts of the codebase


lib/utils/include/utils/fmt.h line 57 at r2 (raw file):

  // CHECK_FMTABLE(T); 
  // std::string result = fmt::to_string(t); 
  std::string result = "debugging"; 

Why?

@lockshaw lockshaw changed the title Add unit tests for lib/kernels Add unit tests for subset of kernels May 31, 2024
Copy link
Contributor Author

@dylanllim dylanllim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 13 of 63 files reviewed, 28 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)


lib/CMakeLists.txt line 9 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why?

I think I copied over this CMakeList from a different branch before I actually had local-execution so I commented out in order to compile


lib/kernels/CMakeLists.txt line 11 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I assume eventually this will get changed back to src/cuda/ops/*.cu once everything builds successfully?

Yeah, it's just src/cuda/ops/*.cu now


lib/kernels/include/kernels/device.h line 31 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why this change?

Was running into dependency issues, but forgot to uncomment => Fixed


lib/kernels/include/kernels/softmax_kernels.h line 22 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add argument names

Added


lib/kernels/src/accessor.cc line 6 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Generally we prefer explicitly using this as it makes it clearer what is a method and what is a function

Added


lib/kernels/src/device.h line 96 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why commented out?

Fixed, forgot to uncomment


lib/kernels/src/cuda/cuda_helper.cu line 294 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why do we need to add bools? Is it just because DatatypeDispatch forces all of the instantiations?

Yeah, ran into compile issues without the bool


lib/kernels/src/cuda/ops/attention_kernels.cu line 47 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Not even on inference/master? In that case they should be removed

I just don't see the dropOutState fields being used anywhere in the init kernel


lib/kernels/src/cuda/ops/attention_kernels.cu line 307 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Where are they used?

In init_kernel:

// allocate memory for loWinIdx/hiWinIdx
int *loWinIdx = (int *)malloc(sizeof(int) * qoSeqLength);
int *hiWinIdx = (int *)malloc(sizeof(int) * qoSeqLength);


lib/kernels/src/cuda/ops/cast_kernels.cu line 61 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

If these shouldn't be here, delete don't just comment out

Done.


lib/kernels/test/CMakeLists.txt line 3 at r2 (raw file):

ff_add_test_executable(
  NAME
    kernel-tests

Done.


lib/kernels/test/CMakeLists.txt line 21 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Definitely don't need compiler or pcg, I also don't think you need nccl

Removed


lib/kernels/test/CMakeLists.txt line 24 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Fix up commented-out stuff before merging

Removed


lib/kernels/test/src/test_attention_kernel.cc line 24 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Probably best to pull this out into a helper function

Created


lib/kernels/test/src/test_attention_kernel.cc line 51 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Pull generating a random GenericTensorAccessor out into a separate function

Done


lib/kernels/test/src/test_attention_kernel.cc line 86 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

No check needed, as long as it doesn't crash I'm happy

Done


lib/kernels/test/src/test_attention_kernel.cc line 205 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer functions from containers.h which don't require passing iterators

Done.


lib/kernels/test/src/test_cast_kernel.cc line 8 at r2 (raw file):

namespace FlexFlow {
TEST_SUITE(FF_TEST_SUITE) {

Done.


lib/kernels/test/src/test_cast_kernel.cc line 83 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Generally we avoid using auto unless the type name is really long

Done.


lib/kernels/test/src/test_cast_kernel.cc line 108 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Just remove the check, not necessary here

Done.


lib/kernels/test/src/test_transpose_kernel.cc line 68 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

It seems like this is actually computing the transpose? In that case I'd rather this get moved into kernels itself as a CPU kernel, and then the tests can just compare that the CUDA and the CPU kernels do the same thing. Mainly this is because (1) super complicated tests are not great, but primarily (2) we have uses for CPU-only versions of the kernels in testing other parts of the codebase

Done.


lib/kernels/include/kernels/concat_kernels.h line 6 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why does this file include itself?

Fixed


lib/kernels/include/kernels/cuda_helper.h line 21 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This shouldn't be exposed by kernels, only the wrapped kernel functions for each operator should be exposed through the public headers

Deleted file

@reyna-abhyankar reyna-abhyankar marked this pull request as ready for review June 10, 2024 16:46
Copy link
Collaborator

@reyna-abhyankar reyna-abhyankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with this and I'd add to separate the kernel tests by forward and backward (and then pull out the duplicated code). Currently it's a giant end to end test for each operator, which if it fails we don't actually know what part went wrong.

Reviewable status: 10 of 63 files reviewed, 37 unresolved discussions (waiting on @lockshaw and @oOTigger)


lib/kernels/include/kernels/device.h line 31 at r6 (raw file):

#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA)
typedef cudaStream_t ffStream_t;
cudaError_t get_legion_stream(cudaStream_t *stream);

Why deleted? iirc this is called in a couple of places


lib/kernels/test/src/test_attention_kernel.cc line 36 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

@reyna-abhyankar Isn't there a way to directly allocate a GenericTensorAccessor using an Allocator?

Yes, this is in the latest repo-refactor from 1400.


lib/kernels/test/src/test_cast_kernel.cc line 8 at r6 (raw file):

template <typename T>
void allocate_ptrs(std::vector<T **> &gpu_data_ptrs,

This code can be moved out since it's replicated a bunch


lib/kernels/test/src/test_concat_kernel.cc line 27 at r6 (raw file):

    for (int i = 0; i < num_inputs; i++) {
      void *input_data_ptr = allocator.allocate(size_per_input * sizeof(float));

Merge latest repo-refactor, I think the allocator interface now allows you to pass in a tensor shape and get back a GTAW


lib/kernels/test/src/test_layer_norm_kernels.cc line 21 at r6 (raw file):

TEST_SUITE("kernel-tests") {
  TEST_CASE("Test LayerNorm Forward and Backward Kernel") {

In general, I think we should have separate test cases for the 3 kernels since we aren't actually testing for correctness in this. Then, later we can add correctness tests that do it end to end


lib/kernels/test/src/test_partition_kernel.cc line 19 at r6 (raw file):

}

using namespace ::FlexFlow;

Suggestion

Code snippet:

using namespace FlexFlow;

lib/kernels/test/src/test_partition_kernel.cc line 25 at r6 (raw file):

    std::size_t dims[] = {num_elements};
    std::size_t num_dims = 1;
    FlexFlow::ArrayShape shape(dims, num_dims);

Suggestion

Code snippet:

ArrayShape shape(...);

lib/kernels/test/src/test_partition_kernel.cc line 44 at r6 (raw file):

        Kernels::Repartition::init_kernel(handle, DataType::FLOAT);

    

What's this?


lib/kernels/test/src/test_softmax_kernel.cc line 22 at r6 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("Test Softmax Forward") {
    std::size_t num_elements = 100;

This is good, splitting up the kernels. But then you can also just refactor any shared code between the test cases (can call it like softmax_test_setup() that does any other work for you)


lib/kernels/test/src/test_utils.h line 10 at r6 (raw file):

#include <vector>
#include <algorithm>

move that pointers allocation to here. we can maybe push it to allocator too but for now, the utils file should be fine

Copy link
Collaborator

@reyna-abhyankar reyna-abhyankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 10 of 63 files reviewed, 38 unresolved discussions (waiting on @lockshaw and @oOTigger)


lib/kernels/include/kernels/conv_2d_kernels.h line 54 at r6 (raw file):

                                 float *filter_grad_ptr);

void forward_kernel(cudaStream_t stream,

Also, why? this is the header file for both the cuda and hip kernels so it should be generic, right?

Copy link
Contributor Author

@dylanllim dylanllim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 45 of 89 files reviewed, 26 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)


flake.lock line 84 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

You should be able to set up whatever editorconfig integration your editor uses: https://editorconfig.org/

Is there a .editorconfig file that I'm supposed to be matching somewhere?

It looks like the default one is below. Is this fine?

root = true

Unix-style newlines with a newline ending every file

[*]
end_of_line = lf
insert_final_newline = true

[{CMakeLists.txt,*.cmake}]
indent_style = space
indent_size = 2

[*.{cc,h,cu,cpp}]
indent_style = space
indent_size = 2

[*.py]
indent_style = space
indent_size = 4

[*.toml]
indent_style = space
indent_size = 2


lib/kernels/include/kernels/managed_handle.h line 14 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is move construction disabled? Should copy-assignment be deleted?

I thought it's considered safer to disable these when dealing with memory, to avoid things like multiple deletions.

I noticed that's what the Allocator's had so thought it would be safe to transfer over.


lib/kernels/include/kernels/managed_handle.h line 19 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Not really necessary here? ManagedHandle already has a default constructor

Done.


lib/kernels/src/cuda/cuda_helper.cu line 283 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

As a side note, real_type should really be renamed to real_type_t to follow naming conventions, @oOTigger can you create a separate PR to do that renaming?

The code below I think should work, if it does it should get propagated to the other kernel structs below in this file

This code gave issues regarding dispatch. I ended up just sticking with original code but changing to make use of real_type


lib/kernels/src/cuda/cuda_helper.cu line 285 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Make sure these get exposed through the header files.

Done.

I exposed everything except dispatch_print_tensor_kernel and ended up reverting that function back to it's old template format.

For some reason, the compiler didn't like me exposing the print_tensor kernel and got a long list of cascading errors. It says it has something to do with ff_handle and nccl.h but I wasn't really able to figure out how to fix.


lib/kernels/test/src/test_cast_kernel.cc line 10 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

m have a bit of an overloaded meaning, and managed_stream is substantially clearer. Also, ManagedFFStream makes it much clearer what stream you're talking about

Done.


lib/kernels/test/src/test_dropout.cc line 20 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Clearer names are really preferred: the codebase is going to have many "handle objects", so naming something ManagedHandle makes the code really difficult to read

Done.


lib/kernels/test/src/test_partition_kernel.cc line 52 at r13 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is the host_grad_input_data being read from the output_accessor? Either the names here are incredibly confusing or something weird is going on: the output of the backward kernel should generally be referred to as input_grad, not as output, as that would confuse things with the output tensor generated in the forward pass

Done.


lib/kernels/test/src/test_partition_kernel.cc line 55 at r13 (raw file):

Previously, lockshaw (Colin Unger) wrote…

See above comment, something weird seems to be going on with the names here

Done.

Namings should be correct. Added back in some accessors in the backward_kernels to be explicit on grad accessors


lib/kernels/test/src/test_replicate_kernel.cc line 10 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

"shape" by itself is a bit unclear, also wouldn't hurt to use this in the other tests where you just have "shape"--even if you don't want the second output_shape, replacing shape with input_shape should already help quite a bit

Done.


lib/kernels/test/src/test_utils.h line 53 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why are we using memcpy? Can't you just use vector's constructor?

Done. => also removed as will add back in when adding cpu kernels


lib/kernels/test/src/test_utils.h line 58 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why do we have both this and load_data_to_host_from_device?

This was old function from when I was still dealing with raw pointers in some places. It is removed.


lib/kernels/test/src/test_utils.cc line 90 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What is the purpose of this function? In theory the placement of GenericTensorAccessors (i.e., whether they are on gpu or cpu) shouldn't really be explicitly managed and should be left to whatever object is behind the Allocator. If this information does need to be tracked (e.g., for filling with data, etc.) it is probably best to have it be a private field on GenericTensorAccessor and provide a method on GenericTensorAccessor for setting the contents that handles all the internal details. That would also remove the cpu_fill parameters in the functions above

That said, considering this complication I'm unsure why cpu allocation is getting addressed in this PR? afaict it's not necessary and is just going to further delay merging this?

Done.


lib/kernels/include/kernels/local_allocator.h line 6 at r14 (raw file):

namespace FlexFlow {

struct LocalAllocator : public IAllocator {

Done.


lib/kernels/include/kernels/local_allocator.h line 22 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is this part of this PR? I guess there's not an issue with it being included, but I don't think it should be used here?

Removed


lib/kernels/src/local_allocator.cc line 31 at r14 (raw file):

void *LocalCPUAllocator::allocate(size_t requested_memory_size) {
  void *ptr = malloc(requested_memory_size);
  if (ptr) {

Done.


lib/kernels/src/local_allocator.cc line 46 at r14 (raw file):

  } else {
    throw std::runtime_error(
        "Deallocating a pointer that was not allocated by this allocator");

Done.


lib/kernels/src/local_allocator.cc line 54 at r14 (raw file):

    void *ptr = *it;
    it++;
    this->deallocate(ptr);

This gives issues as were modifying ptr list while iterating over it => I think we need to safely iterate over ptrs list as previously


lib/kernels/src/managed_handle.cc line 9 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Do these functions also need to be checked, e.g., by checkCUDA?

Done.


lib/kernels/src/managed_handle.cc line 15 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Do these functions also need to be checked, e.g., by checkCUDA?

Done.


lib/kernels/src/cuda/ops/reverse_kernels.cu line 39 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete?

Deleted.

On a side note ~ Is there a way to easily and quickly view the diff of the files I'm committing to make sure these errors don't happen? I've kind of just been using git status and trying to make sure that spam files aren't added, but this isn't always the greatest, and if there's junk in the file I'm not necessarilly catching it.


lib/local-execution/CMakeLists.txt line 16 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Remove, kernels once is plenty 😛

Done.


lib/kernels/include/kernels/managed_stream.h line 14 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is move construction disabled? Should copy-assignment be deleted?

I thought it's considered safer to disable these when dealing with memory, to avoid things like multiple deletions.

I noticed that's what the Allocator's had so thought it would be safe to transfer over.


lib/kernels/include/kernels/managed_stream.h line 19 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Not really necessary here? ManagedStream already has a default constructor

Done.

@lockshaw
Copy link
Collaborator

flake.lock line 84 at r14 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Is there a .editorconfig file that I'm supposed to be matching somewhere?

It looks like the default one is below. Is this fine?

root = true

Unix-style newlines with a newline ending every file

[*]
end_of_line = lf
insert_final_newline = true

[{CMakeLists.txt,*.cmake}]
indent_style = space
indent_size = 2

[*.{cc,h,cu,cpp}]
indent_style = space
indent_size = 2

[*.py]
indent_style = space
indent_size = 4

[*.toml]
indent_style = space
indent_size = 2

Yes, just the one in the root of the repository: https://github.com/flexflow/FlexFlow/blob/repo-refactor/.editorconfig. If you enable whatever editorconfig support your editor has it should automatically find it and apply those settings

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 44 of 44 files at r15, 2 of 2 files at r16, all commit messages.
Reviewable status: all files reviewed, 20 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)


.proj.toml line 18 at r16 (raw file):

test_targets = [
  "kernels-tests",

What does this do on machines without GPUs? (also, what does this do on machines with GPUs, doesn't it need nixGL?)


lib/kernels/include/kernels/managed_handle.h line 14 at r14 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

I thought it's considered safer to disable these when dealing with memory, to avoid things like multiple deletions.

I noticed that's what the Allocator's had so thought it would be safe to transfer over.

But move-construction is exactly the operation that should be allowed if you don't want multiple deletions, right? And copy-assignment is something you wouldn't want allowed, but I'm not seeing deleted. I think you'd want move construction and move assignment enabled, and copy construction and copy assignment disabled, just like unique_ptr


lib/kernels/include/kernels/managed_per_device_ff_handle.h line 8 at r16 (raw file):

namespace FlexFlow {

struct ManagedPerDeviceFFHandle {

This appears duplicated with the file managed_handle.h which seems to still exist?


lib/kernels/src/local_cuda_allocator.cc line 14 at r16 (raw file):

void LocalCudaAllocator::deallocate(void *ptr) {
  checkCUDA(cudaFree(ptr));
  this->ptrs.erase(ptr);

Add a check if the pointer is not in ptrs?


lib/kernels/src/local_cuda_allocator.cc line 18 at r16 (raw file):

LocalCudaAllocator::~LocalCudaAllocator() {
  for (auto it = this->ptrs.begin(); it != this->ptrs.end();) {

Replace with range-based loop


lib/kernels/src/local_cuda_allocator.cc line 25 at r16 (raw file):

}

Allocator get_local_cuda_memory_allocator() {

Suggestion:

Allocator create_local_cuda_memory_allocator() {

lib/kernels/src/cuda/cuda_helper.cu line 283 at r14 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

This code gave issues regarding dispatch. I ended up just sticking with original code but changing to make use of real_type

The static_casting in your version is very messy, so I'd prefer to adapt the version above. If you want help resolving the errors let me know, but just dumping everything to a void pointer and then doing a convoluted set of casts may pass the compiler but often doesn't actually result in the behavior you want


lib/kernels/src/cuda/cuda_helper.cu line 285 at r14 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Done.

I exposed everything except dispatch_print_tensor_kernel and ended up reverting that function back to it's old template format.

For some reason, the compiler didn't like me exposing the print_tensor kernel and got a long list of cascading errors. It says it has something to do with ff_handle and nccl.h but I wasn't really able to figure out how to fix.

What's the error message?


lib/kernels/test/src/test_cuda.cc line 1 at r16 (raw file):

#include "test_utils.h"

Why remove doctest?


lib/kernels/test/src/test_pool_2d_kernels.cc line 54 at r16 (raw file):

                                      output_accessor.ptr);

      std::vector<float> host_output_accessor =

This isn't an accessor?


lib/kernels/test/src/test_split_kernel.cc line 26 at r16 (raw file):

      std::vector<float *> output_ptrs(num_outputs);
      for (int i = 0; i < num_outputs; i++) {

Replace loop with repeat?


lib/kernels/test/src/test_split_kernel.cc line 43 at r16 (raw file):

    SUBCASE("backward_kernel") {
      std::vector<float *> output_grad_ptrs(num_outputs);
      for (int i = 0; i < num_outputs; i++) {

Replace loop with repeat?


lib/kernels/test/src/test_utils.h line 4 at r16 (raw file):

#define _FLEXFLOW_KERNELS_TEST_UTILS

#include "doctest/doctest.h"

Why? I don't see doctest being used in this file?


lib/kernels/test/src/test_utils.cc line 90 at r14 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Done.

I'm still seeing it here as of the latest revision?


lib/kernels/src/local_allocator.cc line 54 at r14 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

This gives issues as were modifying ptr list while iterating over it => I think we need to safely iterate over ptrs list as previously

I don't see how we're modifying the ptr list at all, so afaict the code above should work


lib/kernels/src/cuda/ops/reverse_kernels.cu line 39 at r14 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Deleted.

On a side note ~ Is there a way to easily and quickly view the diff of the files I'm committing to make sure these errors don't happen? I've kind of just been using git status and trying to make sure that spam files aren't added, but this isn't always the greatest, and if there's junk in the file I'm not necessarilly catching it.

git diff, or if they're already added, git diff --cached. You can also add each change individually (letting you preview each change as it's added) via git add -p


lib/local-execution/include/local-execution/local_allocator.h line 8 at r15 (raw file):

namespace FlexFlow {

struct LocalAllocator : public IAllocator {

Why do we have both LocalAllocator and LocalCudaAllocator?


lib/kernels/include/kernels/managed_stream.h line 14 at r14 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

I thought it's considered safer to disable these when dealing with memory, to avoid things like multiple deletions.

I noticed that's what the Allocator's had so thought it would be safe to transfer over.

But move-construction is exactly the operation that should be allowed if you don't want multiple deletions, right? And copy-assignment is something you wouldn't want allowed, but I'm not seeing deleted.

Copy link
Contributor Author

@dylanllim dylanllim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 59 of 90 files reviewed, 19 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)


.proj.toml line 18 at r16 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What does this do on machines without GPUs? (also, what does this do on machines with GPUs, doesn't it need nixGL?)

If you're asking if things build fine when on a machine with GPU's => proj test builds fine but runs into an error of:

CUDA failure: CUDA driver version is insufficient for CUDA runtime version (35)

when actually trying to test things.

Everything is fine though when using:
NIXPKGS_ALLOW_UNFREE=1 nix run --impure github:nix-community/nixGL -- build/lib/kernels/test/kernels-tests


lib/kernels/include/kernels/managed_per_device_ff_handle.h line 8 at r16 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This appears duplicated with the file managed_handle.h which seems to still exist?

Deleted managed_handle.h


lib/kernels/src/local_cuda_allocator.cc line 14 at r16 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add a check if the pointer is not in ptrs?

Done.


lib/kernels/src/local_cuda_allocator.cc line 18 at r16 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Replace with range-based loop

Same issue as mentioned on local_allocator.cc ~ When I change this to use a range-based loop I get the following issue:

kernels-tests: /home/dylanlim/ff/pr/622/lib/kernels/src/local_cuda_allocator.cc:13: virtual void FlexFlow::LocalCudaAllocator::deallocate(void*): Assertion `false' failed.

/home/dylanlim/ff/pr/622/lib/kernels/test/src/test_attention_kernel.cc:8: FATAL ERROR: test case CRASHED: SIGABRT - Abort (abnormal termination) signal

and then a bunch of out of bounds memory accesses are printed. ie:

фY�LCvalid argument (1)
�Y�

��

PZ�
p_�

�Y�
�gXb�
Ћ
�^�
��
�`�

�`�
p��

b�

A
��


lib/kernels/src/local_cuda_allocator.cc line 25 at r16 (raw file):

}

Allocator get_local_cuda_memory_allocator() {

Done.


lib/kernels/src/cuda/cuda_helper.cu line 283 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

The static_casting in your version is very messy, so I'd prefer to adapt the version above. If you want help resolving the errors let me know, but just dumping everything to a void pointer and then doing a convoluted set of casts may pass the compiler but often doesn't actually result in the behavior you want

Ended up just getting rid of this for a later PR, feel like I'm spending too much time and no progress and still not fully sure why not working


lib/kernels/test/src/test_cuda.cc line 1 at r16 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why remove doctest?

Since all test files include "doctest" I thought it'd maybe be cleaner to just include doctest in "test_utils"?


lib/kernels/test/src/test_pool_2d_kernels.cc line 54 at r16 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This isn't an accessor?

Done.


lib/kernels/test/src/test_split_kernel.cc line 26 at r16 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Replace loop with repeat?

Done.


lib/kernels/test/src/test_split_kernel.cc line 43 at r16 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Replace loop with repeat?

Done.


lib/kernels/test/src/test_utils.h line 4 at r16 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why? I don't see doctest being used in this file?

Thought since all tests use doctest I could clean things by putting it in test_utils, but I guess that's bad practice. I reverted.


lib/kernels/test/src/test_utils.cc line 90 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I'm still seeing it here as of the latest revision?

Done.


lib/local-execution/include/local-execution/local_allocator.h line 8 at r15 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why do we have both LocalAllocator and LocalCudaAllocator?

This must've gotten added back with my repo-refactor merge. It's deleted.


lib/kernels/include/kernels/managed_handle.h line 14 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

But move-construction is exactly the operation that should be allowed if you don't want multiple deletions, right? And copy-assignment is something you wouldn't want allowed, but I'm not seeing deleted. I think you'd want move construction and move assignment enabled, and copy construction and copy assignment disabled, just like unique_ptr

Done.


lib/kernels/src/local_allocator.cc line 54 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I don't see how we're modifying the ptr list at all, so afaict the code above should work

That's what I thought too, but I get a memory error when I use the ranged based for loop when I run the tests


lib/kernels/include/kernels/managed_stream.h line 14 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

But move-construction is exactly the operation that should be allowed if you don't want multiple deletions, right? And copy-assignment is something you wouldn't want allowed, but I'm not seeing deleted.

Done.

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 31 of 31 files at r17, all commit messages.
Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)


.proj.toml line 18 at r16 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

If you're asking if things build fine when on a machine with GPU's => proj test builds fine but runs into an error of:

CUDA failure: CUDA driver version is insufficient for CUDA runtime version (35)

when actually trying to test things.

Everything is fine though when using:
NIXPKGS_ALLOW_UNFREE=1 nix run --impure github:nix-community/nixGL -- build/lib/kernels/test/kernels-tests

Let's leave this out of test_targets then for now until we have a fix so proj doesn't crash on machines without a GPU


lib/kernels/include/kernels/managed_per_device_ff_handle.h line 15 at r17 (raw file):

  ~ManagedPerDeviceFFHandle();

  ManagedPerDeviceFFHandle(ManagedPerDeviceFFHandle &&other) noexcept

These implementations should be in the .cc file


lib/kernels/include/kernels/managed_per_device_ff_handle.h line 16 at r17 (raw file):

  ManagedPerDeviceFFHandle(ManagedPerDeviceFFHandle &&other) noexcept
      : handle(std::move(other.handle)) {}

Right now you're going to run into issues with the value being moved from double-freeing your objects I think. Making this a pointer (1) makes it easy to fix your double-free problem, and (2) as a nice side effect probably makes copying cheaper

Note that you'll need to modify your deconstructor accordingly, mostly along the lines of the implementation in ManagedFFStream but with slightly different behavior in terms of the actual checkCUDA(freething(...)) calls

Suggestion:

  PerDeviceFFHandle *handle;

  ManagedPerDeviceFFHandle();

  ~ManagedPerDeviceFFHandle();

  ManagedPerDeviceFFHandle(ManagedPerDeviceFFHandle &&other) noexcept
      : handle(std::exchange(other.handle, nullptr)) {}

lib/kernels/include/kernels/managed_per_device_ff_handle.h line 18 at r17 (raw file):

      : handle(std::move(other.handle)) {}

  ManagedPerDeviceFFHandle &

Probably a good idea to test the move behaviors of both this and ManagedFFStream just to make sure you actually got the implementations correct. I think you should just be able to (in your tests), declare an object, move from it, and then check the state of both the moved-from object and the moved to object


lib/kernels/include/kernels/managed_per_device_ff_handle.h line 23 at r17 (raw file):

      handle = std::move(other.handle);
    }
    return *this;

Otherwise you'll double free your handle things

Suggestion:

    using std::swap;
    
    swap(this->handle, other.handle);
    
    return *this;

lib/kernels/include/kernels/managed_per_device_ff_handle.h line 30 at r17 (raw file):

  ManagedPerDeviceFFHandle &
      operator=(ManagedPerDeviceFFHandle const &) = delete;
};

Just for consistency we tend to layout struct definitions like this--obviously there's some variation, but trying to follow the convention makes the codebase a bit more approachable

Suggestion:

struct ManagedPerDeviceFFHandle {
public:
  ManagedPerDeviceFFHandle();

  ManagedPerDeviceFFHandle(ManagedPerDeviceFFHandle const &) = delete;
  ManagedPerDeviceFFHandle &
      operator=(ManagedPerDeviceFFHandle const &) = delete;
      
  ManagedPerDeviceFFHandle(ManagedPerDeviceFFHandle &&other);
  ManagedPerDeviceFFHandle &
      operator=(ManagedPerDeviceFFHandle &&other) noexcept;
      
  ~ManagedPerDeviceFFHandle();
  
  PerDeviceFFHandle const &raw_handle() const;
private:
    PerDeviceFFHandle *handle;
};

lib/kernels/src/local_cuda_allocator.cc line 18 at r16 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Same issue as mentioned on local_allocator.cc ~ When I change this to use a range-based loop I get the following issue:

kernels-tests: /home/dylanlim/ff/pr/622/lib/kernels/src/local_cuda_allocator.cc:13: virtual void FlexFlow::LocalCudaAllocator::deallocate(void*): Assertion `false' failed.

/home/dylanlim/ff/pr/622/lib/kernels/test/src/test_attention_kernel.cc:8: FATAL ERROR: test case CRASHED: SIGABRT - Abort (abnormal termination) signal

and then a bunch of out of bounds memory accesses are printed. ie:

фY�LCvalid argument (1)
�Y�

��

PZ�
p_�

�Y�
�gXb�
Ћ
�^�
��
�`�

�`�
p��

b�

A
��

Without seeing what code you tried I can't really help. In general try to post not just the output but also the code you tried out unless it's included in the PR


lib/kernels/src/local_cuda_allocator.cc line 14 at r17 (raw file):

void LocalCudaAllocator::deallocate(void *ptr) {
  auto it = this->ptrs.find(ptr);
  if (it != this->ptrs.end()) {

Suggestion:

  if (contains(this->ptrs, ptr)) {

lib/kernels/test/src/test_cuda.cc line 1 at r16 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Since all test files include "doctest" I thought it'd maybe be cleaner to just include doctest in "test_utils"?

Just include it in the tests directly, in my experience trying tricks like this to avoid repeating headers just makes everything more confusing and harms build times.


lib/kernels/test/src/test_utils.h line 4 at r16 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Thought since all tests use doctest I could clean things by putting it in test_utils, but I guess that's bad practice. I reverted.

I don't know if it's widely thought of as a bad practice, but I've had issues with it in the past so I try to avoid it. I generally find it makes everything more complicated and can harm build times with little concrete benefit.


lib/kernels/src/local_allocator.cc line 54 at r14 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

That's what I thought too, but I get a memory error when I use the ranged based for loop when I run the tests

See other comment about posting the code you tried


lib/kernels/include/kernels/managed_ff_stream.h line 18 at r17 (raw file):

      : stream(std::exchange(other.stream, nullptr)) {}

  ManagedFFStream &operator=(ManagedFFStream &&other) noexcept {

These should be in the .cc file


lib/kernels/include/kernels/managed_ff_stream.h line 24 at r17 (raw file):

    }
    return *this;
  }

Suggestion:

  ManagedFFStream &operator=(ManagedFFStream &&other) noexcept {
    std::swap(this->stream, other.stream);
    return *this;
  }

lib/kernels/src/managed_ff_stream.cc line 9 at r17 (raw file):

ManagedFFStream::~ManagedFFStream() {
  checkCUDA(cudaStreamDestroy(stream));

Since moved-from values will have a stream with value nullptr

Suggestion:

  if (stream != nullptr) {
    checkCUDA(cudaStreamDestroy(stream));
  }

Copy link
Collaborator

@reyna-abhyankar reyna-abhyankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @lockshaw and @oOTigger)


lib/kernels/src/array_shape.cc line 64 at r11 (raw file):

Previously, lockshaw (Colin Unger) wrote…

@reyna-abhyankar Can we get either a fix or an issue created for this?

Yeah, I can fix this in #1418 (issue created is #1421)


lib/kernels/src/cuda/ops/batch_norm_kernels.cu line 134 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I don't see any issue in passing in a stream in local-execution (I assume there's an ArgRef of some kind for that)? @reyna-abhyankar

init_kernel should be wrapped in a profile(...) like the other kernel calls. if you look at kernels/profiling.h, you'll see the profiling wrapper passes in the stream, so local execution doesn't handle it.

Copy link
Contributor Author

@dylanllim dylanllim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 66 of 90 files reviewed, 14 unresolved discussions (waiting on @lockshaw)


.proj.toml line 18 at r16 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Let's leave this out of test_targets then for now until we have a fix so proj doesn't crash on machines without a GPU

idk why I didn't understand your question initially. things still build fine even on machines without GPU


lib/kernels/include/kernels/managed_per_device_ff_handle.h line 15 at r17 (raw file):

Previously, lockshaw (Colin Unger) wrote…

These implementations should be in the .cc file

Done.


lib/kernels/include/kernels/managed_per_device_ff_handle.h line 16 at r17 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Right now you're going to run into issues with the value being moved from double-freeing your objects I think. Making this a pointer (1) makes it easy to fix your double-free problem, and (2) as a nice side effect probably makes copying cheaper

Note that you'll need to modify your deconstructor accordingly, mostly along the lines of the implementation in ManagedFFStream but with slightly different behavior in terms of the actual checkCUDA(freething(...)) calls

Done.


lib/kernels/include/kernels/managed_per_device_ff_handle.h line 18 at r17 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Probably a good idea to test the move behaviors of both this and ManagedFFStream just to make sure you actually got the implementations correct. I think you should just be able to (in your tests), declare an object, move from it, and then check the state of both the moved-from object and the moved to object

Done.


lib/kernels/include/kernels/managed_per_device_ff_handle.h line 23 at r17 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Otherwise you'll double free your handle things

Done.


lib/kernels/include/kernels/managed_per_device_ff_handle.h line 30 at r17 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Just for consistency we tend to layout struct definitions like this--obviously there's some variation, but trying to follow the convention makes the codebase a bit more approachable

Done.


lib/kernels/src/local_cuda_allocator.cc line 18 at r16 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Without seeing what code you tried I can't really help. In general try to post not just the output but also the code you tried out unless it's included in the PR

Oh I was just trying to show that it lead to printing out random memory.

But yeah I've been testing it for a while and a range based loop always seems to lead to problems with all tests not passing as it's deallocating a pointer that wasn't allocated by it. I tried verifying the tests as well, and they all look like they shouldn't be calling deallocate on an invalid pointer, and it looks like only when it's in a non range based loops, all allocations and deallocations work as expected.

I rewrote it to use a while loop so potentially that looks a little cleaner?


lib/kernels/src/local_cuda_allocator.cc line 14 at r17 (raw file):

void LocalCudaAllocator::deallocate(void *ptr) {
  auto it = this->ptrs.find(ptr);
  if (it != this->ptrs.end()) {

Done.


lib/kernels/src/cuda/ops/batch_norm_kernels.cu line 134 at r6 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

init_kernel should be wrapped in a profile(...) like the other kernel calls. if you look at kernels/profiling.h, you'll see the profiling wrapper passes in the stream, so local execution doesn't handle it.

In that case I'll just keep things as is?


lib/kernels/test/src/test_cuda.cc line 1 at r16 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Just include it in the tests directly, in my experience trying tricks like this to avoid repeating headers just makes everything more confusing and harms build times.

Done.


lib/kernels/src/local_allocator.cc line 54 at r14 (raw file):

Previously, lockshaw (Colin Unger) wrote…

See other comment about posting the code you tried

Done.


lib/kernels/include/kernels/managed_ff_stream.h line 18 at r17 (raw file):

Previously, lockshaw (Colin Unger) wrote…

These should be in the .cc file

Done.


lib/kernels/include/kernels/managed_ff_stream.h line 24 at r17 (raw file):

    }
    return *this;
  }

Done.


lib/kernels/src/managed_ff_stream.cc line 9 at r17 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Since moved-from values will have a stream with value nullptr

Done.

Copy link
Collaborator

@reyna-abhyankar reyna-abhyankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 64 of 90 files reviewed, 14 unresolved discussions (waiting on @lockshaw)


lib/kernels/test/src/test_dropout.cc line 54 at r11 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Originally wanted to keep this unnested but fixed after our discussion on Friday

I believe this is still nested?

Copy link
Contributor Author

@dylanllim dylanllim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 64 of 90 files reviewed, 14 unresolved discussions (waiting on @lockshaw)


lib/kernels/test/src/test_dropout.cc line 54 at r11 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

I believe this is still nested?

It doesn't look like it is?

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 24 of 24 files at r18, 2 of 2 files at r19, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @oOTigger)


.proj.toml line 18 at r16 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

idk why I didn't understand your question initially. things still build fine even on machines without GPU

Yes, but proj test runs the tests, at which point you'll get an error on non-GPU machines


lib/kernels/include/kernels/managed_per_device_ff_handle.h line 18 at r17 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Done.

Where? Unless I missed something I'm not seeing any new tests added


lib/kernels/include/kernels/managed_per_device_ff_handle.h line 22 at r19 (raw file):

  ~ManagedPerDeviceFFHandle();

  PerDeviceFFHandle const &raw_handle();

Suggestion:

  PerDeviceFFHandle const &raw_handle() const;

lib/kernels/src/local_cuda_allocator.cc line 18 at r16 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Oh I was just trying to show that it lead to printing out random memory.

But yeah I've been testing it for a while and a range based loop always seems to lead to problems with all tests not passing as it's deallocating a pointer that wasn't allocated by it. I tried verifying the tests as well, and they all look like they shouldn't be calling deallocate on an invalid pointer, and it looks like only when it's in a non range based loops, all allocations and deallocations work as expected.

I rewrote it to use a while loop so potentially that looks a little cleaner?

I still don't see a reason that the range-based for loop wouldn't work. Can you show me the code you're trying that isn't working properly?


lib/kernels/src/cuda/ops/batch_norm_kernels.cu line 134 at r6 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

In that case I'll just keep things as is?

Sure, this is fine for now.


lib/kernels/include/kernels/managed_ff_stream.h line 20 at r19 (raw file):

  ~ManagedFFStream();

  ffStream_t const &raw_stream();

Suggestion:

  ffStream_t const &raw_stream() const;

Copy link
Contributor Author

@dylanllim dylanllim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 84 of 90 files reviewed, 5 unresolved discussions (waiting on @lockshaw)


.proj.toml line 18 at r16 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Yes, but proj test runs the tests, at which point you'll get an error on non-GPU machines

Done.


lib/kernels/include/kernels/managed_per_device_ff_handle.h line 18 at r17 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Where? Unless I missed something I'm not seeing any new tests added

Created issue for this


lib/kernels/include/kernels/managed_per_device_ff_handle.h line 22 at r19 (raw file):

  ~ManagedPerDeviceFFHandle();

  PerDeviceFFHandle const &raw_handle();

Done.


lib/kernels/src/local_cuda_allocator.cc line 18 at r16 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I still don't see a reason that the range-based for loop wouldn't work. Can you show me the code you're trying that isn't working properly?

Done.


lib/kernels/include/kernels/managed_ff_stream.h line 20 at r19 (raw file):

  ~ManagedFFStream();

  ffStream_t const &raw_stream();

Done.

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r20, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @oOTigger)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kernels Kernels library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants