Add unit tests for subset of kernels#1384
Conversation
lib/kernels
There was a problem hiding this comment.
Can you update the PR description and title? (nevermind, took care of this already)
Overall the tests look okay, but they need two things: (1) use some of the higher level interfaces (I think Allocator already has a way of producing a GenericTensorAccessor for a given ArrayShape, though if not it should be added), and (2) pull out duplicate testing code--currently it's nearly impossible to read what the test functions are doing, because 90% of the code is the same "generate a random tensor" code over and over again, which should be pulled out into separate functions (kernels-tests can have helper functions and header files in it)
Reviewed 8 of 13 files at r1, 23 of 31 files at r2, all commit messages.
Reviewable status: 31 of 39 files reviewed, 28 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)
.proj.toml line 16 at r2 (raw file):
test_targets = [ "kernel-tests",
Make sure these changes don't get merged--the only change here should be adding kernel-tests to the list of tests to run, not removing anything
lib/CMakeLists.txt line 9 at r2 (raw file):
add_subdirectory(ffi) add_subdirectory(substitutions) # add_subdirectory(local-execution)
Why?
lib/kernels/CMakeLists.txt line 11 at r2 (raw file):
src/*.cc src/cuda/cuda_helper.cu src/cuda/ops/cast_kernels.cu
I assume eventually this will get changed back to src/cuda/ops/*.cu once everything builds successfully?
lib/kernels/include/kernels/concat_kernels.h line 6 at r2 (raw file):
#include "device.h" #include "kernels/accessor.h" #include "kernels/concat_kernels.h"
Why does this file include itself?
lib/kernels/include/kernels/cuda_helper.h line 21 at r2 (raw file):
template <typename T> __global__ void apply_add_with_scale(T *data_ptr, T const *grad_ptr, size_t size, T scale);
This shouldn't be exposed by kernels, only the wrapped kernel functions for each operator should be exposed through the public headers
lib/kernels/include/kernels/device.h line 31 at r2 (raw file):
#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA) typedef cudaStream_t ffStream_t; // cudaError_t get_legion_stream(cudaStream_t *stream);
Why this change?
lib/kernels/include/kernels/softmax_kernels.h line 22 at r2 (raw file):
SoftmaxPerDeviceState init_kernel(PerDeviceFFHandle const &, int, int, int, int, int);
Add argument names
lib/kernels/src/accessor.cc line 6 at r2 (raw file):
int32_t *GenericTensorAccessorW::get_int32_ptr() const { return get<DataType::INT32>();
Generally we prefer explicitly using this as it makes it clearer what is a method and what is a function
Suggestion:
return this->get<DataType::INT32>();lib/kernels/src/device.h line 96 at r2 (raw file):
// template <typename DT> // __global__ void apply_add_with_scale(DT *data_ptr,
Why commented out?
lib/kernels/src/cuda/cuda_helper.cu line 274 at r2 (raw file):
} template __global__ void
Can/should we move these manual instantiations over to use the stuff in datatype_dispatch.h?
lib/kernels/src/cuda/cuda_helper.cu line 294 at r2 (raw file):
add_kernel<int64_t>(int64_t *dst, int64_t const *src, size_t size); template __global__ void add_kernel<bool>(bool *dst, bool const *src, unsigned long size);
Why do we need to add bools? Is it just because DatatypeDispatch forces all of the instantiations?
lib/kernels/src/cuda/ops/attention_kernels.cu line 47 at r2 (raw file):
ffSeqDataDescriptor_t oDesc; void *reserveSpace; // void *dropoutStates; // NOT USED
Not even on inference/master? In that case they should be removed
lib/kernels/src/cuda/ops/attention_kernels.cu line 307 at r2 (raw file):
void cleanup_kernel(Allocator &allocator, MHAPerDeviceState const &device_state) { /* Noticed that loWinIdx and hiWinIdx are not allocated on GPU? Should
Where are they used?
lib/kernels/src/cuda/ops/cast_kernels.cu line 61 at r2 (raw file):
}; // void forward_kernel(PerDeviceFFHandle handle,
If these shouldn't be here, delete don't just comment out
lib/kernels/test/CMakeLists.txt line 3 at r2 (raw file):
ff_add_test_executable( NAME kernel-tests
Suggestion:
kernels-testslib/kernels/test/CMakeLists.txt line 21 at r2 (raw file):
cudart cublas )
Definitely don't need compiler or pcg, I also don't think you need nccl
Suggestion:
utils
doctest
utils-test-common
kernels
op-attrs
cuda
cudnn
cudart
cublas
)lib/kernels/test/CMakeLists.txt line 24 at r2 (raw file):
# set(project_target kernel-tests)
Fix up commented-out stuff before merging
lib/kernels/test/src/test_attention_kernel.cc line 24 at r2 (raw file):
handle.workSpaceSize = 1024 * 1024; cudaMalloc(&handle.workSpace, handle.workSpaceSize); handle.allowTensorOpMathConversion = true;
Probably best to pull this out into a helper function
Code quote:
PerDeviceFFHandle handle;
cudnnCreate(&handle.dnn);
cublasCreate(&handle.blas);
handle.workSpaceSize = 1024 * 1024;
cudaMalloc(&handle.workSpace, handle.workSpaceSize);
handle.allowTensorOpMathConversion = true;lib/kernels/test/src/test_attention_kernel.cc line 36 at r2 (raw file):
allocator.allocate(num_samples * kvSeqLength * kSize * sizeof(float)); void *value_ptr = allocator.allocate(num_samples * kvSeqLength * vSize * sizeof(float));
@reyna-abhyankar Isn't there a way to directly allocate a GenericTensorAccessor using an Allocator?
Code quote:
allocator.allocate(lib/kernels/test/src/test_attention_kernel.cc line 51 at r2 (raw file):
for (auto &val : host_query) val = dist(gen);
Pull generating a random GenericTensorAccessor out into a separate function
Code quote:
for (auto &val : host_query)
val = dist(gen);lib/kernels/test/src/test_attention_kernel.cc line 86 at r2 (raw file):
// TODO: PROBABLY NEED DIFFERENT CHECK?!!??! REQUIRE(std::any_of(host_output.begin(), host_output.end(),
No check needed, as long as it doesn't crash I'm happy
lib/kernels/test/src/test_attention_kernel.cc line 205 at r2 (raw file):
REQUIRE(std::any_of(output_grad.begin(), output_grad.end(), [](float v) { return v != 0; }));
Prefer functions from containers.h which don't require passing iterators
Code quote:
REQUIRE(std::any_of(output_grad.begin(), output_grad.end(),
[](float v) { return v != 0; }));lib/kernels/test/src/test_cast_kernel.cc line 8 at r2 (raw file):
namespace FlexFlow { TEST_SUITE(FF_TEST_SUITE) {
Suggestion:
using namespace ::FlexFlow;
TEST_SUITE(FF_TEST_SUITE) {lib/kernels/test/src/test_cast_kernel.cc line 12 at r2 (raw file):
std::size_t dims[] = {100, 100}; std::size_t num_dims = 2; FlexFlow::ArrayShape shape(dims, num_dims);
Suggestion:
ArrayShape shape = ArrayShape{
std::vector<size_t>{100, 100},
};lib/kernels/test/src/test_cast_kernel.cc line 83 at r2 (raw file):
std::vector<int> host_data(100 * 100); for (auto &val : host_data) {
Generally we avoid using auto unless the type name is really long
Suggestion:
for (int &val : host_data) {lib/kernels/test/src/test_cast_kernel.cc line 108 at r2 (raw file):
for (size_t i = 0; i < host_int_data.size(); ++i) { REQUIRE(typeid(host_float_data[i]) == typeid(float));
Just remove the check, not necessary here
lib/kernels/test/src/test_transpose_kernel.cc line 68 at r2 (raw file):
std::vector<int> in_strides(num_dims, 1); std::vector<int> out_strides(num_dims, 1); for (int i = 1; i < num_dims; i++) {
It seems like this is actually computing the transpose? In that case I'd rather this get moved into kernels itself as a CPU kernel, and then the tests can just compare that the CUDA and the CPU kernels do the same thing. Mainly this is because (1) super complicated tests are not great, but primarily (2) we have uses for CPU-only versions of the kernels in testing other parts of the codebase
lib/utils/include/utils/fmt.h line 57 at r2 (raw file):
// CHECK_FMTABLE(T); // std::string result = fmt::to_string(t); std::string result = "debugging";
Why?
lib/kernels
dylanllim
left a comment
There was a problem hiding this comment.
Reviewable status: 13 of 63 files reviewed, 28 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)
lib/CMakeLists.txt line 9 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why?
I think I copied over this CMakeList from a different branch before I actually had local-execution so I commented out in order to compile
lib/kernels/CMakeLists.txt line 11 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I assume eventually this will get changed back to
src/cuda/ops/*.cuonce everything builds successfully?
Yeah, it's just src/cuda/ops/*.cu now
lib/kernels/include/kernels/device.h line 31 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why this change?
Was running into dependency issues, but forgot to uncomment => Fixed
lib/kernels/include/kernels/softmax_kernels.h line 22 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Add argument names
Added
lib/kernels/src/accessor.cc line 6 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Generally we prefer explicitly using
thisas it makes it clearer what is a method and what is a function
Added
lib/kernels/src/device.h line 96 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why commented out?
Fixed, forgot to uncomment
lib/kernels/src/cuda/cuda_helper.cu line 294 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why do we need to add
bools? Is it just becauseDatatypeDispatchforces all of the instantiations?
Yeah, ran into compile issues without the bool
lib/kernels/src/cuda/ops/attention_kernels.cu line 47 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Not even on
inference/master? In that case they should be removed
I just don't see the dropOutState fields being used anywhere in the init kernel
lib/kernels/src/cuda/ops/attention_kernels.cu line 307 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Where are they used?
In init_kernel:
// allocate memory for loWinIdx/hiWinIdx
int *loWinIdx = (int *)malloc(sizeof(int) * qoSeqLength);
int *hiWinIdx = (int *)malloc(sizeof(int) * qoSeqLength);
lib/kernels/src/cuda/ops/cast_kernels.cu line 61 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
If these shouldn't be here, delete don't just comment out
Done.
lib/kernels/test/CMakeLists.txt line 3 at r2 (raw file):
ff_add_test_executable( NAME kernel-tests
Done.
lib/kernels/test/CMakeLists.txt line 21 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Definitely don't need
compilerorpcg, I also don't think you neednccl
Removed
lib/kernels/test/CMakeLists.txt line 24 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Fix up commented-out stuff before merging
Removed
lib/kernels/test/src/test_attention_kernel.cc line 24 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Probably best to pull this out into a helper function
Created
lib/kernels/test/src/test_attention_kernel.cc line 51 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Pull generating a random
GenericTensorAccessorout into a separate function
Done
lib/kernels/test/src/test_attention_kernel.cc line 86 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
No check needed, as long as it doesn't crash I'm happy
Done
lib/kernels/test/src/test_attention_kernel.cc line 205 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer functions from
containers.hwhich don't require passing iterators
Done.
lib/kernels/test/src/test_cast_kernel.cc line 8 at r2 (raw file):
namespace FlexFlow { TEST_SUITE(FF_TEST_SUITE) {
Done.
lib/kernels/test/src/test_cast_kernel.cc line 83 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Generally we avoid using
autounless the type name is really long
Done.
lib/kernels/test/src/test_cast_kernel.cc line 108 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Just remove the check, not necessary here
Done.
lib/kernels/test/src/test_transpose_kernel.cc line 68 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
It seems like this is actually computing the transpose? In that case I'd rather this get moved into
kernelsitself as a CPU kernel, and then the tests can just compare that the CUDA and the CPU kernels do the same thing. Mainly this is because (1) super complicated tests are not great, but primarily (2) we have uses for CPU-only versions of the kernels in testing other parts of the codebase
Done.
lib/kernels/include/kernels/concat_kernels.h line 6 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why does this file include itself?
Fixed
lib/kernels/include/kernels/cuda_helper.h line 21 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This shouldn't be exposed by kernels, only the wrapped kernel functions for each operator should be exposed through the public headers
Deleted file
reyna-abhyankar
left a comment
There was a problem hiding this comment.
Agree with this and I'd add to separate the kernel tests by forward and backward (and then pull out the duplicated code). Currently it's a giant end to end test for each operator, which if it fails we don't actually know what part went wrong.
Reviewable status: 10 of 63 files reviewed, 37 unresolved discussions (waiting on @lockshaw and @oOTigger)
lib/kernels/include/kernels/device.h line 31 at r6 (raw file):
#if defined(FF_USE_CUDA) || defined(FF_USE_HIP_CUDA) typedef cudaStream_t ffStream_t; cudaError_t get_legion_stream(cudaStream_t *stream);
Why deleted? iirc this is called in a couple of places
lib/kernels/test/src/test_attention_kernel.cc line 36 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
@reyna-abhyankar Isn't there a way to directly allocate a
GenericTensorAccessorusing anAllocator?
Yes, this is in the latest repo-refactor from 1400.
lib/kernels/test/src/test_cast_kernel.cc line 8 at r6 (raw file):
template <typename T> void allocate_ptrs(std::vector<T **> &gpu_data_ptrs,
This code can be moved out since it's replicated a bunch
lib/kernels/test/src/test_concat_kernel.cc line 27 at r6 (raw file):
for (int i = 0; i < num_inputs; i++) { void *input_data_ptr = allocator.allocate(size_per_input * sizeof(float));
Merge latest repo-refactor, I think the allocator interface now allows you to pass in a tensor shape and get back a GTAW
lib/kernels/test/src/test_layer_norm_kernels.cc line 21 at r6 (raw file):
TEST_SUITE("kernel-tests") { TEST_CASE("Test LayerNorm Forward and Backward Kernel") {
In general, I think we should have separate test cases for the 3 kernels since we aren't actually testing for correctness in this. Then, later we can add correctness tests that do it end to end
lib/kernels/test/src/test_partition_kernel.cc line 19 at r6 (raw file):
} using namespace ::FlexFlow;
Suggestion
Code snippet:
using namespace FlexFlow;lib/kernels/test/src/test_partition_kernel.cc line 25 at r6 (raw file):
std::size_t dims[] = {num_elements}; std::size_t num_dims = 1; FlexFlow::ArrayShape shape(dims, num_dims);
Suggestion
Code snippet:
ArrayShape shape(...);lib/kernels/test/src/test_partition_kernel.cc line 44 at r6 (raw file):
Kernels::Repartition::init_kernel(handle, DataType::FLOAT);
What's this?
lib/kernels/test/src/test_softmax_kernel.cc line 22 at r6 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Test Softmax Forward") { std::size_t num_elements = 100;
This is good, splitting up the kernels. But then you can also just refactor any shared code between the test cases (can call it like softmax_test_setup() that does any other work for you)
lib/kernels/test/src/test_utils.h line 10 at r6 (raw file):
#include <vector> #include <algorithm>
move that pointers allocation to here. we can maybe push it to allocator too but for now, the utils file should be fine
reyna-abhyankar
left a comment
There was a problem hiding this comment.
Reviewable status: 10 of 63 files reviewed, 38 unresolved discussions (waiting on @lockshaw and @oOTigger)
lib/kernels/include/kernels/conv_2d_kernels.h line 54 at r6 (raw file):
float *filter_grad_ptr); void forward_kernel(cudaStream_t stream,
Also, why? this is the header file for both the cuda and hip kernels so it should be generic, right?
dylanllim
left a comment
There was a problem hiding this comment.
Reviewable status: 45 of 89 files reviewed, 26 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)
flake.lock line 84 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
You should be able to set up whatever editorconfig integration your editor uses: https://editorconfig.org/
Is there a .editorconfig file that I'm supposed to be matching somewhere?
It looks like the default one is below. Is this fine?
root = true
Unix-style newlines with a newline ending every file
[*]
end_of_line = lf
insert_final_newline = true
[{CMakeLists.txt,*.cmake}]
indent_style = space
indent_size = 2
[*.{cc,h,cu,cpp}]
indent_style = space
indent_size = 2
[*.py]
indent_style = space
indent_size = 4
[*.toml]
indent_style = space
indent_size = 2
lib/kernels/include/kernels/managed_handle.h line 14 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is move construction disabled? Should copy-assignment be deleted?
I thought it's considered safer to disable these when dealing with memory, to avoid things like multiple deletions.
I noticed that's what the Allocator's had so thought it would be safe to transfer over.
lib/kernels/include/kernels/managed_handle.h line 19 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Not really necessary here?
ManagedHandlealready has a default constructor
Done.
lib/kernels/src/cuda/cuda_helper.cu line 283 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
As a side note,
real_typeshould really be renamed toreal_type_tto follow naming conventions, @oOTigger can you create a separate PR to do that renaming?The code below I think should work, if it does it should get propagated to the other kernel structs below in this file
This code gave issues regarding dispatch. I ended up just sticking with original code but changing to make use of real_type
lib/kernels/src/cuda/cuda_helper.cu line 285 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Make sure these get exposed through the header files.
Done.
I exposed everything except dispatch_print_tensor_kernel and ended up reverting that function back to it's old template format.
For some reason, the compiler didn't like me exposing the print_tensor kernel and got a long list of cascading errors. It says it has something to do with ff_handle and nccl.h but I wasn't really able to figure out how to fix.
lib/kernels/test/src/test_cast_kernel.cc line 10 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
mhave a bit of an overloaded meaning, andmanaged_streamis substantially clearer. Also,ManagedFFStreammakes it much clearer what stream you're talking about
Done.
lib/kernels/test/src/test_dropout.cc line 20 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Clearer names are really preferred: the codebase is going to have many "handle objects", so naming something
ManagedHandlemakes the code really difficult to read
Done.
lib/kernels/test/src/test_partition_kernel.cc line 52 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is the
host_grad_input_databeing read from theoutput_accessor? Either the names here are incredibly confusing or something weird is going on: the output of the backward kernel should generally be referred to asinput_grad, not asoutput, as that would confuse things with theoutputtensor generated in the forward pass
Done.
lib/kernels/test/src/test_partition_kernel.cc line 55 at r13 (raw file):
Previously, lockshaw (Colin Unger) wrote…
See above comment, something weird seems to be going on with the names here
Done.
Namings should be correct. Added back in some accessors in the backward_kernels to be explicit on grad accessors
lib/kernels/test/src/test_replicate_kernel.cc line 10 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
"shape" by itself is a bit unclear, also wouldn't hurt to use this in the other tests where you just have "shape"--even if you don't want the second
output_shape, replacingshapewithinput_shapeshould already help quite a bit
Done.
lib/kernels/test/src/test_utils.h line 53 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why are we using
memcpy? Can't you just usevector's constructor?
Done. => also removed as will add back in when adding cpu kernels
lib/kernels/test/src/test_utils.h line 58 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why do we have both this and
load_data_to_host_from_device?
This was old function from when I was still dealing with raw pointers in some places. It is removed.
lib/kernels/test/src/test_utils.cc line 90 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What is the purpose of this function? In theory the placement of
GenericTensorAccessors (i.e., whether they are on gpu or cpu) shouldn't really be explicitly managed and should be left to whatever object is behind theAllocator. If this information does need to be tracked (e.g., for filling with data, etc.) it is probably best to have it be a private field onGenericTensorAccessorand provide a method onGenericTensorAccessorfor setting the contents that handles all the internal details. That would also remove thecpu_fillparameters in the functions aboveThat said, considering this complication I'm unsure why cpu allocation is getting addressed in this PR? afaict it's not necessary and is just going to further delay merging this?
Done.
lib/kernels/include/kernels/local_allocator.h line 6 at r14 (raw file):
namespace FlexFlow { struct LocalAllocator : public IAllocator {
Done.
lib/kernels/include/kernels/local_allocator.h line 22 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is this part of this PR? I guess there's not an issue with it being included, but I don't think it should be used here?
Removed
lib/kernels/src/local_allocator.cc line 31 at r14 (raw file):
void *LocalCPUAllocator::allocate(size_t requested_memory_size) { void *ptr = malloc(requested_memory_size); if (ptr) {
Done.
lib/kernels/src/local_allocator.cc line 46 at r14 (raw file):
} else { throw std::runtime_error( "Deallocating a pointer that was not allocated by this allocator");
Done.
lib/kernels/src/local_allocator.cc line 54 at r14 (raw file):
void *ptr = *it; it++; this->deallocate(ptr);
This gives issues as were modifying ptr list while iterating over it => I think we need to safely iterate over ptrs list as previously
lib/kernels/src/managed_handle.cc line 9 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Do these functions also need to be checked, e.g., by
checkCUDA?
Done.
lib/kernels/src/managed_handle.cc line 15 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Do these functions also need to be checked, e.g., by
checkCUDA?
Done.
lib/kernels/src/cuda/ops/reverse_kernels.cu line 39 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete?
Deleted.
On a side note ~ Is there a way to easily and quickly view the diff of the files I'm committing to make sure these errors don't happen? I've kind of just been using git status and trying to make sure that spam files aren't added, but this isn't always the greatest, and if there's junk in the file I'm not necessarilly catching it.
lib/local-execution/CMakeLists.txt line 16 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Remove,
kernelsonce is plenty 😛
Done.
lib/kernels/include/kernels/managed_stream.h line 14 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is move construction disabled? Should copy-assignment be deleted?
I thought it's considered safer to disable these when dealing with memory, to avoid things like multiple deletions.
I noticed that's what the Allocator's had so thought it would be safe to transfer over.
lib/kernels/include/kernels/managed_stream.h line 19 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Not really necessary here?
ManagedStreamalready has a default constructor
Done.
|
Previously, oOTigger (Dylan Lim) wrote…
Yes, just the one in the root of the repository: https://github.com/flexflow/FlexFlow/blob/repo-refactor/.editorconfig. If you enable whatever editorconfig support your editor has it should automatically find it and apply those settings |
lockshaw
left a comment
There was a problem hiding this comment.
Reviewed 44 of 44 files at r15, 2 of 2 files at r16, all commit messages.
Reviewable status: all files reviewed, 20 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)
.proj.toml line 18 at r16 (raw file):
test_targets = [ "kernels-tests",
What does this do on machines without GPUs? (also, what does this do on machines with GPUs, doesn't it need nixGL?)
lib/kernels/include/kernels/managed_handle.h line 14 at r14 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
I thought it's considered safer to disable these when dealing with memory, to avoid things like multiple deletions.
I noticed that's what the Allocator's had so thought it would be safe to transfer over.
But move-construction is exactly the operation that should be allowed if you don't want multiple deletions, right? And copy-assignment is something you wouldn't want allowed, but I'm not seeing deleted. I think you'd want move construction and move assignment enabled, and copy construction and copy assignment disabled, just like unique_ptr
lib/kernels/include/kernels/managed_per_device_ff_handle.h line 8 at r16 (raw file):
namespace FlexFlow { struct ManagedPerDeviceFFHandle {
This appears duplicated with the file managed_handle.h which seems to still exist?
lib/kernels/src/local_cuda_allocator.cc line 14 at r16 (raw file):
void LocalCudaAllocator::deallocate(void *ptr) { checkCUDA(cudaFree(ptr)); this->ptrs.erase(ptr);
Add a check if the pointer is not in ptrs?
lib/kernels/src/local_cuda_allocator.cc line 18 at r16 (raw file):
LocalCudaAllocator::~LocalCudaAllocator() { for (auto it = this->ptrs.begin(); it != this->ptrs.end();) {
Replace with range-based loop
lib/kernels/src/local_cuda_allocator.cc line 25 at r16 (raw file):
} Allocator get_local_cuda_memory_allocator() {
Suggestion:
Allocator create_local_cuda_memory_allocator() {lib/kernels/src/cuda/cuda_helper.cu line 283 at r14 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
This code gave issues regarding dispatch. I ended up just sticking with original code but changing to make use of real_type
The static_casting in your version is very messy, so I'd prefer to adapt the version above. If you want help resolving the errors let me know, but just dumping everything to a void pointer and then doing a convoluted set of casts may pass the compiler but often doesn't actually result in the behavior you want
lib/kernels/src/cuda/cuda_helper.cu line 285 at r14 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Done.
I exposed everything except dispatch_print_tensor_kernel and ended up reverting that function back to it's old template format.
For some reason, the compiler didn't like me exposing the print_tensor kernel and got a long list of cascading errors. It says it has something to do with ff_handle and nccl.h but I wasn't really able to figure out how to fix.
What's the error message?
lib/kernels/test/src/test_cuda.cc line 1 at r16 (raw file):
#include "test_utils.h"
Why remove doctest?
lib/kernels/test/src/test_pool_2d_kernels.cc line 54 at r16 (raw file):
output_accessor.ptr); std::vector<float> host_output_accessor =
This isn't an accessor?
lib/kernels/test/src/test_split_kernel.cc line 26 at r16 (raw file):
std::vector<float *> output_ptrs(num_outputs); for (int i = 0; i < num_outputs; i++) {
Replace loop with repeat?
lib/kernels/test/src/test_split_kernel.cc line 43 at r16 (raw file):
SUBCASE("backward_kernel") { std::vector<float *> output_grad_ptrs(num_outputs); for (int i = 0; i < num_outputs; i++) {
Replace loop with repeat?
lib/kernels/test/src/test_utils.h line 4 at r16 (raw file):
#define _FLEXFLOW_KERNELS_TEST_UTILS #include "doctest/doctest.h"
Why? I don't see doctest being used in this file?
lib/kernels/test/src/test_utils.cc line 90 at r14 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Done.
I'm still seeing it here as of the latest revision?
lib/kernels/src/local_allocator.cc line 54 at r14 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
This gives issues as were modifying ptr list while iterating over it => I think we need to safely iterate over ptrs list as previously
I don't see how we're modifying the ptr list at all, so afaict the code above should work
lib/kernels/src/cuda/ops/reverse_kernels.cu line 39 at r14 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Deleted.
On a side note ~ Is there a way to easily and quickly view the diff of the files I'm committing to make sure these errors don't happen? I've kind of just been using git status and trying to make sure that spam files aren't added, but this isn't always the greatest, and if there's junk in the file I'm not necessarilly catching it.
git diff, or if they're already added, git diff --cached. You can also add each change individually (letting you preview each change as it's added) via git add -p
lib/local-execution/include/local-execution/local_allocator.h line 8 at r15 (raw file):
namespace FlexFlow { struct LocalAllocator : public IAllocator {
Why do we have both LocalAllocator and LocalCudaAllocator?
lib/kernels/include/kernels/managed_stream.h line 14 at r14 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
I thought it's considered safer to disable these when dealing with memory, to avoid things like multiple deletions.
I noticed that's what the Allocator's had so thought it would be safe to transfer over.
But move-construction is exactly the operation that should be allowed if you don't want multiple deletions, right? And copy-assignment is something you wouldn't want allowed, but I'm not seeing deleted.
…helper, other clean up
dylanllim
left a comment
There was a problem hiding this comment.
Reviewable status: 59 of 90 files reviewed, 19 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)
.proj.toml line 18 at r16 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What does this do on machines without GPUs? (also, what does this do on machines with GPUs, doesn't it need nixGL?)
If you're asking if things build fine when on a machine with GPU's => proj test builds fine but runs into an error of:
CUDA failure: CUDA driver version is insufficient for CUDA runtime version (35)
when actually trying to test things.
Everything is fine though when using:
NIXPKGS_ALLOW_UNFREE=1 nix run --impure github:nix-community/nixGL -- build/lib/kernels/test/kernels-tests
lib/kernels/include/kernels/managed_per_device_ff_handle.h line 8 at r16 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This appears duplicated with the file
managed_handle.hwhich seems to still exist?
Deleted managed_handle.h
lib/kernels/src/local_cuda_allocator.cc line 14 at r16 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Add a check if the pointer is not in ptrs?
Done.
lib/kernels/src/local_cuda_allocator.cc line 18 at r16 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Replace with range-based loop
Same issue as mentioned on local_allocator.cc ~ When I change this to use a range-based loop I get the following issue:
kernels-tests: /home/dylanlim/ff/pr/622/lib/kernels/src/local_cuda_allocator.cc:13: virtual void FlexFlow::LocalCudaAllocator::deallocate(void*): Assertion `false' failed.
/home/dylanlim/ff/pr/622/lib/kernels/test/src/test_attention_kernel.cc:8: FATAL ERROR: test case CRASHED: SIGABRT - Abort (abnormal termination) signal
and then a bunch of out of bounds memory accesses are printed. ie:
фY�LCvalid argument (1)
�Y�
�
��
PZ�
p_�
�Y�
�gXb�
Ћ
�^�
��
�`�
�`�
p��
b�
�
A
��
lib/kernels/src/local_cuda_allocator.cc line 25 at r16 (raw file):
} Allocator get_local_cuda_memory_allocator() {
Done.
lib/kernels/src/cuda/cuda_helper.cu line 283 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
The
static_casting in your version is very messy, so I'd prefer to adapt the version above. If you want help resolving the errors let me know, but just dumping everything to a void pointer and then doing a convoluted set of casts may pass the compiler but often doesn't actually result in the behavior you want
Ended up just getting rid of this for a later PR, feel like I'm spending too much time and no progress and still not fully sure why not working
lib/kernels/test/src/test_cuda.cc line 1 at r16 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why remove doctest?
Since all test files include "doctest" I thought it'd maybe be cleaner to just include doctest in "test_utils"?
lib/kernels/test/src/test_pool_2d_kernels.cc line 54 at r16 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This isn't an accessor?
Done.
lib/kernels/test/src/test_split_kernel.cc line 26 at r16 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Replace loop with
repeat?
Done.
lib/kernels/test/src/test_split_kernel.cc line 43 at r16 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Replace loop with
repeat?
Done.
lib/kernels/test/src/test_utils.h line 4 at r16 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why? I don't see doctest being used in this file?
Thought since all tests use doctest I could clean things by putting it in test_utils, but I guess that's bad practice. I reverted.
lib/kernels/test/src/test_utils.cc line 90 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I'm still seeing it here as of the latest revision?
Done.
lib/local-execution/include/local-execution/local_allocator.h line 8 at r15 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why do we have both
LocalAllocatorandLocalCudaAllocator?
This must've gotten added back with my repo-refactor merge. It's deleted.
lib/kernels/include/kernels/managed_handle.h line 14 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
But move-construction is exactly the operation that should be allowed if you don't want multiple deletions, right? And copy-assignment is something you wouldn't want allowed, but I'm not seeing deleted. I think you'd want move construction and move assignment enabled, and copy construction and copy assignment disabled, just like unique_ptr
Done.
lib/kernels/src/local_allocator.cc line 54 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I don't see how we're modifying the ptr list at all, so afaict the code above should work
That's what I thought too, but I get a memory error when I use the ranged based for loop when I run the tests
lib/kernels/include/kernels/managed_stream.h line 14 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
But move-construction is exactly the operation that should be allowed if you don't want multiple deletions, right? And copy-assignment is something you wouldn't want allowed, but I'm not seeing deleted.
Done.
lockshaw
left a comment
There was a problem hiding this comment.
Reviewed 31 of 31 files at r17, all commit messages.
Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)
.proj.toml line 18 at r16 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
If you're asking if things build fine when on a machine with GPU's => proj test builds fine but runs into an error of:
CUDA failure: CUDA driver version is insufficient for CUDA runtime version (35)
when actually trying to test things.
Everything is fine though when using:
NIXPKGS_ALLOW_UNFREE=1 nix run --impure github:nix-community/nixGL -- build/lib/kernels/test/kernels-tests
Let's leave this out of test_targets then for now until we have a fix so proj doesn't crash on machines without a GPU
lib/kernels/include/kernels/managed_per_device_ff_handle.h line 15 at r17 (raw file):
~ManagedPerDeviceFFHandle(); ManagedPerDeviceFFHandle(ManagedPerDeviceFFHandle &&other) noexcept
These implementations should be in the .cc file
lib/kernels/include/kernels/managed_per_device_ff_handle.h line 16 at r17 (raw file):
ManagedPerDeviceFFHandle(ManagedPerDeviceFFHandle &&other) noexcept : handle(std::move(other.handle)) {}
Right now you're going to run into issues with the value being moved from double-freeing your objects I think. Making this a pointer (1) makes it easy to fix your double-free problem, and (2) as a nice side effect probably makes copying cheaper
Note that you'll need to modify your deconstructor accordingly, mostly along the lines of the implementation in ManagedFFStream but with slightly different behavior in terms of the actual checkCUDA(freething(...)) calls
Suggestion:
PerDeviceFFHandle *handle;
ManagedPerDeviceFFHandle();
~ManagedPerDeviceFFHandle();
ManagedPerDeviceFFHandle(ManagedPerDeviceFFHandle &&other) noexcept
: handle(std::exchange(other.handle, nullptr)) {}lib/kernels/include/kernels/managed_per_device_ff_handle.h line 18 at r17 (raw file):
: handle(std::move(other.handle)) {} ManagedPerDeviceFFHandle &
Probably a good idea to test the move behaviors of both this and ManagedFFStream just to make sure you actually got the implementations correct. I think you should just be able to (in your tests), declare an object, move from it, and then check the state of both the moved-from object and the moved to object
lib/kernels/include/kernels/managed_per_device_ff_handle.h line 23 at r17 (raw file):
handle = std::move(other.handle); } return *this;
Otherwise you'll double free your handle things
Suggestion:
using std::swap;
swap(this->handle, other.handle);
return *this;lib/kernels/include/kernels/managed_per_device_ff_handle.h line 30 at r17 (raw file):
ManagedPerDeviceFFHandle & operator=(ManagedPerDeviceFFHandle const &) = delete; };
Just for consistency we tend to layout struct definitions like this--obviously there's some variation, but trying to follow the convention makes the codebase a bit more approachable
Suggestion:
struct ManagedPerDeviceFFHandle {
public:
ManagedPerDeviceFFHandle();
ManagedPerDeviceFFHandle(ManagedPerDeviceFFHandle const &) = delete;
ManagedPerDeviceFFHandle &
operator=(ManagedPerDeviceFFHandle const &) = delete;
ManagedPerDeviceFFHandle(ManagedPerDeviceFFHandle &&other);
ManagedPerDeviceFFHandle &
operator=(ManagedPerDeviceFFHandle &&other) noexcept;
~ManagedPerDeviceFFHandle();
PerDeviceFFHandle const &raw_handle() const;
private:
PerDeviceFFHandle *handle;
};lib/kernels/src/local_cuda_allocator.cc line 18 at r16 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Same issue as mentioned on local_allocator.cc ~ When I change this to use a range-based loop I get the following issue:
kernels-tests: /home/dylanlim/ff/pr/622/lib/kernels/src/local_cuda_allocator.cc:13: virtual void FlexFlow::LocalCudaAllocator::deallocate(void*): Assertion `false' failed.
/home/dylanlim/ff/pr/622/lib/kernels/test/src/test_attention_kernel.cc:8: FATAL ERROR: test case CRASHED: SIGABRT - Abort (abnormal termination) signal
and then a bunch of out of bounds memory accesses are printed. ie:
фY�LCvalid argument (1)
�Y�
�
��PZ�
p_��Y�
�gXb�
Ћ
�^�
��
�`��`�
p��b�
�
A
��
Without seeing what code you tried I can't really help. In general try to post not just the output but also the code you tried out unless it's included in the PR
lib/kernels/src/local_cuda_allocator.cc line 14 at r17 (raw file):
void LocalCudaAllocator::deallocate(void *ptr) { auto it = this->ptrs.find(ptr); if (it != this->ptrs.end()) {
Suggestion:
if (contains(this->ptrs, ptr)) {lib/kernels/test/src/test_cuda.cc line 1 at r16 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Since all test files include "doctest" I thought it'd maybe be cleaner to just include doctest in "test_utils"?
Just include it in the tests directly, in my experience trying tricks like this to avoid repeating headers just makes everything more confusing and harms build times.
lib/kernels/test/src/test_utils.h line 4 at r16 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Thought since all tests use doctest I could clean things by putting it in test_utils, but I guess that's bad practice. I reverted.
I don't know if it's widely thought of as a bad practice, but I've had issues with it in the past so I try to avoid it. I generally find it makes everything more complicated and can harm build times with little concrete benefit.
lib/kernels/src/local_allocator.cc line 54 at r14 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
That's what I thought too, but I get a memory error when I use the ranged based for loop when I run the tests
See other comment about posting the code you tried
lib/kernels/include/kernels/managed_ff_stream.h line 18 at r17 (raw file):
: stream(std::exchange(other.stream, nullptr)) {} ManagedFFStream &operator=(ManagedFFStream &&other) noexcept {
These should be in the .cc file
lib/kernels/include/kernels/managed_ff_stream.h line 24 at r17 (raw file):
} return *this; }
Suggestion:
ManagedFFStream &operator=(ManagedFFStream &&other) noexcept {
std::swap(this->stream, other.stream);
return *this;
}lib/kernels/src/managed_ff_stream.cc line 9 at r17 (raw file):
ManagedFFStream::~ManagedFFStream() { checkCUDA(cudaStreamDestroy(stream));
Since moved-from values will have a stream with value nullptr
Suggestion:
if (stream != nullptr) {
checkCUDA(cudaStreamDestroy(stream));
}
reyna-abhyankar
left a comment
There was a problem hiding this comment.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @lockshaw and @oOTigger)
lib/kernels/src/array_shape.cc line 64 at r11 (raw file):
Previously, lockshaw (Colin Unger) wrote…
@reyna-abhyankar Can we get either a fix or an issue created for this?
Yeah, I can fix this in #1418 (issue created is #1421)
lib/kernels/src/cuda/ops/batch_norm_kernels.cu line 134 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I don't see any issue in passing in a stream in
local-execution(I assume there's anArgRefof some kind for that)? @reyna-abhyankar
init_kernel should be wrapped in a profile(...) like the other kernel calls. if you look at kernels/profiling.h, you'll see the profiling wrapper passes in the stream, so local execution doesn't handle it.
dylanllim
left a comment
There was a problem hiding this comment.
Reviewable status: 66 of 90 files reviewed, 14 unresolved discussions (waiting on @lockshaw)
.proj.toml line 18 at r16 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Let's leave this out of
test_targetsthen for now until we have a fix soprojdoesn't crash on machines without a GPU
idk why I didn't understand your question initially. things still build fine even on machines without GPU
lib/kernels/include/kernels/managed_per_device_ff_handle.h line 15 at r17 (raw file):
Previously, lockshaw (Colin Unger) wrote…
These implementations should be in the
.ccfile
Done.
lib/kernels/include/kernels/managed_per_device_ff_handle.h line 16 at r17 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Right now you're going to run into issues with the value being moved from double-freeing your objects I think. Making this a pointer (1) makes it easy to fix your double-free problem, and (2) as a nice side effect probably makes copying cheaper
Note that you'll need to modify your deconstructor accordingly, mostly along the lines of the implementation in
ManagedFFStreambut with slightly different behavior in terms of the actualcheckCUDA(freething(...))calls
Done.
lib/kernels/include/kernels/managed_per_device_ff_handle.h line 18 at r17 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Probably a good idea to test the move behaviors of both this and
ManagedFFStreamjust to make sure you actually got the implementations correct. I think you should just be able to (in your tests), declare an object, move from it, and then check the state of both the moved-from object and the moved to object
Done.
lib/kernels/include/kernels/managed_per_device_ff_handle.h line 23 at r17 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Otherwise you'll double free your handle things
Done.
lib/kernels/include/kernels/managed_per_device_ff_handle.h line 30 at r17 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Just for consistency we tend to layout struct definitions like this--obviously there's some variation, but trying to follow the convention makes the codebase a bit more approachable
Done.
lib/kernels/src/local_cuda_allocator.cc line 18 at r16 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Without seeing what code you tried I can't really help. In general try to post not just the output but also the code you tried out unless it's included in the PR
Oh I was just trying to show that it lead to printing out random memory.
But yeah I've been testing it for a while and a range based loop always seems to lead to problems with all tests not passing as it's deallocating a pointer that wasn't allocated by it. I tried verifying the tests as well, and they all look like they shouldn't be calling deallocate on an invalid pointer, and it looks like only when it's in a non range based loops, all allocations and deallocations work as expected.
I rewrote it to use a while loop so potentially that looks a little cleaner?
lib/kernels/src/local_cuda_allocator.cc line 14 at r17 (raw file):
void LocalCudaAllocator::deallocate(void *ptr) { auto it = this->ptrs.find(ptr); if (it != this->ptrs.end()) {
Done.
lib/kernels/src/cuda/ops/batch_norm_kernels.cu line 134 at r6 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
init_kernelshould be wrapped in aprofile(...)like the other kernel calls. if you look atkernels/profiling.h, you'll see the profiling wrapper passes in the stream, so local execution doesn't handle it.
In that case I'll just keep things as is?
lib/kernels/test/src/test_cuda.cc line 1 at r16 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Just include it in the tests directly, in my experience trying tricks like this to avoid repeating headers just makes everything more confusing and harms build times.
Done.
lib/kernels/src/local_allocator.cc line 54 at r14 (raw file):
Previously, lockshaw (Colin Unger) wrote…
See other comment about posting the code you tried
Done.
lib/kernels/include/kernels/managed_ff_stream.h line 18 at r17 (raw file):
Previously, lockshaw (Colin Unger) wrote…
These should be in the
.ccfile
Done.
lib/kernels/include/kernels/managed_ff_stream.h line 24 at r17 (raw file):
} return *this; }
Done.
lib/kernels/src/managed_ff_stream.cc line 9 at r17 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Since moved-from values will have a stream with value
nullptr
Done.
reyna-abhyankar
left a comment
There was a problem hiding this comment.
Reviewable status: 64 of 90 files reviewed, 14 unresolved discussions (waiting on @lockshaw)
lib/kernels/test/src/test_dropout.cc line 54 at r11 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Originally wanted to keep this unnested but fixed after our discussion on Friday
I believe this is still nested?
dylanllim
left a comment
There was a problem hiding this comment.
Reviewable status: 64 of 90 files reviewed, 14 unresolved discussions (waiting on @lockshaw)
lib/kernels/test/src/test_dropout.cc line 54 at r11 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
I believe this is still nested?
It doesn't look like it is?
lockshaw
left a comment
There was a problem hiding this comment.
Reviewed 24 of 24 files at r18, 2 of 2 files at r19, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @oOTigger)
.proj.toml line 18 at r16 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
idk why I didn't understand your question initially. things still build fine even on machines without GPU
Yes, but proj test runs the tests, at which point you'll get an error on non-GPU machines
lib/kernels/include/kernels/managed_per_device_ff_handle.h line 18 at r17 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Done.
Where? Unless I missed something I'm not seeing any new tests added
lib/kernels/include/kernels/managed_per_device_ff_handle.h line 22 at r19 (raw file):
~ManagedPerDeviceFFHandle(); PerDeviceFFHandle const &raw_handle();
Suggestion:
PerDeviceFFHandle const &raw_handle() const;lib/kernels/src/local_cuda_allocator.cc line 18 at r16 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Oh I was just trying to show that it lead to printing out random memory.
But yeah I've been testing it for a while and a range based loop always seems to lead to problems with all tests not passing as it's deallocating a pointer that wasn't allocated by it. I tried verifying the tests as well, and they all look like they shouldn't be calling deallocate on an invalid pointer, and it looks like only when it's in a non range based loops, all allocations and deallocations work as expected.
I rewrote it to use a while loop so potentially that looks a little cleaner?
I still don't see a reason that the range-based for loop wouldn't work. Can you show me the code you're trying that isn't working properly?
lib/kernels/src/cuda/ops/batch_norm_kernels.cu line 134 at r6 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
In that case I'll just keep things as is?
Sure, this is fine for now.
lib/kernels/include/kernels/managed_ff_stream.h line 20 at r19 (raw file):
~ManagedFFStream(); ffStream_t const &raw_stream();
Suggestion:
ffStream_t const &raw_stream() const;
dylanllim
left a comment
There was a problem hiding this comment.
Reviewable status: 84 of 90 files reviewed, 5 unresolved discussions (waiting on @lockshaw)
.proj.toml line 18 at r16 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Yes, but
proj testruns the tests, at which point you'll get an error on non-GPU machines
Done.
lib/kernels/include/kernels/managed_per_device_ff_handle.h line 18 at r17 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Where? Unless I missed something I'm not seeing any new tests added
Created issue for this
lib/kernels/include/kernels/managed_per_device_ff_handle.h line 22 at r19 (raw file):
~ManagedPerDeviceFFHandle(); PerDeviceFFHandle const &raw_handle();
Done.
lib/kernels/src/local_cuda_allocator.cc line 18 at r16 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I still don't see a reason that the range-based for loop wouldn't work. Can you show me the code you're trying that isn't working properly?
Done.
lib/kernels/include/kernels/managed_ff_stream.h line 20 at r19 (raw file):
~ManagedFFStream(); ffStream_t const &raw_stream();
Done.
lockshaw
left a comment
There was a problem hiding this comment.
Reviewed 6 of 6 files at r20, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @oOTigger)
Add unit tests for the kernel functions of the following operators:
MultiHeadAttentionCastCombineConcatFlatPartitionReplicateReshapeSoftmaxTransposeThe changes made to
initializer_kernels.h,array_shape.h, andutils/fmt.hwere needed in order for tests to compile.This change is