Conversation
| DeviceSpecific<ElementUnaryPerDeviceState> per_device_state = | ||
| acc.create_device_specific<ElementUnaryPerDeviceState>( | ||
| init_kernel(handle, | ||
| {input_shape.dims}, | ||
| {output_shape.dims}, | ||
| input_shape.data_type)); |
There was a problem hiding this comment.
@lambda7xx the kernel takes ArrayShape and we have ParallelTensorShape
|
how about use |
|
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
I think we can use |
|
Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…
forget to pass the op_type. The definition of init_kernel is below. |
|
can we bind The |
lambda7xx
left a comment
There was a problem hiding this comment.
Reviewed 1 of 5 files at r1.
Reviewable status: 1 of 13 files reviewed, 3 unresolved discussions (waiting on @lockshaw, @reyna-abhyankar, and @wmdi)
|
why |
|
the original code int out_dim = output.shape.at(ff_dim_t{0}) + 1; `` |
|
Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…
the original code so I think the batch_size should be |
|
auto weight_grad = acc.get_tensor_gradPermissions::RW(WEIGHT); |
|
the |
|
where is scalar |
|
Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…
I set a pr about layernorm LayerNorm OP draft by lambda7xx · Pull Request #1186 · flexflow/FlexFlow (github.com) |
reyna-abhyankar
left a comment
There was a problem hiding this comment.
Reviewable status: 1 of 13 files reviewed, 7 unresolved discussions (waiting on @lambda7xx, @lockshaw, and @wmdi)
lib/kernels/src/cuda/element_unary_kernels.cu line 78 at r4 (raw file):
Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…
where is scalar
Done. See other comment (I think we should merge the two attrs classes)
lib/kernels/src/cuda/layer_norm_kernels.cu line 36 at r4 (raw file):
Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…
I set a pr about layernorm LayerNorm OP draft by lambda7xx · Pull Request #1186 · flexflow/FlexFlow (github.com)
Ok, we can use your layer norm PR
I think you can do something like this:
Code snippet:
int n = 6; // number of pointers
mean = (float *) allocator.allocate(sizeof(float) * batch_size * n);
rstd = (float *) mean + batch_size;
...lib/runtime/src/ops/element_unary.cc line 216 at r4 (raw file):
Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…
can we bind
ElementScalarUnaryAttrs const &attrs, then we can getElementUnaryPerDeviceState?The
ElementScalarUnaryAttrs const &attrsandElementUnaryAttrs const &attrsare different class.
I actually think we should merge them. @lockshaw the op type will determine what is executed in the kernel anyway
lib/runtime/src/ops/embedding.h line 20 at r4 (raw file):
Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…
why
InputParallelTensorDesc? original code isParallelTensorShape
InputParallelTensorDesc tells us if an input is trainable or not. Useful for the binding
lib/runtime/src/ops/embedding.cc line 71 at r4 (raw file):
Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…
the original code
int out_dim = output.domain.hi()[0] - output.domain.lo()[0] + 1; int effective_batch_size = output.domain.get_volume() / out_dim;so I think the batch_size should be
int out\_dim = output.shape.at(ff\_dim\_t{0}) + 1; int batch\_size = output.shape.get\_volume() / out\_dim;
lib/runtime/src/ops/embedding.cc line 85 at r4 (raw file):
Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…
auto weight_grad = acc.get_tensor_gradPermissions::RW(WEIGHT);
Done.
lib/runtime/src/ops/embedding.cc line 101 at r4 (raw file):
Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…
the
input.shape[legion_dim_t(1)])is batch_size? I don't think so.
Done. See previous comment.
lockshaw
left a comment
There was a problem hiding this comment.
Reviewed 4 of 5 files at r1, 3 of 4 files at r2, 2 of 2 files at r3, 3 of 3 files at r4, all commit messages.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @lambda7xx, @reyna-abhyankar, and @wmdi)
lib/kernels/include/kernels/element_unary_kernels.h line 19 at r4 (raw file):
OperatorType op_type; DataType data_type; float scalar;
Does this really compile without the req?
Suggestion:
req<float> scalar;lib/kernels/include/kernels/layer_norm_kernels.h line 38 at r4 (raw file):
GenericTensorAccessorW const &beta_grad, DataType data_type, int64_t batch_size,
Isn't this part of the shape so it can just be accessed via input?
lib/kernels/include/kernels/layer_norm_kernels.h line 39 at r4 (raw file):
DataType data_type, int64_t batch_size, int64_t num_elements,
Isn't this part of the shape so it can jut be accessed via one of the weights?
lib/kernels/src/cuda/element_unary_kernels.cu line 78 at r4 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Done. See other comment (I think we should merge the two attrs classes)
Are you still in favor of this after the meeting yesterday? I'd like to keep the attrs separate, though I don't really care what happens with them in kernels
lib/kernels/src/cuda/layer_norm_kernels.cu line 36 at r4 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Ok, we can use your layer norm PR
I think you can do something like this:
Yeah cudaMalloc should be replaced by Allocator
lib/kernels/src/hip/layer_norm_kernels.cpp line 33 at r4 (raw file):
int64_t effective_batch_size) { float *mean, *rstd, *ds, *db, *scale, *bias; checkCUDA(cudaMalloc(&mean, sizeof(float) * batch_size));
Use Allocator here
lib/kernels/src/hip/layer_norm_kernels.cpp line 191 at r4 (raw file):
GenericTensorAccessorW const &beta_grad, DataType data_type, int64_t batch_size,
Isn't this accessible through the tensor shapes?
lib/runtime/src/ops/element_unary.cc line 178 at r4 (raw file):
init_binding.bind_arg(HANDLE, ff_handle()); init_binding.bind_arg(ATTRS, attrs); init_binding.bind_arg(INPUT_SHAPE, input_parallel_tensor_shape(0));
Suggestion:
init_binding.bind_arg(INPUT_SHAPE, input_shape);lib/runtime/src/ops/element_unary.cc line 216 at r4 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
I actually think we should merge them. @lockshaw the op type will determine what is executed in the kernel anyway
I'd like to keep them separate at the op-attrs level, but I don't care what happens to them them in the runtime/ops and kernels levels
lib/runtime/src/ops/embedding.cc line 71 at r4 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Should be input.shape[ff_dim_t(0)] I think, as this is just a TensorShape and not a ParallelTensorShape and so there shouldn't be a parallel dimension present
lib/runtime/src/ops/embedding.cc line 85 at r4 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Done.
I'm still seeing Permissions::RO here...
lib/runtime/src/ops/embedding.cc line 101 at r4 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Done. See previous comment.
Should be ff_dim_t(0)
|
Previously, lockshaw (Colin Unger) wrote…
what's the implementation @lockshaw |
Description of changes:
Ignore branch name
Related Issues:
Linked Issues:
Issues closed by this PR:
This change is