Embedding Op by reyna-abhyankar · Pull Request #1179 · flexflow/flexflow-train

reyna-abhyankar · 2023-10-07T20:58:36Z

Description of changes:

Ignore branch name

Related Issues:

Linked Issues:

Issue #

Issues closed by this PR:

Closes Update Embedding operator #1057

This change is

reyna-abhyankar · 2023-10-08T17:03:00Z

lib/runtime/src/ops/element_unary.cc

+  DeviceSpecific<ElementUnaryPerDeviceState> per_device_state =
+      acc.create_device_specific<ElementUnaryPerDeviceState>(
+          init_kernel(handle,
+                      {input_shape.dims},
+                      {output_shape.dims},
+                      input_shape.data_type));


@lambda7xx the kernel takes ArrayShape and we have ParallelTensorShape

lambda7xx · 2023-10-09T23:36:33Z

lib/kernels/src/cuda/layer_norm_kernels.cu line 36 at r4 (raw file):

  checkCUDA(cudaMalloc(&rstd, sizeof(float) * batch_size));
  checkCUDA(cudaMalloc(&ds, sizeof(float) * batch_size));
  checkCUDA(cudaMalloc(&db, sizeof(float) * batch_size));

how about use Allocator to allocate memory? and

lambda7xx · 2023-10-09T23:42:32Z

lib/runtime/src/ops/element_unary.cc line 70 at r2 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

@lambda7xx the kernel takes ArrayShape and we have ParallelTensorShape

I think we can use input to get its ArrayShape

lambda7xx · 2023-10-09T23:51:48Z

lib/runtime/src/ops/element_unary.cc line 70 at r2 (raw file):

Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…

I think we can use input to get its ArrayShape

forget to pass the op_type. The definition of init_kernel is below.

ElementUnaryPerDeviceState init_kernel(PerDeviceFFHandle const &handle,
                                       ArrayShape const &input_shape,
                                       ArrayShape const &output_shape,
                                       OperatorType op_type,
                                       DataType data_type)

lambda7xx · 2023-10-09T23:54:00Z

lib/runtime/src/ops/element_unary.cc line 216 at r4 (raw file):

  SimTaskBinding init_binding;
  init_binding.bind_arg(HANDLE, ff_handle());
  init_binding.bind_arg(ATTRS, attrs);

can we bind ElementScalarUnaryAttrs const &attrs , then we can get ElementUnaryPerDeviceState?

The ElementScalarUnaryAttrs const &attrs and ElementUnaryAttrs const &attrs are different class.

lambda7xx

Reviewed 1 of 5 files at r1.
Reviewable status: 1 of 13 files reviewed, 3 unresolved discussions (waiting on @lockshaw, @reyna-abhyankar, and @wmdi)

lambda7xx · 2023-10-09T23:54:56Z

lib/runtime/src/ops/embedding.h line 20 at r4 (raw file):

CostMetrics measure_operator_cost(SimEnvFactory const &sim_factory,
                                  EmbeddingAttrs const &attrs,
                                  InputParallelTensorDesc const &input_shape,

why InputParallelTensorDesc ? original code is ParallelTensorShape

lambda7xx · 2023-10-10T00:00:07Z

lib/runtime/src/ops/embedding.cc line 71 at r4 (raw file):

                 input.shape.get_dim(),
                 output.shape.get_dim(),
                 input.shape[legion_dim_t(1)]);

the original code

    int out_dim = output.domain.hi()[0] - output.domain.lo()[0] + 1;
    int effective_batch_size = output.domain.get_volume() / out_dim;


``
so I think the batch_size should be

int out_dim = output.shape.at(ff_dim_t{0}) + 1;
int batch_size = output.shape.get_volume() / out_dim;

``

lambda7xx · 2023-10-10T00:00:42Z

lib/runtime/src/ops/embedding.cc line 71 at r4 (raw file):

Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…

the original code
    int out_dim = output.domain.hi()[0] - output.domain.lo()[0] + 1;
    int effective_batch_size = output.domain.get_volume() / out_dim;


``
so I think the batch_size should be
int out_dim = output.shape.at(ff_dim_t{0}) + 1;
int batch_size = output.shape.get_volume() / out_dim;

``

the original code

    int out_dim = output.domain.hi()[0] - output.domain.lo()[0] + 1;
    int effective_batch_size = output.domain.get_volume() / out_dim;

so I think the batch_size should be


int out\_dim = output.shape.at(ff\_dim\_t{0}) + 1;  
int batch\_size = output.shape.get\_volume() / out\_dim;

lambda7xx · 2023-10-10T00:02:03Z

lib/runtime/src/ops/embedding.cc line 85 at r4 (raw file):

  auto input = acc.get_tensor<Permissions::RO>(INPUT);
  auto output = acc.get_tensor<Permissions::RO>(OUTPUT);
  auto weight_grad = acc.get_tensor_grad<Permissions::RO>(WEIGHT);

auto weight_grad = acc.get_tensor_gradPermissions::RW(WEIGHT);

lambda7xx · 2023-10-10T00:02:59Z

lib/runtime/src/ops/embedding.cc line 101 at r4 (raw file):

                 input.shape.get_dim(),
                 output.shape.get_dim(),
                 input.shape[legion_dim_t(1)]);

the input.shape[legion_dim_t(1)]) is batch_size? I don't think so.

lambda7xx · 2023-10-10T00:04:13Z

lib/kernels/src/cuda/element_unary_kernels.cu line 78 at r4 (raw file):

  ElementUnaryPerDeviceState per_device_state = {
      handle, inputTensor, outputTensor, actiDesc, op_type, data_type, scalar};

where is scalar

lambda7xx · 2023-10-10T00:05:36Z

lib/kernels/src/cuda/layer_norm_kernels.cu line 36 at r4 (raw file):

Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…

how about use Allocator to allocate memory? and

I set a pr about layernorm LayerNorm OP draft by lambda7xx · Pull Request #1186 · flexflow/FlexFlow (github.com)

reyna-abhyankar

Reviewable status: 1 of 13 files reviewed, 7 unresolved discussions (waiting on @lambda7xx, @lockshaw, and @wmdi)

lib/kernels/src/cuda/element_unary_kernels.cu line 78 at r4 (raw file):

Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…

where is scalar

Done. See other comment (I think we should merge the two attrs classes)

lib/kernels/src/cuda/layer_norm_kernels.cu line 36 at r4 (raw file):

Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…

I set a pr about layernorm LayerNorm OP draft by lambda7xx · Pull Request #1186 · flexflow/FlexFlow (github.com)

Ok, we can use your layer norm PR
I think you can do something like this:

Code snippet:

int n = 6; // number of pointers
mean = (float *) allocator.allocate(sizeof(float) * batch_size * n);
rstd = (float *) mean + batch_size;
...

lib/runtime/src/ops/element_unary.cc line 216 at r4 (raw file):

Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…

can we bind ElementScalarUnaryAttrs const &attrs , then we can get ElementUnaryPerDeviceState?

The ElementScalarUnaryAttrs const &attrs and ElementUnaryAttrs const &attrs are different class.

I actually think we should merge them. @lockshaw the op type will determine what is executed in the kernel anyway

lib/runtime/src/ops/embedding.h line 20 at r4 (raw file):

Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…

why InputParallelTensorDesc ? original code is ParallelTensorShape

InputParallelTensorDesc tells us if an input is trainable or not. Useful for the binding

lib/runtime/src/ops/embedding.cc line 71 at r4 (raw file):

Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…

the original code

    int out_dim = output.domain.hi()[0] - output.domain.lo()[0] + 1;
    int effective_batch_size = output.domain.get_volume() / out_dim;

so I think the batch_size should be


int out\_dim = output.shape.at(ff\_dim\_t{0}) + 1;  
int batch\_size = output.shape.get\_volume() / out\_dim;

@lockshaw

lib/runtime/src/ops/embedding.cc line 85 at r4 (raw file):

Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…

auto weight_grad = acc.get_tensor_gradPermissions::RW(WEIGHT);

Done.

lib/runtime/src/ops/embedding.cc line 101 at r4 (raw file):

Previously, lambda7xx (Lambda(Xiaoxiang) Shi ) wrote…

the input.shape[legion_dim_t(1)]) is batch_size? I don't think so.

Done. See previous comment.

lockshaw

Reviewed 4 of 5 files at r1, 3 of 4 files at r2, 2 of 2 files at r3, 3 of 3 files at r4, all commit messages.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @lambda7xx, @reyna-abhyankar, and @wmdi)

lib/kernels/include/kernels/element_unary_kernels.h line 19 at r4 (raw file):

  OperatorType op_type;
  DataType data_type;
  float scalar;

Does this really compile without the req?

Suggestion:

  req<float> scalar;

lib/kernels/include/kernels/layer_norm_kernels.h line 38 at r4 (raw file):

                     GenericTensorAccessorW const &beta_grad,
                     DataType data_type,
                     int64_t batch_size,

Isn't this part of the shape so it can just be accessed via input?

lib/kernels/include/kernels/layer_norm_kernels.h line 39 at r4 (raw file):

                     DataType data_type,
                     int64_t batch_size,
                     int64_t num_elements,

Isn't this part of the shape so it can jut be accessed via one of the weights?

lib/kernels/src/cuda/element_unary_kernels.cu line 78 at r4 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Done. See other comment (I think we should merge the two attrs classes)

Are you still in favor of this after the meeting yesterday? I'd like to keep the attrs separate, though I don't really care what happens with them in kernels

lib/kernels/src/cuda/layer_norm_kernels.cu line 36 at r4 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Ok, we can use your layer norm PR
I think you can do something like this:

Yeah cudaMalloc should be replaced by Allocator

lib/kernels/src/hip/layer_norm_kernels.cpp line 33 at r4 (raw file):

                                    int64_t effective_batch_size) {
  float *mean, *rstd, *ds, *db, *scale, *bias;
  checkCUDA(cudaMalloc(&mean, sizeof(float) * batch_size));

Use Allocator here

lib/kernels/src/hip/layer_norm_kernels.cpp line 191 at r4 (raw file):

                       GenericTensorAccessorW const &beta_grad,
                       DataType data_type,
                       int64_t batch_size,

Isn't this accessible through the tensor shapes?

lib/runtime/src/ops/element_unary.cc line 178 at r4 (raw file):

  init_binding.bind_arg(HANDLE, ff_handle());
  init_binding.bind_arg(ATTRS, attrs);
  init_binding.bind_arg(INPUT_SHAPE, input_parallel_tensor_shape(0));

Suggestion:

init_binding.bind_arg(INPUT_SHAPE, input_shape);

lib/runtime/src/ops/element_unary.cc line 216 at r4 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

I actually think we should merge them. @lockshaw the op type will determine what is executed in the kernel anyway

I'd like to keep them separate at the op-attrs level, but I don't care what happens to them them in the runtime/ops and kernels levels

lib/runtime/src/ops/embedding.cc line 71 at r4 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

@lockshaw

Should be input.shape[ff_dim_t(0)] I think, as this is just a TensorShape and not a ParallelTensorShape and so there shouldn't be a parallel dimension present

lib/runtime/src/ops/embedding.cc line 85 at r4 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Done.

I'm still seeing Permissions::RO here...

lib/runtime/src/ops/embedding.cc line 101 at r4 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Done. See previous comment.

Should be ff_dim_t(0)

lambda7xx · 2023-12-03T01:39:07Z

lib/runtime/src/ops/embedding.cc line 71 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Should be input.shape[ff_dim_t(0)] I think, as this is just a TensorShape and not a ParallelTensorShape and so there shouldn't be a parallel dimension present

what's the implementation @lockshaw

reyna-abhyankar added 2 commits October 7, 2023 13:11

Embedding kernels

bf38e9e

Finish embedding

978c443

reyna-abhyankar added repo-refactor labels Oct 7, 2023

reyna-abhyankar requested review from lambda7xx, lockshaw and wmdi October 7, 2023 20:58

reyna-abhyankar added 4 commits October 7, 2023 13:58

Format

0b1e7c0

Merge branch 'repo-refactor' into emb-eleu-layno

52b5a93

Element Unary

a909c49

Scalar unary

e09080c

reyna-abhyankar commented Oct 8, 2023

View reviewed changes

Layer norm kernels

e1fb8a8

lambda7xx reviewed Oct 9, 2023

View reviewed changes

reyna-abhyankar commented Oct 10, 2023

View reviewed changes

lockshaw requested changes Oct 12, 2023

View reviewed changes

fix some comment

f45cd34

reyna-abhyankar closed this Jan 1, 2024

reyna-abhyankar deleted the emb-eleu-layno branch January 1, 2024 20:04

Conversation

reyna-abhyankar commented Oct 7, 2023 • edited by wmdi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reyna-abhyankar Oct 8, 2023

Choose a reason for hiding this comment

Uh oh!

lambda7xx commented Oct 9, 2023

Uh oh!

lambda7xx commented Oct 9, 2023

Uh oh!

lambda7xx commented Oct 9, 2023

Uh oh!

lambda7xx commented Oct 9, 2023

Uh oh!

lambda7xx left a comment

Choose a reason for hiding this comment

Uh oh!

lambda7xx commented Oct 9, 2023

Uh oh!

lambda7xx commented Oct 10, 2023

Uh oh!

lambda7xx commented Oct 10, 2023

Uh oh!

lambda7xx commented Oct 10, 2023

Uh oh!

lambda7xx commented Oct 10, 2023

Uh oh!

lambda7xx commented Oct 10, 2023

Uh oh!

lambda7xx commented Oct 10, 2023

Uh oh!

reyna-abhyankar left a comment

Choose a reason for hiding this comment

Uh oh!

lockshaw left a comment

Choose a reason for hiding this comment

Uh oh!

lambda7xx commented Dec 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

reyna-abhyankar commented Oct 7, 2023 •

edited by wmdi

Loading