Simplify cat implementation by naoyam · Pull Request #3215 · NVIDIA/Fuser

naoyam · 2024-10-18T06:59:05Z

cat is translated to CUDA code with a if-then-else block:

if (i < input0_ext) {
  out[idx] = input0[idx];
} else if (i < input0_ext + input1_ext) {
  out[idx] = input1[idx];
} else if ...

This is correct, but I believe this can be simplified to just out[idx] = input0[idx] + input1[idx] + ... since all of the inputs are padded by zero, so this result should be equivalent.

Since + is not defined for some low precision types, bitwise-or is instead used when addition is not available.

On A100, this simplification yielded about 5% perf improvement of a RoPE module.

I didn't add any specific test since I don't know if any new test would be beneficial. Half, bfloat16 and fp8 types are tested by ResizeTest.CatMemoryPromotionReducedFloating.

naoyam · 2024-10-18T07:01:30Z

runtime/fp8_support.cu

  return val;
 }

+__device__ __inline__ __e4m3 operator|(const __e4m3 x, const __e4m3 y) {


Not exactly sure if this is reasonable. I don't know why memcpy is extensively used for these fp8 types.

Pinging @jjsjann123

This does look cleaner :)

I thought there may be some magic with memcpy.

naoyam · 2024-10-18T07:02:29Z

!build

jacobhinkle

LGTM. I wonder if we could handle all types (other than ComplexDouble) by bitcasting to an integer of the right size then doing bitwise or. For this we would need Int8 and Int16, but we would not need support in the runtime files for bitwise ops on floats then.

naoyam · 2024-10-18T19:27:14Z

LGTM. I wonder if we could handle all types (other than ComplexDouble) by bitcasting to an integer of the right size then doing bitwise or. For this we would need Int8 and Int16, but we would not need support in the runtime files for bitwise ops on floats then.

Yeah, I did think about it, but I found it's just easier to handle these low precision types separately than adding the new integer types, at least for now.

naoyam added 2 commits October 17, 2024 22:01

Simplify generated code for CatOp

07997c1

fp8 fix

9f5a6b1

naoyam commented Oct 18, 2024

View reviewed changes

clang-format

31b2d2b

naoyam requested a review from jacobhinkle October 18, 2024 07:02

jacobhinkle approved these changes Oct 18, 2024

View reviewed changes

naoyam merged commit c7818bd into main Oct 19, 2024

naoyam deleted the cat_opt branch October 19, 2024 01:21

naoyam mentioned this pull request Oct 21, 2024

FP8 initialization type error #3245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify cat implementation#3215

Simplify cat implementation#3215
naoyam merged 3 commits intomainfrom
cat_opt

naoyam commented Oct 18, 2024

Uh oh!

naoyam Oct 18, 2024

Uh oh!

jjsjann123 Oct 18, 2024

Uh oh!

naoyam Oct 18, 2024

Uh oh!

naoyam commented Oct 18, 2024

Uh oh!

jacobhinkle left a comment

Uh oh!

naoyam commented Oct 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

naoyam commented Oct 18, 2024

Uh oh!

naoyam Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

naoyam Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

naoyam commented Oct 18, 2024

Uh oh!

jacobhinkle left a comment

Choose a reason for hiding this comment

Uh oh!

naoyam commented Oct 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants