[Codegen] Use CUDA's half2 and nv_bfloat162 intrinsics for vector fp16/bf16 data types #15190

yzh119 · 2023-06-30T14:01:41Z

Motivation

Currently, our CUDA codegen would not utilize CUDA's half2 and nv_bfloat162 intrinsics, and calls the scalar operators for each elements in the vector, which is not efficient. This PR improves the CUDA code by emitting half2 and nv_bfloat162 intrinsics when possible, and could potentially makes the generated program running faster (in case that nvcc didn't do this optimization for some programs).

The PR is based on #15183 and will be rebased to mainline after that PR get merged.

Example

Suppose a user is vectorizing the following operation:

import tvm
import tvm.tir as tir
from tvm.script import tir as T

@T.prim_func
def vec_fp16(a: T.Buffer((128,), "float16"), b: T.Buffer((128,), "float16")):
    for i in range(128):
        with T.block("b"):
            vi = T.axis.spatial(128, i)
            b[vi] = a[vi] * T.float16(3.0) + T.float16(1.0)
    
sch = tir.Schedule(vec_fp16)
b = sch.get_block("b")
i = sch.get_loops(b)[0]
bx, tx, vec = sch.split(i, [2, 32, 2])
sch.bind(bx, "blockIdx.x")
sch.bind(tx, "threadIdx.x")
sch.vectorize(vec)

f = tvm.build(sch.mod["main"], target="cuda")
print(f.imported_modules[0].get_source())

Before this PR, TVM would emit the following CUDA code:

extern "C" __global__ void __launch_bounds__(32) default_function_kernel(half* __restrict__ a, half* __restrict__ b) {
  uint1 __1;
    uint1 __2;
      uint1 v_ = *(uint1*)(a + ((((int)blockIdx.x) * 64) + (((int)threadIdx.x) * 2)));
      uint1 v__1 = make_uint1(__pack_half2(__float2half_rn(3.000000e+00f), __float2half_rn(3.000000e+00f)));
      ((half2*)(&(__2.x)))->x = (((half2*)(&(v_.x)))->x*((half2*)(&(v__1.x)))->x);
      ((half2*)(&(__2.x)))->y = (((half2*)(&(v_.x)))->y*((half2*)(&(v__1.x)))->y);
    uint1 v__2 = make_uint1(__pack_half2(__float2half_rn(1.000000e+00f), __float2half_rn(1.000000e+00f)));
    ((half2*)(&(__1.x)))->x = (((half2*)(&(__2.x)))->x+((half2*)(&(v__2.x)))->x);
    ((half2*)(&(__1.x)))->y = (((half2*)(&(__2.x)))->y+((half2*)(&(v__2.x)))->y);
  *(uint1*)(b + ((((int)blockIdx.x) * 64) + (((int)threadIdx.x) * 2))) = __1;
}

After this PR, TVM would emit code that uses half2 instrinsics directly:

extern "C" __global__ void __launch_bounds__(32) default_function_kernel(half* __restrict__ a, half* __restrict__ b) {
  *(half2*)(b + ((((int)blockIdx.x) * 64) + (((int)threadIdx.x) * 2))) = ((*(half2*)(a + ((((int)blockIdx.x) * 64) + (((int)threadIdx.x) * 2))) * make_half2(__float2half_rn(3.000000e+00f), __float2half_rn(3.000000e+00f))) + make_half2(__float2half_rn(1.000000e+00f), __float2half_rn(1.000000e+00f)));
}

cc @Hzfengsy @masahi @tqchen @junrushao @vinx13

tvm-bot · 2023-06-30T14:01:44Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

No users to tag found in teams: codegen _{See #10317 for details}

_{Generated by tvm-bot}

yzh119 added 16 commits June 2, 2023 17:21

upd

a941951

wip

3284151

upd

e753640

upd

7f31636

upd

0f6b858

rm redundant files

edc938a

upd

4ac20f1

Merge branch 'main' into native-bf16

a971ee0

upd

de6c5b2

do not change nvcc

9483524

fix

fd0ff79

remove empty line

a0525b9

fix

9c6d639

lint

bdc3382

c++ lint

a15b600

use ml_dtypes for llvm codegen test

c33423a

yzh119 added 7 commits June 30, 2023 20:41

add ml_dtypes to ci-constraints

6d3b6c3

alphabetical

40f088c

pylint

259f278

lint

aff8f61

upd

d883e33

improve comments

487c15f

improved code comment

73d6361

yzh119 force-pushed the vec-fp16-bf16 branch from 8c5a87f to 0fff749 Compare July 1, 2023 09:54

yzh119 changed the title ~~[WIP][Codegen] Use CUDA's half2 and nv_bfloat162 intrinsics for vector fp16/bf16 data types~~ [Codegen] Use CUDA's half2 and nv_bfloat162 intrinsics for vector fp16/bf16 data types Jul 1, 2023

yzh119 marked this pull request as ready for review July 1, 2023 09:56

github-actions bot requested review from Hzfengsy, junrushao and masahi July 1, 2023 09:57

github-actions bot requested a review from tqchen July 1, 2023 09:57

yzh119 added 2 commits July 1, 2023 12:09

upd

0d7d8ba

bugfix

4f4e8a6

yzh119 force-pushed the vec-fp16-bf16 branch from 880f426 to dfc871b Compare July 1, 2023 19:18

bugfix

2162eba

yzh119 force-pushed the vec-fp16-bf16 branch from dfc871b to ca203e5 Compare July 2, 2023 02:19

lint

31355fa

yzh119 force-pushed the vec-fp16-bf16 branch from ca203e5 to acc1a08 Compare July 2, 2023 02:21

refactor buildprocess

daaae71

yzh119 force-pushed the vec-fp16-bf16 branch from acc1a08 to b781c96 Compare July 3, 2023 08:57

github-actions bot requested a review from vinx13 July 3, 2023 08:58

remove unused functions

af72e1e

yzh119 force-pushed the vec-fp16-bf16 branch from b781c96 to 8524011 Compare July 3, 2023 09:14

pylint

98cb6e4

yzh119 force-pushed the vec-fp16-bf16 branch from 8524011 to 2cb7c4a Compare July 3, 2023 11:49

yzh119 added 8 commits July 3, 2023 06:39

import error

aae8208

upd

0a4a67a

fix

61ed5cc

wip

1cbd2f5

wip

617496a

upd

e962032

complete unary

c2bd9ee

bugfix

3887ff6

yzh119 force-pushed the vec-fp16-bf16 branch from 2cb7c4a to 3887ff6 Compare July 3, 2023 13:40

yzh119 mentioned this pull request Jul 18, 2023

[CI] Add ml_dypes dependency for all docker images #15226

Merged

tqchen closed this Jun 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Codegen] Use CUDA's half2 and nv_bfloat162 intrinsics for vector fp16/bf16 data types #15190

[Codegen] Use CUDA's half2 and nv_bfloat162 intrinsics for vector fp16/bf16 data types #15190

Uh oh!

yzh119 commented Jun 30, 2023 •

edited

Loading

Uh oh!

tvm-bot commented Jun 30, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Codegen] Use CUDA's half2 and nv_bfloat162 intrinsics for vector fp16/bf16 data types #15190

[Codegen] Use CUDA's half2 and nv_bfloat162 intrinsics for vector fp16/bf16 data types #15190

Uh oh!

Conversation

yzh119 commented Jun 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Example

Uh oh!

tvm-bot commented Jun 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yzh119 commented Jun 30, 2023 •

edited

Loading

tvm-bot commented Jun 30, 2023 •

edited

Loading