FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm by eqy · Pull Request #1274 · NVIDIA/apex

eqy · 2022-01-15T00:35:50Z

Pattern-matched implementation of FusedRMSNorm based on FusedLayerNorm. Tests are passing (needed threshold adjustment for float16), awaiting benchmark results and cleanup.

eqy · 2022-01-18T22:26:23Z

Some benchmark data on A100:

[------------------------------------------------------------------------------------------------------------ forward ------------------------------------------------------------------------------------------------------------]
                     |  torch.float32  |  fused torch.float32  |  torch.bfloat16  |  fused torch.bfloat16  |  autocast torch.float32  |  fused autocast torch.float32  |  autocast torch.bfloat16  |  fused autocast torch.bfloat16
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      [8, 28, 1024]  |       62.2      |          27.9         |       75.3       |          28.0          |           63.3           |              41.3              |            63.2           |               64.8
      [8, 30, 1024]  |       62.4      |          28.1         |       74.8       |          27.6          |           63.3           |              41.0              |            64.7           |               66.1
      [8, 32, 1024]  |       61.7      |          28.1         |       74.7       |          27.8          |           63.1           |              41.1              |            62.6           |               65.6
      [8, 34, 1024]  |       63.3      |          28.2         |       76.2       |          28.8          |           63.1           |              41.1              |            64.6           |               66.7
      [8, 36, 1024]  |       61.9      |          27.8         |       74.6       |          28.1          |           62.8           |              41.5              |            64.4           |               66.7
      [8, 38, 1024]  |       63.6      |          28.3         |       74.5       |          28.2          |           63.6           |              40.9              |            64.4           |               65.8
      [8, 40, 1024]  |       62.2      |          28.4         |       75.5       |          28.3          |           63.6           |              42.5              |            64.7           |               66.4
      [8, 42, 1024]  |       62.5      |          28.1         |       74.6       |          28.6          |           63.9           |              41.1              |            64.7           |               65.4
      [8, 44, 1024]  |       62.2      |          27.9         |       74.9       |          28.4          |           64.2           |              40.7              |            63.6           |               65.6
      [8, 46, 1024]  |       62.8      |          28.1         |       75.9       |          29.0          |           63.9           |              41.0              |            64.0           |               65.6
      [8, 48, 1024]  |       62.5      |          28.5         |       74.9       |          28.4          |           63.4           |              41.3              |            63.1           |               65.2
      [8, 50, 1024]  |       62.8      |          28.2         |       74.5       |          27.9          |           63.6           |              40.8              |            62.8           |               65.7
      [8, 52, 1024]  |       62.2      |          28.4         |       76.5       |          28.9          |           64.1           |              41.4              |            62.9           |               65.6
      [8, 54, 1024]  |       62.7      |          28.2         |       74.8       |          28.3          |           63.3           |              41.4              |            63.5           |               65.7
      [8, 56, 1024]  |       62.8      |          28.0         |       75.1       |          28.1          |           63.1           |              41.0              |            63.2           |               65.4

Times are in microseconds (us).

[------------------------------------------------------------------------------------------------------------ backward -----------------------------------------------------------------------------------------------------------]
                     |  torch.float32  |  fused torch.float32  |  torch.bfloat16  |  fused torch.bfloat16  |  autocast torch.float32  |  fused autocast torch.float32  |  autocast torch.bfloat16  |  fused autocast torch.bfloat16
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      [8, 28, 1024]  |      386.3      |         150.2         |      425.3       |         149.8          |          382.8           |             161.0              |           384.8           |              229.7
      [8, 30, 1024]  |      662.0      |         363.6         |      427.3       |         144.8          |          379.5           |             163.7              |           394.8           |              222.1
      [8, 32, 1024]  |      368.2      |         149.4         |      438.6       |         149.8          |          377.5           |             162.2              |           374.3           |              225.6
      [8, 34, 1024]  |      376.6      |         149.2         |      429.5       |         148.2          |          376.8           |             162.5              |           381.4           |              213.9
      [8, 36, 1024]  |      395.9      |         154.3         |      425.7       |         143.9          |          383.2           |             162.2              |           390.6           |              234.8
      [8, 38, 1024]  |      403.5      |         158.5         |      432.1       |         149.4          |          390.5           |             164.1              |           411.9           |              353.4
      [8, 40, 1024]  |      378.3      |         152.8         |      437.0       |         152.8          |          660.5           |             223.3              |           386.4           |              223.4
      [8, 42, 1024]  |      381.3      |         148.1         |      427.7       |         145.6          |          363.2           |             164.2              |           384.1           |              212.9
      [8, 44, 1024]  |      375.2      |         150.3         |      425.4       |         149.5          |          383.1           |             161.5              |           380.7           |              216.3
      [8, 46, 1024]  |      372.6      |         147.3         |      440.5       |         152.9          |          380.9           |             162.1              |           383.3           |              216.6
      [8, 48, 1024]  |      618.1      |         239.7         |      437.4       |         145.0          |          380.3           |             159.9              |           379.1           |              222.2
      [8, 50, 1024]  |      374.2      |         146.1         |      424.7       |         153.4          |          381.0           |             161.8              |           382.6           |              225.7
      [8, 52, 1024]  |      379.7      |         148.7         |      433.8       |         148.2          |          390.0           |             162.3              |           376.4           |              226.2
      [8, 54, 1024]  |      374.3      |         150.6         |      428.7       |         150.8          |          381.5           |             162.5              |           376.7           |              212.1
      [8, 56, 1024]  |      377.3      |         145.9         |      426.5       |         144.5          |          379.0           |             161.2              |           389.1           |              217.3

Times are in microseconds (us).

crcrpar

bunch of suggestions to remove comment outed lines. you can batch into suggestions into one if you like, see https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/incorporating-feedback-in-your-pull-request#applying-suggested-changes.

What do you think about dissecting apex/normalization/fused_layer_norm.py into fused_layer_norm.py and fused_rms_norm.py?

crcrpar · 2022-01-19T16:20:48Z

+class FusedRMSNormAffineFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input, weight, normalized_shape, eps):
+    #def forward(ctx, input, weight, bias, normalized_shape, eps):


Suggested change

#def forward(ctx, input, weight, bias, normalized_shape, eps):

crcrpar · 2022-01-19T16:21:19Z

+        #)
+        output, invvar = fused_layer_norm_cuda.rms_forward_affine(
+            input_, ctx.normalized_shape, weight_, ctx.eps)
+        #ctx.save_for_backward(input_, weight_, bias_, mean, invvar)


Suggested change

#ctx.save_for_backward(input_, weight_, bias_, mean, invvar)

crcrpar · 2022-01-19T16:53:54Z

+    at::IntList normalized_shape,
+    #endif
+    at::Tensor* gamma,
+    // at::Tensor* beta,


Suggested change

// at::Tensor* beta,

crcrpar · 2022-01-19T16:54:17Z

+    double epsilon,
+    at::Tensor* grad_input,
+    at::Tensor* grad_gamma)
+    // at::Tensor* grad_beta


Suggested change

// at::Tensor* grad_beta

crcrpar · 2022-01-19T16:54:24Z

+      using accscalar_t = at::acc_type<scalar_t_in, true>;
+      HostRMSNormGradient(
+        dout->DATA_PTR<scalar_t_out>(),
+        // mean->DATA_PTR<accscalar_t>(),


Suggested change

// mean->DATA_PTR<accscalar_t>(),

crcrpar · 2022-01-19T16:54:36Z

+            // TMJ pass NULL argument for gamma, beta, grad_gamma and grad_beta
+            // if gamma Tensor is NULL on input.
+        gamma != NULL ? gamma->DATA_PTR<scalar_t_out>() : NULL,
+        // gamma != NULL ? beta->DATA_PTR<scalar_t_out>() : NULL,


Suggested change

// gamma != NULL ? beta->DATA_PTR<scalar_t_out>() : NULL,

crcrpar · 2022-01-19T16:55:11Z

+        epsilon,
+        grad_input->DATA_PTR<scalar_t_in>(),
+        gamma != NULL ? grad_gamma->DATA_PTR<scalar_t_out>() : NULL);
+        // gamma != NULL ? grad_beta->DATA_PTR<scalar_t_out>() : NULL);


Suggested change

// gamma != NULL ? grad_beta->DATA_PTR<scalar_t_out>() : NULL);

stas00 · 2022-01-20T19:12:21Z

+    native = apex.normalization.FusedRMSNorm(
+        normalized_shape=normalized_shape, elementwise_affine=elementwise_affine
+    )
+    fused = apex.normalization.FusedRMSNorm(
+        normalized_shape=normalized_shape, elementwise_affine=elementwise_affine
+    ).cuda()
+    return native, fused


this testing won't do much good as it's comparing to itself :)

Since there isn't torch.nn.RMSNorm, perhaps writing one out in plain python?

It's a bit opaque here, but the "native" version is computed on CPU which dispatches to a manual plain python version sourced from T5LayerNorm:

apex/apex/normalization/fused_layer_norm.py

Line 410 in 028ef04

return manual_rms_norm(input, self.normalized_shape, self.weight, self.eps)

apex/apex/normalization/fused_layer_norm.py

Line 16 in 028ef04

def manual_rms_norm(input, normalized_shape, weight, eps):

Thank you for explaining this nuance, @eqy. I can see it now.

eqy · 2022-01-31T22:26:39Z

@crcrpar this is now refactored to use the existing FusedLayerNorm implementation via an added rms_only flag

crcrpar

I think this is the last iteration

Oof, nice catch, thanks Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

stas00 · 2022-02-10T22:24:57Z

+        y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta
+
+    The mean and standard-deviation are calculated separately over the last
+    certain number dimensions which have to be of the shape specified by
+    :attr:`normalized_shape`.
+    :math:`\gamma` and :math:`\beta` are learnable affine transform parameters of
+    :attr:`normalized_shape` if :attr:`elementwise_affine` is ``True``.


this looks like a copy-n-paste error - as this version has no bias and no mean subtraction in the math formula.

I think the note below needs updating as well wrt bias.

Could you check the follow-up #1285?

oh! I missed that one - thank you!

looks much better now - with only one small issue - commented in the other PR.

* FusedRMSNorm based on FusedLayerNorm * refactor duplicated kernels * delete comments * delete comments * cleanup * cleanup * cleanup, fixed clobbering forward_affine_mixed_dtypes * fix pybind naming and add MixedFused test * undo skipping * check elementwise_affine * Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py Oof, nice catch, thanks Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

* FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (NVIDIA#1274) * FusedRMSNorm based on FusedLayerNorm * refactor duplicated kernels * delete comments * delete comments * cleanup * cleanup * cleanup, fixed clobbering forward_affine_mixed_dtypes * fix pybind naming and add MixedFused test * undo skipping * check elementwise_affine * Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py Oof, nice catch, thanks Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com> * fix and generate docs for FusedRMSNorm (NVIDIA#1285) * [FusedRMSNorm doc] document where epsilon is added (NVIDIA#1295) * [FusedRMSNorm doc] add epsilon to formula * correct * better wording * Fix some bugs * Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs * Fix NaN issues in FusedRMSNorm * Update test_fused_layer_norm.py * Skip test_fused_layer_norm.TestAutocastFusedRMSNorm on ROCm * Use at::cuda::warp_size() instead of at::cuda::getCurrentDeviceProperties()->warpSize Co-authored-by: eqy <eddiey@nvidia.com> Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

FusedRMSNorm based on FusedLayerNorm

028ef04

crcrpar reviewed Jan 19, 2022

View reviewed changes

stas00 reviewed Jan 20, 2022

View reviewed changes

stas00 mentioned this pull request Jan 27, 2022

[t5/t0/mt5 models] faster/leaner custom layer norm huggingface/transformers#14656

Merged

eqy added 5 commits January 28, 2022 00:55

refactor duplicated kernels

21c9277

delete comments

40199a5

delete comments

6f62a7b

cleanup

0c7d7ec

cleanup

f6ee6b4

eqy changed the title ~~[WIP] FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm~~ FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm Jan 31, 2022

crcrpar reviewed Feb 2, 2022

View reviewed changes

Comment thread apex/normalization/fused_layer_norm.py

Comment thread csrc/layer_norm_cuda.cpp

Comment thread csrc/layer_norm_cuda.cpp Outdated

cleanup, fixed clobbering forward_affine_mixed_dtypes

8d00d22

crcrpar reviewed Feb 2, 2022

View reviewed changes

Comment thread apex/normalization/fused_layer_norm.py Outdated

Comment thread tests/L0/run_test.py Outdated

eqy added 3 commits February 3, 2022 00:13

fix pybind naming and add MixedFused test

e5c0c61

undo skipping

50e068b

check elementwise_affine

2cca4ff

crcrpar requested changes Feb 3, 2022

View reviewed changes

Comment thread tests/L0/run_fused_layer_norm/test_fused_layer_norm.py Outdated

Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

ce65026

Oof, nice catch, thanks Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

crcrpar approved these changes Feb 4, 2022

View reviewed changes

crcrpar merged commit 684c473 into NVIDIA:master Feb 4, 2022

stas00 mentioned this pull request Feb 4, 2022

[feature request] implement RMSNorm fused cuda kernel #1271

Closed

eqy mentioned this pull request Feb 4, 2022

fix and generate docs for FusedRMSNorm #1285

Merged

stas00 reviewed Feb 10, 2022

View reviewed changes

comaniac mentioned this pull request Mar 31, 2022

[Refactor] Merge upstream szhengac/apex#2

Merged

eqy mentioned this pull request Feb 7, 2025

[WIP][CUDA][cuDNN] Experimental cudnn_rms_norm pytorch/pytorch#146388

Closed

Conversation

eqy commented Jan 15, 2022

Uh oh!

eqy commented Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crcrpar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eqy commented Jan 31, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

crcrpar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stas00 Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eqy commented Jan 18, 2022 •

edited

Loading

crcrpar left a comment •

edited

Loading

stas00 Feb 10, 2022 •

edited

Loading