Implement BatchNormInternal for cuda by mindest · Pull Request #8172 · microsoft/onnxruntime

mindest · 2021-06-28T08:20:36Z

Description: Implement BatchNormInternal for cuda

type binding: input/output, scale/bias, mean/var are of separate types, support fp16 case;
add corresponding forward test cases;
reenable gradient test case.

Motivation and Context

Original version of BatchNormalization does not support training in fp16, and has issue when updating running_mean/_var.

remove bn replacement in grad graph builder

orttraining/orttraining/training_ops/cuda/nn/batch_norm_grad.cc

orttraining/orttraining/training_ops/cuda/nn/batch_norm_internal.cc

onnxruntime/core/providers/cpu/nn/batch_norm_helper.h

orttraining/orttraining/core/graph/training_op_defs.cc

…nxruntime into linmin/bni_cuda

orttraining/orttraining/training_ops/cuda/nn/batch_norm_grad.cc

orttraining/orttraining/core/graph/training_op_defs.cc

SherlockNoMad · 2021-07-22T01:45:39Z

orttraining/orttraining/test/training_ops/cuda/batch_norm_internal_test.cc

+  std::vector<float> running_mean = {-0.1754f, 0.303106f};
+  std::vector<float> running_var = {0.7812f, 1.5865f};
+  std::vector<float> saved_mean = {-0.306f, 0.114562f};
+  std::vector<float> saved_inv_std = {1.2288f, 0.861317f};


let's rename this to saved_inv_var to reflect the reality.
and comment that this test data will only work for CUDA and not CPU.

If the cudnn is actually returning saved_inv_std, would this UT also work for CPU impl?

It should, but I also infer from the result given in the calculation that

when calculating saved_inv_std and y, it uses biased std/var

when calculating running_var, it uses unbiased std/var

As for the CPU implementation, it always uses the biased one for calculation. Not sure why cudnn has such inconsistency itself and which is more reasonable.

And the above difference makes the running_var output differ for CPU/CUDA given the same input data.

ic... thank you for the detail investigation.
Could you please also help document this subtle difference in the kernel and UT comment?

I think the CPU impl is the correct one, as in the ONNX spec, we explicitly mentioned that the variance should be population variance, aka biased variance.

When the number of sample is large, the difference between biased-var and unbiased-var would be small. Let's note this done and move on.

orttraining/orttraining/training_ops/cuda/nn/batch_norm_internal.cc

onnxruntime/core/providers/cpu/nn/batch_norm_helper.h

SherlockNoMad · 2021-07-27T17:53:29Z

Hi @mindest, thanks a lot for the PR.
It's good to merge, after adding the comment about the subtle unbiased-var result by cudnnBatchNorm.

Please ask Vincent for sign-off if I am not online.

mindest added 6 commits June 25, 2021 09:21

correct batchnorm replacement output order;

8d5178e

remove bn replacement in grad graph builder

update op defs and kernel class

fd3eb04

implement batch norm internal and grad.

8ac7e38

change saved_var into saved_inv_std

0ec4e7e

cuda test case: bn internal, reenable bn grad test

cf9d938

remove redundant include

f6c9485

Lafi7e reviewed Jun 29, 2021

View reviewed changes

SherlockNoMad added the training issues related to ONNX Runtime training; typically submitted using template label Jul 1, 2021

mindest added 4 commits July 5, 2021 06:10

fix comment; add support and UT for 1d input.

d45d6ce

exclude batch_norm_internal in amd_hipify

b401855

run BNInternal UT for CUDA only

cb62929

fix CI error

5b25b65

mindest marked this pull request as ready for review July 6, 2021 03:23

mindest requested a review from a team as a code owner July 6, 2021 03:23

mindest requested a review from Lafi7e July 6, 2021 03:23

Merge branch 'master' into linmin/bni_cuda

26179f8

Lafi7e requested review from SherlockNoMad and weixingzhang July 13, 2021 05:52

Lafi7e reviewed Jul 16, 2021

View reviewed changes

onnxruntime/core/providers/cpu/nn/batch_norm_helper.h Outdated Show resolved Hide resolved

orttraining/orttraining/core/graph/training_op_defs.cc Outdated Show resolved Hide resolved

mindest added 3 commits July 16, 2021 04:04

fix comment errors

19bb5b6

Merge branch 'linmin/bni_cuda' of https://www.github.com/microsoft/on…

4f1ee5f

…nxruntime into linmin/bni_cuda

fix error

4500cc4