Skip to content

Implement BatchNormInternal for cuda#8172

Merged
mindest merged 16 commits intomasterfrom
linmin/bni_cuda
Jul 28, 2021
Merged

Implement BatchNormInternal for cuda#8172
mindest merged 16 commits intomasterfrom
linmin/bni_cuda

Conversation

@mindest
Copy link
Contributor

@mindest mindest commented Jun 28, 2021

Description: Implement BatchNormInternal for cuda

  • type binding: input/output, scale/bias, mean/var are of separate types, support fp16 case;
  • add corresponding forward test cases;
  • reenable gradient test case.

Motivation and Context

  • Original version of BatchNormalization does not support training in fp16, and has issue when updating running_mean/_var.

@SherlockNoMad SherlockNoMad added the training issues related to ONNX Runtime training; typically submitted using template label Jul 1, 2021
@mindest mindest marked this pull request as ready for review July 6, 2021 03:23
@mindest mindest requested a review from a team as a code owner July 6, 2021 03:23
@mindest mindest requested a review from Lafi7e July 6, 2021 03:23
std::vector<float> running_mean = {-0.1754f, 0.303106f};
std::vector<float> running_var = {0.7812f, 1.5865f};
std::vector<float> saved_mean = {-0.306f, 0.114562f};
std::vector<float> saved_inv_std = {1.2288f, 0.861317f};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's rename this to saved_inv_var to reflect the reality.
and comment that this test data will only work for CUDA and not CPU.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cudnn is actually returning saved_inv_std, would this UT also work for CPU impl?

Copy link
Contributor Author

@mindest mindest Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should, but I also infer from the result given in the calculation that

  • when calculating saved_inv_std and y, it uses biased std/var
  • when calculating running_var, it uses unbiased std/var

As for the CPU implementation, it always uses the biased one for calculation. Not sure why cudnn has such inconsistency itself and which is more reasonable.

And the above difference makes the running_var output differ for CPU/CUDA given the same input data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ic... thank you for the detail investigation.
Could you please also help document this subtle difference in the kernel and UT comment?

I think the CPU impl is the correct one, as in the ONNX spec, we explicitly mentioned that the variance should be population variance, aka biased variance.

When the number of sample is large, the difference between biased-var and unbiased-var would be small. Let's note this done and move on.

@SherlockNoMad
Copy link
Contributor

Hi @mindest, thanks a lot for the PR.
It's good to merge, after adding the comment about the subtle unbiased-var result by cudnnBatchNorm.

Please ask Vincent for sign-off if I am not online.

SherlockNoMad
SherlockNoMad previously approved these changes Jul 27, 2021
@mindest mindest merged commit a71dab6 into master Jul 28, 2021
@mindest mindest deleted the linmin/bni_cuda branch July 28, 2021 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

training issues related to ONNX Runtime training; typically submitted using template

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants