Implement BatchNormGradient kernel for CPU EP by pranav-prakash · Pull Request #7622 · microsoft/onnxruntime

pranav-prakash · 2021-05-08T00:21:50Z

Description: Register an implementation for BatchNormInternal and add a CPU kernel for BatchNormGradient. This is the third in a series of PRs to implement BN training on CPU (first was #6946, second was #7539).

Motivation and Context
Support training networks with BatchNorm (e.g. convnets). Also note that there exists a CUDA kernel for BN (forward training & backwards) but it's currently disabled due to flaky failures; someone more familiar with those parts can register the implementation for BNInternal on CUDA (gradient kernel doesn't have to change).

pranav-prakash · 2021-05-08T00:42:19Z

Hmm actually I'm seeing occasional failures for this CPU implementation of BNGrad as well. If you use random seed 181700829 for instance you get

max_error: 0.011094339191913605; tolerance: 0.0099999997764825821;

for the cases of batch_size (N) = 1 and case with epsilon not explicitly provided (default value should be used)

This seem close enough that it seems likely to just be an artifact of FP error; is it safe to bump the tolerance to 0.02 or so?

SherlockNoMad · 2021-05-10T22:34:22Z

orttraining/orttraining/training_ops/cpu/nn/batch_norm_grad.cc

+  const Tensor* X = ctx->Input<Tensor>(1);
+  const Tensor* Scale = ctx->Input<Tensor>(2);
+  const Tensor* saved_mean = ctx->Input<Tensor>(3);
+  const Tensor* saved_inv_variance = ctx->Input<Tensor>(4);


Just noticed a minor thing on BatchNormInternal's schema,
.Output(4, "saved_inv_std", "Inverse standard deviation for the batch", "U", OpSchema::Optional, true, 1, OpSchema::NonDifferentiable)

Output4's name should actually be save_inv_variance.

We output "1/std_dev" though. Isn't "saved_inv_std" the correct name for that (where the "inverse" here denotes the multiplicative inverse i.e. reciprocal)? Although I do see that for some reason almost all other implementations of BNgrad call it "inv_var" even when it's defined as "1/sqrttvar"; I'm not sure why this is.

orttraining/orttraining/training_ops/cpu/nn/batch_norm_grad.cc

SherlockNoMad

LGTM.
Should be good to merge when the comments are addressed.

SherlockNoMad · 2021-05-10T22:47:52Z

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline, Windows WebAssembly CI Pipeline

azure-pipelines · 2021-05-10T22:48:10Z

Azure Pipelines successfully started running 9 pipeline(s).

SherlockNoMad · 2021-05-10T22:48:15Z

/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed,orttraining-amd-gpu-ci-pipeline, orttraining-ortmodule, orttraining-ortmodule-distributed

azure-pipelines · 2021-05-10T22:48:20Z

You have several pipelines (over 10) configured to build pull requests in this repository. Specify which pipelines you would like to run by using /azp run [pipelines] command. You can specify multiple pipelines using a comma separated list.

azure-pipelines · 2021-05-10T22:48:36Z

Azure Pipelines successfully started running 10 pipeline(s).

SherlockNoMad · 2021-05-10T22:48:42Z

/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed,orttraining-amd-gpu-ci-pipeline, orttraining-ortmodule

SherlockNoMad · 2021-05-10T22:49:17Z

/azp run orttraining-ortmodule-distributed

azure-pipelines · 2021-05-10T22:49:28Z

Azure Pipelines successfully started running 1 pipeline(s).

mindest · 2021-10-13T03:12:52Z

/azp run ONNX Runtime Web CI Pipeline

azure-pipelines · 2021-10-13T03:13:03Z

Azure Pipelines successfully started running 1 pipeline(s).

stale · 2022-04-16T07:53:51Z

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

mindest · 2023-04-07T05:58:09Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,ONNX Runtime Web CI Pipeline

azure-pipelines · 2023-04-07T05:58:57Z

Azure Pipelines successfully started running 8 pipeline(s).

mindest · 2023-04-07T05:58:59Z

/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

azure-pipelines · 2023-04-07T05:59:28Z

Azure Pipelines successfully started running 8 pipeline(s).

orttraining/orttraining/training_ops/cpu/nn/batch_norm_grad.cc

orttraining/orttraining/test/training_ops/cpu/nn/batchnorm_internal_test.cc

orttraining/orttraining/training_ops/cpu/nn/batch_norm_internal.cc

orttraining/orttraining/training_ops/cpu/nn/batch_norm_grad.h

orttraining/orttraining/test/training_ops/cpu/nn/batchnorm_internal_test.cc

pengwa · 2023-04-07T10:56:19Z

orttraining/orttraining/test/training_ops/cpu/nn/batchnorm_internal_test.cc

+
+  // exclude CUDA and ROCm Execution Provider due to different calculation method of `running_var`
+  // exclude TRT and OpenVINO for same reasons as seen in TestBatchNorm()
+  test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kCudaExecutionProvider, kRocmExecutionProvider, kTensorrtExecutionProvider, kOpenVINOExecutionProvider});


so it only test on CPU EP, right?

If so, is doing this more clear?

std::vector<std::unique_ptr> execution_providers;
execution_providers.emplace_back(DefaultCpuExecutionProvider());
....
test.Run(OpTester::ExpectResult::kExpectSuccess, "", {},
nullptr,
&execution_providers);

orttraining/orttraining/training_ops/cpu/nn/batch_norm_grad.cc

orttraining/orttraining/training_ops/cpu/nn/batch_norm_internal.cc

pengwa · 2023-04-07T11:11:20Z

orttraining/orttraining/training_ops/cpu/nn/batch_norm_internal.cc

+    kMSDomain,
+    1,
+    kCpuExecutionProvider,
+    KernelDefBuilder().Alias(3,1).Alias(4,2).TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),


consider to register T1 and T2 also.

orttraining/orttraining/training_ops/cpu/nn/batch_norm_internal.cc

pengwa · 2023-04-07T11:12:49Z

orttraining/orttraining/training_ops/cpu/nn/batch_norm_grad.cc

+#include "core/util/math_cpuonly.h"
+#include "core/providers/common.h"
+#include "core/framework/op_kernel_context_internal.h"
+#include "core/common/safeint.h"


consider to remove those useless include.

pengwa · 2023-04-07T11:14:24Z

orttraining/orttraining/training_ops/cpu/nn/batch_norm_grad.cc

+  const TensorShape X_shape = X->Shape();
+  const TensorShape channel_shape = saved_mean->Shape();
+
+  // no B here, but B has same size as Scale, so can validate inputs for gradient with this substitute


this looks to be a workaround, maybe we can just check pointer B is null or not in the ValidateInputs?

mindest · 2023-04-07T14:06:46Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,ONNX Runtime Web CI Pipeline

mindest · 2023-04-07T14:07:19Z

/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

azure-pipelines · 2023-04-07T14:07:24Z

Azure Pipelines successfully started running 8 pipeline(s).

azure-pipelines · 2023-04-07T14:07:56Z

Azure Pipelines successfully started running 8 pipeline(s).

pengwa

mindest · 2023-04-08T01:19:48Z

Thanks! @pengwa @baijumeswani

mindest · 2023-04-08T01:22:02Z

@pranav-prakash thanks for your contribution!

pranav-prakash added 7 commits April 30, 2021 21:57

Add transformer for BatchNorm -> BN Internal

e2ba74f

Add test for BN replacement transformer

8142bbb

Resolve comments

8dddf56

Merge master

f2b0844

Resolve comments

772dc3f

Revert removal of InsertMaxpoolOutput in gradient_graph_builder

abc6b35

Register batch norm internal implementation

a857329

pranav-prakash force-pushed the batchnorm_grad_impl branch from abaf1fd to 1824431 Compare May 8, 2021 00:23

Implement CPU BatchNormGrad

eb71531

pranav-prakash force-pushed the batchnorm_grad_impl branch from 1824431 to eb71531 Compare May 8, 2021 00:36

SherlockNoMad added training issues related to ONNX Runtime training; typically submitted using template external_pr labels May 10, 2021

Bump to upstream

f272c69

pranav-prakash force-pushed the batchnorm_grad_impl branch from b83bf80 to f272c69 Compare May 10, 2021 22:18

pranav-prakash marked this pull request as ready for review May 10, 2021 22:19

pranav-prakash requested a review from a team as a code owner May 10, 2021 22:19

SherlockNoMad reviewed May 10, 2021

View reviewed changes

orttraining/orttraining/training_ops/cpu/nn/batch_norm_grad.cc Outdated Show resolved Hide resolved

SherlockNoMad reviewed May 10, 2021

View reviewed changes

orttraining/orttraining/training_ops/cpu/nn/batch_norm_grad.cc Outdated Show resolved Hide resolved

SherlockNoMad reviewed May 10, 2021

View reviewed changes

orttraining/orttraining/training_ops/cpu/nn/batch_norm_grad.cc Outdated Show resolved Hide resolved

SherlockNoMad previously approved these changes May 10, 2021

View reviewed changes

stale bot added the stale issues that have not been addressed in a while; categorized by a bot label Apr 16, 2022

sophies927 removed the external_pr label Aug 12, 2022

Merge branch 'main' into linmin/cpu_bn

194144d

stale bot removed the stale issues that have not been addressed in a while; categorized by a bot label Apr 7, 2023

mindest requested review from baijumeswani and pengwa and removed request for SherlockNoMad and guoyu-wang April 7, 2023 08:49

mindest reviewed Apr 7, 2023

View reviewed changes

orttraining/orttraining/training_ops/cpu/nn/batch_norm_grad.cc Outdated Show resolved Hide resolved

mindest reviewed Apr 7, 2023

View reviewed changes

pengwa reviewed Apr 7, 2023

View reviewed changes

orttraining/orttraining/test/training_ops/cpu/nn/batchnorm_internal_test.cc Outdated Show resolved Hide resolved

pengwa reviewed Apr 7, 2023

View reviewed changes

mindest added 2 commits April 7, 2023 13:57

resolve comments

c03a031

resolve comments 2

a7947c9

baijumeswani approved these changes Apr 7, 2023

View reviewed changes

pengwa approved these changes Apr 8, 2023

View reviewed changes

mindest merged commit 3c5d02a into microsoft:main Apr 8, 2023

Conversation

pranav-prakash commented May 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pranav-prakash commented May 8, 2021

Uh oh!

SherlockNoMad May 10, 2021

Choose a reason for hiding this comment

Uh oh!

pranav-prakash May 11, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SherlockNoMad left a comment

Choose a reason for hiding this comment

Uh oh!

SherlockNoMad commented May 10, 2021

Uh oh!

azure-pipelines bot commented May 10, 2021

Uh oh!

SherlockNoMad commented May 10, 2021

Uh oh!

azure-pipelines bot commented May 10, 2021

Uh oh!

azure-pipelines bot commented May 10, 2021

Uh oh!

SherlockNoMad commented May 10, 2021

Uh oh!

SherlockNoMad commented May 10, 2021

Uh oh!

azure-pipelines bot commented May 10, 2021

Uh oh!

mindest commented Oct 13, 2021

Uh oh!

azure-pipelines bot commented Oct 13, 2021

Uh oh!

stale bot commented Apr 16, 2022

Uh oh!

mindest commented Apr 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azure-pipelines bot commented Apr 7, 2023

Uh oh!

mindest commented Apr 7, 2023

Uh oh!

azure-pipelines bot commented Apr 7, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pengwa Apr 7, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pengwa Apr 7, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pengwa Apr 7, 2023

Choose a reason for hiding this comment

Uh oh!

pengwa Apr 7, 2023

Choose a reason for hiding this comment

Uh oh!

mindest commented Apr 7, 2023

Uh oh!

mindest commented Apr 7, 2023

Uh oh!

azure-pipelines bot commented Apr 7, 2023

Uh oh!

azure-pipelines bot commented Apr 7, 2023

pranav-prakash commented May 8, 2021 •

edited

Loading

mindest commented Apr 7, 2023 •

edited

Loading