Skip to content

Implement BatchNormGradient kernel for CPU EP#7622

Merged
mindest merged 22 commits intomicrosoft:mainfrom
pranav-prakash:batchnorm_grad_impl
Apr 8, 2023
Merged

Implement BatchNormGradient kernel for CPU EP#7622
mindest merged 22 commits intomicrosoft:mainfrom
pranav-prakash:batchnorm_grad_impl

Conversation

@pranav-prakash
Copy link
Contributor

@pranav-prakash pranav-prakash commented May 8, 2021

Description: Register an implementation for BatchNormInternal and add a CPU kernel for BatchNormGradient. This is the third in a series of PRs to implement BN training on CPU (first was #6946, second was #7539).

Motivation and Context
Support training networks with BatchNorm (e.g. convnets). Also note that there exists a CUDA kernel for BN (forward training & backwards) but it's currently disabled due to flaky failures; someone more familiar with those parts can register the implementation for BNInternal on CUDA (gradient kernel doesn't have to change).

@pranav-prakash pranav-prakash force-pushed the batchnorm_grad_impl branch from abaf1fd to 1824431 Compare May 8, 2021 00:23
@pranav-prakash pranav-prakash force-pushed the batchnorm_grad_impl branch from 1824431 to eb71531 Compare May 8, 2021 00:36
@pranav-prakash
Copy link
Contributor Author

Hmm actually I'm seeing occasional failures for this CPU implementation of BNGrad as well. If you use random seed 181700829 for instance you get

max_error: 0.011094339191913605; tolerance: 0.0099999997764825821;

for the cases of batch_size (N) = 1 and case with epsilon not explicitly provided (default value should be used)

This seem close enough that it seems likely to just be an artifact of FP error; is it safe to bump the tolerance to 0.02 or so?

@SherlockNoMad SherlockNoMad added training issues related to ONNX Runtime training; typically submitted using template external_pr labels May 10, 2021
@pranav-prakash pranav-prakash marked this pull request as ready for review May 10, 2021 22:19
@pranav-prakash pranav-prakash requested a review from a team as a code owner May 10, 2021 22:19
const Tensor* X = ctx->Input<Tensor>(1);
const Tensor* Scale = ctx->Input<Tensor>(2);
const Tensor* saved_mean = ctx->Input<Tensor>(3);
const Tensor* saved_inv_variance = ctx->Input<Tensor>(4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed a minor thing on BatchNormInternal's schema,
.Output(4, "saved_inv_std", "Inverse standard deviation for the batch", "U", OpSchema::Optional, true, 1, OpSchema::NonDifferentiable)

Output4's name should actually be save_inv_variance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We output "1/std_dev" though. Isn't "saved_inv_std" the correct name for that (where the "inverse" here denotes the multiplicative inverse i.e. reciprocal)? Although I do see that for some reason almost all other implementations of BNgrad call it "inv_var" even when it's defined as "1/sqrttvar"; I'm not sure why this is.

SherlockNoMad
SherlockNoMad previously approved these changes May 10, 2021
Copy link
Contributor

@SherlockNoMad SherlockNoMad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Should be good to merge when the comments are addressed.

@SherlockNoMad
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline, Windows WebAssembly CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@SherlockNoMad
Copy link
Contributor

/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed,orttraining-amd-gpu-ci-pipeline, orttraining-ortmodule, orttraining-ortmodule-distributed

@azure-pipelines
Copy link

You have several pipelines (over 10) configured to build pull requests in this repository. Specify which pipelines you would like to run by using /azp run [pipelines] command. You can specify multiple pipelines using a comma separated list.

@azure-pipelines
Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@SherlockNoMad
Copy link
Contributor

/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed,orttraining-amd-gpu-ci-pipeline, orttraining-ortmodule

@SherlockNoMad
Copy link
Contributor

/azp run orttraining-ortmodule-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mindest
Copy link
Contributor

mindest commented Oct 13, 2021

/azp run ONNX Runtime Web CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@stale stale bot added the stale issues that have not been addressed in a while; categorized by a bot label Apr 16, 2022
@stale stale bot removed the stale issues that have not been addressed in a while; categorized by a bot label Apr 7, 2023
@mindest
Copy link
Contributor

mindest commented Apr 7, 2023

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,ONNX Runtime Web CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

@mindest
Copy link
Contributor

mindest commented Apr 7, 2023

/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

@mindest mindest requested review from baijumeswani and pengwa and removed request for SherlockNoMad and guoyu-wang April 7, 2023 08:49

// exclude CUDA and ROCm Execution Provider due to different calculation method of `running_var`
// exclude TRT and OpenVINO for same reasons as seen in TestBatchNorm()
test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kCudaExecutionProvider, kRocmExecutionProvider, kTensorrtExecutionProvider, kOpenVINOExecutionProvider});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it only test on CPU EP, right?

If so, is doing this more clear?

std::vector<std::unique_ptr> execution_providers;
execution_providers.emplace_back(DefaultCpuExecutionProvider());
....
test.Run(OpTester::ExpectResult::kExpectSuccess, "", {},
nullptr,
&execution_providers);

kMSDomain,
1,
kCpuExecutionProvider,
KernelDefBuilder().Alias(3,1).Alias(4,2).TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider to register T1 and T2 also.

#include "core/util/math_cpuonly.h"
#include "core/providers/common.h"
#include "core/framework/op_kernel_context_internal.h"
#include "core/common/safeint.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider to remove those useless include.

const TensorShape X_shape = X->Shape();
const TensorShape channel_shape = saved_mean->Shape();

// no B here, but B has same size as Scale, so can validate inputs for gradient with this substitute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks to be a workaround, maybe we can just check pointer B is null or not in the ValidateInputs?

@mindest
Copy link
Contributor

mindest commented Apr 7, 2023

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,ONNX Runtime Web CI Pipeline

@mindest
Copy link
Contributor

mindest commented Apr 7, 2023

/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

Copy link
Contributor

@pengwa pengwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@mindest
Copy link
Contributor

mindest commented Apr 8, 2023

Thanks! @pengwa @baijumeswani

@mindest mindest merged commit 3c5d02a into microsoft:main Apr 8, 2023
@mindest
Copy link
Contributor

mindest commented Apr 8, 2023

@pranav-prakash thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

training issues related to ONNX Runtime training; typically submitted using template

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants