Implement BatchNormGradient kernel for CPU EP#7622
Conversation
abaf1fd to
1824431
Compare
1824431 to
eb71531
Compare
|
Hmm actually I'm seeing occasional failures for this CPU implementation of BNGrad as well. If you use random seed for the cases of This seem close enough that it seems likely to just be an artifact of FP error; is it safe to bump the tolerance to 0.02 or so? |
b83bf80 to
f272c69
Compare
| const Tensor* X = ctx->Input<Tensor>(1); | ||
| const Tensor* Scale = ctx->Input<Tensor>(2); | ||
| const Tensor* saved_mean = ctx->Input<Tensor>(3); | ||
| const Tensor* saved_inv_variance = ctx->Input<Tensor>(4); |
There was a problem hiding this comment.
Just noticed a minor thing on BatchNormInternal's schema,
.Output(4, "saved_inv_std", "Inverse standard deviation for the batch", "U", OpSchema::Optional, true, 1, OpSchema::NonDifferentiable)
Output4's name should actually be save_inv_variance.
There was a problem hiding this comment.
We output "1/std_dev" though. Isn't "saved_inv_std" the correct name for that (where the "inverse" here denotes the multiplicative inverse i.e. reciprocal)? Although I do see that for some reason almost all other implementations of BNgrad call it "inv_var" even when it's defined as "1/sqrttvar"; I'm not sure why this is.
SherlockNoMad
left a comment
There was a problem hiding this comment.
LGTM.
Should be good to merge when the comments are addressed.
|
/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline, Windows WebAssembly CI Pipeline |
|
Azure Pipelines successfully started running 9 pipeline(s). |
|
/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed,orttraining-amd-gpu-ci-pipeline, orttraining-ortmodule, orttraining-ortmodule-distributed |
|
You have several pipelines (over 10) configured to build pull requests in this repository. Specify which pipelines you would like to run by using /azp run [pipelines] command. You can specify multiple pipelines using a comma separated list. |
|
Azure Pipelines successfully started running 10 pipeline(s). |
|
/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed,orttraining-amd-gpu-ci-pipeline, orttraining-ortmodule |
|
/azp run orttraining-ortmodule-distributed |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run ONNX Runtime Web CI Pipeline |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
|
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,ONNX Runtime Web CI Pipeline |
|
Azure Pipelines successfully started running 8 pipeline(s). |
|
/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed |
|
Azure Pipelines successfully started running 8 pipeline(s). |
orttraining/orttraining/test/training_ops/cpu/nn/batchnorm_internal_test.cc
Outdated
Show resolved
Hide resolved
orttraining/orttraining/training_ops/cpu/nn/batch_norm_internal.cc
Outdated
Show resolved
Hide resolved
orttraining/orttraining/test/training_ops/cpu/nn/batchnorm_internal_test.cc
Outdated
Show resolved
Hide resolved
orttraining/orttraining/test/training_ops/cpu/nn/batchnorm_internal_test.cc
Outdated
Show resolved
Hide resolved
orttraining/orttraining/test/training_ops/cpu/nn/batchnorm_internal_test.cc
Outdated
Show resolved
Hide resolved
orttraining/orttraining/test/training_ops/cpu/nn/batchnorm_internal_test.cc
Outdated
Show resolved
Hide resolved
|
|
||
| // exclude CUDA and ROCm Execution Provider due to different calculation method of `running_var` | ||
| // exclude TRT and OpenVINO for same reasons as seen in TestBatchNorm() | ||
| test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kCudaExecutionProvider, kRocmExecutionProvider, kTensorrtExecutionProvider, kOpenVINOExecutionProvider}); |
There was a problem hiding this comment.
so it only test on CPU EP, right?
If so, is doing this more clear?
std::vector<std::unique_ptr> execution_providers;
execution_providers.emplace_back(DefaultCpuExecutionProvider());
....
test.Run(OpTester::ExpectResult::kExpectSuccess, "", {},
nullptr,
&execution_providers);
orttraining/orttraining/training_ops/cpu/nn/batch_norm_internal.cc
Outdated
Show resolved
Hide resolved
| kMSDomain, | ||
| 1, | ||
| kCpuExecutionProvider, | ||
| KernelDefBuilder().Alias(3,1).Alias(4,2).TypeConstraint("T", DataTypeImpl::GetTensorType<float>()), |
There was a problem hiding this comment.
consider to register T1 and T2 also.
orttraining/orttraining/training_ops/cpu/nn/batch_norm_internal.cc
Outdated
Show resolved
Hide resolved
| #include "core/util/math_cpuonly.h" | ||
| #include "core/providers/common.h" | ||
| #include "core/framework/op_kernel_context_internal.h" | ||
| #include "core/common/safeint.h" |
There was a problem hiding this comment.
consider to remove those useless include.
| const TensorShape X_shape = X->Shape(); | ||
| const TensorShape channel_shape = saved_mean->Shape(); | ||
|
|
||
| // no B here, but B has same size as Scale, so can validate inputs for gradient with this substitute |
There was a problem hiding this comment.
this looks to be a workaround, maybe we can just check pointer B is null or not in the ValidateInputs?
|
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,ONNX Runtime Web CI Pipeline |
|
/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed |
|
Azure Pipelines successfully started running 8 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 8 pipeline(s). |
|
Thanks! @pengwa @baijumeswani |
|
@pranav-prakash thanks for your contribution! |
Description: Register an implementation for BatchNormInternal and add a CPU kernel for BatchNormGradient. This is the third in a series of PRs to implement BN training on CPU (first was #6946, second was #7539).
Motivation and Context
Support training networks with BatchNorm (e.g. convnets). Also note that there exists a CUDA kernel for BN (forward training & backwards) but it's currently disabled due to flaky failures; someone more familiar with those parts can register the implementation for BNInternal on CUDA (gradient kernel doesn't have to change).