Fix batch norm training op on CPU by pranav-prakash · Pull Request #6946 · microsoft/onnxruntime

pranav-prakash · 2021-03-09T00:17:21Z

Description: Add support for training-forward-mode of BatchNorm on CPU EP, and implement BN Gradient on CPU EP

Fixes the issue described in #6087

Motivation and Context

As mentioned in the above issue, currently batch norm is not implemented for training. This PR makes the following changes:

Blacklist the mean/variance tensors from having their gradients calculated. These are calculated directly from the batch during training and are not updated via backprop
The onnx spec for BatchNorm currently doesn't provide an attribute to distinguish between training/inference mode (since batch norm has different behavior in both cases). BatchNorm in OP_SET version 7 has no mode attribute onnx/onnx#1042 While we can assume the presence of the optional outputs indicates we're doing training, it's technically valid for a serialized model for inference to contain those as well. The CUDA kernel nonetheless uses this as an indicator of training, and so do we.
The CPU implementation of the above was cribbed from caffe2 (as it appeared that the existing inference-only implementation was also taken from there). They do something weird where instead of outputting saved_variance they output inv_std_dev. Apparently this is for ease of interoperability with cuDNN, but it completely breaks the ONNX spec. Nonetheless, I've chosen to do the same because the existing cuda kernel for batchnormgrad also relies on it actually being inv_std_dev.
Added a CPU implementation for batchnormgrad. Again to match behavior with the cuda version we too completely break the spec and assume that the saved_var is actually inv_std_dev.

As a sidenote, I wonder if the flaky CUDA tests for the batch norm can also be resolved by d01006f

neginraoof · 2021-03-23T19:39:06Z

Hey @pranav-prakash
Can you also review this PR to fix onnx spec? onnx/onnx#3333

pranav-prakash · 2021-03-23T20:21:17Z

@neginraoof I'm not a msft employee, so I don't have any more power than you do with regard to code review. Or did you want me to just look over it and join the discussion? If so I've left my comments,

pranav-prakash · 2021-04-02T09:00:05Z

Given that the training mode spec for BN is only fully fleshed out for opset 14, do we still need to support the case of training for non-spatial BN (which could happen with opset < 14)? Removing this codepath would simplify things greatly (we could still maintain correctness by guarding training on opset 14 with ORT_ENFORCE).

SherlockNoMad · 2021-04-02T22:29:13Z

Given that the training mode spec for BN is only fully fleshed out for opset 14, do we still need to support the case of training for non-spatial BN (which could happen with opset < 14)? Removing this codepath would simplify things greatly (we could still maintain correctness by guarding training on opset 14 with ORT_ENFORCE).

yes. Since BN training mode is only officially added since opset 14, let's ease the burden of supporting the non-spatial mode for training. :)

SherlockNoMad · 2021-04-05T19:00:59Z

Hi @pranav-prakash, thanks a lot for your contribution.
Are you planning to update the PR to deprecate the support for non-spatial mode in training? I can help you with review and merging the PR.

SherlockNoMad · 2021-04-05T19:02:37Z

Hi @pranav-prakash, I am also wondering what's your use scenario of ORT training? Which company/product are you working on?

pranav-prakash · 2021-04-05T20:16:43Z

@SherlockNoMad

update the PR to deprecate the support for non-spatial mode in training

Yes I started the PR for this, but then saw that the spec was also updated to remove the saved_mean and saved_var outputs. With these two removed, it seems that you would have to recompute the mean/var in the gradient op. I had asked about the motivation for this change on the associated PR, and it seems that the intent was to allow "backends [to] transform the graph/node into a custom-op of their choice (for backward-propagation)."

In terms of ORT though, does this mean that we would define our own variant of the BatchNorm schema that includes outputs for saved_mean and saved_inv_std and use a graph transform pass to change nodes to this? If not, is there another way to avoid the recomputation of batch_mean/batch_var in the backward pass?

I am also wondering what's your use scenario of ORT training? Which company/product are you working on?

I'm associated with UC Berkeley's architecture research group, and we're working on an ORT EP for our risc-v ml acceleator (Gemmini). In terms of use-cases, at the moment we're primarily interested in training convolutional neural networks (in both bfloat16 and fp32), e.g. resnet or mobilenet.

In terms of the exposed ORT training APIs we make use of, because running python would be too much overhead (we're targeting edge-devices), we call directly into the underlying C++ functions (TrainingRunner, training_session) rather than the python bindings. (Although this doesn't seem to be as fully documented or supported as the inference APIs).

SherlockNoMad · 2021-04-06T01:46:34Z

Thanks a lot for the detailed introduction! We are really happy to see external contribution to ORT Training!!!

As the for the BatchNorm problem, the plan is to write a custom op (say, BatchNormInternal) that outputs the saved_mean and save_inv_std. We will substitute BatchNorm with BatchNormInternal before building the training graph. The BatchNormGrad op can still assume that O(3) and O(4) are present for speed up the computation.

The rational behind the onnx spec update is that, the "save_inv_std" is an internal implementation detail, (other framework may use save_inv_var instead), so it's better to leave it out of the spec.

SherlockNoMad · 2021-04-06T01:51:06Z

Just curious, how far have your reached for training convolutional neural networks e.g. resnet or mobilenet? Are they working yet? We are also exploring federated learning on mobile/edge device. Would be nice to have a colab if our plans aligns well.

pranav-prakash · 2021-04-06T04:53:20Z

@SherlockNoMad

the plan is to write a custom op (say, BatchNormInternal) that outputs the saved_mean and save_inv_std

I see – so for this PR would you like me to create the schema for such a BatchNormTrainingInternal and move the training-mode calculations there? Or did you still want to have an implementation for training_mode = true in the BatchNorm kernel just for completeness (albeit one that will never be used if the op gets replaced with BatchNormTrainingInternal before training).

how far have your reached for training convolutional neural networks

We just recently got a trainer for resnet50 working, although we haven't yet verified end-to-end correctness since fpga simulation of Gemmini is much too slow for training resnet from scratch (we could likely compare results from a few dozen iterations against the CPU EP though).

We are also exploring federated learning on mobile/edge device. Would be nice to have a colab if our plans aligns well.

Yeah federated learning and training at the edge are exactly the scenarios we envisioned Gemmini would be a good fit for. We'd love to discuss our roadmap and see if there's any potential for collaboration here; feel free to email us at {pranavprakash,hngenc}@<university>.edu.

SherlockNoMad · 2021-04-06T05:56:56Z

BatchNormTrainingInternal and BatchNorm can share the same kernel. Most of the code should be same, except for the handling of output 3 and 4. In the kernel, we can check if the ctx->Output(3, shape) and ctx->Output(4, shape) returns the nullptr.

If you have bandwidth, you can add the BatchNormalizationTraining Schema to training_op_defs.cc.

Also, a sample of the replacement code can be found in concat_replacement.cc. This replaces Concat with ConcatTraining.

SherlockNoMad · 2021-04-06T06:22:48Z

@pranav-prakash. Great to know that you got the resnet50 working...
Actually I am a bit surprised that you didn't find too many missing gradients....

AFAIK, gradient is missing for LRN, Sum, GlobalMaxPool op... GlobalAveragePool's gradient builder can also need a fix...

pranav-prakash · 2021-04-06T06:34:26Z

@SherlockNoMad

BatchNormTrainingInternal and BatchNorm can share the same kernel

SG, I'll update the PR accordingly. Not sure I'll have the bandwidth to add the BatchNormalizationTraining op as well, but that should be an easy subsequent PR.

bit surprised that you didn't find too many missing gradients

We converted Sum into Add before training; iirc both the resnet50 structure from the model-zoo and exported pytorch use MaxPool instead of GlobalMaxPool. The pytorch model does use GlobalAveragePool but we didn't seem to run into any issues with the gradient builder on that.

pranav-prakash · 2021-04-06T23:52:08Z

@SherlockNoMad
I've updated the BN kernel to support the opset-14 case. I don't think the onnx submodule has been bumped to the latest commit though, so that will need to be done before this can be merged.

Since the batch norm grad cannot be implemented until the schema & graph transform for internal BatchNormalizationTraining is added, I've reverted that for now. For reference, those can be found at
https://github.com/microsoft/onnxruntime/blob/8892ee4b6d343109699ab292e66c2c7a5e41925a/orttraining/orttraining/training_ops/cpu/nn/batch_norm_grad.h
https://github.com/microsoft/onnxruntime/blob/8892ee4b6d343109699ab292e66c2c7a5e41925a/orttraining/orttraining/training_ops/cpu/nn/batch_norm_grad.cc

SherlockNoMad · 2021-04-07T05:18:49Z

#7177
is in progress.

SherlockNoMad · 2021-04-07T06:18:07Z

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

SherlockNoMad · 2021-04-07T06:18:15Z

/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed

azure-pipelines · 2021-04-07T06:18:48Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2021-04-07T06:18:55Z

Azure Pipelines successfully started running 8 pipeline(s).

pranav-prakash · 2021-04-28T04:30:00Z

@SherlockNoMad
Now that the onnx submodule has been bumped to opset 14, I updated this PR to enable the opset 14 BN. Also fixed the previous CI failure.

SherlockNoMad · 2021-04-30T07:35:42Z

see line 44 in batch_norm_op_test.cc

  std::unordered_set<std::string> excluded_eps = {kTensorrtExecutionProvider};
  if (spatial_mode == 0) {
    excluded_eps.insert(kOpenVINOExecutionProvider);
  }

I think it's correct to set them in excluded_eps .

SherlockNoMad

Thanks a lot for this PR.
I think it's ready to be merge, after the fix for the TRT and openVino fix.

pranav-prakash · 2021-04-30T19:56:15Z

@SherlockNoMad fixed.

One other question I had was about the old cuda-test for ForwardTrainingTestWithSavedOutputsOpset9. The old test had

  test.AddOutput<float>("running_mean", channel_dims, {-0.1754f, 0.303106f});
  test.AddOutput<float>("saved_mean", channel_dims, {-0.306f, 0.115f});

That is, it had running_mean equal to the saved_mean, when according to the formula should have been equal to momentum*mean + saved_mean*(1-momentum). Was the old test incorrect, or did I miss something obvious?

SherlockNoMad · 2021-04-30T19:57:56Z

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

SherlockNoMad · 2021-04-30T19:58:09Z

/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed

azure-pipelines · 2021-04-30T19:58:43Z

Azure Pipelines successfully started running 9 pipeline(s).

SherlockNoMad · 2021-04-30T19:58:54Z

I think the old test case was incorrect.

azure-pipelines · 2021-04-30T19:58:58Z

Azure Pipelines successfully started running 8 pipeline(s).

onnxruntime/core/providers/cpu/cpu_execution_provider.cc

SherlockNoMad · 2021-04-30T23:29:16Z

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

SherlockNoMad · 2021-04-30T23:29:28Z

/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed

azure-pipelines · 2021-04-30T23:29:58Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2021-04-30T23:30:09Z

Azure Pipelines successfully started running 8 pipeline(s).

SherlockNoMad · 2021-05-01T07:52:03Z

/azp run orttraining-amd-gpu-ci-pipeline, orttraining-ortmodule, orttraining-ortmodule-distributed

azure-pipelines · 2021-05-01T07:52:44Z

Azure Pipelines successfully started running 3 pipeline(s).

SherlockNoMad · 2021-05-01T07:53:12Z

/azp run orttraining-amd-gpu-ci-pipeline

azure-pipelines · 2021-05-01T07:53:21Z

Azure Pipelines successfully started running 1 pipeline(s).

**Description**: Register an implementation for BatchNormInternal and add a CPU kernel for BatchNormGradient. This is the third in a series of PRs to implement BN training on CPU (first was #6946, second was #7539). **Motivation and Context** Support training networks with BatchNorm (e.g. convnets). Also note that there exists a CUDA kernel for BN (forward training & backwards) but it's currently disabled due to flaky failures; someone more familiar with those parts can register the implementation for BNInternal on CUDA (gradient kernel doesn't have to change). --------- Co-authored-by: Simon Zirui Guo <simonguozirui@berkeley.edu> Co-authored-by: mindest <linminuser@gmail.com> Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>

pranav-prakash requested a review from a team as a code owner March 9, 2021 00:17

snnn added the training issues related to ONNX Runtime training; typically submitted using template label Mar 25, 2021

snnn assigned SherlockNoMad Mar 25, 2021

pranav-prakash mentioned this pull request Apr 2, 2021

Add fp16 support for BatchNormalization Forward/Backward #7218

Closed

pranav-prakash force-pushed the master branch 3 times, most recently from 2f78c94 to a139854 Compare April 6, 2021 23:46

Fix batch norm training op on CPU

0698d8c

pranav-prakash force-pushed the master branch from a139854 to 0698d8c Compare April 6, 2021 23:46

Merge branch 'master' of https://github.com/microsoft/onnxruntime

e782696

SherlockNoMad previously approved these changes Apr 30, 2021

View reviewed changes

Exclude TRT and OpenVINO for BatchNorm training test

388b7b2

pranav-prakash dismissed SherlockNoMad’s stale review via 388b7b2 April 30, 2021 19:53

SherlockNoMad previously approved these changes Apr 30, 2021

View reviewed changes

askhade reviewed Apr 30, 2021

View reviewed changes

onnxruntime/core/providers/cpu/cpu_execution_provider.cc Outdated Show resolved Hide resolved

askhade reviewed Apr 30, 2021

View reviewed changes

onnxruntime/core/providers/cpu/cpu_execution_provider.cc Outdated Show resolved Hide resolved

Move BN opset 14 declarations to right section

679a6b5

pranav-prakash dismissed SherlockNoMad’s stale review via 679a6b5 April 30, 2021 22:42

pranav-prakash mentioned this pull request May 1, 2021

Add pre-training transform to convert BatchNorm to BatchNormInternal #7539

Merged

SherlockNoMad approved these changes May 1, 2021

View reviewed changes

SherlockNoMad merged commit 8ba6ed9 into microsoft:master May 1, 2021

pranav-prakash mentioned this pull request May 8, 2021

Implement BatchNormGradient kernel for CPU EP #7622

Merged

mindest mentioned this pull request Jul 22, 2021

Implement BatchNormInternal for cuda #8172

Merged

ashbhandare mentioned this pull request Oct 14, 2021

Failure to create a BatchNormgrad node for Mobilenet training app #6087

Closed

Conversation

pranav-prakash commented Mar 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neginraoof commented Mar 23, 2021

Uh oh!

pranav-prakash commented Mar 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pranav-prakash commented Apr 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SherlockNoMad commented Apr 2, 2021

Uh oh!

SherlockNoMad commented Apr 5, 2021

Uh oh!

SherlockNoMad commented Apr 5, 2021

Uh oh!

pranav-prakash commented Apr 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SherlockNoMad commented Apr 6, 2021

Uh oh!

SherlockNoMad commented Apr 6, 2021

Uh oh!

pranav-prakash commented Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SherlockNoMad commented Apr 6, 2021

Uh oh!

SherlockNoMad commented Apr 6, 2021

Uh oh!

pranav-prakash commented Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pranav-prakash commented Apr 6, 2021

Uh oh!

SherlockNoMad commented Apr 7, 2021

Uh oh!

SherlockNoMad commented Apr 7, 2021

Uh oh!

SherlockNoMad commented Apr 7, 2021

Uh oh!

azure-pipelines bot commented Apr 7, 2021

Uh oh!

azure-pipelines bot commented Apr 7, 2021

Uh oh!

pranav-prakash commented Apr 28, 2021

Uh oh!

SherlockNoMad commented Apr 30, 2021

Uh oh!

SherlockNoMad left a comment

Choose a reason for hiding this comment

Uh oh!

pranav-prakash commented Apr 30, 2021

Uh oh!

SherlockNoMad commented Apr 30, 2021

Uh oh!

SherlockNoMad commented Apr 30, 2021

Uh oh!

azure-pipelines bot commented Apr 30, 2021

Uh oh!

SherlockNoMad commented Apr 30, 2021

Uh oh!

azure-pipelines bot commented Apr 30, 2021

Uh oh!

Uh oh!

Uh oh!

SherlockNoMad commented Apr 30, 2021

Uh oh!

SherlockNoMad commented Apr 30, 2021

Uh oh!

azure-pipelines bot commented Apr 30, 2021

Uh oh!

azure-pipelines bot commented Apr 30, 2021

Uh oh!

SherlockNoMad commented May 1, 2021

Uh oh!

azure-pipelines bot commented May 1, 2021

Uh oh!

pranav-prakash commented Mar 9, 2021 •

edited

Loading

pranav-prakash commented Mar 23, 2021 •

edited

Loading

pranav-prakash commented Apr 2, 2021 •

edited

Loading

pranav-prakash commented Apr 5, 2021 •

edited

Loading

pranav-prakash commented Apr 6, 2021 •

edited

Loading

pranav-prakash commented Apr 6, 2021 •

edited

Loading