Add pre-training transform to convert BatchNorm to BatchNormInternal by pranav-prakash · Pull Request #7539 · microsoft/onnxruntime

pranav-prakash · 2021-05-01T04:53:15Z

Description: As opset 14 lacks saved_mean/saved_inv_std, we have to convert nodes to a custom op if we want to ensure good performance on backward pass. This is the second in a series of PRs to implement BN training on CPU (first was #6946).

As part of this we also remove the existing AdjustBatchNormOutputs pass – this only works for opset 9 BN and is subsumed by the more general BatchNormReplacement we define in this PR.

WIP – I think the only thing missing is a unit test. I'd also appreciate any feedback on whether it's necessary to copy/paste the entire schema-def for the BatchNormInternal op or whether something like protobuf's toBuilder exists for these schema defs.

SherlockNoMad · 2021-05-03T20:33:24Z

Hi @pranav-prakash, thanks for contributing this !

Let us know when this is ready for review. I will help you merging this change.

pranav-prakash · 2021-05-03T21:25:54Z

@SherlockNoMad
Thanks, I just added the test so it should be ready for review. One question I had was that the test required me to explicitly import the com.microsoft domain in order for it to recognize the BatchNormInternal op – I'm not sure if this is expected behavior. When loading a model from disk is com.microsoft automatically added as a valid import?

orttraining/orttraining/test/optimizer/graph_transform_test.cc

orttraining/orttraining/core/framework/gradient_graph_builder.cc

SherlockNoMad · 2021-05-06T06:48:06Z

I think we also need to update gradient_builder_registry.cc, to register the gradient builder for BatchNormInternal.
We should be able to reuse "GetBatchNormalizationGradient" though.

SherlockNoMad · 2021-05-06T06:53:32Z

@SherlockNoMad
Thanks, I just added the test so it should be ready for review. One question I had was that the test required me to explicitly import the com.microsoft domain in order for it to recognize the BatchNormInternal op – I'm not sure if this is expected behavior. When loading a model from disk is com.microsoft automatically added as a valid import?

I can imagine msDomain is needed to make the UT work. In the training process, msDomain kernel registry will be automatically included at program start time, coz most of the gradient ops are under MSDomain.

orttraining/orttraining/core/optimizer/batchnorm_replacement.cc

orttraining/orttraining/test/optimizer/graph_transform_test.cc

SherlockNoMad · 2021-05-06T07:05:23Z

Also need to add BatchNormInternal into the STOP_GRADIENT_EDGES in gradient_graph_builder.h

SherlockNoMad · 2021-05-06T07:08:38Z

Thanks again for the PR. The bulk part is looking good!
We should be good to merge once the comments are addressed

pranav-prakash · 2021-05-07T00:28:41Z

@SherlockNoMad
Thanks for the quick review! I addressed the comments you left.

register gradient builder for BatchNormInternal.

Done. I'll add a CPU implementation of the gradient in a subsequent PR.

I'll also register an implementation for BatchNormInternal (re-using existing BatchNorm kernel) in a subsequent PR. I believe there's also a pre-existing CUDA kernel for batch norm training & gradient, but it seems to be broken (tests were disabled) so someone else who's both more familiar and capable of testing it will have to modify those parts.

SherlockNoMad · 2021-05-07T05:15:40Z

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

SherlockNoMad · 2021-05-07T05:15:47Z

/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed

azure-pipelines · 2021-05-07T05:16:19Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2021-05-07T05:16:29Z

Azure Pipelines successfully started running 8 pipeline(s).

SherlockNoMad · 2021-05-07T05:23:48Z

sure. Appreciate it if you can contribute the CPU BatchNormGradient kernel !!

pranav-prakash · 2021-05-07T06:54:35Z

@SherlockNoMad
Seems like maybe we can't remove InsertMaxPoolOutput in gradient_graph_builder.cc after all? It fails the test

1: C++ exception with description "/onnxruntime_src/orttraining/orttraining/core/graph/gradient_builder_base.h:102 onnxruntime::training::ArgDef onnxruntime::training::GradientBuilderBase::O(size_t) const i < node_->OutputDefs().size() was false.

although I'm not sure why this is since we register it in graph_transformer_utils.cc as well. I'm not sure about the order in which these things are called, but is it possible that the tests only call into gradient_graph_builder.cc but not graph_transformer_utils.cc?

SherlockNoMad · 2021-05-07T22:32:00Z

ah... ic.. the GradientCheckerTest.MaxPoolGrad is not invoking the GeneratePreTrainingTransformers, as it's taking the InferenceSession path to construct the test model. and InferenceSession's default graph transformer doesn't include InsertMaxpoolOutput...

Please revert the Maxpool related changes, we will fix this in the latter PRs. Sorry for the inconvenience caused.

pranav-prakash · 2021-05-07T23:11:30Z

@SherlockNoMad Done, thanks for determining the cause!

pranav-prakash · 2021-05-08T00:10:06Z

the GradientCheckerTest.MaxPoolGrad is not invoking the GeneratePreTrainingTransformers

Actually this affects the unit tests for BatchNormGradient as well (seen in the below draft PR). Even if we register

rule_based_graph_transformer->Register(std::make_unique<BatchNormReplacement>());

for the constructor of GradientGraphBuilder, the reachable_nodes_ and forward_reachable_nodes are determined before the transform is run. So the unit test for BatchNormGradient will fail with the message

Cannot compute the partial derivative for 'input1' as it's unreachable from the output node(s).

since reachable_nodes_ has the node before transformation (BatchNormalization) but the CheckNodeArgsReachable is done after the graph transformation. I've worked around it for now by moving the graph transform before the transitive closure computation, but as you mentioned the cleaner solution is to just invoke GeneratePreTrainingTransformers directly.

SherlockNoMad · 2021-05-10T17:28:59Z

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

SherlockNoMad · 2021-05-10T17:29:10Z

/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed

azure-pipelines · 2021-05-10T17:29:41Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2021-05-10T17:29:55Z

Azure Pipelines successfully started running 8 pipeline(s).

SherlockNoMad · 2021-05-10T21:06:47Z

/azp run Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-ortmodule, orttraining-ortmodule-distributed

azure-pipelines · 2021-05-10T21:07:11Z

Azure Pipelines successfully started running 4 pipeline(s).

**Description**: Register an implementation for BatchNormInternal and add a CPU kernel for BatchNormGradient. This is the third in a series of PRs to implement BN training on CPU (first was #6946, second was #7539). **Motivation and Context** Support training networks with BatchNorm (e.g. convnets). Also note that there exists a CUDA kernel for BN (forward training & backwards) but it's currently disabled due to flaky failures; someone more familiar with those parts can register the implementation for BNInternal on CUDA (gradient kernel doesn't have to change). --------- Co-authored-by: Simon Zirui Guo <simonguozirui@berkeley.edu> Co-authored-by: mindest <linminuser@gmail.com> Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>

pranav-prakash force-pushed the batchnorm_grad branch from 17e0b84 to c96d2f4 Compare May 1, 2021 04:54

Add transformer for BatchNorm -> BN Internal

e2ba74f

pranav-prakash force-pushed the batchnorm_grad branch from c96d2f4 to e2ba74f Compare May 1, 2021 04:57

SherlockNoMad requested a review from mindest May 3, 2021 20:30

SherlockNoMad added the training issues related to ONNX Runtime training; typically submitted using template label May 3, 2021

pranav-prakash changed the title ~~[WIP] Add pre-training transform to convert BatchNorm to BatchNormInternal~~ Add pre-training transform to convert BatchNorm to BatchNormInternal May 3, 2021

pranav-prakash marked this pull request as ready for review May 3, 2021 21:24

pranav-prakash requested a review from a team as a code owner May 3, 2021 21:24

Add test for BN replacement transformer

8142bbb

pranav-prakash force-pushed the batchnorm_grad branch from 1f16281 to 8142bbb Compare May 4, 2021 08:40

SherlockNoMad added the external_pr label May 5, 2021