Skip to content

Add pre-training transform to convert BatchNorm to BatchNormInternal#7539

Merged
SherlockNoMad merged 6 commits intomicrosoft:masterfrom
pranav-prakash:batchnorm_grad
May 10, 2021
Merged

Add pre-training transform to convert BatchNorm to BatchNormInternal#7539
SherlockNoMad merged 6 commits intomicrosoft:masterfrom
pranav-prakash:batchnorm_grad

Conversation

@pranav-prakash
Copy link
Contributor

@pranav-prakash pranav-prakash commented May 1, 2021

Description: As opset 14 lacks saved_mean/saved_inv_std, we have to convert nodes to a custom op if we want to ensure good performance on backward pass. This is the second in a series of PRs to implement BN training on CPU (first was #6946).

As part of this we also remove the existing AdjustBatchNormOutputs pass – this only works for opset 9 BN and is subsumed by the more general BatchNormReplacement we define in this PR.

WIP – I think the only thing missing is a unit test. I'd also appreciate any feedback on whether it's necessary to copy/paste the entire schema-def for the BatchNormInternal op or whether something like protobuf's toBuilder exists for these schema defs.

@SherlockNoMad SherlockNoMad requested a review from mindest May 3, 2021 20:30
@SherlockNoMad SherlockNoMad added the training issues related to ONNX Runtime training; typically submitted using template label May 3, 2021
@SherlockNoMad
Copy link
Contributor

Hi @pranav-prakash, thanks for contributing this !

Let us know when this is ready for review. I will help you merging this change.

@pranav-prakash pranav-prakash changed the title [WIP] Add pre-training transform to convert BatchNorm to BatchNormInternal Add pre-training transform to convert BatchNorm to BatchNormInternal May 3, 2021
@pranav-prakash pranav-prakash marked this pull request as ready for review May 3, 2021 21:24
@pranav-prakash pranav-prakash requested a review from a team as a code owner May 3, 2021 21:24
@pranav-prakash
Copy link
Contributor Author

@SherlockNoMad
Thanks, I just added the test so it should be ready for review. One question I had was that the test required me to explicitly import the com.microsoft domain in order for it to recognize the BatchNormInternal op – I'm not sure if this is expected behavior. When loading a model from disk is com.microsoft automatically added as a valid import?

@SherlockNoMad
Copy link
Contributor

I think we also need to update gradient_builder_registry.cc, to register the gradient builder for BatchNormInternal.
We should be able to reuse "GetBatchNormalizationGradient" though.

@SherlockNoMad
Copy link
Contributor

@SherlockNoMad
Thanks, I just added the test so it should be ready for review. One question I had was that the test required me to explicitly import the com.microsoft domain in order for it to recognize the BatchNormInternal op – I'm not sure if this is expected behavior. When loading a model from disk is com.microsoft automatically added as a valid import?

I can imagine msDomain is needed to make the UT work. In the training process, msDomain kernel registry will be automatically included at program start time, coz most of the gradient ops are under MSDomain.

@SherlockNoMad
Copy link
Contributor

Also need to add BatchNormInternal into the STOP_GRADIENT_EDGES in gradient_graph_builder.h

@SherlockNoMad
Copy link
Contributor

Thanks again for the PR. The bulk part is looking good!
We should be good to merge once the comments are addressed

@pranav-prakash
Copy link
Contributor Author

pranav-prakash commented May 7, 2021

@SherlockNoMad
Thanks for the quick review! I addressed the comments you left.

register gradient builder for BatchNormInternal.

Done. I'll add a CPU implementation of the gradient in a subsequent PR.

I'll also register an implementation for BatchNormInternal (re-using existing BatchNorm kernel) in a subsequent PR. I believe there's also a pre-existing CUDA kernel for batch norm training & gradient, but it seems to be broken (tests were disabled) so someone else who's both more familiar and capable of testing it will have to modify those parts.

@SherlockNoMad
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

@SherlockNoMad
Copy link
Contributor

/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

SherlockNoMad
SherlockNoMad previously approved these changes May 7, 2021
@SherlockNoMad
Copy link
Contributor

sure. Appreciate it if you can contribute the CPU BatchNormGradient kernel !!

@pranav-prakash
Copy link
Contributor Author

pranav-prakash commented May 7, 2021

@SherlockNoMad
Seems like maybe we can't remove InsertMaxPoolOutput in gradient_graph_builder.cc after all? It fails the test

1: C++ exception with description "/onnxruntime_src/orttraining/orttraining/core/graph/gradient_builder_base.h:102 onnxruntime::training::ArgDef onnxruntime::training::GradientBuilderBase::O(size_t) const i < node_->OutputDefs().size() was false. 

although I'm not sure why this is since we register it in graph_transformer_utils.cc as well. I'm not sure about the order in which these things are called, but is it possible that the tests only call into gradient_graph_builder.cc but not graph_transformer_utils.cc?

@SherlockNoMad
Copy link
Contributor

SherlockNoMad commented May 7, 2021

ah... ic.. the GradientCheckerTest.MaxPoolGrad is not invoking the GeneratePreTrainingTransformers, as it's taking the InferenceSession path to construct the test model. and InferenceSession's default graph transformer doesn't include InsertMaxpoolOutput...

Please revert the Maxpool related changes, we will fix this in the latter PRs. Sorry for the inconvenience caused.

@pranav-prakash
Copy link
Contributor Author

@SherlockNoMad Done, thanks for determining the cause!

@pranav-prakash
Copy link
Contributor Author

pranav-prakash commented May 8, 2021

the GradientCheckerTest.MaxPoolGrad is not invoking the GeneratePreTrainingTransformers

Actually this affects the unit tests for BatchNormGradient as well (seen in the below draft PR). Even if we register

rule_based_graph_transformer->Register(std::make_unique<BatchNormReplacement>());

for the constructor of GradientGraphBuilder, the reachable_nodes_ and forward_reachable_nodes are determined before the transform is run. So the unit test for BatchNormGradient will fail with the message

Cannot compute the partial derivative for 'input1' as it's unreachable from the output node(s).

since reachable_nodes_ has the node before transformation (BatchNormalization) but the CheckNodeArgsReachable is done after the graph transformation. I've worked around it for now by moving the graph transform before the transitive closure computation, but as you mentioned the cleaner solution is to just invoke GeneratePreTrainingTransformers directly.

@SherlockNoMad
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline

@SherlockNoMad
Copy link
Contributor

/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 8 pipeline(s).

@SherlockNoMad SherlockNoMad self-assigned this May 10, 2021
@SherlockNoMad
Copy link
Contributor

/azp run Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-ortmodule, orttraining-ortmodule-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@SherlockNoMad SherlockNoMad merged commit a684e9a into microsoft:master May 10, 2021
mindest added a commit that referenced this pull request Apr 8, 2023
**Description**: Register an implementation for BatchNormInternal and
add a CPU kernel for BatchNormGradient. This is the third in a series of
PRs to implement BN training on CPU (first was #6946, second was #7539).

**Motivation and Context**
Support training networks with BatchNorm (e.g. convnets). Also note that
there exists a CUDA kernel for BN (forward training & backwards) but
it's currently disabled due to flaky failures; someone more familiar
with those parts can register the implementation for BNInternal on CUDA
(gradient kernel doesn't have to change).

---------

Co-authored-by: Simon Zirui Guo <simonguozirui@berkeley.edu>
Co-authored-by: mindest <linminuser@gmail.com>
Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

training issues related to ONNX Runtime training; typically submitted using template

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants