Add pre-training transform to convert BatchNorm to BatchNormInternal#7539
Add pre-training transform to convert BatchNorm to BatchNormInternal#7539SherlockNoMad merged 6 commits intomicrosoft:masterfrom
Conversation
17e0b84 to
c96d2f4
Compare
c96d2f4 to
e2ba74f
Compare
|
Hi @pranav-prakash, thanks for contributing this ! Let us know when this is ready for review. I will help you merging this change. |
|
@SherlockNoMad |
1f16281 to
8142bbb
Compare
|
I think we also need to update gradient_builder_registry.cc, to register the gradient builder for BatchNormInternal. |
I can imagine msDomain is needed to make the UT work. In the training process, msDomain kernel registry will be automatically included at program start time, coz most of the gradient ops are under MSDomain. |
orttraining/orttraining/core/optimizer/batchnorm_replacement.cc
Outdated
Show resolved
Hide resolved
|
Also need to add BatchNormInternal into the STOP_GRADIENT_EDGES in gradient_graph_builder.h |
|
Thanks again for the PR. The bulk part is looking good! |
|
@SherlockNoMad
Done. I'll add a CPU implementation of the gradient in a subsequent PR. I'll also register an implementation for BatchNormInternal (re-using existing BatchNorm kernel) in a subsequent PR. I believe there's also a pre-existing CUDA kernel for batch norm training & gradient, but it seems to be broken (tests were disabled) so someone else who's both more familiar and capable of testing it will have to modify those parts. |
|
/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline |
|
/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed |
|
Azure Pipelines successfully started running 9 pipeline(s). |
|
Azure Pipelines successfully started running 8 pipeline(s). |
|
sure. Appreciate it if you can contribute the CPU BatchNormGradient kernel !! |
|
@SherlockNoMad although I'm not sure why this is since we register it in |
|
ah... ic.. the GradientCheckerTest.MaxPoolGrad is not invoking the GeneratePreTrainingTransformers, as it's taking the InferenceSession path to construct the test model. and InferenceSession's default graph transformer doesn't include InsertMaxpoolOutput... Please revert the Maxpool related changes, we will fix this in the latter PRs. Sorry for the inconvenience caused. |
|
@SherlockNoMad Done, thanks for determining the cause! |
Actually this affects the unit tests for BatchNormGradient as well (seen in the below draft PR). Even if we register for the constructor of since |
|
/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline |
|
/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed |
|
Azure Pipelines successfully started running 9 pipeline(s). |
|
Azure Pipelines successfully started running 8 pipeline(s). |
|
/azp run Windows WebAssembly CI Pipeline, orttraining-amd-gpu-ci-pipeline, orttraining-ortmodule, orttraining-ortmodule-distributed |
|
Azure Pipelines successfully started running 4 pipeline(s). |
**Description**: Register an implementation for BatchNormInternal and add a CPU kernel for BatchNormGradient. This is the third in a series of PRs to implement BN training on CPU (first was #6946, second was #7539). **Motivation and Context** Support training networks with BatchNorm (e.g. convnets). Also note that there exists a CUDA kernel for BN (forward training & backwards) but it's currently disabled due to flaky failures; someone more familiar with those parts can register the implementation for BNInternal on CUDA (gradient kernel doesn't have to change). --------- Co-authored-by: Simon Zirui Guo <simonguozirui@berkeley.edu> Co-authored-by: mindest <linminuser@gmail.com> Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>
Description: As opset 14 lacks saved_mean/saved_inv_std, we have to convert nodes to a custom op if we want to ensure good performance on backward pass. This is the second in a series of PRs to implement BN training on CPU (first was #6946).
As part of this we also remove the existing
AdjustBatchNormOutputspass – this only works for opset 9 BN and is subsumed by the more generalBatchNormReplacementwe define in this PR.WIP – I think the only thing missing is a unit test. I'd also appreciate any feedback on whether it's necessary to copy/paste the entire schema-def for the
BatchNormInternalop or whether something like protobuf'stoBuilderexists for these schema defs.