Fix batch norm training op on CPU#6946
Conversation
|
Hey @pranav-prakash |
|
@neginraoof I'm not a msft employee, so I don't have any more power than you do with regard to code review. Or did you want me to just look over it and join the discussion? If so I've left my comments, |
|
Given that the training mode spec for BN is only fully fleshed out for opset 14, do we still need to support the case of training for non-spatial BN (which could happen with opset < 14)? Removing this codepath would simplify things greatly (we could still maintain correctness by guarding training on opset 14 with |
yes. Since BN training mode is only officially added since opset 14, let's ease the burden of supporting the non-spatial mode for training. :) |
|
Hi @pranav-prakash, thanks a lot for your contribution. |
|
Hi @pranav-prakash, I am also wondering what's your use scenario of ORT training? Which company/product are you working on? |
Yes I started the PR for this, but then saw that the spec was also updated to remove the In terms of ORT though, does this mean that we would define our own variant of the BatchNorm schema that includes outputs for
I'm associated with UC Berkeley's architecture research group, and we're working on an ORT EP for our risc-v ml acceleator (Gemmini). In terms of use-cases, at the moment we're primarily interested in training convolutional neural networks (in both bfloat16 and fp32), e.g. resnet or mobilenet. In terms of the exposed ORT training APIs we make use of, because running python would be too much overhead (we're targeting edge-devices), we call directly into the underlying C++ functions ( |
|
Thanks a lot for the detailed introduction! We are really happy to see external contribution to ORT Training!!! As the for the BatchNorm problem, the plan is to write a custom op (say, BatchNormInternal) that outputs the saved_mean and save_inv_std. We will substitute BatchNorm with BatchNormInternal before building the training graph. The BatchNormGrad op can still assume that O(3) and O(4) are present for speed up the computation. The rational behind the onnx spec update is that, the "save_inv_std" is an internal implementation detail, (other framework may use save_inv_var instead), so it's better to leave it out of the spec. |
|
Just curious, how far have your reached for training convolutional neural networks e.g. resnet or mobilenet? Are they working yet? We are also exploring federated learning on mobile/edge device. Would be nice to have a colab if our plans aligns well. |
I see – so for this PR would you like me to create the schema for such a
We just recently got a trainer for resnet50 working, although we haven't yet verified end-to-end correctness since fpga simulation of Gemmini is much too slow for training resnet from scratch (we could likely compare results from a few dozen iterations against the CPU EP though).
Yeah federated learning and training at the edge are exactly the scenarios we envisioned Gemmini would be a good fit for. We'd love to discuss our roadmap and see if there's any potential for collaboration here; feel free to email us at |
|
BatchNormTrainingInternal and BatchNorm can share the same kernel. Most of the code should be same, except for the handling of output 3 and 4. In the kernel, we can check if the ctx->Output(3, shape) and ctx->Output(4, shape) returns the nullptr. If you have bandwidth, you can add the BatchNormalizationTraining Schema to training_op_defs.cc. Also, a sample of the replacement code can be found in concat_replacement.cc. This replaces Concat with ConcatTraining. |
|
@pranav-prakash. Great to know that you got the resnet50 working... AFAIK, gradient is missing for LRN, Sum, GlobalMaxPool op... GlobalAveragePool's gradient builder can also need a fix... |
SG, I'll update the PR accordingly. Not sure I'll have the bandwidth to add the
We converted |
2f78c94 to
a139854
Compare
|
@SherlockNoMad Since the batch norm grad cannot be implemented until the schema & graph transform for internal |
|
#7177 |
|
/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline |
|
/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed |
|
Azure Pipelines successfully started running 9 pipeline(s). |
|
Azure Pipelines successfully started running 8 pipeline(s). |
|
@SherlockNoMad |
|
see line 44 in batch_norm_op_test.cc I think it's correct to set them in excluded_eps . |
SherlockNoMad
left a comment
There was a problem hiding this comment.
Thanks a lot for this PR.
I think it's ready to be merge, after the fix for the TRT and openVino fix.
|
@SherlockNoMad fixed. One other question I had was about the old cuda-test for That is, it had |
|
/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline |
|
/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed |
|
Azure Pipelines successfully started running 9 pipeline(s). |
|
I think the old test case was incorrect. |
|
Azure Pipelines successfully started running 8 pipeline(s). |
|
/azp run Linux CPU CI Pipeline,Linux CPU x64 NoContribops CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,MacOS CI Pipeline,MacOS NoContribops CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline |
|
/azp run orttraining-linux-ci-pipeline,orttraining-mac-ci-pipeline,orttraining-linux-gpu-ci-pipeline,centos7_cpu,Linux CPU Minimal Build E2E CI Pipeline,Linux Nuphar CI Pipeline,MacOS NoContribops CI Pipeline,Linux OpenVINO CI Pipeline,orttraining-distributed |
|
Azure Pipelines successfully started running 9 pipeline(s). |
|
Azure Pipelines successfully started running 8 pipeline(s). |
|
/azp run orttraining-amd-gpu-ci-pipeline, orttraining-ortmodule, orttraining-ortmodule-distributed |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
/azp run orttraining-amd-gpu-ci-pipeline |
|
Azure Pipelines successfully started running 1 pipeline(s). |
**Description**: Register an implementation for BatchNormInternal and add a CPU kernel for BatchNormGradient. This is the third in a series of PRs to implement BN training on CPU (first was #6946, second was #7539). **Motivation and Context** Support training networks with BatchNorm (e.g. convnets). Also note that there exists a CUDA kernel for BN (forward training & backwards) but it's currently disabled due to flaky failures; someone more familiar with those parts can register the implementation for BNInternal on CUDA (gradient kernel doesn't have to change). --------- Co-authored-by: Simon Zirui Guo <simonguozirui@berkeley.edu> Co-authored-by: mindest <linminuser@gmail.com> Co-authored-by: mindest <30493312+mindest@users.noreply.github.com>
Description: Add support for training-forward-mode of BatchNorm on CPU EP, and implement BN Gradient on CPU EP
Fixes the issue described in #6087
Motivation and Context
As mentioned in the above issue, currently batch norm is not implemented for training. This PR makes the following changes:
As a sidenote, I wonder if the flaky CUDA tests for the batch norm can also be resolved by d01006f