From a8c2549cd5d5de102519611bdf988915e0eba660 Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Tue, 8 Jun 2021 21:04:04 -0700 Subject: [PATCH 01/21] draft v1 --- rfcs/0001-AMP_pass.md | 128 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 128 insertions(+) create mode 100644 rfcs/0001-AMP_pass.md diff --git a/rfcs/0001-AMP_pass.md b/rfcs/0001-AMP_pass.md new file mode 100644 index 00000000..e0b9b26a --- /dev/null +++ b/rfcs/0001-AMP_pass.md @@ -0,0 +1,128 @@ +- Feature Name: Automatic Mixed Precision Pass +- Start Date: 2021-06-08 +- RFC PR: [apache/tvm-rfcs#0001](https://github.com/apache/tvm-rfcs/pull/0002) +- GitHub Issue: [apache/tvm#0001](https://github.com/apache/tvm/issues/0002) + +# Summary +[summary]: #summary + +Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. +These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth. +As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. +Surprisingly, for many operations this has little effect on the results, though some care must had when changing +operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe +due to loss of numerical precision (source). In general for a function `f`, if `|f(x)| >> |x|` for expected +ranges of input we probably do not want to use the 16 bit floating point versions. + +This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit +floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support +for bfloat16 should be in mind. + +# Motivation +[motivation]: #motivation + +Many machine learning models can move significant portions of their computational graphs into the FP16 space +without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In +the past utilizing FP16 in mixed precision training saw signficiant increases in convergence speed (source). + +We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable +for many users. + +# Guide-level explanation +[guide-level-explanation]: #guide-level-explanation + +Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit +of using a reduced floating point version of the operation. "Green" operations are compute intensive +and almost always see signficant memory and latency savings by utilizing a reduced floating point form. +Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to +know savings in using reduced floating point forms -- at least not enough to justify the overhead of +casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to +use reduced floating point forms on, usually due to numerical precision reasons. + +In general we always want to insert casts into reduced floating point space for "Green" operations, +are fine with transforming "Gray" operations into reduced floating point space if their inputs are already +in that form, and want to explictly cast back into full floating point space for "Red" operations. +Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes` +and returns a color. For example, we might have a function which colors only a convolution as "Green" if it +has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple +however and do something like place all convolutions in the "Green" list, all elementwise operations in +the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting +this "coloring" function. + +The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced +floating point types. However, while they for example may take two FP16 operands they may accumulate the +result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. +The final knob we give is a control over how operations accumulate their result. For this, we have +a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output +datatype. The output datatype is the type other operations down the line will likely ingest from the previous +calculation while the accumulation datatype describes the size of buffer where the results are initially +stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype +of FP16. The default implementation will follow this guideline closely and will by default have all +operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular +operation. + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + +See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994) + +# Drawbacks +[drawbacks]: #drawbacks + +If this is not useful, we are just adding an additional pass which will do nothing. Furthermore we +will have to make sure it works on a wide range of models or people will be very mad at TVM. + +This might not be useful if mixed precision training becomes super popular in the future in which +case most models might be in a reduced precision floating point form already. + +It also might not be useful if integer quantization becomes super popular, though it may be possible +to mix integer quantization and mixed floating precision techniques. Floating point does have +several advantages still over integer quantization including simplicity and the fact that some +operators like `sin` and `erf` are still designed in hardware with floating point in mind. + +# Rationale and alternatives +[rationale-and-alternatives]: #rationale-and-alternatives + +- Why is this design the best in the space of possible designs? + +Other alternatives require a lot more work and changes and could probably considered future goals of TVM. +This include automatic mixed precision training. + +- What other designs have been considered and what is the rationale for not choosing them? + +We can support automatic mixed precision retraining though that is a much, much larger future goal. It's +good to have this in the meantime. + +- What is the impact of not doing this? + +TVM is not the best tool for making models go fast as we leave a lot of free speedup on the table. + +# Prior art +[prior-art]: #prior-art + +Many of the ideas are taken from Tensorflow's [automatic mixed precision training framework](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf) +and the initial "Green", "Gray", and "Red" lists are based [similarly](github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/core/grappler/optimizers/auto_mixed_precision_lists.h). + +# Unresolved questions +[unresolved-questions]: #unresolved-questions + +- What parts of the design do you expect to resolve through the RFC process before this gets merged? + +We still need to make sure that the current design and knobs exposed provide extensibility to every hardware platform out there. + +- What parts of the design do you expect to resolve through the implementation of this feature before stabilization? + +Probably a lot of edge cases of operations within TVM. + +- What related issues do you consider out of scope for this RFC that could be addressed in the future + independently of the solution that comes out of this RFC? + +Making accumulation datatypes a standard idea for all operations within TVM. + +# Future possibilities +[future-possibilities]: #future-possibilities + +Really this can be used for any floating point datatype. A custom FP24 for FPGA? +BFloat16? Some other weird floating point type? We have an easy way to convert +toward utilizing these weird floating point types with FP32 when appropriate +under this framework. From 6698203a65881607b9730ce7ab3433754f721e5e Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Wed, 9 Jun 2021 10:57:38 -0700 Subject: [PATCH 02/21] add sources --- rfcs/0001-AMP_pass.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/rfcs/0001-AMP_pass.md b/rfcs/0001-AMP_pass.md index e0b9b26a..015b97c1 100644 --- a/rfcs/0001-AMP_pass.md +++ b/rfcs/0001-AMP_pass.md @@ -11,7 +11,8 @@ These 16 bit operations typically have higher theoretical throughput and involve As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. Surprisingly, for many operations this has little effect on the results, though some care must had when changing operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe -due to loss of numerical precision (source). In general for a function `f`, if `|f(x)| >> |x|` for expected +due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). +In general for a function `f`, if `|f(x)| >> |x|` for expected ranges of input we probably do not want to use the 16 bit floating point versions. This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit @@ -23,7 +24,7 @@ for bfloat16 should be in mind. Many machine learning models can move significant portions of their computational graphs into the FP16 space without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In -the past utilizing FP16 in mixed precision training saw signficiant increases in convergence speed (source). +the past utilizing FP16 in mixed precision training saw signficiant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable for many users. @@ -64,7 +65,11 @@ operation. # Reference-level explanation [reference-level-explanation]: #reference-level-explanation -See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994) +See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994). + +As some have noticed the design can be simplified to a single pass where casting is determined by +running type inference on mutated nodes. With a postorder traversal we can then check if we need to +cast arguments/propagate color. # Drawbacks [drawbacks]: #drawbacks From 2a209ef866985cd9f9b9d6bc5ae1f5e913de832f Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Wed, 9 Jun 2021 11:06:49 -0700 Subject: [PATCH 03/21] editor for spelling --- rfcs/0001-AMP_pass.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/rfcs/0001-AMP_pass.md b/rfcs/0001-AMP_pass.md index 015b97c1..2054c4c2 100644 --- a/rfcs/0001-AMP_pass.md +++ b/rfcs/0001-AMP_pass.md @@ -1,7 +1,7 @@ - Feature Name: Automatic Mixed Precision Pass - Start Date: 2021-06-08 -- RFC PR: [apache/tvm-rfcs#0001](https://github.com/apache/tvm-rfcs/pull/0002) -- GitHub Issue: [apache/tvm#0001](https://github.com/apache/tvm/issues/0002) +- RFC PR: TODO +- GitHub Issue: TODO # Summary [summary]: #summary @@ -24,7 +24,7 @@ for bfloat16 should be in mind. Many machine learning models can move significant portions of their computational graphs into the FP16 space without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In -the past utilizing FP16 in mixed precision training saw signficiant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). +the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable for many users. @@ -34,19 +34,19 @@ for many users. Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit of using a reduced floating point version of the operation. "Green" operations are compute intensive -and almost always see signficant memory and latency savings by utilizing a reduced floating point form. +and almost always see hardware memory and latency savings by utilizing a reduced floating point form. Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to -know savings in using reduced floating point forms -- at least not enough to justify the overhead of +no savings in using reduced floating point forms -- at least not enough to justify the overhead of casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to use reduced floating point forms on, usually due to numerical precision reasons. In general we always want to insert casts into reduced floating point space for "Green" operations, are fine with transforming "Gray" operations into reduced floating point space if their inputs are already -in that form, and want to explictly cast back into full floating point space for "Red" operations. +in that form, and want to explicitly cast back into full floating point space for "Red" operations. Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes` and returns a color. For example, we might have a function which colors only a convolution as "Green" if it has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple -however and do something like place all convolutions in the "Green" list, all elementwise operations in +however and do something like place all convolutions in the "Green" list, all element-wise operations in the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting this "coloring" function. @@ -68,7 +68,7 @@ operation. See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32-fp16-model-support/9994). As some have noticed the design can be simplified to a single pass where casting is determined by -running type inference on mutated nodes. With a postorder traversal we can then check if we need to +running type inference on mutated nodes. With a post-order traversal we can then check if we need to cast arguments/propagate color. # Drawbacks From 97ddca98f26b99b68e5149ba28c9a539aad18e2a Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Thu, 17 Jun 2021 00:04:16 -0700 Subject: [PATCH 04/21] add plans for benchmarking + tutorial --- rfcs/0001-AMP_pass.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/rfcs/0001-AMP_pass.md b/rfcs/0001-AMP_pass.md index 2054c4c2..7e1bebbd 100644 --- a/rfcs/0001-AMP_pass.md +++ b/rfcs/0001-AMP_pass.md @@ -71,6 +71,10 @@ As some have noticed the design can be simplified to a single pass where casting running type inference on mutated nodes. With a post-order traversal we can then check if we need to cast arguments/propagate color. +Part of the associated RFC issue will also be used dedicated to creating a tutorial on how to control +the conversion of ops via the Python interface. Furthermore, some work will be done in benchmarking +the performance gains from the pass. + # Drawbacks [drawbacks]: #drawbacks From 45133d501e66e5aa4277052901a805e198b228a3 Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Thu, 5 Aug 2021 20:34:00 -0700 Subject: [PATCH 05/21] rename to rfc name --- rfcs/{0001-AMP_pass.md => 0006-AMP_pass.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename rfcs/{0001-AMP_pass.md => 0006-AMP_pass.md} (100%) diff --git a/rfcs/0001-AMP_pass.md b/rfcs/0006-AMP_pass.md similarity index 100% rename from rfcs/0001-AMP_pass.md rename to rfcs/0006-AMP_pass.md From fcb5325bc6f34f7a1dd5f147968927832513c05a Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Thu, 5 Aug 2021 20:35:14 -0700 Subject: [PATCH 06/21] add links to PRs and Issues --- rfcs/0006-AMP_pass.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfcs/0006-AMP_pass.md b/rfcs/0006-AMP_pass.md index 7e1bebbd..fa6d75df 100644 --- a/rfcs/0006-AMP_pass.md +++ b/rfcs/0006-AMP_pass.md @@ -1,7 +1,7 @@ - Feature Name: Automatic Mixed Precision Pass - Start Date: 2021-06-08 -- RFC PR: TODO -- GitHub Issue: TODO +- RFC PR: https://github.com/apache/tvm-rfcs/pull/6 +- GitHub Issue: https://github.com/apache/tvm/issues/8296 # Summary [summary]: #summary From 56831ab0c708ac45d48ee3415062b96f781be2b3 Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Thu, 5 Aug 2021 20:53:25 -0700 Subject: [PATCH 07/21] light edits --- rfcs/0006-AMP_pass.md | 48 +++++++++++++++++++++---------------------- 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/rfcs/0006-AMP_pass.md b/rfcs/0006-AMP_pass.md index fa6d75df..3b1f01d6 100644 --- a/rfcs/0006-AMP_pass.md +++ b/rfcs/0006-AMP_pass.md @@ -6,48 +6,48 @@ # Summary [summary]: #summary -Many pieces of hardware support operation not only on 32 bit floating point, but also 16 bit floating point. +Many pieces of hardware support arithmetic not only on IEEE 32 bit floating point numbers, but also IEEE 16 bit floating point numbers. These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth. -As a result, we can see significant increases from changing normal 32 bit operations with 16 bit analogs. -Surprisingly, for many operations this has little effect on the results, though some care must had when changing -operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered less safe +As a result, we can see significant increases in speed from changing 32 bit floating point operations into 16 bit analogs for many models. +Surprisingly, this change has little affect on the results of some models, though some care must had when changing a select few +operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered unsafe to use 16 bit analogs due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). -In general for a function `f`, if `|f(x)| >> |x|` for expected -ranges of input we probably do not want to use the 16 bit floating point versions. +In general for a function `f`, if `|f(x)| >> |x|` for expected ranges of input we probably do not want to use the 16 bit floating point versions. -This feature will be a relay pass which automatically converts a 32 bit floating point model into a reduced bit -floating point analog. For the initial pass IEEE's 16 bit floating point will be targeted though future support -for bfloat16 should be in mind. +This RFC describes a relay pass which automatically converts a 32 bit floating point model into a reduced bit +floating point analog. For the initial work, IEEE's 16 bit floating point will be targeted though future support +for bfloat16 will be held in mind. Additionally, we discuss some additional work that must be done to support 16 bit floating point +on some common targets. # Motivation [motivation]: #motivation Many machine learning models can move significant portions of their computational graphs into the FP16 space -without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. In -the past utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). +without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. For example, +Pytorch saw utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). -We should expect similar increases for inference. This speed increase without accuracy loss is highly desirable +We should expect similar increases for inference. This speed increase without significant accuracy loss is highly desirable for many users. # Guide-level explanation [guide-level-explanation]: #guide-level-explanation -Operations are partitioned into colors denoted "Green", "Red", and "Gray" which represents the benefit -of using a reduced floating point version of the operation. "Green" operations are compute intensive +Operations are partitioned into categories denoted "ALLOW", "DENY", and "FOLLOW" which represents the benefit +of using a reduced floating point version of the operation. "ALLOW" operations are compute intensive and almost always see hardware memory and latency savings by utilizing a reduced floating point form. -Examples of these operations are matrix multiplies and convolutions. "Gray" operations see little to +Examples of these operations are matrix multiplication and convolutions. "FOLLOW" operations see little to no savings in using reduced floating point forms -- at least not enough to justify the overhead of -casting values back and forth from FP32. "Red" operations meanwhile are operations we do not want to +casting values back and forth from FP32. "DENY" operations meanwhile are operations we do not want to use reduced floating point forms on, usually due to numerical precision reasons. -In general we always want to insert casts into reduced floating point space for "Green" operations, -are fine with transforming "Gray" operations into reduced floating point space if their inputs are already -in that form, and want to explicitly cast back into full floating point space for "Red" operations. +In general we always want to insert casts into reduced floating point space for "ALLOW" operations, +are fine with transforming "FOLLOW" operations into reduced floating point space if their inputs are already +in that form, and want to explicitly cast back into full floating point space for "DENY" operations. Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes` -and returns a color. For example, we might have a function which colors only a convolution as "Green" if it -has a large enough kernel and "Gray" otherwise. For the default implementation we will keep things simple -however and do something like place all convolutions in the "Green" list, all element-wise operations in -the "Gray" list, and so on. Still, the code will be designed to be easily extensible via overwriting +and returns a color. For example, we might have a function which colors only a convolution as "ALLOW" if it +has a large enough kernel and "FOLLOW" otherwise. For the default implementation we will keep things simple +however and do something like place all convolutions in the "ALLOW" list, all element-wise operations in +the "FOLLOW" list, and so on. Still, the code will be designed to be easily extensible via overwriting this "coloring" function. The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced @@ -110,7 +110,7 @@ TVM is not the best tool for making models go fast as we leave a lot of free spe [prior-art]: #prior-art Many of the ideas are taken from Tensorflow's [automatic mixed precision training framework](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf) -and the initial "Green", "Gray", and "Red" lists are based [similarly](github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/core/grappler/optimizers/auto_mixed_precision_lists.h). +and the initial "ALLOW", "FOLLOW", and "DENY" lists are based [similarly](github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/core/grappler/optimizers/auto_mixed_precision_lists.h). # Unresolved questions [unresolved-questions]: #unresolved-questions From ed104ec7fc7ac488bf12133d74eee7b63636a1fd Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Fri, 13 Aug 2021 11:13:56 -0700 Subject: [PATCH 08/21] flesh out interface for user --- rfcs/0006-AMP_pass.md | 131 +++++++++++++++++++++++++++++++----------- 1 file changed, 99 insertions(+), 32 deletions(-) diff --git a/rfcs/0006-AMP_pass.md b/rfcs/0006-AMP_pass.md index 3b1f01d6..9b44a636 100644 --- a/rfcs/0006-AMP_pass.md +++ b/rfcs/0006-AMP_pass.md @@ -1,4 +1,4 @@ -- Feature Name: Automatic Mixed Precision Pass +- Feature Name: Automatic Mixed Precision Pass and support - Start Date: 2021-06-08 - RFC PR: https://github.com/apache/tvm-rfcs/pull/6 - GitHub Issue: https://github.com/apache/tvm/issues/8296 @@ -9,30 +9,42 @@ Many pieces of hardware support arithmetic not only on IEEE 32 bit floating point numbers, but also IEEE 16 bit floating point numbers. These 16 bit operations typically have higher theoretical throughput and involve less use of memory bandwidth. As a result, we can see significant increases in speed from changing 32 bit floating point operations into 16 bit analogs for many models. -Surprisingly, this change has little affect on the results of some models, though some care must had when changing a select few -operations. Some 16 bit floating point operations such as `exp` and `log` for example are considered unsafe to use 16 bit analogs +Surprisingly, this change has little effect on the results of some models, even without retraining, though some care must had when changing a select few +operations. For example, some 16 bit floating point operations such as `exp` and `log` are considered generally unsafe in the 16 bit floating point space due to loss of [numerical precision](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf). -In general for a function `f`, if `|f(x)| >> |x|` for expected ranges of input we probably do not want to use the 16 bit floating point versions. +In general for a function `f`, if `|f(x)| >> |x|` for expected ranges of input we probably want to stick to 32 bit floating point versions. +As a result, within models, 16 bit floating point is often interspersed with 32 bit floating point operations for unsafe operations. The usage of +differing precision for floating point in a model is often called "Mixed Precision." -This RFC describes a relay pass which automatically converts a 32 bit floating point model into a reduced bit -floating point analog. For the initial work, IEEE's 16 bit floating point will be targeted though future support -for bfloat16 will be held in mind. Additionally, we discuss some additional work that must be done to support 16 bit floating point -on some common targets. +This RFC describes a plan to support automatic mixed floating point precision models within TVM. Specifically, we focus on the conversion +of an existing, trained 32 bit floating point model, into a mixed precision model without retraining. Note, we do not focus on the conversion +of models already operating in a mixed precision space though much of the work being done will help guarantee support for these models. + +In particular we focus discussion on the following areas: +- Creating a pass to automatically transform a 32 bit floating point Relay model into a 16 bit analog +- The changes in the intermediate representation of TVM that must be made to ensure wide operator support for FP16 +- Issues in some codegen pathways that must be address to ensure wide support for FP16. + +For the initial work, IEEE's 16 bit floating point will be targeted though future support for bfloat16 will be held in mind. # Motivation [motivation]: #motivation -Many machine learning models can move significant portions of their computational graphs into the FP16 space +Many machine learning models can move large portions of their computational graphs into the FP16 space without significant loss of accuracy. For many pieces of hardware this also comes with a boost in speed. For example, -Pytorch saw utilizing FP16 in mixed precision training saw significant [increases in convergence speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). +PyTorch utilized FP16 in mixed precision training and saw significant [increases in training speed](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/). -We should expect similar increases for inference. This speed increase without significant accuracy loss is highly desirable -for many users. +We should expect similar increases in speed for inference. + +This speed increase without significant accuracy loss is highly desirable for many users. # Guide-level explanation [guide-level-explanation]: #guide-level-explanation -Operations are partitioned into categories denoted "ALLOW", "DENY", and "FOLLOW" which represents the benefit +## Pass Explanation +The mixed precision pass operates on Relay models and their operations. + +Operations are partitioned into category lists denoted "ALLOW", "DENY", and "FOLLOW" which represents the benefit of using a reduced floating point version of the operation. "ALLOW" operations are compute intensive and almost always see hardware memory and latency savings by utilizing a reduced floating point form. Examples of these operations are matrix multiplication and convolutions. "FOLLOW" operations see little to @@ -40,27 +52,82 @@ no savings in using reduced floating point forms -- at least not enough to justi casting values back and forth from FP32. "DENY" operations meanwhile are operations we do not want to use reduced floating point forms on, usually due to numerical precision reasons. -In general we always want to insert casts into reduced floating point space for "ALLOW" operations, +We always want to insert casts into reduced floating point space for inputs to "ALLOW" operations, are fine with transforming "FOLLOW" operations into reduced floating point space if their inputs are already in that form, and want to explicitly cast back into full floating point space for "DENY" operations. -Each operation will be placed into one of these lists via a "coloring" function which take in Relay `CallNodes` -and returns a color. For example, we might have a function which colors only a convolution as "ALLOW" if it -has a large enough kernel and "FOLLOW" otherwise. For the default implementation we will keep things simple -however and do something like place all convolutions in the "ALLOW" list, all element-wise operations in -the "FOLLOW" list, and so on. Still, the code will be designed to be easily extensible via overwriting -this "coloring" function. - -The final variable we must keep in mind is the fact that some hardware platforms can operate on reduced -floating point types. However, while they for example may take two FP16 operands they may accumulate the -result in a 32 bit buffer. An example of this are the Tensor Cores in Nvidia's Turing architecture. -The final knob we give is a control over how operations accumulate their result. For this, we have -a function, which maps operation types like `conv2d` to an accumulation datatype as well as an output -datatype. The output datatype is the type other operations down the line will likely ingest from the previous -calculation while the accumulation datatype describes the size of buffer where the results are initially -stored. For NVidia's tensor cores for example many operations accumulate in FP32 but have an output datatype -of FP16. The default implementation will follow this guideline closely and will by default have all -operations output FP16 and accumulate in FP32 only if TVM supports mixed datatypes for that particular -operation. +Each operation will be placed into one of these lists via a function which take in Relay `CallNodes` +and returns either "ALLOW", "DENY", or "FOLLOW. For example, we might have a function which colors only +a convolution as "ALLOW" if it has a large enough kernel and "FOLLOW" otherwise. + +The final consideration is using higher bit accumulators. For example, for a global average pool, we might +have 16 bit floating point inputs, but accumulate the result in a 32 bit floating point buffer in order to +maintain numerical information. As a result, we must have a way to communicate whether an operator should +accumulate results in a higher bit buffer. An example of hardware with native support for this sort of operation +are the Tensor Cores in Nvidia's Turing architecture. For NVidia's Tensor Cores for example have many operations +accumulate in FP32 but have an output datatype of FP16. + +The interface to control the conversion of an operator for the mixed precision pass is therefore as follows: + - Write a function in python which given a Relay CallNode, decides whether it should be in the "ALLOW", + "FOLLOW", or "DENY" lists of operations. Furthermore, the function should decide the accumulation + and output datatypes of the operation. + ```python + def color_func(call_node: "relay.Call", mixed_precision_dtype: str) -> Tuple[int, str, str]: + """ + Parameters + ---------- + call_node: + A Relay Call node which is currently being examined by the mixed precision pass. + + mixed_precision_dtype: + The datatype of the mixed precision pass (i.e. usually float16). + + Returns + ------- + result : Tuple[int, str, str] + A tuple where the first element (int) represents a code describing the operation as belonging to "ALLOW", "DENY", or "FOLLOW" lists. + The second element describes the accumulation datatype of the operation (i.e. usually float32 or mixed_precision_dtype). The third + element describes the output datatype of the operation (i.e. usually mixed_precision_dtype). + """ + ``` + - Register the function as an operator attribute with a provided function: + ```python + def register_mixed_precision_conversion(op_name, func=None, level=10): + """Register mixed precision conversion function for an op + + Given an op the function should return information on how the value should be + converted. Specifically the function should take a call node and the target + mixed precision datatype (e.g. FP16) and return the conversion category + (see python/tvm/relay/transform/mixed_precision.py) as well as the accumulation + and output datatype of the operation in the mixed precision dtype space. + + Parameters + ---------- + op_name : str + The name of the operator + + func: function (call_node: relay.Call, target_dtype: string) + -> [conversion category, accumulation dtype, output dtype]: [int, string, string] + A function which given a call_node and target_dtype (e.g. FP16) returns the + conversion category and associated accumulation/output of the operation + when transformed into the mixed precision dtype space. + + level : int + The priority level + """ + ``` +By default, unregistered operators will always be assumed to be in the "FOLLOW" list of operations and accumulate +and output results as the mixed precision dtype. A default registry of functions will also be provided and be based +on TensorFlow's [similar feature](github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/core/grappler/optimizers/auto_mixed_precision_lists.h). + + + + + # Reference-level explanation [reference-level-explanation]: #reference-level-explanation From 87813af9cdc03fdc32436e128d6fc71ba103f7a2 Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Fri, 13 Aug 2021 11:29:04 -0700 Subject: [PATCH 09/21] add example --- rfcs/0006-AMP_pass.md | 54 ++++++++++++++++++++++++++++--------------- 1 file changed, 35 insertions(+), 19 deletions(-) diff --git a/rfcs/0006-AMP_pass.md b/rfcs/0006-AMP_pass.md index 9b44a636..3bd691f4 100644 --- a/rfcs/0006-AMP_pass.md +++ b/rfcs/0006-AMP_pass.md @@ -41,9 +41,10 @@ This speed increase without significant accuracy loss is highly desirable for ma # Guide-level explanation [guide-level-explanation]: #guide-level-explanation -## Pass Explanation The mixed precision pass operates on Relay models and their operations. +We define an operator as in "mixed precision" space if it's inputs are in reduced precision form (e.g. FP16). + Operations are partitioned into category lists denoted "ALLOW", "DENY", and "FOLLOW" which represents the benefit of using a reduced floating point version of the operation. "ALLOW" operations are compute intensive and almost always see hardware memory and latency savings by utilizing a reduced floating point form. @@ -52,9 +53,9 @@ no savings in using reduced floating point forms -- at least not enough to justi casting values back and forth from FP32. "DENY" operations meanwhile are operations we do not want to use reduced floating point forms on, usually due to numerical precision reasons. -We always want to insert casts into reduced floating point space for inputs to "ALLOW" operations, -are fine with transforming "FOLLOW" operations into reduced floating point space if their inputs are already -in that form, and want to explicitly cast back into full floating point space for "DENY" operations. +We always want to move "ALLOW" operations into mixed precision space by casting their inputs, +are fine with transforming "FOLLOW" operations into mixed precision space space if their inputs are already +in reduced form, and want to explicitly cast back into full floating point space for "DENY" operations. Each operation will be placed into one of these lists via a function which take in Relay `CallNodes` and returns either "ALLOW", "DENY", or "FOLLOW. For example, we might have a function which colors only a convolution as "ALLOW" if it has a large enough kernel and "FOLLOW" otherwise. @@ -69,9 +70,10 @@ accumulate in FP32 but have an output datatype of FP16. The interface to control the conversion of an operator for the mixed precision pass is therefore as follows: - Write a function in python which given a Relay CallNode, decides whether it should be in the "ALLOW", "FOLLOW", or "DENY" lists of operations. Furthermore, the function should decide the accumulation - and output datatypes of the operation. + and output datatypes of the operation, though these are only used if the operator will be in mixed + precision space. ```python - def color_func(call_node: "relay.Call", mixed_precision_dtype: str) -> Tuple[int, str, str]: + def mixed_precision_func(call_node: "relay.Call", mixed_precision_dtype: str) -> Tuple[int, str, str]: """ Parameters ---------- @@ -86,7 +88,7 @@ The interface to control the conversion of an operator for the mixed precision p result : Tuple[int, str, str] A tuple where the first element (int) represents a code describing the operation as belonging to "ALLOW", "DENY", or "FOLLOW" lists. The second element describes the accumulation datatype of the operation (i.e. usually float32 or mixed_precision_dtype). The third - element describes the output datatype of the operation (i.e. usually mixed_precision_dtype). + element describes the output datatype of the operation (i.e. usually mixed_precision_dtype). """ ``` - Register the function as an operator attribute with a provided function: @@ -115,20 +117,34 @@ The interface to control the conversion of an operator for the mixed precision p The priority level """ ``` -By default, unregistered operators will always be assumed to be in the "FOLLOW" list of operations and accumulate -and output results as the mixed precision dtype. A default registry of functions will also be provided and be based + - An example of creating a function which operates on a Conv2D operator and registering it is as follows: + ```python + import math + + MIXED_PRECISION_ALWAYS = 0 + MIXED_PRECISION_FOLLOW = 1 + MIXED_PRECISION_NEVER = 2 + + def conv2d_mixed_precision_func(call_node: "relay.Call", mixed_precision_dtype: str) -> Tuple[int, str, str]: + """Note this won't work for dynamic shaped inputs.""" + accumulation_dtype = "float32" + output_dtype = mixed_precision_dtype + + input_shape_elements = math.prod(call_node.op.data.shape) + + # Always convert to mixed precision if the input is big enough, else move to follow list + if input_shape_elements > 100: + return (MIXED_PRECISION_ALWAYS, accumulation_dtype, output_dtype) + return (MIXED_PRECISION_FOLLOW, accumulation_dtype, output_dtype) + + # Register conversion function for conv2d + register_mixed_precision_conversion("nn.conv2d", conv2d_mixed_precision_func) + ``` +With this interface, every single Relay operator within a model will belong to "ALLOW", "FOLLOW", or "DENY" lists and is +accordingly transformed into a mixed precision form. By default, unregistered operators will always be assumed to be in the +"FOLLOW" list of operations and accumulate and output results as the mixed precision dtype. A default registry of functions will also be provided and be based on TensorFlow's [similar feature](github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/core/grappler/optimizers/auto_mixed_precision_lists.h). - - - - - # Reference-level explanation [reference-level-explanation]: #reference-level-explanation From e9777c19741f851e6a2cc2a5161e88ef775fec4c Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Fri, 13 Aug 2021 11:56:44 -0700 Subject: [PATCH 10/21] clean up all sections except reference level explanation --- rfcs/0006-AMP_pass.md | 42 +++++++++++++++++++++++++++++------------- 1 file changed, 29 insertions(+), 13 deletions(-) diff --git a/rfcs/0006-AMP_pass.md b/rfcs/0006-AMP_pass.md index 3bd691f4..d92080a9 100644 --- a/rfcs/0006-AMP_pass.md +++ b/rfcs/0006-AMP_pass.md @@ -158,19 +158,31 @@ Part of the associated RFC issue will also be used dedicated to creating a tutor the conversion of ops via the Python interface. Furthermore, some work will be done in benchmarking the performance gains from the pass. +## Pass implementation + +## Other changes needed in TVM + +## Codegen issues + +## Plan for benchmarking + # Drawbacks [drawbacks]: #drawbacks If this is not useful, we are just adding an additional pass which will do nothing. Furthermore we -will have to make sure it works on a wide range of models or people will be very mad at TVM. +will have to make sure it works well on a wide range of models or people will be very mad at TVM. +This necessitates good default definitions for automatic mixed precision conversion for operators +which we will have to maintain. Furthermore, additional work needs to be done in order to ensure +good coverage of support for this pass. -This might not be useful if mixed precision training becomes super popular in the future in which -case most models might be in a reduced precision floating point form already. +Furthermore, this pass might not be useful if mixed precision training becomes super popular in the +future in which case most models might be in a reduced precision floating point form already. It also might not be useful if integer quantization becomes super popular, though it may be possible -to mix integer quantization and mixed floating precision techniques. Floating point does have -several advantages still over integer quantization including simplicity and the fact that some -operators like `sin` and `erf` are still designed in hardware with floating point in mind. +to mix integer quantization and mixed floating precision techniques. Despite this, automatic mixed +precision has an advantage of having a lesser accuracy loss compared to integer quantization, especially +when models are not retrained. This makes it very useful as a simple flag that can be turned on for +every trained model to essentially get a free speed increases. # Rationale and alternatives [rationale-and-alternatives]: #rationale-and-alternatives @@ -178,7 +190,8 @@ operators like `sin` and `erf` are still designed in hardware with floating poin - Why is this design the best in the space of possible designs? Other alternatives require a lot more work and changes and could probably considered future goals of TVM. -This include automatic mixed precision training. +This include automatic mixed precision training. It also operates on the Relay level which is the right +place to make these changes and seems to mostly work mostly well out of the box excepting a few issues. - What other designs have been considered and what is the rationale for not choosing them? @@ -200,21 +213,24 @@ and the initial "ALLOW", "FOLLOW", and "DENY" lists are based [similarly](github - What parts of the design do you expect to resolve through the RFC process before this gets merged? -We still need to make sure that the current design and knobs exposed provide extensibility to every hardware platform out there. +Feedback on the user interface in the pass will be appreciated. The creation of default conversion methods for operators +is another topic discussion. - What parts of the design do you expect to resolve through the implementation of this feature before stabilization? -Probably a lot of edge cases of operations within TVM. +There are likely many misc. TVM changes that must be made in order to better support FP16 execution as discussed above. +We will deal with these issues as we encounter them, though believe that in general these are issues not specific +to the pass in general but rather FP16 support throughout all of TVM. - What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC? -Making accumulation datatypes a standard idea for all operations within TVM. +Making accumulation datatypes a standard idea for all operations within TVM. Furthermore, having good coverage +for the conversion of existing mixed precision models. # Future possibilities [future-possibilities]: #future-possibilities Really this can be used for any floating point datatype. A custom FP24 for FPGA? -BFloat16? Some other weird floating point type? We have an easy way to convert -toward utilizing these weird floating point types with FP32 when appropriate -under this framework. +BFloat16? Some other weird dtype? We have an easy way to convert models +toward utilizing exotic types with FP32 when appropriate under this framework. From ed3acb776a43eda87bb7863df67cb3f301d64f06 Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Fri, 13 Aug 2021 12:01:52 -0700 Subject: [PATCH 11/21] add example of how to convert to fp16 --- rfcs/0006-AMP_pass.md | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/rfcs/0006-AMP_pass.md b/rfcs/0006-AMP_pass.md index d92080a9..eca2de00 100644 --- a/rfcs/0006-AMP_pass.md +++ b/rfcs/0006-AMP_pass.md @@ -140,6 +140,31 @@ The interface to control the conversion of an operator for the mixed precision p # Register conversion function for conv2d register_mixed_precision_conversion("nn.conv2d", conv2d_mixed_precision_func) ``` + - After registering the appropriate operators for the model. We then invoke the mixed precision pass. Note we want to rerun some + other graph optimizations afterwards: + ```python + def convert_to_fp16(mod, params, fast_math=True): + mod = tvm.IRModule.from_expr(mod["main"]) + + # Run safe operations to simplify graph + mod = tvm.relay.transform.EliminateCommonSubexpr()(mod) + mod = tvm.relay.transform.FoldConstant()(mod) + + # Run main mixed precision pass + mod = InferType()(mod) + mod = ToMixedPrecision()(mod) + + # Run more passes to clean up new graph + mod = tvm.relay.transform.EliminateCommonSubexpr()(mod) + mod = tvm.relay.transform.FoldConstant()(mod) + mod = tvm.relay.transform.CombineParallelBatchMatmul()(mod) + mod = tvm.relay.transform.FoldConstant()(mod) + mod = tvm.relay.transform.FastMath()(mod) if fast_math else mod + + return mod, params + ``` + - We then have a Relay model in mixed precision form! + With this interface, every single Relay operator within a model will belong to "ALLOW", "FOLLOW", or "DENY" lists and is accordingly transformed into a mixed precision form. By default, unregistered operators will always be assumed to be in the "FOLLOW" list of operations and accumulate and output results as the mixed precision dtype. A default registry of functions will also be provided and be based From 1c4e595c9c55915d5695d4c2be2824c2153ee97a Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Fri, 13 Aug 2021 12:06:24 -0700 Subject: [PATCH 12/21] correct example --- rfcs/0006-AMP_pass.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/0006-AMP_pass.md b/rfcs/0006-AMP_pass.md index eca2de00..56bff93e 100644 --- a/rfcs/0006-AMP_pass.md +++ b/rfcs/0006-AMP_pass.md @@ -130,7 +130,7 @@ The interface to control the conversion of an operator for the mixed precision p accumulation_dtype = "float32" output_dtype = mixed_precision_dtype - input_shape_elements = math.prod(call_node.op.data.shape) + input_shape_elements = math.prod(call_node.args[0].type_annotation.shape) # Always convert to mixed precision if the input is big enough, else move to follow list if input_shape_elements > 100: From 2bc896c2d2d94aa9f09c6e75a021faaac96ae3d2 Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Fri, 13 Aug 2021 12:49:09 -0700 Subject: [PATCH 13/21] pass implementation details --- rfcs/0006-AMP_pass.md | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/rfcs/0006-AMP_pass.md b/rfcs/0006-AMP_pass.md index 56bff93e..0ed5c682 100644 --- a/rfcs/0006-AMP_pass.md +++ b/rfcs/0006-AMP_pass.md @@ -184,6 +184,22 @@ the conversion of ops via the Python interface. Furthermore, some work will be d the performance gains from the pass. ## Pass implementation +The centerpiece of the Relay pass is it's behavior with CallNodes, which are the actual functions and operations which might be converted into +mixed precision space. The key idea is to use the user provided functions above to determine whether the node is part of the "ALLOW", "FOLLOW" +or "DENY" lists. If the CallNode is calling a non-Relay operator (e.g. it is a function call) then nothing is changed. + +In the case of an operator, we cover the cases where the operator belongs to either of the "ALLOW", "FOLLOW" or "DENY" lists. +- If an operator is in the "ALLOW" list, then all floating point inputs not the mixed precision type will be cast into the mixed precision type +- If an operator is in the "FOLLOW" list, then if all floating point inputs are in the mixed precision type, then nothing will be changes and + the operator will operate in mixed precision space. If some floating point inputs are not in the mixed precision space, then all inputs are + case back to FP32. +- If an operator is in the "DENY" list, then all floating point inputs are cast back into FP32. + +At the end, if the operator is operating in mixed precision space, then we will accumulate in the given accumulation datatype and output a result +in the output datatype. Some operators specify this information in their attributes, so we must sometimes construct a new operator node with +the appropriate attributes to share this information. + +For more information, please refer to the initial implementation of the [mixed precision pass](https://github.com/apache/tvm/pull/8069). ## Other changes needed in TVM @@ -258,4 +274,4 @@ for the conversion of existing mixed precision models. Really this can be used for any floating point datatype. A custom FP24 for FPGA? BFloat16? Some other weird dtype? We have an easy way to convert models -toward utilizing exotic types with FP32 when appropriate under this framework. +toward utilizing exotic types with FP32 when appropriate under this framework. \ No newline at end of file From 2749f7cf8967318de7a054cade0b09aa4895f361 Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Tue, 17 Aug 2021 10:25:19 -0700 Subject: [PATCH 14/21] flesh out final sections --- rfcs/0006-AMP_pass.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/rfcs/0006-AMP_pass.md b/rfcs/0006-AMP_pass.md index 0ed5c682..84aa3c9b 100644 --- a/rfcs/0006-AMP_pass.md +++ b/rfcs/0006-AMP_pass.md @@ -201,12 +201,19 @@ the appropriate attributes to share this information. For more information, please refer to the initial implementation of the [mixed precision pass](https://github.com/apache/tvm/pull/8069). -## Other changes needed in TVM +## Code-gen issues -## Codegen issues +There are some issues with generating valid CUDA code for FP16 at the moment. Other backends such as Vulkan also +have similar issues. These will need to be fixed to ensure wide coverage of support for this pass and will be +tracked in the linked GitHub issue. ## Plan for benchmarking +At a later date we will come with a comprehensive plan to benchmark this pass on some common models. This includes +documenting speedups from using FP16 on select platforms and determining accuracy loss on some select datasets. For +a comprehensive benchmark, the above issues will need to be tackled first. The GitHub issue will be used for tracking +progress on this. + # Drawbacks [drawbacks]: #drawbacks From def9bc53c3bfbb67f78066e08ba5e01fbe595ce8 Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Tue, 17 Aug 2021 10:48:30 -0700 Subject: [PATCH 15/21] add sentence --- rfcs/0006-AMP_pass.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/rfcs/0006-AMP_pass.md b/rfcs/0006-AMP_pass.md index 84aa3c9b..d75b7a3a 100644 --- a/rfcs/0006-AMP_pass.md +++ b/rfcs/0006-AMP_pass.md @@ -201,6 +201,18 @@ the appropriate attributes to share this information. For more information, please refer to the initial implementation of the [mixed precision pass](https://github.com/apache/tvm/pull/8069). +## Other changes to TVM + +Other miscellaneous changes must be made to TVM to fully support FP16 operations. For one, many operations and their +schedules make assumptions on the input types they can handle. For example, our CPU sorting operations assume 32 bit +alignment. We will have to deal with these one off adhoc instances in order to have good support for the pass. +Thankfully, these are fairly uncommon based on an initial survey and we can probably manage to tackle them one by one +as they pop up. + +Another issue we must deal with are making sure schedules support accumulation datatypes. Some schedules, do not type +check their TIR for mixed precision due to inadequately placed casts that are needed to operate in one datatype but output in another. We suggest relaxing the TIR type checking constraints by allowing upcasting floating point types. E.g. automatically inserting casts to convert from FP16 to FP32 when appropriate. In addition, other schedules hard code their +accumulation datatypes which need to be changed. + ## Code-gen issues There are some issues with generating valid CUDA code for FP16 at the moment. Other backends such as Vulkan also From 6886d38bc7c5f486f42503ce1f8950d5160ada54 Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Tue, 17 Aug 2021 11:03:00 -0700 Subject: [PATCH 16/21] add sentence --- rfcs/0006-AMP_pass.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rfcs/0006-AMP_pass.md b/rfcs/0006-AMP_pass.md index d75b7a3a..e7454121 100644 --- a/rfcs/0006-AMP_pass.md +++ b/rfcs/0006-AMP_pass.md @@ -213,6 +213,8 @@ Another issue we must deal with are making sure schedules support accumulation d check their TIR for mixed precision due to inadequately placed casts that are needed to operate in one datatype but output in another. We suggest relaxing the TIR type checking constraints by allowing upcasting floating point types. E.g. automatically inserting casts to convert from FP16 to FP32 when appropriate. In addition, other schedules hard code their accumulation datatypes which need to be changed. +We might also anticipate other issues popping up that may require further changes to TVM to support mixed precision. + ## Code-gen issues There are some issues with generating valid CUDA code for FP16 at the moment. Other backends such as Vulkan also From d45a661760d3f637fbf9122d7d8ec5d93f323508 Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Tue, 17 Aug 2021 11:56:50 -0700 Subject: [PATCH 17/21] more touch ups --- rfcs/0006-AMP_pass.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/rfcs/0006-AMP_pass.md b/rfcs/0006-AMP_pass.md index e7454121..dd30d0a3 100644 --- a/rfcs/0006-AMP_pass.md +++ b/rfcs/0006-AMP_pass.md @@ -163,7 +163,7 @@ The interface to control the conversion of an operator for the mixed precision p return mod, params ``` - - We then have a Relay model in mixed precision form! + - We then have a Relay model in mixed precision form! With this interface, every single Relay operator within a model will belong to "ALLOW", "FOLLOW", or "DENY" lists and is accordingly transformed into a mixed precision form. By default, unregistered operators will always be assumed to be in the @@ -177,7 +177,7 @@ See [previous discussion thread](https://discuss.tvm.apache.org/t/rfc-relay-fp32 As some have noticed the design can be simplified to a single pass where casting is determined by running type inference on mutated nodes. With a post-order traversal we can then check if we need to -cast arguments/propagate color. +cast arguments/propagate casting attributes. Part of the associated RFC issue will also be used dedicated to creating a tutorial on how to control the conversion of ops via the Python interface. Furthermore, some work will be done in benchmarking @@ -205,15 +205,14 @@ For more information, please refer to the initial implementation of the [mixed p Other miscellaneous changes must be made to TVM to fully support FP16 operations. For one, many operations and their schedules make assumptions on the input types they can handle. For example, our CPU sorting operations assume 32 bit -alignment. We will have to deal with these one off adhoc instances in order to have good support for the pass. +alignment. We will have to deal with these adhoc problems in order to have good support for the pass. Thankfully, these are fairly uncommon based on an initial survey and we can probably manage to tackle them one by one as they pop up. -Another issue we must deal with are making sure schedules support accumulation datatypes. Some schedules, do not type -check their TIR for mixed precision due to inadequately placed casts that are needed to operate in one datatype but output in another. We suggest relaxing the TIR type checking constraints by allowing upcasting floating point types. E.g. automatically inserting casts to convert from FP16 to FP32 when appropriate. In addition, other schedules hard code their +Another issue we must deal with are making sure schedules support accumulation datatypes. Some schedules, do not have their TIR type check for mixed precision due to inadequately placed casts that are needed to operate in one datatype but output in another. We suggest relaxing the TIR type checking constraints by allowing upcasting floating point types. E.g. automatically inserting casts to convert from FP16 to FP32 when appropriate. In addition, other schedules hard code their accumulation datatypes which need to be changed. -We might also anticipate other issues popping up that may require further changes to TVM to support mixed precision. +We also anticipate other issues popping up that may require further changes to TVM to support mixed precision but believe we can deal with these as they become apparent. ## Code-gen issues From 24c66c68d28c0e013dd370e5e3e94cb9b331126d Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Tue, 17 Aug 2021 12:48:45 -0700 Subject: [PATCH 18/21] talk about XLA and existing support --- rfcs/0006-AMP_pass.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/rfcs/0006-AMP_pass.md b/rfcs/0006-AMP_pass.md index dd30d0a3..4161350b 100644 --- a/rfcs/0006-AMP_pass.md +++ b/rfcs/0006-AMP_pass.md @@ -251,8 +251,11 @@ every trained model to essentially get a free speed increases. - Why is this design the best in the space of possible designs? Other alternatives require a lot more work and changes and could probably considered future goals of TVM. -This include automatic mixed precision training. It also operates on the Relay level which is the right -place to make these changes and seems to mostly work mostly well out of the box excepting a few issues. +This include automatic mixed precision training. Existing frameworks like Tensorflow and PyTorch support +this and is based on work by [NVidia](https://developer.nvidia.com/blog/mixed-precision-training-deep-neural-networks/). +This involves rewriting the graph in a similar fashion to this pass, with some care with the gradient calculations to +ensure stability. Additionally, XLA can be run on top of this to further use model information to optimize kernels, much +like TVM. As a lot of prior art has a similar design, we can be confident in this approach. - What other designs have been considered and what is the rationale for not choosing them? From 63d1cb0ed33546860525207ce2a2e445541e6210 Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Tue, 17 Aug 2021 13:07:29 -0700 Subject: [PATCH 19/21] discussion on possible targets --- rfcs/0006-AMP_pass.md | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/rfcs/0006-AMP_pass.md b/rfcs/0006-AMP_pass.md index 4161350b..2d8f48c6 100644 --- a/rfcs/0006-AMP_pass.md +++ b/rfcs/0006-AMP_pass.md @@ -297,4 +297,14 @@ for the conversion of existing mixed precision models. Really this can be used for any floating point datatype. A custom FP24 for FPGA? BFloat16? Some other weird dtype? We have an easy way to convert models -toward utilizing exotic types with FP32 when appropriate under this framework. \ No newline at end of file +toward utilizing exotic types with FP32 when appropriate under this framework. + +Some hardware we are interested in, usually because they support native FP16 instructions: +- ARM CPUs, ARMv8.4-A+ (e.g. M1 in Apple Macs) +- NVidia GPUs, especially those with Tensor Cores +- AMD GPUs +- AMD APUs +- Intel CPUs / Intel Integrated Graphics (Skylake+ has FP16 support) + +We might need to further support for some targets like OpenCL, CUDA, and Metal in order to get the most from these hardwares. + From def82a1c0c18b4caf02e9ed4e9e62bff0c9be9a5 Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Wed, 18 Aug 2021 10:58:36 -0700 Subject: [PATCH 20/21] address comments on PyTorch vs TF appraoches --- rfcs/0006-AMP_pass.md | 22 ++++++++++------------ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/rfcs/0006-AMP_pass.md b/rfcs/0006-AMP_pass.md index 2d8f48c6..70855591 100644 --- a/rfcs/0006-AMP_pass.md +++ b/rfcs/0006-AMP_pass.md @@ -245,17 +245,21 @@ precision has an advantage of having a lesser accuracy loss compared to integer when models are not retrained. This makes it very useful as a simple flag that can be turned on for every trained model to essentially get a free speed increases. +# Prior art +[prior-art]: #prior-art + +Many of the ideas are taken from Tensorflow's [automatic mixed precision training framework](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf) +and the initial "ALLOW", "FOLLOW", and "DENY" lists are based [similarly](github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/core/grappler/optimizers/auto_mixed_precision_lists.h). + +Existing frameworks like Tensorflow and PyTorch support this and is based on work by [NVidia](https://developer.nvidia.com/blog/mixed-precision-training-deep-neural-networks/). This involves rewriting the graph in a similar fashion to this pass, with some care with the gradient calculations to ensure stability. There are differences in implementation between the two +however. PyTorch's interpreter model of execution poses an issue of where to insert casts when rewriting the graph. The solution PyTorch uses is to have a Tensor cache mechanism similar to the one used in the pass mentioned here for TVM. Both cast FP32 tensors to FP16 when needed for certain operations and then cache this tensor to avoid future extraneous casts from being used. However, where we differ from PyTorch is that for PyTorch this is done (dynamically during execution)[https://github.com/pytorch/pytorch/blob/324673a537fc818527b8375700a9b95a83a00c92/aten/src/ATen/autocast_mode.cpp#L32] of the graph while for TVM, this is simply used to perform analysis when rewritting the Relay graph. In this sense we are more similar to Tensorflow, which can use the XLA compiler and get better graph level optimizations ahead of time. Tensorflow's more compiled approach is more compatible with TVM. + # Rationale and alternatives [rationale-and-alternatives]: #rationale-and-alternatives - Why is this design the best in the space of possible designs? -Other alternatives require a lot more work and changes and could probably considered future goals of TVM. -This include automatic mixed precision training. Existing frameworks like Tensorflow and PyTorch support -this and is based on work by [NVidia](https://developer.nvidia.com/blog/mixed-precision-training-deep-neural-networks/). -This involves rewriting the graph in a similar fashion to this pass, with some care with the gradient calculations to -ensure stability. Additionally, XLA can be run on top of this to further use model information to optimize kernels, much -like TVM. As a lot of prior art has a similar design, we can be confident in this approach. +As discussed in prior art, many frameworks above use a similar process to support automatic mixed precision. Furthermore, we lean closer to Tensorflow's compiled approach to execution than PyTorch's interpreter heavy approach. The compiled approach will play better to TVM's strengths and abilities to do many different optimizations at the graph level and below. - What other designs have been considered and what is the rationale for not choosing them? @@ -266,12 +270,6 @@ good to have this in the meantime. TVM is not the best tool for making models go fast as we leave a lot of free speedup on the table. -# Prior art -[prior-art]: #prior-art - -Many of the ideas are taken from Tensorflow's [automatic mixed precision training framework](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf) -and the initial "ALLOW", "FOLLOW", and "DENY" lists are based [similarly](github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/core/grappler/optimizers/auto_mixed_precision_lists.h). - # Unresolved questions [unresolved-questions]: #unresolved-questions From 6f65e82a5df91f0c3dee88505b521acad595a451 Mon Sep 17 00:00:00 2001 From: Andrew Luo Date: Wed, 18 Aug 2021 11:01:27 -0700 Subject: [PATCH 21/21] light edits for grammar --- rfcs/0006-AMP_pass.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/rfcs/0006-AMP_pass.md b/rfcs/0006-AMP_pass.md index 70855591..f77222af 100644 --- a/rfcs/0006-AMP_pass.md +++ b/rfcs/0006-AMP_pass.md @@ -251,8 +251,7 @@ every trained model to essentially get a free speed increases. Many of the ideas are taken from Tensorflow's [automatic mixed precision training framework](https://on-demand.gputechconf.com/gtcdc/2019/pdf/dc91247-automatic-mixed-precision-in-tensorflow.pdf) and the initial "ALLOW", "FOLLOW", and "DENY" lists are based [similarly](github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/core/grappler/optimizers/auto_mixed_precision_lists.h). -Existing frameworks like Tensorflow and PyTorch support this and is based on work by [NVidia](https://developer.nvidia.com/blog/mixed-precision-training-deep-neural-networks/). This involves rewriting the graph in a similar fashion to this pass, with some care with the gradient calculations to ensure stability. There are differences in implementation between the two -however. PyTorch's interpreter model of execution poses an issue of where to insert casts when rewriting the graph. The solution PyTorch uses is to have a Tensor cache mechanism similar to the one used in the pass mentioned here for TVM. Both cast FP32 tensors to FP16 when needed for certain operations and then cache this tensor to avoid future extraneous casts from being used. However, where we differ from PyTorch is that for PyTorch this is done (dynamically during execution)[https://github.com/pytorch/pytorch/blob/324673a537fc818527b8375700a9b95a83a00c92/aten/src/ATen/autocast_mode.cpp#L32] of the graph while for TVM, this is simply used to perform analysis when rewritting the Relay graph. In this sense we are more similar to Tensorflow, which can use the XLA compiler and get better graph level optimizations ahead of time. Tensorflow's more compiled approach is more compatible with TVM. +Existing frameworks like Tensorflow and PyTorch support automatic mixed precision for training and execution, and is based on work by [NVidia](https://developer.nvidia.com/blog/mixed-precision-training-deep-neural-networks/). This involves rewriting the graph in a similar fashion to this pass, with some care with the gradient calculations to ensure stability. There are differences in implementation between the two frameworks however. PyTorch's interpreter model of execution poses an issue of where to insert casts when rewriting the graph. The solution PyTorch uses is to have a Tensor cache mechanism similar to the one used in the pass mentioned here for TVM. Both cast FP32 tensors to FP16 when needed for certain operations and then cache this tensor to avoid future extraneous casts from being used. However, where we differ from PyTorch is that for PyTorch this is done (dynamically during execution)[https://github.com/pytorch/pytorch/blob/324673a537fc818527b8375700a9b95a83a00c92/aten/src/ATen/autocast_mode.cpp#L32] on the graph. For TVM meanwhile, this is simply used to perform analysis when rewriting the Relay graph. In this sense we are more similar to Tensorflow, which can use the XLA compiler to get better graph level optimizations. Tensorflow's more compiled approach is more compatible with TVM. # Rationale and alternatives [rationale-and-alternatives]: #rationale-and-alternatives