-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[microTVM] Modernize Arm Cortex-M convolution schedules #13242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
8f0b1a4 to
4fd94e2
Compare
f206531 to
40b5554
Compare
39cb5a4 to
7b465c2
Compare
|
This pull request is ready for review! Would love reviews from @mkatanbaf (who's doing some microTVM + MetaSchedule work), @areusch, and @ekalda. Would also love a look from someone who's more familiar with TVMScript, and can critique my use of it :). That said, there are a few known issues in this PR I still need to fix:
In a following PR, I'll also address:
|
a6dfafc to
febb861
Compare
areusch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a first pass here, thanks @guberti !
| """Addition is commutative, so we could add the bias before, during, or after performing our | ||
| multiply-accumulate operations. It "costs" one cycle either way - if done at the beginning we | ||
| can't use a SMULXY trick to set sum_i to zero for "free", and if done at the end it doesn't | ||
| combine with anything. However, doing it at the beginning frees up a register/prevents needing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of bias addition does not change the overflow behavior. This comment is just stating we could do the additions as:
OR as:
I've changed the wording a bit to make this clearer.
src/relay/qnn/op/requantize.cc
Outdated
| // Check and assign types for scale and zero points. | ||
| AssignType(types[1], DataType::Float(32), axis_shape, reporter); // input_scale | ||
| AssignType(types[2], DataType::Int(32), axis_shape, reporter); // input_zero_pt | ||
| // AssignType(types[1], DataType::Float(32), axis_shape, reporter); // input_scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uncomment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed - this PR should not change requantize.cc.
However, it is a bit of a tricky issue. In qnn_alter_op.py, I want to manually choose the int32 requantize scale to improve performance. However, Relay's requantize op only allows the output scale to be a float32.
I get around this by storing the scale data as a float32 array with the correct bytes, and reading it back as an int32 array. I've added a comment to qnn_alter_op.py to better explain what happens here. This is pretty gross.
Longer term, I'd love to add a new Relay op IntegerRequantize that takes int32 scale and shift arguments, which will let us solve this problem in a nice way. Would love your thoughts on the right way to address this!
tests/python/relay/strategy/arm_cpu/test_quantized_convolution.py
Outdated
Show resolved
Hide resolved
febb861 to
fae2a12
Compare
mkatanbaf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @guberti I added a few comments, mostly asking for clarifications.
| def _apply_simd_optimizations(instruction_tuples) -> Iterator[Tuple]: | ||
| """When possible, fuses single MACs into SIMD MAC instructions. | ||
| The compiler cannot do this automatically, as calling __builtin_arm_smlaxy forces the SMLAxy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand this correctly, but does this mean that we will unroll the loop and get a long list of instructions instead? would this significantly increase the code size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the inner reduction loops will always be unrolled (this occurs in _get_draft_macs). We will often unroll even more than this, either as another unrolled copy of the inner loops for odd-numbered channels (this happens e.g. for 3x3 depthwise convolutions) or by computing multiple sums at the same times (i.e. when num_sums > 1).
Compared with the naive approach, this does increase code size. However, the increase is very small - for example, unrolling a 3x3 depthwise convolution might take ~10 extra instructions, or 0.01 KB more flash size. This is well worth it, as unrolling dramatically reduces overhead and increases speed by ~2x. The previous tensordot implementation also unrolled these loops for the same reason.
| # Arm GCC does not have `__builtin_arm_smlabt`, even though `__builtin_arm_smlatt`, | ||
| # `__builtin_arm_smlatb`, `__builtin_arm_smlad` and so on all exist. Perhaps this is a | ||
| # choice, since we can just use `smlabt` with the argument order swapped instead? Note that | ||
| # `__builtin_arm_smlabt` exists on most compilers (e.g. Clang) - this is just a GCC thing. | ||
| if instruction == "smlabt": | ||
| yield f"sum_{index} = __builtin_arm_smlatb({op2}, {op1}, sum_{index});" | ||
| else: | ||
| yield f"sum_{index} = __builtin_arm_{instruction}({op1}, {op2}, sum_{index});" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is because you're using the builtins directly rather than using the ACLE interface (
https://arm-software.github.io/acle/main/acle.html#accumulating-multiplications) - unsure how much guarantee you get with built-ins, I would move to the ACLE interface anyway.
Also see: https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm_acle.h#L661-L675 😸
| ( | ||
| f""" | ||
| #include <stdint.h> | ||
| #include <arm_nnsupportfunctions.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay! I think this solves the same problem as #13363 😸 !
| # under the License. | ||
| """microTVM cares a lot about the convolution + bias + requantize + fused ReLU use case. There have | ||
| been some accuracy issues in the past, so this test steps through a model (MobileNetV1) layer by | ||
| layer and ensures there is 1-1 correspondance at each step. This test would run way faster if we ran |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool, great idea!
29d97a8 to
f11243a
Compare
areusch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @guberti, did a more fine-grained pass now.
| scale = T.match_buffer(scale_handle, scale_shape) | ||
| output = T.match_buffer(output_handle, output_shape, dtype="int16") | ||
|
|
||
| # This hack prevents TVM from seeing these variables as "unused". I should be using T.reads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you file a bug for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is user error on my part, or an issue with TVM. I'll look around a bit and file an issue if it seems to be a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, apologies for bringing up an old PR thread, I just ran into a similar problem, was an issue filed in the end? If so, could you possibly point me to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhutton1 A bug still needs to be filed here - I meant to write up a small reproducible example, but never got around to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll take a look into it :)
f11243a to
9bd3598
Compare
areusch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @guberti, this is basically ready, i've highlighted a couple last areas (in particular the doctest). feel free to merge once you've addressed!
| including regular conv2d, depthwise conv2d, and grouped conv2d provided the data and kernel layouts | ||
| are the optimal ones. When groups=1, the optimal data layout is NHWC and kernel layout is OHWI. When | ||
| this is a depthwise convolution, the optimal data layout is NCHW and kernel layout is OIHW.""" | ||
| """Generates optimized code to compute a tensor dot product on ARMv7E-M. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this apply to v8-M also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes - this uses the DSP instructions, which are required in v7E-M but optional in v8-M. This code also does not use MVE, which is optional in v8-M but would be really useful for deep learning. I've clarified this in the docstring.
Get QNN strategy running QNN strategy with operator fusion
Assembly tensordot from other PR Tensordot offset support Hand tested tensordot code
Formatting fixes Don't use automatic AOT building when skipping pass Assorted tech for scheduling with TIR Hacky int16 support
Bugged schedule implementation Passing test! Works for all 1x1 conv2ds! External QNN operator altering Debugging work Pad with correct constant Broadly functional conv2d Reorganize quantize convolution test
Working depthwise convolution for strides=1 Working depthwise convolution!
Support Python 3.7 Clean up code to prepare for review
Second round of code review Fix tensordot opts test
dcd9c17 to
431e4e4
Compare
|
I've addressed the comments from @areusch, so per his instructions I'm merging this. Thanks for the feedback! |
In #13242, I rewrote microTVM's convolution schedules to give a major improvement in performance. While I demonstrated in tests that my changes worked, they could not be used with relay.build. This pull request expands the functionality of #13242 and adds new legalize and alter_op passes to take advantage of the quantized schedules. This dramatically improves performance on some models, dramatically cuts RAM usage, and removes the need for autotuning on microTVM. More specifically, for the vww model from MLPerf Tiny running on the nucleo_l4r5zi, this pull request: - Improves untuned performance from 1741 ms to 137 ms - a 6.8x improvement! - Improves tuned performance from 337 ms to 137 ms. - Sets a new state-of-the-art for MLPerf Tiny, beating Plumerai's previous 208 ms record - Reduces RAM consumption by 73 KB (a large amount on microcontrollers!) by eliminating intermediate buffers. - Reduces flash consumption for model weights by 5x - Slightly improves accuracy @mehrdadh has kindly tested these changes himself, and has confirmed my 137 ms figure. To enable the schedules that grant these performance improvements, this pull request: 1. Adds out_layout support to the regular and depthwise conv2d schedules from [microTVM] Modernize Arm Cortex-M convolution schedules #13242. 2. Generalizes the schedules from [microTVM] Modernize Arm Cortex-M convolution schedules #13242 to be more widely applicable. 3. Adds a layout alternation pass to ensure regular and depthwise conv2d schedules always get their desired input formats. 4. Adds a conv2d -> depthwise conv2d -> unpadded conv2d rewrite step to remove empty channels from conv2d operators. 5. Adds a conv2d -> average pool -> dense rewrite step to remove empty channels from conv2d operators. 6. Adds an alter_op pass to fold padding into a separate Relay operator.
* [microTVM] Fix tvmc tutorial (#14076)
This PR applies appropriate changes to make sure the CI fails if micro_tvmc.sh tutorial fails. This issue was captured in #14074.
This PR also makes changes to avoid this breakage in bash script tutorials in future. In addition, this PR fixes the bug in running TVMC tutorial which happened due to renaming zephyr_board to board.
* [MetaSchedule] Introduce Async Pipeline in MultiLevelTiling (#14009)
This PR introduces async pipeline in the current TVM's MultiLevelTiling Rules. This PR is based on apache/tvm#13966, which is already merged. This is because some conv2d workload will use `tir.if_then_else` to pad the input to the correct size, and this PR uses async copy in such copy statement.
1. Add a subrule in `src/meta_schedule/schedule_rule/multi_level_tiling.h/.cc` that annotate async copy for mlt in supported arch (>= sm80).
In CUDA Core, this PR has a perf boost of around 1T GFLOP/s in most Conv2d test cases and 1T ~ 2T in most GEMM test cases.
All generated codes, scripts, and traces are available at https://github.com/Rainy-Memory/tvm-async-rule-benchmark.
Currently tested on commit `afbfb7aa7e43732cb716f8e443df696110be6afc` in conv2d NHWC workload, with a RTX 3080 GPU.
**Notice: given the stochastic nature of evolutionary search, perfromance might become worse if enable this PR.**
Workload: Conv2d NHWC
|Shape|Mainline TVM|Mainline TVM with Async|Performance Boost|
|-|-|-|-|
|N=1_H=224_W=224_C=3_K=64_R=7_S=7_STR=2_PAD=3_DIL=1|13838.05219|14687.89452|6.141343581679319%|
|N=1_H=56_W=56_C=64_K=64_R=1_S=1_STR=1_PAD=0_DIL=1|5398.305085|5613.892553|3.9936140067192905%|
|N=1_H=56_W=56_C=64_K=64_R=3_S=3_STR=1_PAD=1_DIL=1|11652.96825|13157.88249|12.91442839038028%|
|N=1_H=56_W=56_C=64_K=256_R=1_S=1_STR=1_PAD=0_DIL=1|10638.8309|11674.68499|9.736540600527816%|
|N=1_H=56_W=56_C=256_K=64_R=1_S=1_STR=1_PAD=0_DIL=1|8692.32829|9469.264089|8.938178277203573%|
|N=1_H=56_W=56_C=256_K=128_R=1_S=1_STR=2_PAD=0_DIL=1|4685.767442|5698.19634|21.606469175684712%|
|N=1_H=28_W=28_C=128_K=128_R=3_S=3_STR=1_PAD=1_DIL=1|9872.787087|10404.60405|5.38669535070061%|
|N=1_H=28_W=28_C=128_K=512_R=1_S=1_STR=1_PAD=0_DIL=1|9974.281496|10073.31657|0.9929043414276753%|
|N=1_H=28_W=28_C=512_K=128_R=1_S=1_STR=1_PAD=0_DIL=1|7075.866932|8564.572712|21.039199780135142%|
|N=1_H=28_W=28_C=512_K=256_R=1_S=1_STR=2_PAD=0_DIL=1|3648.330914|4021.923142|10.240086132713124%|
|N=1_H=14_W=14_C=256_K=256_R=3_S=3_STR=1_PAD=1_DIL=1|8192.954618|9160.182054|11.805599824451525%|
|N=1_H=14_W=14_C=256_K=1024_R=1_S=1_STR=1_PAD=0_DIL=1|8008.870153|9362.825279|16.90569456283206%|
|N=1_H=14_W=14_C=1024_K=256_R=1_S=1_STR=1_PAD=0_DIL=1|5210.062241|6051.208379|16.144646629759908%|
|N=1_H=14_W=14_C=1024_K=512_R=1_S=1_STR=2_PAD=0_DIL=1|2550.787202|3587.902938|40.65865373586739%|
|N=1_H=7_W=7_C=512_K=512_R=3_S=3_STR=1_PAD=1_DIL=1|4350.626084|5432.788068|24.873706981617943%|
|N=1_H=7_W=7_C=512_K=2048_R=1_S=1_STR=1_PAD=0_DIL=1|6672.068026|7663.725217|14.862815953549454%|
|N=1_H=7_W=7_C=2048_K=512_R=1_S=1_STR=1_PAD=0_DIL=1|3142.564263|4297.988014|36.766909259541826%|
Workload: GEMM NN
|Shape|Mainline TVM|Mainline TVM with Async|Performance Boost|
|-|-|-|-|
|M=512_N=256_K=640|8678.46|10607.37|22.226408832903555%|
|M=512_N=384_K=256|8109.13|10290.72|26.902886006267003%|
|M=512_N=512_K=512|11419.83|14000.86|22.601299669084398%|
|M=512_N=3072_K=768|19709.39|18351.61|-6.8890006235606425%|
|M=512_N=768_K=3072|12844.59|13730.88|6.90010346768561%|
|M=896_N=896_K=896|16149.91|16131.39|-0.11467556165947945%|
|M=1024_N=1024_K=1024|18842.11|19662.8|4.355616223448428%|
|M=1152_N=1152_K=1152|15386.79|16736.1|8.769275462913303%|
|M=1536_N=1536_K=1536|18522.67|18872.06|1.88628313304725%|
|M=2048_N=2048_K=2048|19515.42|18874.85|-3.282378754851291%|
|M=3072_N=3072_K=3072|19233.9|19291.42|0.2990553137948975%|
|M=4096_N=4096_K=4096|17122.17|19259.01|12.479960191961652%|
* [TVMScript] Use op attribute to control whether to print dtype in TVMScript (#14111)
This PR adds an op attribute `TScriptDtypePrintLocation`, and modifies the dtype printing logic of the builtin op to check this attribute. So that user defined operators can use it to specify how there dtype argument are printed by appending attributes instead of appending members to `dtype_first_arg`/`dtype_last_arg`.
* [Fix][TVMScript] Fix index of metadata in printed script (#14130)
Currently, if the same metadata object (e.g. a multi-line `tir.StringImm`) is referenced for more than one times in an IRModule, each reference will have different indices of the metadata array. For example, this code
```
str_imm = T.StringImm("aaa\nbbb\n")
@I.ir_module
class Module:
@T.prim_func
def foo() -> None:
A = str_imm
B = str_imm
@T.prim_func
def foo1() -> None:
A = str_imm
Module.show()
```
where `str_imm` is referenced three times, will generate such output:
```
@I.ir_module
class Module:
@T.prim_func
def foo():
A: T.handle = metadata["tir.StringImm"][0]
B: T.handle = metadata["tir.StringImm"][1]
T.evaluate(0)
@T.prim_func
def foo1():
A: T.handle = metadata["tir.StringImm"][2]
T.evaluate(0)
```
Each time has a different metadata index.
This PR fixes this problem by detecting duplicate item in `IRDocsifierNode::AddMetadata`.
* [Pytorch] frontend full_impl fix (#14122)
Minor fix in pytorch frontend to compile gpt2 model, reproduce script.
torch_version = 1.13.1
transformers_version = 4.26.1
```
from transformers import GPT2LMHeadModel
import torch
import tvm
from tvm import relay
inp = torch.ones((1, 128)).to(torch.int64)
input_shapes = [("input_ids", ((1, 128), "int64"))]
model = GPT2LMHeadModel.from_pretrained('gpt2', return_dict=False)
trace_model = torch.jit.trace(model, inp, strict=False)
outputs = trace_model(inp)
mod, params = relay.frontend.from_pytorch(trace_model, input_shapes)
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target='llvm', params=params)
runtime = tvm.contrib.graph_executor.GraphModule(lib["default"](tvm.device('cpu', 0)))
runtime.set_input("input_ids", inp.numpy())
runtime.run()
out = runtime.get_output(0).numpy()
print(out)
print('Done...')
```
Before the fix, the error message
```
Traceback (most recent call last):
File "gpt2_compile.py", line 13, in <module>
mod, params = relay.frontend.from_pytorch(trace_model, input_shapes)
File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 4791, in from_pytorch
outputs = converter.convert_operators(_get_operator_nodes(graph.nodes()), outputs, ret_name)
File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 4164, in convert_operators
relay_out = relay_op(
File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 841, in full
return self.full_impl(data, fill_value, dtype)
File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 743, in full_impl
fill_value = _expr.const(fill_value, dtype=dtype)
File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/expr.py", line 707, in const
raise ValueError("value has to be scalar or NDArray")
ValueError: value has to be scalar or NDArray
```
because `fill_value` is
```
%0 = cast(64, dtype="float32");
power(%0, 0.5f)
```
* [DOCKER] Configurable NDK version support (#14000)
Let the Android NDK version configurable as a command line argument
* [Fix][TIR] SampleCategorical apply-to-schedule (#14133)
This PR is another way to fix the issue described in #14118.
Since we do not have a standard for json file on the format of float
numbers (for example, we cannot require a json file producer to print
the "integer" float numbers with at least one decimal), and the json
parser is not responsible for determining if an integer in a json file
should be parsed to a float or an int, the most convenient way of fixing
the SampleCategorical issue will be allowing both FloatImms and IntImms
as input, and converting all IntImms to FloatImms accordingly.
This PR fixes the issue in this way.
* [Arith] ConstIntBound was incorrectly assuming bounds were over int64… (#13918)
[Arith] ConstIntBound was incorrectly assuming bounds were over int64_t range
This commit improved the floormod and floordiv conversion check to be
simpler for the negative range and adds a test to cover all integer data types.
* [CMSIS-NN] Reduction in code size of AOT test runner binary (#13815)
* [CMSIS-NN] Reduction in code size of AOT test runner binary
Co-authored-by: Ashutosh Parkhi <ashutosh.parkhi@arm.com>
* [CMSIS-NN] Add a runtime error message (#13643)
[CMSIS-NN] Add a runtime error message
APIs TVMAPISetLastError and TVMGetLastError are used to propagate CMSIS-NN
errors caught in the backend. AOT test runner was improved to observe the contents
of this global variable. A test was added to check for the last set error as part of this
commit.
* [CRT]Cleanup unused macros in crt_config.h.template (#14125)
This PR removes old macros in crt_config.h.template.
* [Fix][Relay] Fix axis transformation in squeeze shape function (#14135)
* fix squeeze shape function issue and add testcase.
* fix lint
* [Unittest] merge test_cp_async_in_if_then_else into test_tir_transform_inject_ptx_async_copy (#14138)
This PR merge two related unittests into one.
* [Frontend][TFLite] Fix conv2d import bug (#14124)
* Fix TFLite frontend bug and add test
* lint
* [ONNX][TORCH] Replace scatter op by scatter_elements (#14019)
* remove scatter attr class
* update pytorch: scatter was replaced by scatter_elements
* remove scatter compute and strategy registration
* remove scatter attrs registration
* update onnx front-end: replace _op.scatter by _op.scatter_elements, add checks
* update oneflow front-end
* update paddlepaddle front-end
* update pytorch utils
* remove front-end scatter definition
* fix scatter strategy for rocm
* small update
* remove scatter definition in back-end
* remove scatter strategy for cuda, gpu. transfer special case to scatter_elements
* fix test
* small fix
* upstream scatter with torch description
* last upstream of scatter in pytorch front-end
* fix reduction attribute in cuda strategy
* set scalar to test instead of tensor. update check for dynamic dim
* skip scalar source check in tests for scatter due to issue on torch side
* remove scatter op implementation from topi/cuda
* remove scatter op implementation from topi. small clean code
---------
Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
* [TVMScript][Printer] Remove relax prefix for now (#14140)
Remove relax prefix for now
This PR cleans up relax prefix in printer for now.
While these setups are useful and do not cause any
technical debts in the codebase. We remove it given requests.
They can be added back to unity branch and later as part
of upstream
* [microNPU] Sum legalization support (#13997)
Supports legalizing a relay sum operation to an equivalent series of NPU operations. It supports case with int8 output type and channel axis.
* [Fix][MetaSchedule] Fix redundant stages in async pipeline for mlt (#14143)
This PR fixes redundant stages if visiting `InitializeWithTuneContext`
multiple times.
* [COMMUNITY] Cheng Wen -> Reviewer (#14153)
Please join me @chengven027-intellif as a new Reviewer in TVM.
Cheng has contributed to ONNX/PyTorch frontend and Relay passes, making TVM support more input models.
- [Commits History](https://github.com/apache/tvm/pulls?q=author%3Achengven027-intellif+)
- [Code Review](https://github.com/apache/tvm/pulls?q=reviewed-by%3Achengven027-intellif+)
* [Runtime] Fix high RAM usage when saving / loading paramters of big models (#14147)
* add load_params_from_file
* add save_params_to_file
* avoid making another copy in save_params
* black
* add test
* update doc
* [Relay][Frontend] Span Filling PyTorch (#14050)
* [Relay][Frontend] Span Filling PyTorch
- Construct debug name of C graph instruction as the source name of span for pytorch model.
- To get the reference of renamed nodes. Add a function to export the converted C graph after conversion.
- Add structural_equal comparisons with and without set_span to the existing test cases.
- Add span test cases for frequent conversions.
- Add span test case for exporting model parameter.
* [SpanFillingPyTorch]
- Return TupleGetItem expr from TupleWrapper with the span of its Tuple.
- Add None type symbol in set sapn for certain conversion.
- Add current_op member varible to PyTorchOpConverter to track which op
is converting for pytorch frontend.
* [SpanFillingPyTorch]
- Fix the error caused by the quantized params not found after renaming
the debug name of C graph.
---------
Co-authored-by: Joey Tsai <chunit@qti.qualcomm.com>
* [TRT][BYOC] allow strided_slice ops on selected dimensions (#14142) (#14144)
* [ONNX][TOPI] Add `DFT` operator (#13999)
* init convertor for DFT
* init test for DFT
* init DFT operator in Relay
* update topi implementation for DFT
* clean up
* update ONNX frontend
* support attribute
* fix error: Expected Array[Tensor], but got Array[index 0: Array]
* support inverse, onsided, dft_lenght
* update tests for DFT
* update TOPI test for DFT
* add documentation
* fix pylint
* fix cpplint
* fix cpplint
* fix threshold for FP16 (ARM)
* add CUDA compute
* fix pylint
* fix doc string
* code review fixes for ONNX front-end
* code review fixes for TOPI
* rename: stft.py -> signal.py
* pass input_shape and output_shape to verify_dft
* [CRT][microTVM] Enable USMP by default for AoTExecutor + CRT runtime (#14107)
This PR enables USMP by default when AoTExecutor and CRT runtime are selected. Check forum discussion about this change: https://discuss.tvm.apache.org/t/enable-usmp-by-default-in-aot-executor-with-runtime-crt/14406
As a result, the workspace memory in mlperftiny project type is removed since memory allocation is not required. If we keep this workspace, the model doesn't fit since some of the memory is allocated twice.
* [Android] Fix using system libraries in Android apps (#14145)
- Starting from API 31, using `uses-native-library` is required if we
want to open system library:
https://developer.android.com/about/versions/12/reference/compat-framework-changes#enforce_native_shared_library_dependencies
We should specify OpenCL library in `user-native-library` in all
applications where OpenCL backend might be used.
- Updated README files and describe how to fix synchronization issues
in Android Studio.
* [microTVM]Enable TVMC micro with AoT Executor (#14077)
This PR enables AoT Executor for tvmc micro compilation.
* [bugfix] Fix the write buffer scope of `mma_store_impl` (#14174)
fix
* [Relay] Enhance EliminateCommonSubexpr to support Tuple argument (#14169)
If an argument of a call is a Tuple, we should check its fields.
Different tuples with the same fields should be treated as same inputs
* [TIR] Fix typo in doc (#14178)
* [microTVM] Use QNN schedules to give SOTA performance (#13752)
In #13242, I rewrote microTVM's convolution schedules to give a major improvement in performance. While I demonstrated in tests that my changes worked, they could not be used with relay.build.
This pull request expands the functionality of #13242 and adds new legalize and alter_op passes to take advantage of the quantized schedules. This dramatically improves performance on some models, dramatically cuts RAM usage, and removes the need for autotuning on microTVM. More specifically, for the vww model from MLPerf Tiny running on the nucleo_l4r5zi, this pull request:
- Improves untuned performance from 1741 ms to 137 ms - a 6.8x improvement!
- Improves tuned performance from 337 ms to 137 ms.
- Sets a new state-of-the-art for MLPerf Tiny, beating Plumerai's previous 208 ms record
- Reduces RAM consumption by 73 KB (a large amount on microcontrollers!) by eliminating intermediate buffers.
- Reduces flash consumption for model weights by 5x
- Slightly improves accuracy
@mehrdadh has kindly tested these changes himself, and has confirmed my 137 ms figure.
To enable the schedules that grant these performance improvements, this pull request:
1. Adds out_layout support to the regular and depthwise conv2d schedules from [microTVM] Modernize Arm Cortex-M convolution schedules #13242.
2. Generalizes the schedules from [microTVM] Modernize Arm Cortex-M convolution schedules #13242 to be more widely applicable.
3. Adds a layout alternation pass to ensure regular and depthwise conv2d schedules always get their desired input formats.
4. Adds a conv2d -> depthwise conv2d -> unpadded conv2d rewrite step to remove empty channels from conv2d operators.
5. Adds a conv2d -> average pool -> dense rewrite step to remove empty channels from conv2d operators.
6. Adds an alter_op pass to fold padding into a separate Relay operator.
* Add v0.11.0 docs link to site (#14181)
Update the version menu in TVM documentation to add a specific v0.11.0 release docs link.
* [TIR] Allow TransformLayout with non-inversible index map (#14095)
* [TIR] Allow TransformLayout with non-inversible index map
TransformLayout requires the index map to have inverse map that can be
calculated by the analyzer in order to check whether padding is added.
However, such check doesn't always work for all cases because of
limitation of the affine analysis that can only handle a set of
supported patterns. In some cases, even if the index map doesn't
introduce padding, the schedule primitive throws `TransformationIntroducesPaddingError` because it
fails to calculate the inverse index map.
It is safe to allow buffer being padded without providing pad_value
because the original loop extent is not changed and the padded region is not accessed.
This PR changes the behavior of `TransformLayout` to allow
non-inversible index map.
Previous discussion:
https://discuss.tvm.apache.org/t/conflict-free-shared-memory-permutation-in-tensorir/13959/9
* add assume_injective_transform option
* Apply suggestions from code review
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
---------
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
* [TIR][Analysis] Implement IdentifyMemCpy analysis function (#13947)
* [HotFix][MetaSchedule] Turn off database shash check (#14188)
At this moment, the structural hash values of IR in TVM is platform
dependent (e.g., the hash values of a String may differ on different
platforms). In our recent practice, we found this an obstacle for us
to apply one existing database on different platforms (say we tune
an IRModule with MetaSchedule on Metal, and then apply the database
on CUDA, etc.)
To clear this obstacle, we decide to remove the shash value check. The
purpose of that check is mainly to ensure safety, and thus turning it
off will make no difference in terms of using MetaSchedule in most of
the cases that we can imagine.
Meanwhile, it is equally important that we need to make our structural
hash platform independent. There are plans ongoing for this target.
* [TOPI] Batch Norm Training Mode (#14190)
Prior to this PR, TOPI batch_norm only supports inference.
This PR adds training: bool flag and momentum: float argument to support training mode (update moving_mean / var and return), which aligns with torch.nn.functional.batch_norm.
* [TOPI] Group normalization (#14193)
As more and more ML models nowadays contain the group normalization
computation, we find it beneficial to introduce this op to TOPI level.
It will enable us to optimize the group normalization operation as a
whole in a more convenient way.
This PR introduces the group normalization op to TOPI. The group norm
operation was introduced in https://arxiv.org/abs/1803.08494. The
implementation uses tuple reduction, same as the implementation of layer
norm. Implemented with tuple reduction, the corresponding generated TIR
function can be optimized by cross-thread reduction or rfactor through
MetaSchedule.
Co-authored-by: Bohan Hou <spectrometerh@gmail.com>
* [Fix][TIR] LowerCrossThreadReduction with write-back predicate (#14199)
Prior to this PR, the cross-thread reduction lowering pass does not
add a store predicate to the write-back block. This is in consideration
that for a certain write-back buffer position, all values being stored
(by all the threads) in the write-back block are the same. Since all
threads are writing the same value, we were assuming that not having a
write-back block predicate is fine, because the result will not be wrong
in any way.
However, recently we noticed that some GPU backend compiler will capture
this behavior (multiple threads writing a same position) as a race
condition and thus throw compilation error. The compiler does not take
the fact that all values being stored are the same, and insist on
complaining.
This means that we will still need the write-back block predicate to
make things work. And this PR does this change. I have done integration
tests locally to make sure that the generated kernels is right and
produces the right results numerically.
* [Unity] Relax VM (#13878)
This PR implements a flexible register-based VM to execute relax programs with dynamic shape and control flow. Design: https://github.com/tlc-pack/relax/wiki/Relax-VM-Design.
Co-Authored-by: Ziheng Jiang <ziheng@apache.org>
Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-Authored-by: Junru Shao <junrushao1994@gmail.com>
Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-Authored-by: Yong Wu <yongcale@gmail.com>
Co-Authored-by: Steven S. Lyubomirsky <slyubomirsky@octoml.ai>
Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-Authored-by: Hongyi Jin <3231950289@qq.com>
* [Unity] Relax expressions and types (#13901)
* [Unity][IR] First-class StructInfo (#13907)
* [Unity][IR] First-class StructInfo
Relax tracks structural information (such as tensor shape) via `StructInfo` about the values in Relax.
* Fix rust build
---------
Co-authored-by: Junru Shao <junrushao1994@gmail.com>
* [Unity][CI] Unity specific jenkins setup (do not upstream to main) (#13910)
This PR setup a unity specific jenkins with minimum jenkinsfile
without sharding and disables most of the tests to reduce overall
cost. We can add tests of unty branch by configuring the specific
groovy file.
* [Unity] Basic StructInfo Analysis and Expr construction (#13916)
[Unity] Basic StructInfo Analysis and Expr construction.
This PR adds struct info analysis and expr support.
These are logics to construct the IR node and perform
struct info related analysis.
Testcases are added to cover the IR node construction
and related struct info analysis checks.
Co-authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-authored-by: Altan Haan <altanh@cs.washington.edu>
Co-authored-by: Andrew Liu <andrewlliu@gmail.com>
Co-authored-by: Hongyi Jin <3231950289@qq.com>
Co-authored-by: Jiawei Liu <jaway.liu@gmail.com>
Co-authored-by: Junru Shao <junrushao1994@gmail.com>
Co-authored-by: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com>
Co-authored-by: masahi <masahi129@gmail.com>
Co-authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-authored-by: Steven S. <Lyubomirsky slyubomirsky@octoml.ai>
Co-authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-authored-by: Yixin Dong <ubospica@gmail.com>
Co-authored-by: Yong Wu <yongcale@gmail.com>
Co-authored-by: Ziheng Jiang <ziheng@apache.org>
* [Unity] Relax BlockBuilder and ExprMutator (#13926)
This PR adds BlockBuilder: the core data structure to construct Relax AST, and ExprMutator: performs AST mutation for implementing transformation passes.
Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-Authored-by: Altan Haan <altanh@cs.washington.edu>
Co-Authored-by: Andrew Liu <andrewlliu@gmail.com>
Co-Authored-by: Hongyi Jin <3231950289@qq.com>
Co-Authored-by: Jiawei Liu <jaway.liu@gmail.com>
Co-Authored-by: Junru Shao <junrushao1994@gmail.com>
Co-Authored-by: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com>
Co-Authored-by: masahi <masahi129@gmail.com>
Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-Authored-by: Steven S. <Lyubomirsky slyubomirsky@octoml.ai>
Co-Authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-Authored-by: Yixin Dong <ubospica@gmail.com>
Co-Authored-by: Yong Wu <yongcale@gmail.com>
Co-Authored-by: Ziheng Jiang <ziheng@apache.org>
* [Unity] Relax TVMScript Parser. (#13932)
This PR adds the TVMScript parser/ir_builder support based on the blockbuilder.
Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-authored-by: Junru Shao <junrushao1994@gmail.com>
Co-authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-authored-by: Yuchen Jin <yuchenj@cs.washington.edu>
Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com>
Co-authored-by: Yong Wu <yongcale@gmail.com>
* [Unity] Relax TVMScript Printer (#13944)
This PR introduces Relax as a dialect supported by the TVMScript
Printer. Some caveats:
- Needs to rebase to mainline before merging.
- Some tests are skiped because some operators are not upstreamed to
the unity branch yet.
Co-authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-authored-by: Yuchen Jin <yuchenj@cs.washington.edu>
Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com>
Co-authored-by: Yong Wu <yongcale@gmail.com>
Co-authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-authored-by: Hongyi Jin <3231950289@qq.com>
Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
* [Unity] Relax VM codegen (#13954)
* [Unity] Relax VM shape lowering pass (#13956)
This PR introduces Relax `FunctionPass` and `DataflowBlockPass` API, and the `VMShapeLower` pass to lower the shape expression in Relax to TIR functions and VM shape heap builtin functions.
Co-Authored-by: Ziheng Jiang <ziheng@apache.org>
Co-Authored-by: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com>
Co-Authored-by: Altan Haan <altanh@cs.washington.edu>
Co-Authored-by: Junru Shao <junrushao1994@gmail.com>
Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-Authored-by: Steven S. <Lyubomirsky slyubomirsky@octoml.ai>
Co-Authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-Authored-by: Yong Wu <yongcale@gmail.com>
* [Unity] e2e Relax minimum build flow (#13961)
This PR introduces the e2e Relax lowering flow (`relax.vm.build`). Tests for each pass in the flow are added.
Co-Authored-by: Altan Haan <altanh@cs.washington.edu>
Co-Authored-by: Andrew Liu <andrewlliu@gmail.com>
Co-Authored-by: Hongyi Jin <3231950289@qq.com>
Co-Authored-by: Jiawei Liu <jaway.liu@gmail.com>
Co-Authored-by: Junru Shao <junrushao1994@gmail.com>
Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-Authored-by: Steven S. <Lyubomirsky slyubomirsky@octoml.ai>
Co-Authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-Authored-by: Yong Wu <yongcale@gmail.com>
Co-Authored-by: Ziheng Jiang <ziheng@apache.org>
* [Unity][TVMScript] Use explicit `R.shape` in TVMScript (#13979)
As we've introduced `arg_sinfo` in CallNode, implicit shape constructor
is not widely used in TVMScript. This PR removes the implicit shape since
it may cause confusion between shape and tuple.
* [Unity] Relax op: index (#13987)
This PR is about the high-level tensor computation operators in Relax.
This PR includes the tensor indexing operators.
* [Unity] Relax op: datatype (#13986)
* [Unity] Relax op: set (#13990)
This PR is about the high-level tensor computation operators in Relax.
This PR includes the set operators.
Co-authored-by: Prakalp Srivastava <prakalp@octoml.ai>
* [Unity] Relax op: image (#13994)
This PR is about the high-level tensor computation operators in Relax.
This PR includes the image operators.
* [Unity] Relax op: arithmetic, comparison (#13983)
This PR is about the high-level tensor computation operators in Relax.
This PR includes the unary, binary and ternary arithmetic and
comparison operators.
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-authored-by: Chaofan Lin <1713833595@qq.com>
* [Unity] Relax op: statistical (#13991)
This PR is about the high-level tensor computation operators in Relax.
This PR includes the statistical operators.
* [Unity] Relax op: neural networks (#13993)
This PR is about the high-level tensor computation operators in Relax.
This PR includes the neural network operators.
* [Unity] Relax op: creation (#13984)
This PR is about the high-level tensor computation operators in Relax.
This PR includes the tensor creation operators.
* [Unity] Relax op: linear algebra (#13988)
This PR is about the high-level tensor computation operators in Relax.
This PR includes the linear algebra operators.
Co-authored-by: Siyuan Fneg <Hzfengsy@sjtu.edu.cn>
* [Unity] Relax op: search (#13992)
This PR is about the high-level tensor computation operators in Relax.
This PR includes the search operators.
* [Unity] Relax op: manipulation (#13989)
This PR is about the high-level tensor computation operators in Relax.
This PR includes the tensor manipulation operators.
Co-authored-by: Prakalp Srivastava <prakalp@octoml.ai>
* [Unity] NestedMsg Support utility (#13995)
This PR introduce NestedMsg to robustly handle nested-tuple analysis.
Relax support nested tuple structures in the IR.
Nested tuple structure is important to support advanced groupings in
cases such as gradient calculation and other scenarios.
The possible presence of nested tuple does mean that we need to to
robustly handle analysis that contains nested tuple structures in a dataflow graph.
This PR introduces a NestedMsg<T> class that corresponds to a possibly
nested message tuple for a given leaf message class T.
We also introduces various helper functions to compose and decompose messages.
Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com>
Co-authored-by: Yixin Dong <ubospica@gmail.com>
Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
* [Unity][Pass] Operator Fusion Passes (#14001)
[Unity][Pass] Operator fusion passes
This PR introduces three passes for operator fusion:
1. AnnotateTIROpPattern: analysis the operator kind from PrimFunc.
2. FuseOps: fuse operators for Relax functions, which adds a new fused
relax primitive function.
3. FuseTIR: fuse corresponding TIR PrimFuncs for the fused relax.
* [Unity][Pass] LambdaLift pass (#14012)
* [Unity][VM] Supporting "compiled" exec mode. (#14015)
[VM] Supporting "compiled" exec mode.
This PR adds support of "compiled" mode to the VM. The compiled mode translate
the relax function into TIR function and drive it through the TIR function.
It is different from the micro AOT codegen, which generate TIR code that targets
the micro C runtime environment and useful for resource limited settings with
smaller set of features. Both leverages the low-level TIR build that is also shared with TensorIR.
The current implementation targets full TVM (VM) runtime, that comes with PackedFunc,
object, tuple, closure and all kinds of rich structure support. This also mean that
we can leverage the full runtime support to handle things like allocation, dynamic shape,
easy plugins and python interaction, which are not available in more limited runtime.
The user directly use the same API to load the generated code regardless of
compiled mode or bytecode. And just need to change one line
```python
ex = relax.vm.build(mod, target, exec_mode="compiled")
```
The simplicity is thanks to the TVM runtime archiecture that allows us to compose things together in objects.
The only difference is how the PackedFunc of high-level driving is being provided. In the case of bytecode
it is normal interpretation and in the case of compiled mode it is TIR.
It is a complete implementation Unit-testcases are added. All codegen build tests are updated to include two
exec_modes and have passed locally.
Co-authored-by: Junru Shao <junrushao1994@gmail.com>
* [Unity][Pass] BindParams pass, FoldConstant pass (#14016)
This PR introduces FoldConstant/BindParam passes.
Co-authored-by: Yong Wu <yongcale@gmail.com>
Co-Authored-by: Hongyi Jin <3231950289@qq.com>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
* [Unity][Pass][TuningAPI] Introduce TuningAPI and MetaSchedule pass (#14014)
Add TuningAPI and MetaSchedule tuning pass
* [Unity] Relay -> Relax translator (#14026)
This PR implements a Relay to Relax translator, which allows us to import Relay workloads to Relax for benchmarking and development purposes (tests and examples are added).
* [Unity][Pass] Normalize Pass (#14031)
This PR implements relax `Normalize` Pass, which allows users to transform Relax IR to normal form, i.e., the expressions are normalized (no nesting and hence the AST is in ANF), and all `checked_type_` and `shape_` of expressions are available. (tests are added).
Co-Authored-by: Yuchen Jin <yuchenj@cs.washington.edu>
Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
* [Unity][BlockBuilder] CallTE convert PrimValue args (#14028)
Prior to this PR, the `call_te` of BlockBuilder is not capable of converting PrimValue arguments and directly rejects PrimValues instead. This PR fixes this behavior with PrimValue conversion support and one regression test.
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
* [Unity][Pass] Wellformed Analysis (#14032)
This PR implements relax wellformed analysis, which checks if the IRModule is well-formed. (tests and examples are added).
Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com>
Co-authored-by: Yong Wu <yongcale@gmail.com>
Co-Authored-by: Yuchen Jin <yuchenj@cs.washington.edu>
Co-Authored-by: Yixin Dong <ubospica@gmail.com>
Co-Authored-by: Chaofan Lin <siriusneo@sjtu.edu.cn>
Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-Authored-by: Junru Shao <junrushao1994@gmail.com>
* [Unity][TVMScript] Move tir/relax import in script out of __init__.py (#14033)
Prior to this PR, `python/tvm/script/__init__.py` imports both tir and relax
submodules. This leads to the phenomenum that when people does
```python
from tvm.script import tir as T
```
, the relax submodule will be implicitly visited by `__init__.py` as well.
Since TIR does not rely on Relax, it is good not to import both of them
at the same time. (This can prevent cyclic imports sometimes.)
This PR does this decoupling by introducing two files
* `python/tvm/script/relax.py`
* `python/tvm/script/tir.py`
and removing the imports from `python/tvm/script/__init__.py` and
`python/tvm/script/parser/__init__.py`. With this change, we force people to
manually do `from tvm.script import tir` and `from tvm.script import relax`
to use TVMScript parser, which is right our conventional way.
* [Unity][Pass] Operator legalization (#14029)
This PR is the operator legalization pass, which transforms high-level
operator calls to `call_tir`s of corresponding low-level TIR PrimFuncs.
- The legalization pass provides customizability, which enables people
to pass in a customized legalization map to override the default
legalization method.
- The legalization supports symbolic shape. (At this moment only pooling
does not support symbolic shape, as TOPI pooling does not support. This
needs to be fixed in followup PRs.)
Co-authored-by: Chaofan Lin <siriusneo@sjtu.edu.cn>
Co-authored-by: Yixin Dong <ubospica@gmail.com>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
* [Unity][Op] Add ShapeExpr Tests for Reshape Op (#14035)
This PR specially checks the relax.reshape operator when the input is a ShapeExpr.
* [Unity] Initial PyTorch Frontend (#14037)
[Unity] Initial PyTorch Frontend
This PR introduces initial pytorch frontend components of Relax, including
- a FX translator that translates a Torch FX graph module to an TVM IRModule,
- a Relax-backend of Torch Dynamo, which brings the mechanism to build PyTorch model using Relax compilation pipeline,
- a pipeline prototype that contains the collection of pre-defined pipelines that optimizes and lower IRModule before passing to minimum build.
Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com>
Co-authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
* [Unity][Pass] Block-level static memory planning (#14038)
This PR introduces the static memory planning pass on binding block level, as well as an analysis function that estimate the memory usage after the memory planning pass. It supports the following features: nested-tuples, reuse memory of the input of reshape ops, an estimator that returns total memory size needed to be allocated before and after memory planning, as well as the number of tensors / memory blocks to be allocated before and after memory planning.
The estimation is static -- it does not consider control flows (such as “if” and cross-function calls). It simply accumulates the size of every alloc_tensor and alloc_storage.
We will produce “`relax.memory.alloc_tensor/storage`” as the results produced by memory planning.
* [Unity] Disallow inline prim_func in relax IR (#14040)
Disallow inline prim_func in relax IR
* [Unity] Update tests to adapt to latest TVMScript syntax (#14039)
Given that some latest changes of TVMScript syntax have been merged,
some test files are now containing deprecated uses of TVMScript syntax.
This PR updates the test files with latest TVMScript syntax so that
running the tests will not trigger deprecation warnings.
Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com>
* [Unity] Relax dataflow pattern language (matching) (#14041)
The dataflow pattern language for Relax (originally from https://github.com/tlc-pack/relax/pull/163).
The implementation splits patterns into two parts:
- Match an Expression: match an expression syntactically (MatchExprPattern, i.e., DFPatternMatcher);
- Match a Graph: match a graph (cross multiple VarBinding) topologically (MatchGraphPattern);
* [Unity] Statement rewriter for DataflowBlock (#14043)
This PR implements a few APIs to quickly perform statement-level mutation:
`add`/`remove_unused`/`remove_all_unused`/`replace_all_uses`.
It also implements `remove_all_unused` to remove dead statements inside `DataflowBlock`.
* [Unity][Pass] FuseOps FuseTIR fixes (#14044)
This PR fixes two bugs of FuseOps and FuseTIR:
It fixes FuseOps who only rewrites the "main" function of the
IRModule. After the fix, FuseOps now goes through each non-primitive
Relax function. Test cases for both FuseOps and FuseTIR sides are added
so ensure that both of the two passes work for cases of multiple Relax
functions.
It also fixes FuseOps and FuseTIR who did not take "call_dps_packed" style
"call_tir" into account. The previous behavior will directly downcast
the first argument of "call_tir" to GlobalVar, which is not right when
the "call_tir" is in "call_dps_packed" stype and the first argument is
a PackedFunc. With this fix, FuseOps and FuseTIR will skip such
"call_tir"s. Tests for both CallTIR and CallOps are added accordingly.
* [Unity][TVMScript] Overload `__neg__` for relax expr (#14045)
This PR overloads `__neg__` given that `relax.negative` is now supported. Besides, it adds `test_op_misc.py` and brings tests for calling overloaded operators.
* [Unity][VM] Add per-op profiling support (#14053)
Adds per-op profiling support to Relax VM, in a way similar to how Relay VM is instrumented via the common profiling infra in the runtime. Profiling over RPC is supported.
Example output:
```
Name Duration (us) Percent Device Count Argument Shapes
conv2d1 705,779.00 51.22 hexagon0 1 float32[1, 64, 56, 56], float32[1, 64, 54, 54]
conv2d 669,589.00 48.60 hexagon0 1 float32[1, 64, 56, 56], float32[1, 64, 56, 56]
relu 683.00 0.05 hexagon0 1 float32[1, 64, 56, 56], float32[1, 64, 56, 56]
relu1 679.00 0.05 hexagon0 1 float32[1, 64, 54, 54], float32[1, 64, 54, 54]
vm.builtin.check_tensor_info 28.00 0.00 hexagon0 1 float32[1, 64, 56, 56]
vm.builtin.match_shape 25.00 0.00 hexagon0 1 float32[1, 64, 56, 56]
----------
Sum 1,376,783.00 99.93 6
Total 0.00 cpu0 1
Total 1,377,809.00 hexagon0 1
Configuration
-------------
Number of threads: 4
Executor: VM
```
The original PR: https://github.com/tlc-pack/relax/pull/422
* [Unity][BYOC] Add pattern-based partitioning pass (#14054)
This adds a new pass, FuseOpsByPattern, which applies pattern matching to each function in the given module, and groups matched expressions into a new function. The end result is similar to FuseOps, but fusion is driven completely by
the provided patterns. The implementation also reuses OperatorFusor used by FuseOps to create grouped functions from partitioned groups, further illustrating the similarity between the two passes.
The new pass will serve the same role the MergeComposite pass plays in Relay BYOC - grouped functions are annotated with the "composite" attribute to denote what operations a given function consists of, and offloaded to external backends. But it can be also useful in non-BYOC settings, for example to support advanced fusion that the op-kind based one doesn't handle (fused MHA, conv2d / gemm + reduction fusion, etc).
The original PR: https://github.com/tlc-pack/relax/pull/366
* [Unity] Relax op: collapse sum (#14059)
This PR brings high-level operators `relax.collapse_sum_like` and `relax.collapse_sum_to` which is useful when doing AD in Relax. To achieve this, it exposes the interface of `topi.collapse_sum`. Moreover, this PR also implements the legalization of these op and adds corresponding tests.
* [Unity][Fix][Pass] Fix FuseOps for lack graph edges (#14058)
This PR fixes a mistake of #14044. In #14044, in VisitLeaf of graph
construction of FuseOps, we first check if the input node is Leaf and
then check if it is Tuple. This is not right: as Tuple is not
categorized as one leaf node, when the input node is a Tuple, the
function will return since the input is not a LeafNode. And the check
for Tuple will thereby never holds.
It is quite interesting that our existing unit tests fail to filter this
mistake out. I add a regression test for this case, which can ensure
that the tuple is always visited.
* [Unity][Pass] Remove Unused Function (#14061)
This PR implements a pass to clean up unused functions.
Co-authored-by: masahi <masahi129@gmail.com>
* [Unity][BYOC] Add pass to merge composite functions to offload large subgraphs (#14062)
This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830.
Original PR
https://github.com/tlc-pack/relax/pull/372
Substantial improvement by @yelite
https://github.com/tlc-pack/relax/pull/411
Related fix PR by @yelite
https://github.com/tlc-pack/relax/pull/406
Co-authored-by: Lite Ye <yelite958@gmail.com>
* [Unity][Frontend] Annotate number of non-static input of FX function (#14067)
* [Unity][Transform] Add LiftTransformParams pass (#14069)
This PR added a pass `LiftTransformParams`. It allows to compile the
end-to-end model without weights provided. The idea is annotate the
input parameters that are weights, and identify and lift the
transformations to weights, and compile it to a separate function
`transform_params` that can be executed in runtime. Users can run
`transform_params` with weights to get the weights for the optimized
model as a prep step before the deployment. In this way, we perform the
same optimizations and defer the weight transformations to the user
side, while the overhead of the deferred weight transformation can be
ignored as it only need to be run once.
This pass is integrated with the default `vm.build`. It is optional and
only necessary when the parameters are kept as inputs when importing the
model from the frontend.
* [Unity][BYOC][Pass] RunCodegen and TensorRT (#14078)
This PR introduces the fundamental workflow for BYOC and integrate TensorRT as a demonstration.
* [Unity][Pass] Canonicalize Bindings (#14079)
It may be useful for some passes to collapse chains of definitions, particularly after other compiler transformations that may reduce or simplify some expressions.
This pass will take chains of definitions and replace references to later definitions to the original one. It works by checking `LookupBinding` for each var use-site and replacing the var with its definition if the definition was another var. Additionally, `MatchCast` bindings where the LHS and the RHS are guaranteed to match at compile time are canonicalized into ordinary `VarBinding`s.
Example:
```python
y = x
z = y
w = z
o = w
p = o
```
Will be replaced with
```python
y = x
z = x
w = x
o = x
p = x
```
Original PR: https://github.com/tlc-pack/relax/pull/233
Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com>
* [Unity] Add testcases for `expr_args_converter` (#14080)
This is a missing test file when we added the `expr_args_converter`. This
PR adds it back.
* [Unity][BYOC] Add CUTLASS backend (#14081)
Co-authored-by: Lite Ye <yelite958@gmail.com>
* [Unity][BYOC] Add DNNL backend (#14082)
This PR adds dnnl backend to the unity flow.
* [Unity][Op] `log_softmax` and `cross_entropy_with_logits` (#14083)
This PR introduces two high-level operators log_softmax and cross_entropy_with_logits, which are important when we are calculating CrossEntropyLoss (in torch).
Co-authored-by: Yixin Dong <ubospica@gmail.com>
* [Unity][Analysis] TIR pattern kind analysis for multi-buffer write block (#14075)
This PR supports TIR pattern kind analysis for TIR blocks which write
to multiple buffers, which is helpful for normalization operators like
layernorm, groupnorm, etc.
Prior to this PR, the analyzer does not support a blocks which write to
multiple buffers. On seeing such a block, the analyzer simply sets the
analysis result to "opaque". With this PR, on seeing a block which
writes multiple buffers, the analyzer will check if all the BufferStores
have the same indices. And it will only set the result to "opaque" when
the BufferStores have different indices.
By doing this, the analysis works for common cases where a block may
write to multiple buffers, like layernorm or groupnorm.
Besides the unit test for the analysis itself, this PR also adds a unit
test for FuseOps pass, make sure that a "layernorm + relu" pattern can
be fused together.
* [Unity][Fix][Pass] FoldConstant with DCE in dataflow block (#14087)
The current FoldConstant pass does not support removing unused bindings
in the post-folding function. Therefore, for large real-world models,
the built executable will be overlarge because of the redundant unused
constants.
This PR removes the redundant unused constant bindings in FoldConstant
by using the analysis function "RemoveAllUnused".
Note that "RemoveAllUnused" only works at dataflow block level.
Therefore FoldConstant will not remove unused bindings outside of
dataflow block as well.
* [Unity] Refactor Relax Build JIT UX (#14088)
This PR refactors relax build so it get exposed at the opt-level.
We also introduces an explicit jit functionality to handle
live loading of compiled artifacts from cutlass.
We also move relax vm to runtime so it can be clearly isolated
from the rest of the compiler stack.
* [Unity][Relax] Set Shape Function to Be Host Function (#14090)
Set shape function to be host func.
* [Unity] Fix typo in the comment (#14096)
* [Unity] Lower `shape_of` to a builtin (#14093)
This PR lowers shape_of op to a Relax VM builtin, and changes a utility function to take StructInfo as input.
Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com>
* [Unity] Relax Recursive function (#14092)
This PR adds TVMScript local recursive function support. It also update lambda lifting pass. Removed CalledGlobalVars, it was not used anymore. It also updates well-form pass to allow un-defined vars for recursive call
* [Unity][Layout] Add layout transformation analysis for PrimFunc (#14066)
* [Layout] Add layout transformation analysis for PrimFunc.
This change adds a PrimFunc level analysis to suggest layout transformations to block and buffers in the PrimFunc based on the layout transformations to PrimFunc outputs.
* Add support for multiple blocks such as split op.
* Add negative tests and increase coverage.
* fix warning message
* fix lint
* remove unused header
* Address comments.
Moved some utility functions to support/array.h
improve doc
* fix deprecation warn T.var("int64") to T.int64()
* address comments
* [Unity] Remove attributes of relax.print, assert and unique (#14101)
Remove the attributes of operators assert, print and unique.
Use PrimValue as substitute.
Co-authored-by: Steven S. Lyubomirsky [slyubomirsky@gmail.com](mailto:slyubomirsky@gmail.com)
Co-authored-by: Prakalp Srivastava [prakalp@octoml.ai](mailto:prakalp@octoml.ai)
* [Unity][BYOC]Add relax backend pattern registry (#14106)
* Add relax backend pattern registry
* Add doc
* [Unity] Update tests again to adapt to latest TVMScript syntax (#14115)
* finished
* fix
* rollback merge_composite_functions
* [Unity][Fix] Fix bug in MergeCompositeFunctions (#14117)
Currently `MergeCompositeFunctions` will modify the map while iterating over it, and that makes tests/python/relax/test_transform_merge_composite_functions.py does not pass. This PR fixes this bug.
* [Unity][BlockBuilder] Add `name_hint` argument for `emit` and `emit_output` (#14126)
This PR adds `name_hint` argument for `emit` and `emit_output` API of Relax blockbuilder. The argument exists in the C++ side but not exposed to Python side (So user who use the Python bb.emit will let `name_hint` be `""` by default).
Co-authored-by: Yixin Dong <ubospica@gmail.com>
* [Unity][WEB] Relax vm on web runtime (#14131)
This PR brings initial relax vm support on web runtime
* [Unity] Add Global info (#14132)
* [Unity][BYOC] Add transposed matmul support to Relax CUTLASS BYOC (#14128)
Add transposed matmul support for Relax CUTLASS
* [Unity][TVMScript] emit_te sugar (#14123)
This PR adds R.emit_te meta-programming mechanism to emit a topi operator from TVMScript
* [Unity][BYOC] Assign group to unused bindings and ignroe PrimFunc (#14139)
* [Unity][BYOC] Assign group to unused bindings and ignroe PrimFunc
* Update fuse_ops.cc
* [Unity] Add callback to FuseOpsByPattern to check match result is accepted (#14109)
* [Unity] Add callback to FuseOpsByPattern to check match result is accepted
* add callnode to callback args
* update pattern registry
* fix
* [Unity][Legalize] Fix Scalar Constant Legalization (#14127)
This PR fixes the issue of loss of data type during Legalization. Previously, if we use a constant scalar in operators like `multiply`, it will automatically be converted to a python data type variable, which may lose its original data type. For example, `float16` may become python `float` and be interpreted as `float32` later.
This is now fixed by avoiding scalar value conversion. The conversion could be added back once we have better support for scalar prim value.
Co-authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-authored-by: Wuwei Lin <wuwei@apache.org>
* [Unity][Pass] Enhance constant folding to fold relax ops by evaluating them. (#14146)
* [Unity][Pass] Enhance constant folding to fold relax ops
by evaluating them.
This uses the registered legalization function attached to
the op to lower it to call_tir and uses the existing call_tir
folding mechanism to fold it.
This kind of op folding is only allowed within dataflow block
as ops could have side-effects.
Limitations:
* This currently does not support folding ops
that could lower to multiple call_tir bindings.
* Folding by evaluating ops is not always beneficial.
We need a heuristic to check if it is useful. This is
not implemented yet and folding is always allowed
by evaluating expressions.
* fix ci error
* fix doc
* fix bug
* [Unity][Debugging] AST printer (#14152)
This PR transfers over the AST printer from tlc-pack/relax. The AST printer is a debugging tool that prints out a Relax AST in a precise and human-readable format, which can be helpful for debugging the parser or various passes.
Co-authored-by: Yuchen Jin <yuchenj@cs.washington.edu>
Co-authored-by: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-authored-by: Ruihang Lai <ruihangl@cd.cmu.edu>
Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com>
* [Unity][Pass] Support Symbolic Shape Deduction during BindParam (#14154)
`BindParam` replace function params to constant nodes. However, it will
drop the shape information of the params, considering the following case:
```python
@R.function
def main(
x: R.Tensor(("batch", "m"), dtype="float32"),
w0: R.Tensor(("n", "m"), dtype="float32"),
b0: R.Tensor(("n",), dtype="float32"),
w1: R.Tensor(("k", "n"), dtype="float32"),
b1: R.Tensor(("k",), dtype="float32"),
) -> R.Tensor(("batch", "k"), dtype="float32"):
batch = T.Var("batch", "int64")
k = T.Var("k", "int64")
m = T.Var("m", "int64")
n = T.Var("n", "int64")
with R.dataflow():
lv0 = R.call_tir("linear0", (x, w0, b0), out_sinfo=R.Tensor((batch, n), dtype="float32"))
out = R.call_tir("linear1", (lv0, w1, b1), out_sinfo=R.Tensor((batch, k), dtype="float32"))
R.output(out)
return out
```
The current pass will simply drop the symbolic var `n`, `k` and cause
undefined vars during build as
```python
@R.function
def main(x: R.Tensor((1, "m"), dtype="float32")) -> R.Tensor(dtype="float32", ndim=2):
m = T.Var("m", "int64")
n = T.Var("n", "int64")
k = T.Var("k", "int64")
with R.dataflow():
lv0 = R.call_tir("linear0", (x, metadata["relax.expr.Constant"][0], metadata["relax.expr.Constant"][1]), out_sinfo=R.Tensor((1, n), dtype="float32"))
out = R.call_tir("linear1", (lv0, metadata["relax.expr.Constant"][2], metadata["relax.expr.Constant"][3]), out_sinfo=R.Tensor((1, k), dtype="float32"))
R.output(out)
return out
```
This PR updates the pass to bind the symbolic shape during binding.
* [Unity][Analysis] Checking function return struct info in well-formed check (#14155)
The current well-formed misses the check of function return struct info,
which may mistakenly pass the check if there are undefined vars in the
function return struct info.
* [Unity][BYOC] Use Relax legalize + CPU build for reference in tests (#14162)
* clean dnnl test
* clean trt test
* clean cutlass test
* fix gelu legalize for fp16
* use memoize in dnnl and trt tests
* [Unity] Add bind_constants option to FuseOpsByPattern (#14151)
* [Unity] Add lift_constatns option to FuseOpsByPattern
* lift_constants -> bind_constants
* [Unity][Analysis] Analysis for detecting recursion in Relax (#14149)
* DFS based attempt to detect mutual recursion
* Use Johnson's circuit-detecting algorithm instead
* Fix control flow test
* Detect all recursion anyway
* Add new test cases for simple recursion
* Fix mistake in test case
* Include missing dependencies
* Remove trailing whitespace
* Dependencies are simply references, not necessarily calls
* More trailing whitespace
* Newline at end of file
* Fix spacing in docstring
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
---------
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
* [Unity][BYOC] Add batch matmul support to Relax CUTLASS BYOC (#14166)
* Add batch matmul support to Relax CUTLASS BYOC
* Allow more dtypes
* Fix tests
* Revert how to get batch attr
* [Unity][Op] Full support of Relax op `power` (#14171)
This PR provides a full support of `R.power` including op registering, legalization, overloading `__power__` for Expr and torch fx frontend.
* [Unity][Analysis] Restore Python bindings for var analyses (#14180)
Restore Python bindings for var analyses
* [Unity][OP] Add an operator for fused multi head attention (#14150)
* [Unity][OP] Add an operator for fused multi head attention
This PR introduces the new relax operator `R.nn.attention` for fused multi head attention, and the support of fused multi head attention to relax cutlass BYOC. The input of the operator are query, key and value tensor, with `BSNH` layout, namely `[batch size, sequence length, number of heads, dimension of heads]`. And the output shares the same layout with all input tensor.
* remove useless codes, remove attrs and add memoize
* add more dispatches
* nit and fix rebase
* fix linter
* add support for bias
* fix lint
* BNSS layout for bias
* update doc
* fix typo
* support bias broadcast
* [Unity][WEBGPU] Codegen improvements and WebRuntime (#14187)
This PR makes various improvements web codegen in relax web runtime.
Correct support of shift operators.
Update relax vm to make most use of internal allocators.
Update the webgpu API to the latest spec.
* [Unity][Transform] LiftTransformParams handling multiple functions (#14192)
Previously, the LiftTransformParams pass only works on function
`"main"`. This is a bit restrictive as in our recent practice on stable
diffusion, there are cases where multiple Relax functions inside an
IRModule all need to be transformed.
Therefore, this PR enhances the LiftTransformParams pass, so that it
will now transform **all** functions **with attribute `num_input`**. For
functions without this attribute, the pass will simply skip them.
* [Unity][Op] Group normalization (#14194)
* [TOPI] Group normalization
As more and more ML models nowadays contain the group normalization
computation, we find it beneficial to introduce this op to TOPI level.
It will enable us to optimize the group normalization operation as a
whole in a more convenient way.
This PR introduces the group normalization op to TOPI. The group norm
operation was introduced in https://arxiv.org/abs/1803.08494. The
implementation uses tuple reduction, same as the implementation of layer
norm. Implemented with tuple reduction, the corresponding generated TIR
function can be optimized by cross-thread reduction or rfactor through
MetaSchedule.
Prior to this PR, the group normalization operations in frontend models
are translated to a series of operations, which brings inconvenience
when we want to optimize the group norm op as a whole.
With the TOPI implementation of group norm being introduced by #14193,
we can now use it to legalize the high-level group norm op and optimize
it using cross-thread reduction or rfactor via MetaSchedule.
Co-authored-by: Bohan Hou <spectrometerh@gmail.com>
* [Unity][Op] Argmax and argmin (#14195)
This PR introduces full support to the argmax and argmin op to the unity
branch, including the structure info inference, the legalization, and
the translation from Torch FX.
* [Unity][Op] Legalize `round`, `floor`, `ceil`, `sign` (#14198)
This PR implements the legalization for four unary operators:
* round,
* floor,
* ceil,
* sign.
Unit tests are provided accordingly.
* [Unity][Frontend] FX translator supporting more ops (#14196)
This PR improves the torch FX translator in the following perspectives:
* support unary op `sigmoid` and `round`,
* support in-place `fill`, `triu` and `tril`,
* support `tensor`, `arange`, `empty`,
* support `bmm` (batch matrix multiplication),
* support `astype`,
* support `chunk` and `squeeze`.
This PR also fixes `Embedding`. Previously the translation assumes that
the input to Embedding will only be 1-dimensional, and will throw
exception when the input has more than one dimension (i.e., batched).
This PR brings the support.
* [Unity][Frontend] FX translator returning weights with `keep_params_as_input` (#14197)
PR #14067 introduces the flag `keep_params_as_input` to the FX
translator, in the purpose to handle to model weights outside of the
translated Relax function.
This PR takes a further step, by returning the model weights as
NDArrays when the flag `keep_params_as_input` is true. With this PR, the
translator now can return back the weights upon requested. Otherwise,
after the import we will lose the model weights in the given PyTorch
model.
* [Unity][Fix] FX translating dtype (#14201)
This PR fixes a bug of the current FX translator when dealing with
dtype.
Previously, the translator does not take the cases
```python
dtype = x.getattr("dtype")
```
into consideration. In this case, the dtype will be a fx.Node object,
while the translator assumes that the dtype is either a string or
a torch native datatype (e.g., torch.float32).
This PR fixes this by doing an environment table lookup before for all
dtypes.
* [Unity][TIR][Pass] ForceNarrowIndexToInt32 (#14203)
[TIR][Pass] ForceNarrowIndexToInt32
This PR introduces a pass which forces every index expression in a
PrimFunc to have dtype int32. Meanwhile, it also checks if all integer
buffers in the PrimFunc have int32 dtype, and report error if some
integer buffer has dtype other than int32.
In terms of implementation, this pass leverages the
IndexDataTypeNormalizer, with the target dtype being int32.
This PR contains a few basic tests that come from
`test_tir_transform_narrow_datatype.py`, and contains some negative
tests as well.
* [Unity][Frontend] FX translator support torch.baddbmm (#14202)
This PR brings the support of translating `torch.baddbmm` into
combination of operators (matmul, add, multiply). Unit tests
are provided accordingly.
This PR also fixes the kwarg fetching issue of `torch.interpolate`.
* [CI] Point cpu ci to dep with onnx (#40)
Point cpu ci to dep with onnx
* [Unity] Introduce Default GPU Schedule Pass (#14182)
* Implement default schedule.
* Add test.
* Add tests.
* Fix linting.
* Skip scheduled blocks.
* Address issues.
* Use target current.
* Minor fixes.
* Remove Mutator.
…
* Quantized Corstone300 test draft * Add QNN strategy with operator fusion for Cortex-M Get QNN strategy running QNN strategy with operator fusion * Add assembly tensordot code from other PR Assembly tensordot from other PR Tensordot offset support Hand tested tensordot code * Helper work to support microTVM TIR schedules Formatting fixes Don't use automatic AOT building when skipping pass Assorted tech for scheduling with TIR Hacky int16 support * TIR schedule for microTVM conv2d Bugged schedule implementation Passing test! Works for all 1x1 conv2ds! External QNN operator altering Debugging work Pad with correct constant Broadly functional conv2d Reorganize quantize convolution test * TIR schedule for microTVM depthwise_conv2d Working depthwise convolution for strides=1 Working depthwise convolution! * Clean up code Support Python 3.7 Clean up code to prepare for review * Break qnn.py into helper functions * Finish reorganizing qnn.py * Fix linting * Remove residual debug code and fix linting * Try repairing unit tests * Run black to fix linting * Address code review comments * Second round of code review Second round of code review Fix tensordot opts test * Address @areusch code review * More code review * Catch VWW model download with request hook
For a long time, I've been unhappy with TVM's TE-based convolution schedules for Arm Cortex-M. They were a lot slower than the state-of-the-art, and had a lot of strange inefficiencies caused by limitations of TE.
This pull request rewrites regular and depthwise convolution schedules on Arm Cortex-M, using MetaSchedule and TIR to make them much faster. It took some work and ended up being a big PR (as many of these changes depend on the others), but I'm really happy with the result.
High level changes
qnnoperator strategy to TVM for Arm Cortex-M. With this change, we are able to skip the QNN lowering pass, letting us use Cortex-M specific implementations ofqnn_conv2d,add, andrequantizethat perform much better.alter_op_layoutfunctions foraddandrequantize. This reduces the amount of memory loaded during each requantization by over 5x with some snazzy tricks (pre-multiplying the kernel values with the input zero point, skipping the "shift" step in our floating point multiplication approximation, fusing the bias with the pre-multiplied zero point).vwwmodel using TFLite and ensures our implementation (with all the optimizations above) produces the same outputs. This is done by layer, so if there is ever an accuracy issue, we will know exactly which layer is causing the problem.TFLite-ground-truth Corstone300 Test
For a while, microTVM has had Corstone300 tests which compare our schedules for regular
nnops to implementations elsewhere in TVM, to make sure the schedules are written correctly. Despite this, we've had some accuracy issues (see #13364) when running models end-to-end, and we don't really have tools to debug these.The way I see it, the existing tests have two key limitations. They:
nn.conv2d), while leaving out the bias and re-quantize operations (which are normally fused).To fix this, I've added
test_quantized_convolution.pyin this PR. This test runs the convolution layers of thevwwmodel from TinyML perf using TensorFlow's TFLite Interpreter, while saving all the intermediate layer outputs.Then, one by one each layer is loaded with TVM and Corstone300, and the full operator (with fused convolution, bias, ReLU, and requantization) is run and compared to TFLite's result.
Quantized operators and fusion
TFLite Micro, CMSIS-NN, and (AFAIK) all other microcontroller AI platforms write code for "fused operators" - e.g. a convolution combined with a bias addition, ReLU activation, and requantization. This is good for a few reasons - it prevents us from having to store "intermediate results", it lets us combine steps from different operators, and it makes parts of the code easier to write.
This wasn't possible with TVM until recently, thanks to #12398 which enabled it for Hexagon. I've done the same thing here for Arm. I've also added strategy functions for 2D quantized convolutions on Arm, though (a) only some cases are supported and (b) the
qnn.Legalizepass must be disabled for these to be used.TVMScript convolution schedules
For a while, TE has had a known limitation that makes it impossible to fuse certain operators when they follow
reduceoperations. This meant microTVM would generate code like the following:I previously looked into this limitation, and with the help of Eric L. and others realized it would be really annoying to fix. Instead, our schedule has been replaced with a
T.prim_func, which lets us do this fusion (and have much more fine-grained control in general).I hit a few bugs doing this (e.g. #13330), and the limited docs for TVMScript meant I had to make some guesses about the right way to do things. It's totally possible this code is gross - I'll describe these issues more in a comment below. However, the generated code looks much nicer.
New optimized C intrinsic for convolutions
A few weeks ago, I wrote a faster version of microTVM's tensordot kernel. That got folded into this PR, as that schedule was not usable on its own. I've added a unit test test_topi_conv2d_tensordot_opts that goes into more detail about what the schedule does and why it is fast, but here's just a taste.
Our previous microTVM-specific schedule for regular
conv2dwas not very good, and was slower than just autotuning a generic implementation (for this reason, OctoML used a generic autotuned schedule to submit microTVM results to MLPerf Tiny). However, there are major limitations for how far an autotuning + C code generation approach can go, as GCC only uses the fast intrinsic functions in super narrow cases.For example, here is how microTVM would previously generate the inner loop of a 1x1 4-channel convolution:
Arm GCC 12.2 (with flags
-mcpu=cortex-m4 -O3) compiles this into instructions taking 29 cycles per output generated. That's not good, and the previous microTVM schedule was even worse.The new implementation in
tensordot.pyinstead gets compiled into just 15 cycles (though there is still work to be done to get this even lower):This is a very simple case, but we also have good support and tests for complex cases. We can work on data where the start pointers aren't word aligned, work on data where one or more of the data, kernel, or output has width not divisible by the SIMD width, have multiple sums running concurrently to reduce the number of memory loads (e.g. for 3x3 depthwise convolutions). The unit test checks all these capabilities, and the
tensordot.pyfile itself has comments explaining why doing it this way is faster.Faster re-quantization algorithm!
The way microTVM handled convolutions before was terrible. Here is an actual implementation from our MLPerf Tiny submission, which I've modified slightly for readability.
There are a bunch of things about this that aren't ideal:
int64values, and they are all padded with unnecessary zeros. This means we need to load eight words from memory for each re-quantization operation.int64ops, which are slow because Arm Cortex-M is a 32-bit platform.int8bounds checking is done with a wacky set of ternary operators. I checked - these do not get complied down nicely.I've fixed all these things using QNN
alter_op_layoutfunctions, and I've implemented a few more complex optimizations:biasbybias + sum(kernel) * input_zero_point(e.g. pre-multiplying the kernel values by the input zero point). This prevents us from having to subtract out the bias every time we do a multiplication by a kernel value (note that the input zero point is-128basically every time, because Cortex-M does not have auint x intinstruction). The result is stored in anint32value.32bits from ourint32 x int32multiplication. This lets us use zeroint64memory loads or instructions, without sacrificing accuracy.Together, this means our requantization code now looks like this:
All in all, requantization now takes ~8x fewer cycles per output than it did before.