[microTVM] Modernize Arm Cortex-M convolution schedules #13242

guberti · 2022-10-31T18:23:35Z

For a long time, I've been unhappy with TVM's TE-based convolution schedules for Arm Cortex-M. They were a lot slower than the state-of-the-art, and had a lot of strange inefficiencies caused by limitations of TE.

This pull request rewrites regular and depthwise convolution schedules on Arm Cortex-M, using MetaSchedule and TIR to make them much faster. It took some work and ended up being a big PR (as many of these changes depend on the others), but I'm really happy with the result.

High level changes

Adds a qnn operator strategy to TVM for Arm Cortex-M. With this change, we are able to skip the QNN lowering pass, letting us use Cortex-M specific implementations of qnn_conv2d, add, and requantize that perform much better.
Performs operator fusion on the (very common) convolution + bias + requantize block for Cortex-M. This lets us write one schedule that performs all three of these steps, which improves performance by letting us compute multiple operator outputs concurrently (which reduces the number of times the data and weights are loaded from memory).
Adds a TIR/TVMScript schedule for the convolution + bias + requantize block described above. This fixes a longstanding limitation of TE which required us to write convolution outputs to an intermediate buffer and took a lot of time.
Uses new, highly optimized C/Assembly extern function to actually perform the convolution. I spent a ton of time looking at and verifying the assembly code it gets compiled into is really fast.
Overhauls the way requantization is done on Cortex-M by adding new alter_op_layout functions for add and requantize. This reduces the amount of memory loaded during each requantization by over 5x with some snazzy tricks (pre-multiplying the kernel values with the input zero point, skipping the "shift" step in our floating point multiplication approximation, fusing the bias with the pre-multiplied zero point).
Adds a new, end-to-end Corstone300 test that runs the MLPerf Tiny vww model using TFLite and ensures our implementation (with all the optimizations above) produces the same outputs. This is done by layer, so if there is ever an accuracy issue, we will know exactly which layer is causing the problem.

TFLite-ground-truth Corstone300 Test

For a while, microTVM has had Corstone300 tests which compare our schedules for regular nn ops to implementations elsewhere in TVM, to make sure the schedules are written correctly. Despite this, we've had some accuracy issues (see #13364) when running models end-to-end, and we don't really have tools to debug these.

The way I see it, the existing tests have two key limitations. They:

Compare our results only to other TVM implementations
They run only the base operators (e.g. just the nn.conv2d), while leaving out the bias and re-quantize operations (which are normally fused).

To fix this, I've added test_quantized_convolution.py in this PR. This test runs the convolution layers of the vww model from TinyML perf using TensorFlow's TFLite Interpreter, while saving all the intermediate layer outputs.

Then, one by one each layer is loaded with TVM and Corstone300, and the full operator (with fused convolution, bias, ReLU, and requantization) is run and compared to TFLite's result.

Quantized operators and fusion

TFLite Micro, CMSIS-NN, and (AFAIK) all other microcontroller AI platforms write code for "fused operators" - e.g. a convolution combined with a bias addition, ReLU activation, and requantization. This is good for a few reasons - it prevents us from having to store "intermediate results", it lets us combine steps from different operators, and it makes parts of the code easier to write.

This wasn't possible with TVM until recently, thanks to #12398 which enabled it for Hexagon. I've done the same thing here for Arm. I've also added strategy functions for 2D quantized convolutions on Arm, though (a) only some cases are supported and (b) the qnn.Legalize pass must be disabled for these to be used.

TVMScript convolution schedules

For a while, TE has had a known limitation that makes it impossible to fuse certain operators when they follow reduce operations. This meant microTVM would generate code like the following:

for (int32_t k_outer = 0; k_outer < 2; ++k_outer) {
  for (int32_t i = 0; i < 48; ++i) {
    for (int32_t j = 0; j < 48; ++j) {
      int32_t cse_var_4 = (j * 8);
      int32_t cse_var_3 = (k_outer * 4);
      // Writes data to a buffer in memory
      convolution_helper_function(
        (&(((int32_t*)depthwise_conv2d)[(((i * 384) + cse_var_4) + cse_var_3)])), 
        (&(((int8_t*)padded_data)[(((i * 400) + cse_var_4) + cse_var_3)])), 
        (&(T_reshape[(k_outer * 36)])));
    }
  }
}
for (int32_t ax1_1 = 0; ax1_1 < 48; ++ax1_1) {
  for (int32_t ax2_1 = 0; ax2_1 < 48; ++ax2_1) {
    for (int32_t ax3_1 = 0; ax3_1 < 8; ++ax3_1) {
      int32_t cse_var_5 = (((ax1_1 * 384) + (ax2_1 * 8)) + ax3_1);
      // Then, has to read the data back before doing more operations. Would be way faster
      // to just fuse these loops, but TE won't let us.
      int32_t __1 = ((int32_t)(((((((int64_t)((int32_t*)depthwise_conv2d)[cse_var_5]) + ... 
      // The rest is omitted for brevity
    }
  }

I previously looked into this limitation, and with the help of Eric L. and others realized it would be really annoying to fix. Instead, our schedule has been replaced with a T.prim_func, which lets us do this fusion (and have much more fine-grained control in general).

I hit a few bugs doing this (e.g. #13330), and the limited docs for TVMScript meant I had to make some guesses about the right way to do things. It's totally possible this code is gross - I'll describe these issues more in a comment below. However, the generated code looks much nicer.

New optimized C intrinsic for convolutions

A few weeks ago, I wrote a faster version of microTVM's tensordot kernel. That got folded into this PR, as that schedule was not usable on its own. I've added a unit test test_topi_conv2d_tensordot_opts that goes into more detail about what the schedule does and why it is fast, but here's just a taste.

Our previous microTVM-specific schedule for regular conv2d was not very good, and was slower than just autotuning a generic implementation (for this reason, OctoML used a generic autotuned schedule to submit microTVM results to MLPerf Tiny). However, there are major limitations for how far an autotuning + C code generation approach can go, as GCC only uses the fast intrinsic functions in super narrow cases.

For example, here is how microTVM would previously generate the inner loop of a 1x1 4-channel convolution:

output[oco_1] = 0;
for (int ic_1 = 0; ic_1 < 4; ++ic_1) {
    output[oco_1] = (output[oco_1] + (((int)tensor[ic_1]) * ((int)((short*)kernel)[((oco_1 * 128) + ic_1)])));
}

Arm GCC 12.2 (with flags -mcpu=cortex-m4 -O3) compiles this into instructions taking 29 cycles per output generated. That's not good, and the previous microTVM schedule was even worse.

The new implementation in tensordot.py instead gets compiled into just 15 cycles (though there is still work to be done to get this even lower):

int tensor__y00_x00__y00_x01 = tensor[0];
int tensor__y00_x02__y00_x03 = tensor[1];

int kernel__y00_x00__y00_x01 = kernel[0];
int kernel__y00_x02__y00_x03 = kernel[1];

int sum_0 = __builtin_arm_smuad(tensor__y00_x00__y00_x01, kernel__y00_x00__y00_x01);
sum_0 = __builtin_arm_smlad(tensor__y00_x02__y00_x03, kernel__y00_x02__y00_x03, sum_0);
sum_0 = __builtin_arm_smlad(tensor__y00_x04__y00_x05, kernel__y00_x04__y00_x05, sum_0);
sum_0 = __builtin_arm_smlad(tensor__y00_x06__y00_x07, kernel__y00_x06__y00_x07, sum_0);

This is a very simple case, but we also have good support and tests for complex cases. We can work on data where the start pointers aren't word aligned, work on data where one or more of the data, kernel, or output has width not divisible by the SIMD width, have multiple sums running concurrently to reduce the number of memory loads (e.g. for 3x3 depthwise convolutions). The unit test checks all these capabilities, and the tensordot.py file itself has comments explaining why doing it this way is faster.

Faster re-quantization algorithm!

The way microTVM handled convolutions before was terrible. Here is an actual implementation from our MLPerf Tiny submission, which I've modified slightly for readability.

static const int32_t fused_nn_conv2d_constant_1[8] = {
    +0x00000f80, -0x00000180, +0x00007e80, +0x00002880, +0x00010680, -0x00000980, +0x00001380, +0x0000a900
};

static const int32_t fused_nn_conv2d_subtract_constant_2[8] = {
    +0x0000306e, +0x00003092, +0x00008470, +0x00004a13, +0x0000c411, +0x00012da6, +0x00003b70, +0x00015bd8
};

static const int64_t fused_nn_conv2d_subtract_add_cast_constant_3[8] = {
    +0x000000004648e699LL, +0x0000000063c512d1LL, +0x00000000611b0293LL, +0x000000007524d8c7LL, +0x000000007758617fLL, +0x00000000590a119bLL, +0x00000000500f9336LL, +0x0000000040ee5089LL
};

static const int64_t fused_nn_conv2d_subtract_add_cast_multiply_constant_4[8] = {
    +0x0000002000000000LL, +0x0000004000000000LL, +0x0000004000000000LL, +0x0000004000000000LL, +0x0000008000000000LL, +0x0000010000000000LL, +0x0000004000000000LL, +0x0000008000000000LL
};
static const int64_t fused_nn_conv2d_subtract_add_cast_multiply_add_constant_5[8] = {
    +0x0000000000000026LL, +0x0000000000000027LL, +0x0000000000000027LL, +0x0000000000000027LL, +0x0000000000000028LL, +0x0000000000000029LL, +0x0000000000000027LL, +0x0000000000000028LL
};

void requantize(void* compute, int32_t conv[8], int32_t i0_i1_outer_fused, int32_t i2_outer, int32_t i3_outer) {
  // Reorganized by @guberti for readability
  int64_t _0 = ((int64_t)conv[i3_outer]) + ((int64_t)(fused_nn_conv2d_subtract_constant_2)[i3_outer]);
  int64_t _1 = _0 - ((int64_t)(fused_nn_conv2d_constant_1)[i3_outer]);
  int64_t _2 = _1 * fused_nn_conv2d_subtract_add_cast_constant_3[i3_outer];
  int64_t _3 = _2 + fused_nn_conv2d_subtract_add_cast_multiply_constant_4[i3_outer];
  int64_t _4 = _3 >> fused_nn_conv2d_subtract_add_cast_multiply_add_constant_5[i3_outer];
  int32_t __1 = ((int32_t) (_4)) - 128;

  // Code below is untouched
  int32_t __2 = (__1) < (127) ? (__1) : (127);
  int8_t __3 = (int8_t)((__2) > (-128) ? (__2) : (-128));
  int8_t __4 = (int8_t)127;
  int8_t __5 = (__3) < (__4) ? (__3) : (__4);
  int8_t __6 = (int8_t)-128;
  ((int8_t*)compute)[(((i0_i1_outer_fused * 384) + (i2_outer * 8)) + i3_outer)] = ((__5) > (__6) ? (__5) : (__6));
}

There are a bunch of things about this that aren't ideal:

We have FIVE re-quantization constants (not counting the zero point, which is correctly inlined).
3/5 of the constants are int64 values, and they are all padded with unnecessary zeros. This means we need to load eight words from memory for each re-quantization operation.
The first five math operations are int64 ops, which are slow because Arm Cortex-M is a 32-bit platform.
int8 bounds checking is done with a wacky set of ternary operators. I checked - these do not get complied down nicely.

I've fixed all these things using QNN alter_op_layout functions, and I've implemented a few more complex optimizations:

When applicable, we now replace the bias by bias + sum(kernel) * input_zero_point (e.g. pre-multiplying the kernel values by the input zero point). This prevents us from having to subtract out the bias every time we do a multiplication by a kernel value (note that the input zero point is -128 basically every time, because Cortex-M does not have a uint x int instruction). The result is stored in an int32 value.
We force bitshifts to be >=33 (which in practice they always are), which allows us to only use the top 32 bits from our int32 x int32 multiplication. This lets us use zero int64 memory loads or instructions, without sacrificing accuracy.

Together, this means our requantization code now looks like this:

static const int32_t REQUANTIZE_SCALE[8] = {
    +0x067bed1c, +0x05c578b9, +0x03d08ea5, +0x01ed066f, 
    +0x0176b86f, +0x027f3977, +0x054e3783, +0x06d0a442, 
};

static const int32_t BIAS[16] = {
    +0x00006b58, -0x000023aa, +0x00005cf3, +0x00004cc6, 
    +0x0000605e, +0x00006eed, +0x0000512e, -0x00002526, 
};
// Some lines omitted for brevity

// Bias is added before convolution, as doing it this way is faster
int requant_0 = (sum_0 * (long long) REQUANTIZE_SCALE[j]) >> 32;
requant_0 = (requant_0 + 1) >> 1;
requant_0 = __builtin_arm_ssat(requant_0 - 128, 8);
((short*) output)[0] = (short) requant_0;

All in all, requantization now takes ~8x fewer cycles per output than it did before.

tvm-bot · 2022-10-31T18:23:39Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

cc @alanmacd, @gromero, @leandron, @mehrdadh _{See #10317 for details}
Built docs for commit e5e37cd can be found here.

_{Generated by tvm-bot}

guberti · 2022-11-17T21:08:13Z

This pull request is ready for review! Would love reviews from @mkatanbaf (who's doing some microTVM + MetaSchedule work), @areusch, and @ekalda. Would also love a look from someone who's more familiar with TVMScript, and can critique my use of it :).

That said, there are a few known issues in this PR I still need to fix:

There is a hack where I read dummy data from a TVMScript buffer to prevent TVM from seeing the buffers as "unused". I think I'm supposed to use T.reads/T.writes in this situation, but I could not make those functions work.
It's kinda gross for me to use alter_op to change the requantize ops to be integers. It would be much better if I could do this in the TIR schedules, but this does not work as I cannot alter the requantize constants. I had to disable a type check to make this hack work, so I would like to find a different solution before merging.
~~The output zero point for requantization in tensordot.py is a fixed value of -128. I need to fix this to be dynamic.~~ Done!

In a following PR, I'll also address:

The out_layout attribute is not supported for my conv2d or depthwise_conv2d schedules. Adding this will let me get some timing results!

python/tvm/testing/aot.py

areusch

did a first pass here, thanks @guberti !

python/tvm/relay/op/nn/_nn.py

python/tvm/relay/qnn/strategy/arm_cpu.py

areusch · 2022-11-18T15:20:20Z

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

+    """Addition is commutative, so we could add the bias before, during, or after performing our
+    multiply-accumulate operations. It "costs" one cycle either way - if done at the beginning we
+    can't use a SMULXY trick to set sum_i to zero for "free", and if done at the end it doesn't
+    combine with anything. However, doing it at the beginning frees up a register/prevents needing


what about overflow?

The order of bias addition does not change the overflow behavior. This comment is just stating we could do the additions as:
$$A_1 B_1 + A_2 B_2 + \cdots A_n B_n + \text{bias}$$
OR as:
$$\text{bias} + A_1 B_1 + A_2 B_2 + \cdots A_n B_n$$
I've changed the wording a bit to make this clearer.

areusch · 2022-11-18T16:24:09Z

src/relay/qnn/op/requantize.cc

  // Check and assign types for scale and zero points.
-  AssignType(types[1], DataType::Float(32), axis_shape, reporter);  // input_scale
-  AssignType(types[2], DataType::Int(32), axis_shape, reporter);    // input_zero_pt
+  // AssignType(types[1], DataType::Float(32), axis_shape, reporter);  // input_scale


Fixed - this PR should not change requantize.cc.

However, it is a bit of a tricky issue. In qnn_alter_op.py, I want to manually choose the int32 requantize scale to improve performance. However, Relay's requantize op only allows the output scale to be a float32.

I get around this by storing the scale data as a float32 array with the correct bytes, and reading it back as an int32 array. I've added a comment to qnn_alter_op.py to better explain what happens here. This is pretty gross.

Longer term, I'd love to add a new Relay op IntegerRequantize that takes int32 scale and shift arguments, which will let us solve this problem in a nice way. Would love your thoughts on the right way to address this!

tests/python/relay/strategy/arm_cpu/test_quantized_convolution.py

mkatanbaf

Great work @guberti I added a few comments, mostly asking for clarifications.

python/tvm/relay/qnn/strategy/arm_cpu.py

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

mkatanbaf · 2022-11-21T18:54:46Z

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

+def _apply_simd_optimizations(instruction_tuples) -> Iterator[Tuple]:
+    """When possible, fuses single MACs into SIMD MAC instructions.
+
+    The compiler cannot do this automatically, as calling __builtin_arm_smlaxy forces the SMLAxy


I'm not sure if I understand this correctly, but does this mean that we will unroll the loop and get a long list of instructions instead? would this significantly increase the code size?

Yes, the inner reduction loops will always be unrolled (this occurs in _get_draft_macs). We will often unroll even more than this, either as another unrolled copy of the inner loops for odd-numbered channels (this happens e.g. for 3x3 depthwise convolutions) or by computing multiple sums at the same times (i.e. when num_sums > 1).

Compared with the naive approach, this does increase code size. However, the increase is very small - for example, unrolling a 3x3 depthwise convolution might take ~10 extra instructions, or 0.01 KB more flash size. This is well worth it, as unrolling dramatically reduces overhead and increases speed by ~2x. The previous tensordot implementation also unrolled these loops for the same reason.

tests/python/relay/strategy/arm_cpu/test_quantized_convolution.py

src/relay/qnn/op/convolution.cc

Mousius · 2022-11-23T12:05:22Z

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

+        # Arm GCC does not have `__builtin_arm_smlabt`, even though `__builtin_arm_smlatt`,
+        # `__builtin_arm_smlatb`, `__builtin_arm_smlad` and so on all exist. Perhaps this is a
+        # choice, since we can just use `smlabt` with the argument order swapped instead? Note that
+        # `__builtin_arm_smlabt` exists on most compilers (e.g. Clang) - this is just a GCC thing.
+        if instruction == "smlabt":
+            yield f"sum_{index} = __builtin_arm_smlatb({op2}, {op1}, sum_{index});"
+        else:
+            yield f"sum_{index} = __builtin_arm_{instruction}({op1}, {op2}, sum_{index});"


I believe this is because you're using the builtins directly rather than using the ACLE interface (
https://arm-software.github.io/acle/main/acle.html#accumulating-multiplications) - unsure how much guarantee you get with built-ins, I would move to the ACLE interface anyway.

Also see: https://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm_acle.h#L661-L675 😸

Mousius · 2022-11-23T12:11:26Z

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

-        (
-            f"""
-        #include <stdint.h>
-        #include <arm_nnsupportfunctions.h>


Yay! I think this solves the same problem as #13363 😸 !

guberti · 2022-11-23T17:04:05Z

@Mousius I'm a fan of switching to use ACLE! I originally used the __builtin functions simply because CMSIS-NN used them, but ACLE seems more stable. I've updated this PR to use ACLE.

This PR only affects tensordot.py, though. We still need #13363 to switch the rest of the micro schedules to ACLE.

alanmacd · 2022-11-23T23:15:34Z

tests/python/relay/strategy/arm_cpu/test_quantized_convolution.py

+# under the License.
+"""microTVM cares a lot about the convolution + bias + requantize + fused ReLU use case. There have
+been some accuracy issues in the past, so this test steps through a model (MobileNetV1) layer by
+layer and ensures there is 1-1 correspondance at each step. This test would run way faster if we ran


This is very cool, great idea!

areusch

thanks @guberti, did a more fine-grained pass now.

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

areusch · 2022-11-30T17:25:13Z

python/tvm/topi/arm_cpu/qnn.py

+        scale = T.match_buffer(scale_handle, scale_shape)
+        output = T.match_buffer(output_handle, output_shape, dtype="int16")
+
+        # This hack prevents TVM from seeing these variables as "unused". I should be using T.reads


can you file a bug for this?

I'm not sure if this is user error on my part, or an issue with TVM. I'll look around a bit and file an issue if it seems to be a bug.

Hi, apologies for bringing up an old PR thread, I just ran into a similar problem, was an issue filed in the end? If so, could you possibly point me to it?

@lhutton1 A bug still needs to be filed here - I meant to write up a small reproducible example, but never got around to it.

Thanks, I'll take a look into it :)

python/tvm/topi/arm_cpu/qnn.py

guberti · 2022-12-01T15:48:56Z

Thanks for the detailed review @areusch - your comments should be addressed by 9bd3598.

areusch

thanks @guberti, this is basically ready, i've highlighted a couple last areas (in particular the doctest). feel free to merge once you've addressed!

python/tvm/topi/arm_cpu/qnn.py

areusch · 2022-12-05T21:15:24Z

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

-including regular conv2d, depthwise conv2d, and grouped conv2d provided the data and kernel layouts
-are the optimal ones. When groups=1, the optimal data layout is NHWC and kernel layout is OHWI. When
-this is a depthwise convolution, the optimal data layout is NCHW and kernel layout is OIHW."""
+"""Generates optimized code to compute a tensor dot product on ARMv7E-M.


does this apply to v8-M also?

Sometimes - this uses the DSP instructions, which are required in v7E-M but optional in v8-M. This code also does not use MVE, which is optional in v8-M but would be really useful for deep learning. I've clarified this in the docstring.

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/tensordot.py

Get QNN strategy running QNN strategy with operator fusion

Assembly tensordot from other PR Tensordot offset support Hand tested tensordot code

Formatting fixes Don't use automatic AOT building when skipping pass Assorted tech for scheduling with TIR Hacky int16 support

Bugged schedule implementation Passing test! Works for all 1x1 conv2ds! External QNN operator altering Debugging work Pad with correct constant Broadly functional conv2d Reorganize quantize convolution test

Working depthwise convolution for strides=1 Working depthwise convolution!

Support Python 3.7 Clean up code to prepare for review

Second round of code review Fix tensordot opts test

guberti · 2022-12-06T16:27:39Z

I've addressed the comments from @areusch, so per his instructions I'm merging this. Thanks for the feedback!

@mehrdadh

In #13242, I rewrote microTVM's convolution schedules to give a major improvement in performance. While I demonstrated in tests that my changes worked, they could not be used with relay.build. This pull request expands the functionality of #13242 and adds new legalize and alter_op passes to take advantage of the quantized schedules. This dramatically improves performance on some models, dramatically cuts RAM usage, and removes the need for autotuning on microTVM. More specifically, for the vww model from MLPerf Tiny running on the nucleo_l4r5zi, this pull request: - Improves untuned performance from 1741 ms to 137 ms - a 6.8x improvement! - Improves tuned performance from 337 ms to 137 ms. - Sets a new state-of-the-art for MLPerf Tiny, beating Plumerai's previous 208 ms record - Reduces RAM consumption by 73 KB (a large amount on microcontrollers!) by eliminating intermediate buffers. - Reduces flash consumption for model weights by 5x - Slightly improves accuracy @mehrdadh has kindly tested these changes himself, and has confirmed my 137 ms figure. To enable the schedules that grant these performance improvements, this pull request: 1. Adds out_layout support to the regular and depthwise conv2d schedules from [microTVM] Modernize Arm Cortex-M convolution schedules #13242. 2. Generalizes the schedules from [microTVM] Modernize Arm Cortex-M convolution schedules #13242 to be more widely applicable. 3. Adds a layout alternation pass to ensure regular and depthwise conv2d schedules always get their desired input formats. 4. Adds a conv2d -> depthwise conv2d -> unpadded conv2d rewrite step to remove empty channels from conv2d operators. 5. Adds a conv2d -> average pool -> dense rewrite step to remove empty channels from conv2d operators. 6. Adds an alter_op pass to fold padding into a separate Relay operator.

* [microTVM] Fix tvmc tutorial (#14076) This PR applies appropriate changes to make sure the CI fails if micro_tvmc.sh tutorial fails. This issue was captured in #14074. This PR also makes changes to avoid this breakage in bash script tutorials in future. In addition, this PR fixes the bug in running TVMC tutorial which happened due to renaming zephyr_board to board. * [MetaSchedule] Introduce Async Pipeline in MultiLevelTiling (#14009) This PR introduces async pipeline in the current TVM's MultiLevelTiling Rules. This PR is based on apache/tvm#13966, which is already merged. This is because some conv2d workload will use `tir.if_then_else` to pad the input to the correct size, and this PR uses async copy in such copy statement. 1. Add a subrule in `src/meta_schedule/schedule_rule/multi_level_tiling.h/.cc` that annotate async copy for mlt in supported arch (>= sm80). In CUDA Core, this PR has a perf boost of around 1T GFLOP/s in most Conv2d test cases and 1T ~ 2T in most GEMM test cases. All generated codes, scripts, and traces are available at https://github.com/Rainy-Memory/tvm-async-rule-benchmark. Currently tested on commit `afbfb7aa7e43732cb716f8e443df696110be6afc` in conv2d NHWC workload, with a RTX 3080 GPU. **Notice: given the stochastic nature of evolutionary search, perfromance might become worse if enable this PR.** Workload: Conv2d NHWC |Shape|Mainline TVM|Mainline TVM with Async|Performance Boost| |-|-|-|-| |N=1_H=224_W=224_C=3_K=64_R=7_S=7_STR=2_PAD=3_DIL=1|13838.05219|14687.89452|6.141343581679319%| |N=1_H=56_W=56_C=64_K=64_R=1_S=1_STR=1_PAD=0_DIL=1|5398.305085|5613.892553|3.9936140067192905%| |N=1_H=56_W=56_C=64_K=64_R=3_S=3_STR=1_PAD=1_DIL=1|11652.96825|13157.88249|12.91442839038028%| |N=1_H=56_W=56_C=64_K=256_R=1_S=1_STR=1_PAD=0_DIL=1|10638.8309|11674.68499|9.736540600527816%| |N=1_H=56_W=56_C=256_K=64_R=1_S=1_STR=1_PAD=0_DIL=1|8692.32829|9469.264089|8.938178277203573%| |N=1_H=56_W=56_C=256_K=128_R=1_S=1_STR=2_PAD=0_DIL=1|4685.767442|5698.19634|21.606469175684712%| |N=1_H=28_W=28_C=128_K=128_R=3_S=3_STR=1_PAD=1_DIL=1|9872.787087|10404.60405|5.38669535070061%| |N=1_H=28_W=28_C=128_K=512_R=1_S=1_STR=1_PAD=0_DIL=1|9974.281496|10073.31657|0.9929043414276753%| |N=1_H=28_W=28_C=512_K=128_R=1_S=1_STR=1_PAD=0_DIL=1|7075.866932|8564.572712|21.039199780135142%| |N=1_H=28_W=28_C=512_K=256_R=1_S=1_STR=2_PAD=0_DIL=1|3648.330914|4021.923142|10.240086132713124%| |N=1_H=14_W=14_C=256_K=256_R=3_S=3_STR=1_PAD=1_DIL=1|8192.954618|9160.182054|11.805599824451525%| |N=1_H=14_W=14_C=256_K=1024_R=1_S=1_STR=1_PAD=0_DIL=1|8008.870153|9362.825279|16.90569456283206%| |N=1_H=14_W=14_C=1024_K=256_R=1_S=1_STR=1_PAD=0_DIL=1|5210.062241|6051.208379|16.144646629759908%| |N=1_H=14_W=14_C=1024_K=512_R=1_S=1_STR=2_PAD=0_DIL=1|2550.787202|3587.902938|40.65865373586739%| |N=1_H=7_W=7_C=512_K=512_R=3_S=3_STR=1_PAD=1_DIL=1|4350.626084|5432.788068|24.873706981617943%| |N=1_H=7_W=7_C=512_K=2048_R=1_S=1_STR=1_PAD=0_DIL=1|6672.068026|7663.725217|14.862815953549454%| |N=1_H=7_W=7_C=2048_K=512_R=1_S=1_STR=1_PAD=0_DIL=1|3142.564263|4297.988014|36.766909259541826%| Workload: GEMM NN |Shape|Mainline TVM|Mainline TVM with Async|Performance Boost| |-|-|-|-| |M=512_N=256_K=640|8678.46|10607.37|22.226408832903555%| |M=512_N=384_K=256|8109.13|10290.72|26.902886006267003%| |M=512_N=512_K=512|11419.83|14000.86|22.601299669084398%| |M=512_N=3072_K=768|19709.39|18351.61|-6.8890006235606425%| |M=512_N=768_K=3072|12844.59|13730.88|6.90010346768561%| |M=896_N=896_K=896|16149.91|16131.39|-0.11467556165947945%| |M=1024_N=1024_K=1024|18842.11|19662.8|4.355616223448428%| |M=1152_N=1152_K=1152|15386.79|16736.1|8.769275462913303%| |M=1536_N=1536_K=1536|18522.67|18872.06|1.88628313304725%| |M=2048_N=2048_K=2048|19515.42|18874.85|-3.282378754851291%| |M=3072_N=3072_K=3072|19233.9|19291.42|0.2990553137948975%| |M=4096_N=4096_K=4096|17122.17|19259.01|12.479960191961652%| * [TVMScript] Use op attribute to control whether to print dtype in TVMScript (#14111) This PR adds an op attribute `TScriptDtypePrintLocation`, and modifies the dtype printing logic of the builtin op to check this attribute. So that user defined operators can use it to specify how there dtype argument are printed by appending attributes instead of appending members to `dtype_first_arg`/`dtype_last_arg`. * [Fix][TVMScript] Fix index of metadata in printed script (#14130) Currently, if the same metadata object (e.g. a multi-line `tir.StringImm`) is referenced for more than one times in an IRModule, each reference will have different indices of the metadata array. For example, this code ``` str_imm = T.StringImm("aaa\nbbb\n") @I.ir_module class Module: @T.prim_func def foo() -> None: A = str_imm B = str_imm @T.prim_func def foo1() -> None: A = str_imm Module.show() ``` where `str_imm` is referenced three times, will generate such output: ``` @I.ir_module class Module: @T.prim_func def foo(): A: T.handle = metadata["tir.StringImm"][0] B: T.handle = metadata["tir.StringImm"][1] T.evaluate(0) @T.prim_func def foo1(): A: T.handle = metadata["tir.StringImm"][2] T.evaluate(0) ``` Each time has a different metadata index. This PR fixes this problem by detecting duplicate item in `IRDocsifierNode::AddMetadata`. * [Pytorch] frontend full_impl fix (#14122) Minor fix in pytorch frontend to compile gpt2 model, reproduce script. torch_version = 1.13.1 transformers_version = 4.26.1 ``` from transformers import GPT2LMHeadModel import torch import tvm from tvm import relay inp = torch.ones((1, 128)).to(torch.int64) input_shapes = [("input_ids", ((1, 128), "int64"))] model = GPT2LMHeadModel.from_pretrained('gpt2', return_dict=False) trace_model = torch.jit.trace(model, inp, strict=False) outputs = trace_model(inp) mod, params = relay.frontend.from_pytorch(trace_model, input_shapes) with tvm.transform.PassContext(opt_level=3): lib = relay.build(mod, target='llvm', params=params) runtime = tvm.contrib.graph_executor.GraphModule(lib["default"](tvm.device('cpu', 0))) runtime.set_input("input_ids", inp.numpy()) runtime.run() out = runtime.get_output(0).numpy() print(out) print('Done...') ``` Before the fix, the error message ``` Traceback (most recent call last): File "gpt2_compile.py", line 13, in <module> mod, params = relay.frontend.from_pytorch(trace_model, input_shapes) File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 4791, in from_pytorch outputs = converter.convert_operators(_get_operator_nodes(graph.nodes()), outputs, ret_name) File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 4164, in convert_operators relay_out = relay_op( File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 841, in full return self.full_impl(data, fill_value, dtype) File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 743, in full_impl fill_value = _expr.const(fill_value, dtype=dtype) File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/expr.py", line 707, in const raise ValueError("value has to be scalar or NDArray") ValueError: value has to be scalar or NDArray ``` because `fill_value` is ``` %0 = cast(64, dtype="float32"); power(%0, 0.5f) ``` * [DOCKER] Configurable NDK version support (#14000) Let the Android NDK version configurable as a command line argument * [Fix][TIR] SampleCategorical apply-to-schedule (#14133) This PR is another way to fix the issue described in #14118. Since we do not have a standard for json file on the format of float numbers (for example, we cannot require a json file producer to print the "integer" float numbers with at least one decimal), and the json parser is not responsible for determining if an integer in a json file should be parsed to a float or an int, the most convenient way of fixing the SampleCategorical issue will be allowing both FloatImms and IntImms as input, and converting all IntImms to FloatImms accordingly. This PR fixes the issue in this way. * [Arith] ConstIntBound was incorrectly assuming bounds were over int64… (#13918) [Arith] ConstIntBound was incorrectly assuming bounds were over int64_t range This commit improved the floormod and floordiv conversion check to be simpler for the negative range and adds a test to cover all integer data types. * [CMSIS-NN] Reduction in code size of AOT test runner binary (#13815) * [CMSIS-NN] Reduction in code size of AOT test runner binary Co-authored-by: Ashutosh Parkhi <ashutosh.parkhi@arm.com> * [CMSIS-NN] Add a runtime error message (#13643) [CMSIS-NN] Add a runtime error message APIs TVMAPISetLastError and TVMGetLastError are used to propagate CMSIS-NN errors caught in the backend. AOT test runner was improved to observe the contents of this global variable. A test was added to check for the last set error as part of this commit. * [CRT]Cleanup unused macros in crt_config.h.template (#14125) This PR removes old macros in crt_config.h.template. * [Fix][Relay] Fix axis transformation in squeeze shape function (#14135) * fix squeeze shape function issue and add testcase. * fix lint * [Unittest] merge test_cp_async_in_if_then_else into test_tir_transform_inject_ptx_async_copy (#14138) This PR merge two related unittests into one. * [Frontend][TFLite] Fix conv2d import bug (#14124) * Fix TFLite frontend bug and add test * lint * [ONNX][TORCH] Replace scatter op by scatter_elements (#14019) * remove scatter attr class * update pytorch: scatter was replaced by scatter_elements * remove scatter compute and strategy registration * remove scatter attrs registration * update onnx front-end: replace _op.scatter by _op.scatter_elements, add checks * update oneflow front-end * update paddlepaddle front-end * update pytorch utils * remove front-end scatter definition * fix scatter strategy for rocm * small update * remove scatter definition in back-end * remove scatter strategy for cuda, gpu. transfer special case to scatter_elements * fix test * small fix * upstream scatter with torch description * last upstream of scatter in pytorch front-end * fix reduction attribute in cuda strategy * set scalar to test instead of tensor. update check for dynamic dim * skip scalar source check in tests for scatter due to issue on torch side * remove scatter op implementation from topi/cuda * remove scatter op implementation from topi. small clean code --------- Co-authored-by: Valery Chernov <valery.chernov@deelvin.com> * [TVMScript][Printer] Remove relax prefix for now (#14140) Remove relax prefix for now This PR cleans up relax prefix in printer for now. While these setups are useful and do not cause any technical debts in the codebase. We remove it given requests. They can be added back to unity branch and later as part of upstream * [microNPU] Sum legalization support (#13997) Supports legalizing a relay sum operation to an equivalent series of NPU operations. It supports case with int8 output type and channel axis. * [Fix][MetaSchedule] Fix redundant stages in async pipeline for mlt (#14143) This PR fixes redundant stages if visiting `InitializeWithTuneContext` multiple times. * [COMMUNITY] Cheng Wen -> Reviewer (#14153) Please join me @chengven027-intellif as a new Reviewer in TVM. Cheng has contributed to ONNX/PyTorch frontend and Relay passes, making TVM support more input models. - [Commits History](https://github.com/apache/tvm/pulls?q=author%3Achengven027-intellif+) - [Code Review](https://github.com/apache/tvm/pulls?q=reviewed-by%3Achengven027-intellif+) * [Runtime] Fix high RAM usage when saving / loading paramters of big models (#14147) * add load_params_from_file * add save_params_to_file * avoid making another copy in save_params * black * add test * update doc * [Relay][Frontend] Span Filling PyTorch (#14050) * [Relay][Frontend] Span Filling PyTorch - Construct debug name of C graph instruction as the source name of span for pytorch model. - To get the reference of renamed nodes. Add a function to export the converted C graph after conversion. - Add structural_equal comparisons with and without set_span to the existing test cases. - Add span test cases for frequent conversions. - Add span test case for exporting model parameter. * [SpanFillingPyTorch] - Return TupleGetItem expr from TupleWrapper with the span of its Tuple. - Add None type symbol in set sapn for certain conversion. - Add current_op member varible to PyTorchOpConverter to track which op is converting for pytorch frontend. * [SpanFillingPyTorch] - Fix the error caused by the quantized params not found after renaming the debug name of C graph. --------- Co-authored-by: Joey Tsai <chunit@qti.qualcomm.com> * [TRT][BYOC] allow strided_slice ops on selected dimensions (#14142) (#14144) * [ONNX][TOPI] Add `DFT` operator (#13999) * init convertor for DFT * init test for DFT * init DFT operator in Relay * update topi implementation for DFT * clean up * update ONNX frontend * support attribute * fix error: Expected Array[Tensor], but got Array[index 0: Array] * support inverse, onsided, dft_lenght * update tests for DFT * update TOPI test for DFT * add documentation * fix pylint * fix cpplint * fix cpplint * fix threshold for FP16 (ARM) * add CUDA compute * fix pylint * fix doc string * code review fixes for ONNX front-end * code review fixes for TOPI * rename: stft.py -> signal.py * pass input_shape and output_shape to verify_dft * [CRT][microTVM] Enable USMP by default for AoTExecutor + CRT runtime (#14107) This PR enables USMP by default when AoTExecutor and CRT runtime are selected. Check forum discussion about this change: https://discuss.tvm.apache.org/t/enable-usmp-by-default-in-aot-executor-with-runtime-crt/14406 As a result, the workspace memory in mlperftiny project type is removed since memory allocation is not required. If we keep this workspace, the model doesn't fit since some of the memory is allocated twice. * [Android] Fix using system libraries in Android apps (#14145) - Starting from API 31, using `uses-native-library` is required if we want to open system library: https://developer.android.com/about/versions/12/reference/compat-framework-changes#enforce_native_shared_library_dependencies We should specify OpenCL library in `user-native-library` in all applications where OpenCL backend might be used. - Updated README files and describe how to fix synchronization issues in Android Studio. * [microTVM]Enable TVMC micro with AoT Executor (#14077) This PR enables AoT Executor for tvmc micro compilation. * [bugfix] Fix the write buffer scope of `mma_store_impl` (#14174) fix * [Relay] Enhance EliminateCommonSubexpr to support Tuple argument (#14169) If an argument of a call is a Tuple, we should check its fields. Different tuples with the same fields should be treated as same inputs * [TIR] Fix typo in doc (#14178) * [microTVM] Use QNN schedules to give SOTA performance (#13752) In #13242, I rewrote microTVM's convolution schedules to give a major improvement in performance. While I demonstrated in tests that my changes worked, they could not be used with relay.build. This pull request expands the functionality of #13242 and adds new legalize and alter_op passes to take advantage of the quantized schedules. This dramatically improves performance on some models, dramatically cuts RAM usage, and removes the need for autotuning on microTVM. More specifically, for the vww model from MLPerf Tiny running on the nucleo_l4r5zi, this pull request: - Improves untuned performance from 1741 ms to 137 ms - a 6.8x improvement! - Improves tuned performance from 337 ms to 137 ms. - Sets a new state-of-the-art for MLPerf Tiny, beating Plumerai's previous 208 ms record - Reduces RAM consumption by 73 KB (a large amount on microcontrollers!) by eliminating intermediate buffers. - Reduces flash consumption for model weights by 5x - Slightly improves accuracy @mehrdadh has kindly tested these changes himself, and has confirmed my 137 ms figure. To enable the schedules that grant these performance improvements, this pull request: 1. Adds out_layout support to the regular and depthwise conv2d schedules from [microTVM] Modernize Arm Cortex-M convolution schedules #13242. 2. Generalizes the schedules from [microTVM] Modernize Arm Cortex-M convolution schedules #13242 to be more widely applicable. 3. Adds a layout alternation pass to ensure regular and depthwise conv2d schedules always get their desired input formats. 4. Adds a conv2d -> depthwise conv2d -> unpadded conv2d rewrite step to remove empty channels from conv2d operators. 5. Adds a conv2d -> average pool -> dense rewrite step to remove empty channels from conv2d operators. 6. Adds an alter_op pass to fold padding into a separate Relay operator. * Add v0.11.0 docs link to site (#14181) Update the version menu in TVM documentation to add a specific v0.11.0 release docs link. * [TIR] Allow TransformLayout with non-inversible index map (#14095) * [TIR] Allow TransformLayout with non-inversible index map TransformLayout requires the index map to have inverse map that can be calculated by the analyzer in order to check whether padding is added. However, such check doesn't always work for all cases because of limitation of the affine analysis that can only handle a set of supported patterns. In some cases, even if the index map doesn't introduce padding, the schedule primitive throws `TransformationIntroducesPaddingError` because it fails to calculate the inverse index map. It is safe to allow buffer being padded without providing pad_value because the original loop extent is not changed and the padded region is not accessed. This PR changes the behavior of `TransformLayout` to allow non-inversible index map. Previous discussion: https://discuss.tvm.apache.org/t/conflict-free-shared-memory-permutation-in-tensorir/13959/9 * add assume_injective_transform option * Apply suggestions from code review Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> --------- Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> * [TIR][Analysis] Implement IdentifyMemCpy analysis function (#13947) * [HotFix][MetaSchedule] Turn off database shash check (#14188) At this moment, the structural hash values of IR in TVM is platform dependent (e.g., the hash values of a String may differ on different platforms). In our recent practice, we found this an obstacle for us to apply one existing database on different platforms (say we tune an IRModule with MetaSchedule on Metal, and then apply the database on CUDA, etc.) To clear this obstacle, we decide to remove the shash value check. The purpose of that check is mainly to ensure safety, and thus turning it off will make no difference in terms of using MetaSchedule in most of the cases that we can imagine. Meanwhile, it is equally important that we need to make our structural hash platform independent. There are plans ongoing for this target. * [TOPI] Batch Norm Training Mode (#14190) Prior to this PR, TOPI batch_norm only supports inference. This PR adds training: bool flag and momentum: float argument to support training mode (update moving_mean / var and return), which aligns with torch.nn.functional.batch_norm. * [TOPI] Group normalization (#14193) As more and more ML models nowadays contain the group normalization computation, we find it beneficial to introduce this op to TOPI level. It will enable us to optimize the group normalization operation as a whole in a more convenient way. This PR introduces the group normalization op to TOPI. The group norm operation was introduced in https://arxiv.org/abs/1803.08494. The implementation uses tuple reduction, same as the implementation of layer norm. Implemented with tuple reduction, the corresponding generated TIR function can be optimized by cross-thread reduction or rfactor through MetaSchedule. Co-authored-by: Bohan Hou <spectrometerh@gmail.com> * [Fix][TIR] LowerCrossThreadReduction with write-back predicate (#14199) Prior to this PR, the cross-thread reduction lowering pass does not add a store predicate to the write-back block. This is in consideration that for a certain write-back buffer position, all values being stored (by all the threads) in the write-back block are the same. Since all threads are writing the same value, we were assuming that not having a write-back block predicate is fine, because the result will not be wrong in any way. However, recently we noticed that some GPU backend compiler will capture this behavior (multiple threads writing a same position) as a race condition and thus throw compilation error. The compiler does not take the fact that all values being stored are the same, and insist on complaining. This means that we will still need the write-back block predicate to make things work. And this PR does this change. I have done integration tests locally to make sure that the generated kernels is right and produces the right results numerically. * [Unity] Relax VM (#13878) This PR implements a flexible register-based VM to execute relax programs with dynamic shape and control flow. Design: https://github.com/tlc-pack/relax/wiki/Relax-VM-Design. Co-Authored-by: Ziheng Jiang <ziheng@apache.org> Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-Authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com> Co-Authored-by: Junru Shao <junrushao1994@gmail.com> Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai> Co-Authored-by: Yong Wu <yongcale@gmail.com> Co-Authored-by: Steven S. Lyubomirsky <slyubomirsky@octoml.ai> Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com> Co-Authored-by: Hongyi Jin <3231950289@qq.com> * [Unity] Relax expressions and types (#13901) * [Unity][IR] First-class StructInfo (#13907) * [Unity][IR] First-class StructInfo Relax tracks structural information (such as tensor shape) via `StructInfo` about the values in Relax. * Fix rust build --------- Co-authored-by: Junru Shao <junrushao1994@gmail.com> * [Unity][CI] Unity specific jenkins setup (do not upstream to main) (#13910) This PR setup a unity specific jenkins with minimum jenkinsfile without sharding and disables most of the tests to reduce overall cost. We can add tests of unty branch by configuring the specific groovy file. * [Unity] Basic StructInfo Analysis and Expr construction (#13916) [Unity] Basic StructInfo Analysis and Expr construction. This PR adds struct info analysis and expr support. These are logics to construct the IR node and perform struct info related analysis. Testcases are added to cover the IR node construction and related struct info analysis checks. Co-authored-by: Tianqi Chen <tianqi.tchen@gmail.com> Co-authored-by: Altan Haan <altanh@cs.washington.edu> Co-authored-by: Andrew Liu <andrewlliu@gmail.com> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Jiawei Liu <jaway.liu@gmail.com> Co-authored-by: Junru Shao <junrushao1994@gmail.com> Co-authored-by: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com> Co-authored-by: masahi <masahi129@gmail.com> Co-authored-by: Prakalp Srivastava <prakalp@octoml.ai> Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Steven S. <Lyubomirsky slyubomirsky@octoml.ai> Co-authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com> Co-authored-by: Yixin Dong <ubospica@gmail.com> Co-authored-by: Yong Wu <yongcale@gmail.com> Co-authored-by: Ziheng Jiang <ziheng@apache.org> * [Unity] Relax BlockBuilder and ExprMutator (#13926) This PR adds BlockBuilder: the core data structure to construct Relax AST, and ExprMutator: performs AST mutation for implementing transformation passes. Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com> Co-Authored-by: Altan Haan <altanh@cs.washington.edu> Co-Authored-by: Andrew Liu <andrewlliu@gmail.com> Co-Authored-by: Hongyi Jin <3231950289@qq.com> Co-Authored-by: Jiawei Liu <jaway.liu@gmail.com> Co-Authored-by: Junru Shao <junrushao1994@gmail.com> Co-Authored-by: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com> Co-Authored-by: masahi <masahi129@gmail.com> Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai> Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-Authored-by: Steven S. <Lyubomirsky slyubomirsky@octoml.ai> Co-Authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com> Co-Authored-by: Yixin Dong <ubospica@gmail.com> Co-Authored-by: Yong Wu <yongcale@gmail.com> Co-Authored-by: Ziheng Jiang <ziheng@apache.org> * [Unity] Relax TVMScript Parser. (#13932) This PR adds the TVMScript parser/ir_builder support based on the blockbuilder. Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-authored-by: Junru Shao <junrushao1994@gmail.com> Co-authored-by: Tianqi Chen <tianqi.tchen@gmail.com> Co-authored-by: Yuchen Jin <yuchenj@cs.washington.edu> Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com> Co-authored-by: Yong Wu <yongcale@gmail.com> * [Unity] Relax TVMScript Printer (#13944) This PR introduces Relax as a dialect supported by the TVMScript Printer. Some caveats: - Needs to rebase to mainline before merging. - Some tests are skiped because some operators are not upstreamed to the unity branch yet. Co-authored-by: Tianqi Chen <tianqi.tchen@gmail.com> Co-authored-by: Yuchen Jin <yuchenj@cs.washington.edu> Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com> Co-authored-by: Yong Wu <yongcale@gmail.com> Co-authored-by: Prakalp Srivastava <prakalp@octoml.ai> Co-authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com> Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> * [Unity] Relax VM codegen (#13954) * [Unity] Relax VM shape lowering pass (#13956) This PR introduces Relax `FunctionPass` and `DataflowBlockPass` API, and the `VMShapeLower` pass to lower the shape expression in Relax to TIR functions and VM shape heap builtin functions. Co-Authored-by: Ziheng Jiang <ziheng@apache.org> Co-Authored-by: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com> Co-Authored-by: Altan Haan <altanh@cs.washington.edu> Co-Authored-by: Junru Shao <junrushao1994@gmail.com> Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai> Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-Authored-by: Steven S. <Lyubomirsky slyubomirsky@octoml.ai> Co-Authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com> Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com> Co-Authored-by: Yong Wu <yongcale@gmail.com> * [Unity] e2e Relax minimum build flow (#13961) This PR introduces the e2e Relax lowering flow (`relax.vm.build`). Tests for each pass in the flow are added. Co-Authored-by: Altan Haan <altanh@cs.washington.edu> Co-Authored-by: Andrew Liu <andrewlliu@gmail.com> Co-Authored-by: Hongyi Jin <3231950289@qq.com> Co-Authored-by: Jiawei Liu <jaway.liu@gmail.com> Co-Authored-by: Junru Shao <junrushao1994@gmail.com> Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai> Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-Authored-by: Steven S. <Lyubomirsky slyubomirsky@octoml.ai> Co-Authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com> Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com> Co-Authored-by: Yong Wu <yongcale@gmail.com> Co-Authored-by: Ziheng Jiang <ziheng@apache.org> * [Unity][TVMScript] Use explicit `R.shape` in TVMScript (#13979) As we've introduced `arg_sinfo` in CallNode, implicit shape constructor is not widely used in TVMScript. This PR removes the implicit shape since it may cause confusion between shape and tuple. * [Unity] Relax op: index (#13987) This PR is about the high-level tensor computation operators in Relax. This PR includes the tensor indexing operators. * [Unity] Relax op: datatype (#13986) * [Unity] Relax op: set (#13990) This PR is about the high-level tensor computation operators in Relax. This PR includes the set operators. Co-authored-by: Prakalp Srivastava <prakalp@octoml.ai> * [Unity] Relax op: image (#13994) This PR is about the high-level tensor computation operators in Relax. This PR includes the image operators. * [Unity] Relax op: arithmetic, comparison (#13983) This PR is about the high-level tensor computation operators in Relax. This PR includes the unary, binary and ternary arithmetic and comparison operators. Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Chaofan Lin <1713833595@qq.com> * [Unity] Relax op: statistical (#13991) This PR is about the high-level tensor computation operators in Relax. This PR includes the statistical operators. * [Unity] Relax op: neural networks (#13993) This PR is about the high-level tensor computation operators in Relax. This PR includes the neural network operators. * [Unity] Relax op: creation (#13984) This PR is about the high-level tensor computation operators in Relax. This PR includes the tensor creation operators. * [Unity] Relax op: linear algebra (#13988) This PR is about the high-level tensor computation operators in Relax. This PR includes the linear algebra operators. Co-authored-by: Siyuan Fneg <Hzfengsy@sjtu.edu.cn> * [Unity] Relax op: search (#13992) This PR is about the high-level tensor computation operators in Relax. This PR includes the search operators. * [Unity] Relax op: manipulation (#13989) This PR is about the high-level tensor computation operators in Relax. This PR includes the tensor manipulation operators. Co-authored-by: Prakalp Srivastava <prakalp@octoml.ai> * [Unity] NestedMsg Support utility (#13995) This PR introduce NestedMsg to robustly handle nested-tuple analysis. Relax support nested tuple structures in the IR. Nested tuple structure is important to support advanced groupings in cases such as gradient calculation and other scenarios. The possible presence of nested tuple does mean that we need to to robustly handle analysis that contains nested tuple structures in a dataflow graph. This PR introduces a NestedMsg<T> class that corresponds to a possibly nested message tuple for a given leaf message class T. We also introduces various helper functions to compose and decompose messages. Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Yixin Dong <ubospica@gmail.com> Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> * [Unity][Pass] Operator Fusion Passes (#14001) [Unity][Pass] Operator fusion passes This PR introduces three passes for operator fusion: 1. AnnotateTIROpPattern: analysis the operator kind from PrimFunc. 2. FuseOps: fuse operators for Relax functions, which adds a new fused relax primitive function. 3. FuseTIR: fuse corresponding TIR PrimFuncs for the fused relax. * [Unity][Pass] LambdaLift pass (#14012) * [Unity][VM] Supporting "compiled" exec mode. (#14015) [VM] Supporting "compiled" exec mode. This PR adds support of "compiled" mode to the VM. The compiled mode translate the relax function into TIR function and drive it through the TIR function. It is different from the micro AOT codegen, which generate TIR code that targets the micro C runtime environment and useful for resource limited settings with smaller set of features. Both leverages the low-level TIR build that is also shared with TensorIR. The current implementation targets full TVM (VM) runtime, that comes with PackedFunc, object, tuple, closure and all kinds of rich structure support. This also mean that we can leverage the full runtime support to handle things like allocation, dynamic shape, easy plugins and python interaction, which are not available in more limited runtime. The user directly use the same API to load the generated code regardless of compiled mode or bytecode. And just need to change one line ```python ex = relax.vm.build(mod, target, exec_mode="compiled") ``` The simplicity is thanks to the TVM runtime archiecture that allows us to compose things together in objects. The only difference is how the PackedFunc of high-level driving is being provided. In the case of bytecode it is normal interpretation and in the case of compiled mode it is TIR. It is a complete implementation Unit-testcases are added. All codegen build tests are updated to include two exec_modes and have passed locally. Co-authored-by: Junru Shao <junrushao1994@gmail.com> * [Unity][Pass] BindParams pass, FoldConstant pass (#14016) This PR introduces FoldConstant/BindParam passes. Co-authored-by: Yong Wu <yongcale@gmail.com> Co-Authored-by: Hongyi Jin <3231950289@qq.com> Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> * [Unity][Pass][TuningAPI] Introduce TuningAPI and MetaSchedule pass (#14014) Add TuningAPI and MetaSchedule tuning pass * [Unity] Relay -> Relax translator (#14026) This PR implements a Relay to Relax translator, which allows us to import Relay workloads to Relax for benchmarking and development purposes (tests and examples are added). * [Unity][Pass] Normalize Pass (#14031) This PR implements relax `Normalize` Pass, which allows users to transform Relax IR to normal form, i.e., the expressions are normalized (no nesting and hence the AST is in ANF), and all `checked_type_` and `shape_` of expressions are available. (tests are added). Co-Authored-by: Yuchen Jin <yuchenj@cs.washington.edu> Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com> * [Unity][BlockBuilder] CallTE convert PrimValue args (#14028) Prior to this PR, the `call_te` of BlockBuilder is not capable of converting PrimValue arguments and directly rejects PrimValues instead. This PR fixes this behavior with PrimValue conversion support and one regression test. Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> * [Unity][Pass] Wellformed Analysis (#14032) This PR implements relax wellformed analysis, which checks if the IRModule is well-formed. (tests and examples are added). Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu> Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com> Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com> Co-authored-by: Yong Wu <yongcale@gmail.com> Co-Authored-by: Yuchen Jin <yuchenj@cs.washington.edu> Co-Authored-by: Yixin Dong <ubospica@gmail.com> Co-Authored-by: Chaofan Lin <siriusneo@sjtu.edu.cn> Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai> Co-Authored-by: Junru Shao <junrushao1994@gmail.com> * [Unity][TVMScript] Move tir/relax import in script out of __init__.py (#14033) Prior to this PR, `python/tvm/script/__init__.py` imports both tir and relax submodules. This leads to the phenomenum that when people does ```python from tvm.script import tir as T ``` , the relax submodule will be implicitly visited by `__init__.py` as well. Since TIR does not rely on Relax, it is good not to import both of them at the same time. (This can prevent cyclic imports sometimes.) This PR does this decoupling by introducing two files * `python/tvm/script/relax.py` * `python/tvm/script/tir.py` and removing the imports from `python/tvm/script/__init__.py` and `python/tvm/script/parser/__init__.py`. With this change, we force people to manually do `from tvm.script import tir` and `from tvm.script import relax` to use TVMScript parser, which is right our conventional way. * [Unity][Pass] Operator legalization (#14029) This PR is the operator legalization pass, which transforms high-level operator calls to `call_tir`s of corresponding low-level TIR PrimFuncs. - The legalization pass provides customizability, which enables people to pass in a customized legalization map to override the default legalization method. - The legalization supports symbolic shape. (At this moment only pooling does not support symbolic shape, as TOPI pooling does not support. This needs to be fixed in followup PRs.) Co-authored-by: Chaofan Lin <siriusneo@sjtu.edu.cn> Co-authored-by: Yixin Dong <ubospica@gmail.com> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> * [Unity][Op] Add ShapeExpr Tests for Reshape Op (#14035) This PR specially checks the relax.reshape operator when the input is a ShapeExpr. * [Unity] Initial PyTorch Frontend (#14037) [Unity] Initial PyTorch Frontend This PR introduces initial pytorch frontend components of Relax, including - a FX translator that translates a Torch FX graph module to an TVM IRModule, - a Relax-backend of Torch Dynamo, which brings the mechanism to build PyTorch model using Relax compilation pipeline, - a pipeline prototype that contains the collection of pre-defined pipelines that optimizes and lower IRModule before passing to minimum build. Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Tianqi Chen <tianqi.tchen@gmail.com> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> * [Unity][Pass] Block-level static memory planning (#14038) This PR introduces the static memory planning pass on binding block level, as well as an analysis function that estimate the memory usage after the memory planning pass. It supports the following features: nested-tuples, reuse memory of the input of reshape ops, an estimator that returns total memory size needed to be allocated before and after memory planning, as well as the number of tensors / memory blocks to be allocated before and after memory planning. The estimation is static -- it does not consider control flows (such as “if” and cross-function calls). It simply accumulates the size of every alloc_tensor and alloc_storage. We will produce “`relax.memory.alloc_tensor/storage`” as the results produced by memory planning. * [Unity] Disallow inline prim_func in relax IR (#14040) Disallow inline prim_func in relax IR * [Unity] Update tests to adapt to latest TVMScript syntax (#14039) Given that some latest changes of TVMScript syntax have been merged, some test files are now containing deprecated uses of TVMScript syntax. This PR updates the test files with latest TVMScript syntax so that running the tests will not trigger deprecation warnings. Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com> * [Unity] Relax dataflow pattern language (matching) (#14041) The dataflow pattern language for Relax (originally from https://github.com/tlc-pack/relax/pull/163). The implementation splits patterns into two parts: - Match an Expression: match an expression syntactically (MatchExprPattern, i.e., DFPatternMatcher); - Match a Graph: match a graph (cross multiple VarBinding) topologically (MatchGraphPattern); * [Unity] Statement rewriter for DataflowBlock (#14043) This PR implements a few APIs to quickly perform statement-level mutation: `add`/`remove_unused`/`remove_all_unused`/`replace_all_uses`. It also implements `remove_all_unused` to remove dead statements inside `DataflowBlock`. * [Unity][Pass] FuseOps FuseTIR fixes (#14044) This PR fixes two bugs of FuseOps and FuseTIR: It fixes FuseOps who only rewrites the "main" function of the IRModule. After the fix, FuseOps now goes through each non-primitive Relax function. Test cases for both FuseOps and FuseTIR sides are added so ensure that both of the two passes work for cases of multiple Relax functions. It also fixes FuseOps and FuseTIR who did not take "call_dps_packed" style "call_tir" into account. The previous behavior will directly downcast the first argument of "call_tir" to GlobalVar, which is not right when the "call_tir" is in "call_dps_packed" stype and the first argument is a PackedFunc. With this fix, FuseOps and FuseTIR will skip such "call_tir"s. Tests for both CallTIR and CallOps are added accordingly. * [Unity][TVMScript] Overload `__neg__` for relax expr (#14045) This PR overloads `__neg__` given that `relax.negative` is now supported. Besides, it adds `test_op_misc.py` and brings tests for calling overloaded operators. * [Unity][VM] Add per-op profiling support (#14053) Adds per-op profiling support to Relax VM, in a way similar to how Relay VM is instrumented via the common profiling infra in the runtime. Profiling over RPC is supported. Example output: ``` Name Duration (us) Percent Device Count Argument Shapes conv2d1 705,779.00 51.22 hexagon0 1 float32[1, 64, 56, 56], float32[1, 64, 54, 54] conv2d 669,589.00 48.60 hexagon0 1 float32[1, 64, 56, 56], float32[1, 64, 56, 56] relu 683.00 0.05 hexagon0 1 float32[1, 64, 56, 56], float32[1, 64, 56, 56] relu1 679.00 0.05 hexagon0 1 float32[1, 64, 54, 54], float32[1, 64, 54, 54] vm.builtin.check_tensor_info 28.00 0.00 hexagon0 1 float32[1, 64, 56, 56] vm.builtin.match_shape 25.00 0.00 hexagon0 1 float32[1, 64, 56, 56] ---------- Sum 1,376,783.00 99.93 6 Total 0.00 cpu0 1 Total 1,377,809.00 hexagon0 1 Configuration ------------- Number of threads: 4 Executor: VM ``` The original PR: https://github.com/tlc-pack/relax/pull/422 * [Unity][BYOC] Add pattern-based partitioning pass (#14054) This adds a new pass, FuseOpsByPattern, which applies pattern matching to each function in the given module, and groups matched expressions into a new function. The end result is similar to FuseOps, but fusion is driven completely by the provided patterns. The implementation also reuses OperatorFusor used by FuseOps to create grouped functions from partitioned groups, further illustrating the similarity between the two passes. The new pass will serve the same role the MergeComposite pass plays in Relay BYOC - grouped functions are annotated with the "composite" attribute to denote what operations a given function consists of, and offloaded to external backends. But it can be also useful in non-BYOC settings, for example to support advanced fusion that the op-kind based one doesn't handle (fused MHA, conv2d / gemm + reduction fusion, etc). The original PR: https://github.com/tlc-pack/relax/pull/366 * [Unity] Relax op: collapse sum (#14059) This PR brings high-level operators `relax.collapse_sum_like` and `relax.collapse_sum_to` which is useful when doing AD in Relax. To achieve this, it exposes the interface of `topi.collapse_sum`. Moreover, this PR also implements the legalization of these op and adds corresponding tests. * [Unity][Fix][Pass] Fix FuseOps for lack graph edges (#14058) This PR fixes a mistake of #14044. In #14044, in VisitLeaf of graph construction of FuseOps, we first check if the input node is Leaf and then check if it is Tuple. This is not right: as Tuple is not categorized as one leaf node, when the input node is a Tuple, the function will return since the input is not a LeafNode. And the check for Tuple will thereby never holds. It is quite interesting that our existing unit tests fail to filter this mistake out. I add a regression test for this case, which can ensure that the tuple is always visited. * [Unity][Pass] Remove Unused Function (#14061) This PR implements a pass to clean up unused functions. Co-authored-by: masahi <masahi129@gmail.com> * [Unity][BYOC] Add pass to merge composite functions to offload large subgraphs (#14062) This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. Original PR https://github.com/tlc-pack/relax/pull/372 Substantial improvement by @yelite https://github.com/tlc-pack/relax/pull/411 Related fix PR by @yelite https://github.com/tlc-pack/relax/pull/406 Co-authored-by: Lite Ye <yelite958@gmail.com> * [Unity][Frontend] Annotate number of non-static input of FX function (#14067) * [Unity][Transform] Add LiftTransformParams pass (#14069) This PR added a pass `LiftTransformParams`. It allows to compile the end-to-end model without weights provided. The idea is annotate the input parameters that are weights, and identify and lift the transformations to weights, and compile it to a separate function `transform_params` that can be executed in runtime. Users can run `transform_params` with weights to get the weights for the optimized model as a prep step before the deployment. In this way, we perform the same optimizations and defer the weight transformations to the user side, while the overhead of the deferred weight transformation can be ignored as it only need to be run once. This pass is integrated with the default `vm.build`. It is optional and only necessary when the parameters are kept as inputs when importing the model from the frontend. * [Unity][BYOC][Pass] RunCodegen and TensorRT (#14078) This PR introduces the fundamental workflow for BYOC and integrate TensorRT as a demonstration. * [Unity][Pass] Canonicalize Bindings (#14079) It may be useful for some passes to collapse chains of definitions, particularly after other compiler transformations that may reduce or simplify some expressions. This pass will take chains of definitions and replace references to later definitions to the original one. It works by checking `LookupBinding` for each var use-site and replacing the var with its definition if the definition was another var. Additionally, `MatchCast` bindings where the LHS and the RHS are guaranteed to match at compile time are canonicalized into ordinary `VarBinding`s. Example: ```python y = x z = y w = z o = w p = o ``` Will be replaced with ```python y = x z = x w = x o = x p = x ``` Original PR: https://github.com/tlc-pack/relax/pull/233 Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com> * [Unity] Add testcases for `expr_args_converter` (#14080) This is a missing test file when we added the `expr_args_converter`. This PR adds it back. * [Unity][BYOC] Add CUTLASS backend (#14081) Co-authored-by: Lite Ye <yelite958@gmail.com> * [Unity][BYOC] Add DNNL backend (#14082) This PR adds dnnl backend to the unity flow. * [Unity][Op] `log_softmax` and `cross_entropy_with_logits` (#14083) This PR introduces two high-level operators log_softmax and cross_entropy_with_logits, which are important when we are calculating CrossEntropyLoss (in torch). Co-authored-by: Yixin Dong <ubospica@gmail.com> * [Unity][Analysis] TIR pattern kind analysis for multi-buffer write block (#14075) This PR supports TIR pattern kind analysis for TIR blocks which write to multiple buffers, which is helpful for normalization operators like layernorm, groupnorm, etc. Prior to this PR, the analyzer does not support a blocks which write to multiple buffers. On seeing such a block, the analyzer simply sets the analysis result to "opaque". With this PR, on seeing a block which writes multiple buffers, the analyzer will check if all the BufferStores have the same indices. And it will only set the result to "opaque" when the BufferStores have different indices. By doing this, the analysis works for common cases where a block may write to multiple buffers, like layernorm or groupnorm. Besides the unit test for the analysis itself, this PR also adds a unit test for FuseOps pass, make sure that a "layernorm + relu" pattern can be fused together. * [Unity][Fix][Pass] FoldConstant with DCE in dataflow block (#14087) The current FoldConstant pass does not support removing unused bindings in the post-folding function. Therefore, for large real-world models, the built executable will be overlarge because of the redundant unused constants. This PR removes the redundant unused constant bindings in FoldConstant by using the analysis function "RemoveAllUnused". Note that "RemoveAllUnused" only works at dataflow block level. Therefore FoldConstant will not remove unused bindings outside of dataflow block as well. * [Unity] Refactor Relax Build JIT UX (#14088) This PR refactors relax build so it get exposed at the opt-level. We also introduces an explicit jit functionality to handle live loading of compiled artifacts from cutlass. We also move relax vm to runtime so it can be clearly isolated from the rest of the compiler stack. * [Unity][Relax] Set Shape Function to Be Host Function (#14090) Set shape function to be host func. * [Unity] Fix typo in the comment (#14096) * [Unity] Lower `shape_of` to a builtin (#14093) This PR lowers shape_of op to a Relax VM builtin, and changes a utility function to take StructInfo as input. Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com> * [Unity] Relax Recursive function (#14092) This PR adds TVMScript local recursive function support. It also update lambda lifting pass. Removed CalledGlobalVars, it was not used anymore. It also updates well-form pass to allow un-defined vars for recursive call * [Unity][Layout] Add layout transformation analysis for PrimFunc (#14066) * [Layout] Add layout transformation analysis for PrimFunc. This change adds a PrimFunc level analysis to suggest layout transformations to block and buffers in the PrimFunc based on the layout transformations to PrimFunc outputs. * Add support for multiple blocks such as split op. * Add negative tests and increase coverage. * fix warning message * fix lint * remove unused header * Address comments. Moved some utility functions to support/array.h improve doc * fix deprecation warn T.var("int64") to T.int64() * address comments * [Unity] Remove attributes of relax.print, assert and unique (#14101) Remove the attributes of operators assert, print and unique. Use PrimValue as substitute. Co-authored-by: Steven S. Lyubomirsky [slyubomirsky@gmail.com](mailto:slyubomirsky@gmail.com) Co-authored-by: Prakalp Srivastava [prakalp@octoml.ai](mailto:prakalp@octoml.ai) * [Unity][BYOC]Add relax backend pattern registry (#14106) * Add relax backend pattern registry * Add doc * [Unity] Update tests again to adapt to latest TVMScript syntax (#14115) * finished * fix * rollback merge_composite_functions * [Unity][Fix] Fix bug in MergeCompositeFunctions (#14117) Currently `MergeCompositeFunctions` will modify the map while iterating over it, and that makes tests/python/relax/test_transform_merge_composite_functions.py does not pass. This PR fixes this bug. * [Unity][BlockBuilder] Add `name_hint` argument for `emit` and `emit_output` (#14126) This PR adds `name_hint` argument for `emit` and `emit_output` API of Relax blockbuilder. The argument exists in the C++ side but not exposed to Python side (So user who use the Python bb.emit will let `name_hint` be `""` by default). Co-authored-by: Yixin Dong <ubospica@gmail.com> * [Unity][WEB] Relax vm on web runtime (#14131) This PR brings initial relax vm support on web runtime * [Unity] Add Global info (#14132) * [Unity][BYOC] Add transposed matmul support to Relax CUTLASS BYOC (#14128) Add transposed matmul support for Relax CUTLASS * [Unity][TVMScript] emit_te sugar (#14123) This PR adds R.emit_te meta-programming mechanism to emit a topi operator from TVMScript * [Unity][BYOC] Assign group to unused bindings and ignroe PrimFunc (#14139) * [Unity][BYOC] Assign group to unused bindings and ignroe PrimFunc * Update fuse_ops.cc * [Unity] Add callback to FuseOpsByPattern to check match result is accepted (#14109) * [Unity] Add callback to FuseOpsByPattern to check match result is accepted * add callnode to callback args * update pattern registry * fix * [Unity][Legalize] Fix Scalar Constant Legalization (#14127) This PR fixes the issue of loss of data type during Legalization. Previously, if we use a constant scalar in operators like `multiply`, it will automatically be converted to a python data type variable, which may lose its original data type. For example, `float16` may become python `float` and be interpreted as `float32` later. This is now fixed by avoiding scalar value conversion. The conversion could be added back once we have better support for scalar prim value. Co-authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com> Co-authored-by: Wuwei Lin <wuwei@apache.org> * [Unity][Pass] Enhance constant folding to fold relax ops by evaluating them. (#14146) * [Unity][Pass] Enhance constant folding to fold relax ops by evaluating them. This uses the registered legalization function attached to the op to lower it to call_tir and uses the existing call_tir folding mechanism to fold it. This kind of op folding is only allowed within dataflow block as ops could have side-effects. Limitations: * This currently does not support folding ops that could lower to multiple call_tir bindings. * Folding by evaluating ops is not always beneficial. We need a heuristic to check if it is useful. This is not implemented yet and folding is always allowed by evaluating expressions. * fix ci error * fix doc * fix bug * [Unity][Debugging] AST printer (#14152) This PR transfers over the AST printer from tlc-pack/relax. The AST printer is a debugging tool that prints out a Relax AST in a precise and human-readable format, which can be helpful for debugging the parser or various passes. Co-authored-by: Yuchen Jin <yuchenj@cs.washington.edu> Co-authored-by: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Ruihang Lai <ruihangl@cd.cmu.edu> Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com> * [Unity][Pass] Support Symbolic Shape Deduction during BindParam (#14154) `BindParam` replace function params to constant nodes. However, it will drop the shape information of the params, considering the following case: ```python @R.function def main( x: R.Tensor(("batch", "m"), dtype="float32"), w0: R.Tensor(("n", "m"), dtype="float32"), b0: R.Tensor(("n",), dtype="float32"), w1: R.Tensor(("k", "n"), dtype="float32"), b1: R.Tensor(("k",), dtype="float32"), ) -> R.Tensor(("batch", "k"), dtype="float32"): batch = T.Var("batch", "int64") k = T.Var("k", "int64") m = T.Var("m", "int64") n = T.Var("n", "int64") with R.dataflow(): lv0 = R.call_tir("linear0", (x, w0, b0), out_sinfo=R.Tensor((batch, n), dtype="float32")) out = R.call_tir("linear1", (lv0, w1, b1), out_sinfo=R.Tensor((batch, k), dtype="float32")) R.output(out) return out ``` The current pass will simply drop the symbolic var `n`, `k` and cause undefined vars during build as ```python @R.function def main(x: R.Tensor((1, "m"), dtype="float32")) -> R.Tensor(dtype="float32", ndim=2): m = T.Var("m", "int64") n = T.Var("n", "int64") k = T.Var("k", "int64") with R.dataflow(): lv0 = R.call_tir("linear0", (x, metadata["relax.expr.Constant"][0], metadata["relax.expr.Constant"][1]), out_sinfo=R.Tensor((1, n), dtype="float32")) out = R.call_tir("linear1", (lv0, metadata["relax.expr.Constant"][2], metadata["relax.expr.Constant"][3]), out_sinfo=R.Tensor((1, k), dtype="float32")) R.output(out) return out ``` This PR updates the pass to bind the symbolic shape during binding. * [Unity][Analysis] Checking function return struct info in well-formed check (#14155) The current well-formed misses the check of function return struct info, which may mistakenly pass the check if there are undefined vars in the function return struct info. * [Unity][BYOC] Use Relax legalize + CPU build for reference in tests (#14162) * clean dnnl test * clean trt test * clean cutlass test * fix gelu legalize for fp16 * use memoize in dnnl and trt tests * [Unity] Add bind_constants option to FuseOpsByPattern (#14151) * [Unity] Add lift_constatns option to FuseOpsByPattern * lift_constants -> bind_constants * [Unity][Analysis] Analysis for detecting recursion in Relax (#14149) * DFS based attempt to detect mutual recursion * Use Johnson's circuit-detecting algorithm instead * Fix control flow test * Detect all recursion anyway * Add new test cases for simple recursion * Fix mistake in test case * Include missing dependencies * Remove trailing whitespace * Dependencies are simply references, not necessarily calls * More trailing whitespace * Newline at end of file * Fix spacing in docstring Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> --------- Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> * [Unity][BYOC] Add batch matmul support to Relax CUTLASS BYOC (#14166) * Add batch matmul support to Relax CUTLASS BYOC * Allow more dtypes * Fix tests * Revert how to get batch attr * [Unity][Op] Full support of Relax op `power` (#14171) This PR provides a full support of `R.power` including op registering, legalization, overloading `__power__` for Expr and torch fx frontend. * [Unity][Analysis] Restore Python bindings for var analyses (#14180) Restore Python bindings for var analyses * [Unity][OP] Add an operator for fused multi head attention (#14150) * [Unity][OP] Add an operator for fused multi head attention This PR introduces the new relax operator `R.nn.attention` for fused multi head attention, and the support of fused multi head attention to relax cutlass BYOC. The input of the operator are query, key and value tensor, with `BSNH` layout, namely `[batch size, sequence length, number of heads, dimension of heads]`. And the output shares the same layout with all input tensor. * remove useless codes, remove attrs and add memoize * add more dispatches * nit and fix rebase * fix linter * add support for bias * fix lint * BNSS layout for bias * update doc * fix typo * support bias broadcast * [Unity][WEBGPU] Codegen improvements and WebRuntime (#14187) This PR makes various improvements web codegen in relax web runtime. Correct support of shift operators. Update relax vm to make most use of internal allocators. Update the webgpu API to the latest spec. * [Unity][Transform] LiftTransformParams handling multiple functions (#14192) Previously, the LiftTransformParams pass only works on function `"main"`. This is a bit restrictive as in our recent practice on stable diffusion, there are cases where multiple Relax functions inside an IRModule all need to be transformed. Therefore, this PR enhances the LiftTransformParams pass, so that it will now transform **all** functions **with attribute `num_input`**. For functions without this attribute, the pass will simply skip them. * [Unity][Op] Group normalization (#14194) * [TOPI] Group normalization As more and more ML models nowadays contain the group normalization computation, we find it beneficial to introduce this op to TOPI level. It will enable us to optimize the group normalization operation as a whole in a more convenient way. This PR introduces the group normalization op to TOPI. The group norm operation was introduced in https://arxiv.org/abs/1803.08494. The implementation uses tuple reduction, same as the implementation of layer norm. Implemented with tuple reduction, the corresponding generated TIR function can be optimized by cross-thread reduction or rfactor through MetaSchedule. Prior to this PR, the group normalization operations in frontend models are translated to a series of operations, which brings inconvenience when we want to optimize the group norm op as a whole. With the TOPI implementation of group norm being introduced by #14193, we can now use it to legalize the high-level group norm op and optimize it using cross-thread reduction or rfactor via MetaSchedule. Co-authored-by: Bohan Hou <spectrometerh@gmail.com> * [Unity][Op] Argmax and argmin (#14195) This PR introduces full support to the argmax and argmin op to the unity branch, including the structure info inference, the legalization, and the translation from Torch FX. * [Unity][Op] Legalize `round`, `floor`, `ceil`, `sign` (#14198) This PR implements the legalization for four unary operators: * round, * floor, * ceil, * sign. Unit tests are provided accordingly. * [Unity][Frontend] FX translator supporting more ops (#14196) This PR improves the torch FX translator in the following perspectives: * support unary op `sigmoid` and `round`, * support in-place `fill`, `triu` and `tril`, * support `tensor`, `arange`, `empty`, * support `bmm` (batch matrix multiplication), * support `astype`, * support `chunk` and `squeeze`. This PR also fixes `Embedding`. Previously the translation assumes that the input to Embedding will only be 1-dimensional, and will throw exception when the input has more than one dimension (i.e., batched). This PR brings the support. * [Unity][Frontend] FX translator returning weights with `keep_params_as_input` (#14197) PR #14067 introduces the flag `keep_params_as_input` to the FX translator, in the purpose to handle to model weights outside of the translated Relax function. This PR takes a further step, by returning the model weights as NDArrays when the flag `keep_params_as_input` is true. With this PR, the translator now can return back the weights upon requested. Otherwise, after the import we will lose the model weights in the given PyTorch model. * [Unity][Fix] FX translating dtype (#14201) This PR fixes a bug of the current FX translator when dealing with dtype. Previously, the translator does not take the cases ```python dtype = x.getattr("dtype") ``` into consideration. In this case, the dtype will be a fx.Node object, while the translator assumes that the dtype is either a string or a torch native datatype (e.g., torch.float32). This PR fixes this by doing an environment table lookup before for all dtypes. * [Unity][TIR][Pass] ForceNarrowIndexToInt32 (#14203) [TIR][Pass] ForceNarrowIndexToInt32 This PR introduces a pass which forces every index expression in a PrimFunc to have dtype int32. Meanwhile, it also checks if all integer buffers in the PrimFunc have int32 dtype, and report error if some integer buffer has dtype other than int32. In terms of implementation, this pass leverages the IndexDataTypeNormalizer, with the target dtype being int32. This PR contains a few basic tests that come from `test_tir_transform_narrow_datatype.py`, and contains some negative tests as well. * [Unity][Frontend] FX translator support torch.baddbmm (#14202) This PR brings the support of translating `torch.baddbmm` into combination of operators (matmul, add, multiply). Unit tests are provided accordingly. This PR also fixes the kwarg fetching issue of `torch.interpolate`. * [CI] Point cpu ci to dep with onnx (#40) Point cpu ci to dep with onnx * [Unity] Introduce Default GPU Schedule Pass (#14182) * Implement default schedule. * Add test. * Add tests. * Fix linting. * Skip scheduled blocks. * Address issues. * Use target current. * Minor fixes. * Remove Mutator. …

@areusch

* Quantized Corstone300 test draft * Add QNN strategy with operator fusion for Cortex-M Get QNN strategy running QNN strategy with operator fusion * Add assembly tensordot code from other PR Assembly tensordot from other PR Tensordot offset support Hand tested tensordot code * Helper work to support microTVM TIR schedules Formatting fixes Don't use automatic AOT building when skipping pass Assorted tech for scheduling with TIR Hacky int16 support * TIR schedule for microTVM conv2d Bugged schedule implementation Passing test! Works for all 1x1 conv2ds! External QNN operator altering Debugging work Pad with correct constant Broadly functional conv2d Reorganize quantize convolution test * TIR schedule for microTVM depthwise_conv2d Working depthwise convolution for strides=1 Working depthwise convolution! * Clean up code Support Python 3.7 Clean up code to prepare for review * Break qnn.py into helper functions * Finish reorganizing qnn.py * Fix linting * Remove residual debug code and fix linting * Try repairing unit tests * Run black to fix linting * Address code review comments * Second round of code review Second round of code review Fix tensordot opts test * Address @areusch code review * More code review * Catch VWW model download with request hook

guberti force-pushed the arm-qnn-convolution branch 2 times, most recently from 8f0b1a4 to 4fd94e2 Compare November 7, 2022 12:34

guberti force-pushed the arm-qnn-convolution branch 3 times, most recently from f206531 to 40b5554 Compare November 11, 2022 12:24

guberti changed the title ~~[microTVM] [WIP] Support and test QNN convolution and fusion on Arm Cortex-M~~ [microTVM] [WIP] Modernize Arm Cortex-M convolution schedules Nov 13, 2022

guberti force-pushed the arm-qnn-convolution branch 3 times, most recently from 39cb5a4 to 7b465c2 Compare November 17, 2022 14:44

guberti marked this pull request as ready for review November 17, 2022 14:45

ibsidorenko reviewed Nov 18, 2022

View reviewed changes

python/tvm/testing/aot.py Outdated Show resolved Hide resolved

guberti force-pushed the arm-qnn-convolution branch from a6dfafc to febb861 Compare November 18, 2022 16:27

areusch reviewed Nov 18, 2022

View reviewed changes

guberti force-pushed the arm-qnn-convolution branch from febb861 to fae2a12 Compare November 20, 2022 20:32

mkatanbaf reviewed Nov 21, 2022

View reviewed changes

ibsidorenko reviewed Nov 22, 2022

View reviewed changes

src/relay/qnn/op/convolution.cc Outdated Show resolved Hide resolved

Mousius reviewed Nov 23, 2022

View reviewed changes

Mousius mentioned this pull request Nov 23, 2022

[microTVM] Replace arm_nnsupportfunctions.h with arm_acle.h #13363

Merged

guberti changed the title ~~[microTVM] [WIP] Modernize Arm Cortex-M convolution schedules~~ [microTVM] Modernize Arm Cortex-M convolution schedules Nov 23, 2022

alanmacd reviewed Nov 23, 2022

View reviewed changes

guberti force-pushed the arm-qnn-convolution branch from 29d97a8 to f11243a Compare November 24, 2022 13:42

guberti mentioned this pull request Nov 29, 2022

[microTVM] [WIP] Optimized assembly schedules for Cortex-M convolution guberti/tvm#2

Closed

areusch reviewed Nov 30, 2022

View reviewed changes

guberti mentioned this pull request Dec 1, 2022

[microTVM] Use autotuning to choose num_outputs value #13528

Closed

guberti force-pushed the arm-qnn-convolution branch from f11243a to 9bd3598 Compare December 1, 2022 15:47

areusch mentioned this pull request Dec 5, 2022

[ci] Run doctest as part of CI #13553

Closed

areusch approved these changes Dec 5, 2022

View reviewed changes

guberti added 17 commits December 5, 2022 16:27

Quantized Corstone300 test draft

8feb4c3

Add QNN strategy with operator fusion for Cortex-M

24f6204

Get QNN strategy running QNN strategy with operator fusion

Add assembly tensordot code from other PR

e1b5341

Assembly tensordot from other PR Tensordot offset support Hand tested tensordot code

Helper work to support microTVM TIR schedules

07e28ca

Formatting fixes Don't use automatic AOT building when skipping pass Assorted tech for scheduling with TIR Hacky int16 support

TIR schedule for microTVM conv2d

1275265

Bugged schedule implementation Passing test! Works for all 1x1 conv2ds! External QNN operator altering Debugging work Pad with correct constant Broadly functional conv2d Reorganize quantize convolution test

TIR schedule for microTVM depthwise_conv2d

3728cbb

Working depthwise convolution for strides=1 Working depthwise convolution!

Clean up code

2f4717a

Support Python 3.7 Clean up code to prepare for review

Break qnn.py into helper functions

0fdc1f2

Finish reorganizing qnn.py

ad604be

Fix linting

351f719

Remove residual debug code and fix linting

ad27d62

Try repairing unit tests

f08170d

Run black to fix linting

d90b39c

Address code review comments

c760ea3

Second round of code review

7a8f506

Second round of code review Fix tensordot opts test

Address @areusch code review

cde0f47

More code review

431e4e4

guberti force-pushed the arm-qnn-convolution branch from dcd9c17 to 431e4e4 Compare December 6, 2022 00:27

Catch VWW model download with request hook

e5e37cd

guberti merged commit bbba8d9 into apache:main Dec 6, 2022

guberti mentioned this pull request Jan 10, 2023

[microTVM] Use QNN schedules to give SOTA performance #13752

Merged

leandron mentioned this pull request Feb 1, 2023

TVM v0.11.0 Release Candidate Notes #13899

Closed

[microTVM] Modernize Arm Cortex-M convolution schedules #13242

[microTVM] Modernize Arm Cortex-M convolution schedules #13242

Uh oh!

Conversation

guberti commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

High level changes

TFLite-ground-truth Corstone300 Test

Quantized operators and fusion

TVMScript convolution schedules

New optimized C intrinsic for convolutions

Faster re-quantization algorithm!

Uh oh!

tvm-bot commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guberti commented Nov 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

areusch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mkatanbaf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guberti Nov 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guberti commented Nov 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

areusch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

guberti commented Dec 1, 2022

guberti commented Oct 31, 2022 •

edited

Loading

tvm-bot commented Oct 31, 2022 •

edited

Loading

guberti commented Nov 17, 2022 •

edited

Loading

guberti Nov 23, 2022 •

edited

Loading

guberti commented Nov 23, 2022 •

edited

Loading

guberti Dec 5, 2022 •

edited

Loading