-
Notifications
You must be signed in to change notification settings - Fork 78
[RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE) #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,148 @@ | ||
| - Feature Name: aarch64_backend | ||
| - Start Date: 2022-09-26 | ||
| - RFC PR: [apache/tvm-rfcs#0000](https://github.com/apache/tvm-rfcs/pull/0000) | ||
| - GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000) | ||
| - Co-Authors: [@manupak](https://github.com/manupak), [@u99127](https://github.com/u99127) | ||
|
|
||
| # Summary | ||
|
|
||
| This RFC is to introduce a new TIR backend for AArch64 codegen for supporting target specific features, specifically SVE. Currently AArch64 specific code is generated either through a generic LLVM backend or by tensorize implementation (e.g. the MMLA Arm(R) Neon(TM) instruction), but we could see a benefit from having a more fine grained control over LLVM that targets AArch64. | ||
|
|
||
| # Motivation | ||
|
|
||
| The main motivation behind this work is to introduce SVE instructions in codegen without changing IRs, scheduling primitives or TVM passes. AArch64 backend would be a good place to work around the issues in LLVM SVE code generation that have surfaced while adding support for SVE in Halide. In addition, `CodegenAArch64` backend would not be limited to SVE codegen – it could be used to introduce AArch64 specific lowering where required, either for specialised use of AArch64 intrinsics or to work around limitations of LLVM. | ||
|
|
||
| # Guide-level explanation | ||
|
|
||
| In comparison to the Arm(R) Neon(TM) instruction set, which uses a fixed vector length, SVE allows the developer to write vectorized code where the exact length of a vector is unknown at a compile time. That code can then run on hardware implementations with different choices of vector lengths. For hardware implementations, the only constraint for the vector length is that it has to be minimum of 128 bits and it has to be a multiple of 128 bits. *Vscale* is the number of sets of 128 bits that fit into the SVE vector, e.g. vscale of 4 results in a vector length of 512 bits. | ||
|
|
||
| The initial SVE implementation in TVM would focus on two main capabilities of SVE: | ||
|
|
||
| **1. Vector length agnostic loops** | ||
| As an example, consider this vector length agnostic loop that adds two vectors with FP32 elements: | ||
|
|
||
| ``` | ||
| for (i = 0; i < n; i += 4 * vscale) | ||
| c[i : i + 4 * vscale] = a[i : i + 4 * vscale] + b[i : i + 4 * vscale] | ||
| ``` | ||
|
|
||
| Number 4 in the above example comes from the fact that we can fit four FP32 elements into 128 bits. Here the number of times we have to run the loop will depend on vscale, which is a hardware implementation detail. If the vector length was, as an example, 256 bits, we could process 8 FP32 elements in one iteration, meaning we would have to do `n / 8` iterations. By increasing the vector length to 512 bits, we would need to do `n / 16` iterations. | ||
|
|
||
| **2. Predication** | ||
| SVE provides support for predication, enabling us to efficiently deal with loop tails, among other things. In the example above, `n` may or may not be a multiple of `4 * vscale`. Predication allows us to handle this loop without any special consideration for the remainder of the elements i.e. `c[n - n % (4 * vscale) : n]`. Essentially, every operation with SVE registers would take a predicate register as one of its arguments that would act as a bit mask indicating which elements are active. Similarly to the vector length, the length of a predicate depends on the hardware implementation. | ||
|
|
||
| ``` | ||
| whilelt p0.s, w17, w12 | ||
| ld1w { z0.s }, p0/z, [x2, x17, lsl #2] | ||
| ld1w { z1.s }, p0/z, [x1, x17, lsl #2] | ||
| fadd z2.s , z0.s , z1.s | ||
| st1w { z2.s }, p0, [x0, x17, lsl #2] | ||
| ``` | ||
|
|
||
| In that example, `whilelt` constructs the predicate register `p0` based on the loop bound variable and the increment variable stored in `w` registers. | ||
|
|
||
| ## How to target AArch64 backend | ||
|
|
||
| Similarly to how we target other LLVM codegen backends, we would invoke AArch64 backend through parsing the `-mtriple` in the target string: | ||
|
|
||
| ``` | ||
| target = "llvm -mtriple=aarch64-gnu-linux -mattr=+sve" | ||
| ``` | ||
|
|
||
| The node visitors in the AArch64 backend implementation would generate SVE code when `+sve` is part of the `-mattr`. | ||
|
|
||
| # Reference-level explanation | ||
|
|
||
| The main difference compared to CodegenLLVM would be how we generate LLVM and assembly for `Ramp` and `Broadcast` nodes. | ||
|
|
||
| Let's take a simple vectorized addition of two dimensional tensors as an example: | ||
|
|
||
| ``` | ||
| A = te.placeholder((200, 200), name="A") | ||
| B = te.placeholder((200, 200), name="B") | ||
| T = te.compute((200, 200), lambda i, j: A[i, j] + B[i, j]) | ||
|
|
||
| s = te.create_schedule(T.op) | ||
| xo, yo, xi, yi = s[T].tile(T.op.axis[0], T.op.axis[1], x_factor=10, y_factor=5) | ||
| # ^^ this would be the vector length | ||
| s[T].vectorize(yi) | ||
| ``` | ||
|
|
||
| Currently, loops that are annotated with vectorize will be represented as `Ramp` nodes in TIR: | ||
|
|
||
| ``` | ||
| @main = primfn(A_1: handle, B_1: handle, m: int32, n: int32) -> () | ||
| attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True} | ||
| buffers = {A: Buffer(A_2: Pointer(float32), float32, [200, 200], []), | ||
| B: Buffer(B_2: Pointer(float32), float32, [200, 200], [])} | ||
| buffer_map = {A_1: A, B_1: B} { | ||
| realize(compute: Buffer(compute_1: Pointer(float32), float32, [200, 200], []), [0:200, 0:200], True { | ||
| for (i.outer: int32, 0, 20) { | ||
| for (j.outer: int32, 0, 40) { | ||
| for (i.inner: int32, 0, 10) "unroll" { | ||
| compute[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)] = (A[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)] + B[(i.inner + (i.outer*10)), ramp((j.outer*5), 1, 5)]) | ||
| } | ||
| } | ||
| } | ||
| }) | ||
| } | ||
| ``` | ||
|
|
||
| The above TIR segment contains static numbers as the lane count (5) and the inferred bound (40) across the `j` axis. If SVE is used, the AArch64 backend would treat the lane count as `llvm.vscale() * 4` and the corresponding loop bound as `ceil( 40 / llvm.vscale() * 4 )`. | ||
|
|
||
| With SVE enabled, this TIR would further be lowered to LLVM: | ||
|
|
||
| ``` | ||
| for_body7: ; preds = %for_body7, %for_begin4.preheader | ||
| %indvars.iv = phi i64 [ %indvars.iv.next, %for_body7 ], [ 0, %for_begin4.preheader ] | ||
| %14 = trunc i64 %indvars.iv to i32 | ||
| %15 = tail call <vscale x 4 x i1> @llvm.aarch64.sve.whilelt.nxv4i1.i32(i32 %14, i32 4) | ||
| %16 = add i32 %12, %14 | ||
| %17 = sext i32 %16 to i64 | ||
| %18 = getelementptr inbounds float, float* %4, i64 %17 | ||
| %19 = tail call <vscale x 4 x float> @llvm.aarch64.sve.ld1.nxv4f32(<vscale x 4 x i1> %15, float* %18) | ||
| %20 = getelementptr inbounds float, float* %5, i64 %17 | ||
| %21 = tail call <vscale x 4 x float> @llvm.aarch64.sve.ld1.nxv4f32(<vscale x 4 x i1> %15, float* %20) | ||
| %22 = fadd <vscale x 4 x float> %19, %21 | ||
| %23 = getelementptr inbounds float, float* %6, i64 %17 | ||
| tail call void @llvm.aarch64.sve.st1.nxv4f32(<vscale x 4 x float> %22, <vscale x 4 x i1> %15, float* %23) | ||
| %indvars.iv.next = add i64 %indvars.iv, %7 | ||
| %24 = icmp slt i64 %indvars.iv, 4 | ||
| br i1 %24, label %for_body7, label %for_end8, !prof ! | ||
| ``` | ||
|
|
||
| That would then be turned into assembly when compiled for SVE enabled AArch64: | ||
|
|
||
| ``` | ||
| .LBB1_3: | ||
| add w17, w14, w15 | ||
| whilelt p0.s, w15, w12 | ||
| sxtw x17, w17 | ||
| ld1w { z0.s }, p0/z, [x2, x17, lsl #2] | ||
| ld1w { z1.s }, p0/z, [x1, x17, lsl #2] | ||
| add x16, x16, x10 | ||
| cmp x16, #4 | ||
| add w15, w15, w10 | ||
| fadd z0.s, z0.s, z1.s | ||
| st1w { z0.s }, p0, [x0, x17, lsl #2] | ||
| b.lt .LBB1_3 | ||
| mov w15, #200 | ||
| mov x16, x11 | ||
| ``` | ||
|
|
||
| ## Pattern matching TIR | ||
|
|
||
| We can change the fixed vector length loops into scalable vector length loops with following steps: | ||
|
|
||
| 1. Pattern match vectorized BufferLoad/BufferStore nodes in a form `A[ramp(iter*lanes, 1, lanes)]` and check that they all have the same lane value. Change the lane value to `vscale * 16 / sizeof(dtype)` in generated LLVM. | ||
| 2. Change the outer loop bound to depend on the `vscale`. | ||
| 3. If the loop doesn't satisfy this pattern, we abort. | ||
|
|
||
| # Drawbacks | ||
|
|
||
| Enabling VLA could possibly be done in `CodeGenLLVM` using non-AArch64 specific intrinsics, however, it would be good to improve the confidence in the new features in a specialised backend before introducing them to CodeGenLLVM. Whether the generic VLA intrinsics in LLVM are mature enough is still an open question. | ||
|
|
||
| # Prior art | ||
|
|
||
| There is already plenty of precedence for specialised LLVM backends, e.g. for Hexagon. | ||
| Regarding to SVE support, there has been work happening to add [128 bit SVE vector support to Halide](https://github.com/halide/Halide/pull/6781) | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this description, seems the proposed approach is that:
One concern is that the code can be simplified by the assumption(lane=5) during lowering phase, but that simplification does not work for the general case.
Edit: After thinking a bit more, i now think the above concern can be addressed by clarifying a strict set of raising rules. so feel free to ignore this