[RFC] Adding initial SVE implementation #18

MeeraN7 · 2021-08-04T16:23:19Z

Introducing the addition of Arm Architecture's Scalable Vector Extension (SVE) in TVM containing initial VLA and predication implementation, based on earlier work by Giuseppe Rossini.

@mbaret

tkonolige

Thanks for the RFC! I've left some comments.

tkonolige · 2021-08-04T21:45:22Z

rfcs/initial_sve_addition.md

+In this RFC we would like to propose a TIR extension to support scalable
+vectorisation. This is an introductory RFC to see if the design of our


Can you define "scalable vectorization" in this paragraph.

tkonolige · 2021-08-04T21:49:38Z

rfcs/initial_sve_addition.md

+stride _VL_, which stands for Vector Length. _VL_ is only showed for ease of
+representation and we don't store _VL_ anywhere inside the TIR data structures.


When is the vector length of the loop determined? Is this decided by the hardware?

tkonolige · 2021-08-04T21:49:52Z

rfcs/initial_sve_addition.md

+a file called codegen_aarch64.cc which handles the creation of SVE intrinsics
+in LLVM by visiting the Load, Store and For nodes (to be replaced with a While
+node) and generating the relevant LLVM IR.
+


Please include a drawbacks section.

tkonolige · 2021-08-04T22:01:33Z

rfcs/initial_sve_addition.md

+Expression language, for example:
+``` 
+s[C].vectorize_scalable(x)
+```


What do you think about adding a new ForKind so that we can express scalable vectorization in TIR.

tkonolige · 2021-08-04T22:05:09Z

rfcs/initial_sve_addition.md

+has changed, it is now scalable. The constructor is:
+
+```
+DataType(int code, int bits, int lanes, bool is_scalable = false)


Why not use a special value for lanes to indicate that it can be a variable number?

tkonolige · 2021-08-04T22:06:07Z

rfcs/initial_sve_addition.md

+pass visits the TIR nodes and transforms them so that they account for the
+unknown vector length. Currently, our prototype was implemented before the
+addition of the While node to TVM, and so it generates a variable for loop. One
+change we are planning to make is transforming a for loop into a while loop


Why is a while loop necessary?

tkonolige · 2021-08-04T22:11:42Z

rfcs/initial_sve_addition.md

+However, most modern vector architectures (e.g. X86 AVX and the Arm
+Architecture's MVE and SVE extensions) support predicated vector instructions,
+removing the need for such a scalar epilogue and also allowing more code to be
+vectorised.  Lane predication allows the enabling/disabling of certain lanes in


Is this PR talking about using predicated vector instructions or scalable vector instructions? If you aren't talking about predicated vector instructions, you could probably just remove this paragraph.

tkonolige · 2021-08-04T22:14:49Z

rfcs/initial_sve_addition.md

+
+We would like to add support for Arm Architecture's Scalable Vector Extension (SVE) in TVM
+by introducing features for Vector Length Agnostic (VLA) programs and
+predication, i.e. the 2 main new SVE features. Thus we would like to express


You mention predication here, yet you never talk about how it is being added to TIR.

tkonolige · 2021-08-04T22:18:54Z

rfcs/initial_sve_addition.md

+
+We can also see the syntax of the Ramp nodes have now been modified to handle
+an unknown vector length, as seen by _ramp(i, 1, VL)_, instead of a fixed
+integer. The form is still _Ramp(base, stride, lanes)_ and the semantics of it


We already have a concept of an unknown scalar value at compile time: Any. It seems like you could just have the lanes of Ramp be a PrimExpr instead of an int. Is there are reason not to take this approach?

Sorry it has taken a while to get back to you. But to answer your question we feel it would be much simpler to keep the number of lanes as an int, since we know that the value will always be an integer but the exact number is what is unknown. Also, it would be slightly misleading to use the value Any as the number of lanes can't be "any" number but instead a multiple of a minimal number (in this case 128).

rfcs/initial_sve_addition.md

masahi · 2021-08-04T23:17:17Z

I think it is a good idea to invite people working on RISC-V support for TVM for review/discuss, since the RISC V vector extension is similar to ARM SVE. I remember some people in Taiwan are working on this. Maybe @comaniac knows who?

comaniac · 2021-08-04T23:33:20Z

I think it is a good idea to invite people working on RISC-V support for TVM for review/discuss, since the RISC V vector extension is similar to ARM SVE. I remember some people in Taiwan are working on this. Maybe @comaniac knows who?

Thanks for bringing up. cc @yrchen

MeeraN7 · 2021-08-05T13:25:18Z

@sjoerdmeijer @giuseros

tqchen · 2021-08-05T15:25:03Z

Thank you @MeeraN7 for the RFC. SVE is certainly an interesting topic.

Because we do not yet have SVE support in TIR. It would be useful to think carefully about how SVE can be presented and transformed. Right now the proposal contains one way to do so. However, we could use a bit of more context information, to see how the proposed design impacts the general lowering flow.

Specifically, it would be very helpful to also show examples of the code along the transformations, so we can have a better understanding the possible design tradeoffs. It might be helpful to translate some of the examples in the whitepaper to tvmscript form.

Specifically:

The TIR before SVE vectorization
The TIR after SVE vectorization right before LLVM codegen
The corresponding lowered llvm intrinsics

To touch a bit on the design alternatives(disclaimer, I only learnt the VLA/SVE by quickly reading through the manual, so I could be wrong). Based on my understanding, the main goal of VLA is to use intrisnics to represent a some what restricted loop pattern(in terms of possible items we can do). While the previous fixed length vectorization pushes the items onto the vector constructs such as Ramp and Broadcast.

I wonder if we could take a different approach for VLA/SVE. Because VLA in nature is closer to the loop pattern. Specifically, I feel we could come up with some form of "loop SVE legalization" that legalize a loop's body to all the patterns that SVE support, then leave the for loop as it is with annotation VLA. Then the code generator can take that and generate VLA code.

tqchen · 2021-08-05T15:29:12Z

cc @junrushao1994 @vinx13 as it can be related to the future tensorization optimizations

giuseros · 2021-08-05T22:51:55Z

HI @tqchen ,
I will try to sporadically comment, since this is a project I prototyped (and enjoyed :) ) when I was in Arm.

If I understand your comment correctly, what @MeeraN7 is doing is closer to what you are proposing. Instead of transforming a loop into a Ramp, and passing the ramp "as is" to LLVM (which is what is done for fixed vector length, but not doable for SVE) @MeeraN7 is legalising the loop in TIR and passing the legal loop down to the LLVM code-generator. In other words, the following loop:

for i = 1:1:10
A[i] = B[i]+C[i];
end

Once legalised becomes

for i = 1:VL:10
A[VL] = B[VL]+C[VL]
end

And then the LLVM code generator, knowing that this is a variable loop, translates this it with some LLVM intrinsics for:

Predicated load/store
Loop increment
Predication mask calculation

Please note that only load/store needs to be predicated. Other register-to-register operations (e.g., add/sub/mul) won't need predication.

Please also note that while predication is not a TIR concept, we use it to support VLA (cc @tkonolige) in the LLVM codegen. In future it should be quite straightforward to expose predication also in TIR (if required).

@MeeraN7 feel free to jump in if something I said is not correct (very likely :) )

MeeraN7 · 2021-08-06T10:17:23Z

Thank you @giuseros for your answer, it looked good to me. Just to add on, we know which loops need to be legalised as in TIR we do add a new ForKind called kVectorizedScalable (cc @tkonolige) which marks the loop as able to be vectorized but the value of VL is unknown. These loops are then legalised and transformed to include the unknown value VL and later passed to the code generator as @giuseros mentioned.

MeeraN7 · 2021-08-06T10:18:28Z

Thank you all for the comments so far, we are currently working on a response to address them.

tqchen · 2021-08-06T13:58:13Z

Thanks @MeeraN7 . I think the overall loop legalization makes sense. I wonder then if it is necessary to update the node construct such as Ramp, or can we directly make use of scalar loop in the body

rfcs/0018-initial-sve-addition.md

masahi · 2021-08-27T06:34:01Z

rfcs/0018-initial-sve-addition.md

+We can also see the syntax of the Ramp nodes have now been modified to handle
+an unknown vector length, as seen by _ramp(i, 1, VL)_, instead of a fixed
+integer. The form is still _Ramp(base, stride, lanes)_ and the semantics of it
+are still the same, the only difference is that the number of lanes is unknown


For this specific example, the semantics does change, doesn't it? Because in your example C[ramp(0, 1, 17)] = A[ramp(0, 1, 17)] + B[ramp(0, 1, 17)], 17 is not the length of a vector, but its the size of an input. While in the SVE example above the input length is explicitly divided by the vector length at TIR level.

Of course, I understand that the semantics of Ramp node in general does not change.

To avoid confusion, it is probably better to talk about different sense in which Ramp node is used, between fixed-width vs scalable vectorization. The former one treats the entire input as one chunk while the latter one is specifically for vector-length wide chunk.

rfcs/0018-initial-sve-addition.md

masahi · 2021-08-27T07:19:33Z

Thanks @MeeraN7 @giuseros, I like the approach of making the vectorized loop explicit with VL parameter at the TIR level, in contrast to how the fixed-width vectorization is done today. It would be great if you can add a section on why an alternative implementation strategy was chosen. Perhaps to make the LLVM codegen simpler?

If possible, I think it is better not to introduce user facing changes, since as far as an API is concerned, s[C].vectorize(...) is already vector-length agnostic.

sjoerdmeijer · 2021-09-01T12:27:00Z

@masahi, about:

If possible, I think it is better not to introduce user facing changes, since as far as an API is concerned, s[C].vectorize(...) is already vector-length agnostic.

which I think is very closely related to an earlier inline comment:

As I commented above, I'd like to continue using s[C].vectorize(...) and when the feature is available, enable SVE by a target attribute. So I don't expect any user facing work.

I think we do need a user-facing option to toggle fixed/scalable vectorisation. If the vectorisation strategy is selected based on an available target attribute, we loose control to choose fixed/scalable vectorisation. For architectures that do support scalable vectorisation, fixed width might still be preferable in some cases.

I think this is similar to Clang's loop pragmas. For example, the vectorise_width pragma has been extended with an optional second argument fixed|scalable:

vectorize_width(_value_[, fixed|scalable]),

see also the Clang docs here.

So I see two approaches:

we extend s[C].vectorize(...) to take an optional fixed/scalable boolean value, similar to Clang's loop pragma, which defaults to fixed if omitted,
or introduce s[C].vectorize_scalable(...) as proposed in this RFC.

I personally don't have any preference. But now I am wondering if extending s[C].vectorize(...), the first option, would be better. What do you think?

masahi · 2021-09-01T22:08:26Z

Thanks @sjoerdmeijer @giuseros, I didn't imagine that there would be a case where mixing fixed and scalable vectorization is beneficial. I prefer s[C].vectorize(..., scalable=True) to s[C].vectorize_scalable(...) but both seem fine.

Any other comments @tqchen @tkonolige?

hogepodge · 2021-09-09T23:27:49Z

rfcs/0018-initial-sve-addition.md

+s[C].vectorize(x)
+```
+
+Vectorisation along the x-axis is requested with _vectorize(x)_, and will


Within markdown, code is typically denoted with a pairback single quotes surrounding the code. I suggest using that as the convention to enclose methods, variables, and other code.

tqchen · 2021-09-10T01:07:31Z

Thanks @MeeraN7 @giuseros, to make the discussion more concrete, right now the IR after legalization looks like

  for (i: int32, 0, 17;i+=VL) {
    C_2[ramp(i, 1, VL)] = ((int8xVL*)A_2[ramp(i, 1, VL)] + (int8xVL*)B_2[ramp(i, 1, VL)])
  }

This would require changes such as the ramp data structure and data type to support the VL vector types, which can be a bit adhoc because there are also additional information needed to be encoded(e.g. this VL and that VL are the same) but nevertheless not clearly encoded here.

Given we are mostly matching a for pattern, I also wonder if they are really necessary. Since we could represent a VLA loop with some form of restricted for loop, with special annotations. Here is a possible alternative way to do so

  for (i: int32, 0, 17;i, annotation={"VLA"}) {
    C_2[i] = A_2[i] + B_2[i];
  }

And we will be defering the vectorized instruction generation to the codegen phase, by specially handling the patterns in the for that is annotated with VLA loop. Of course we can only support a limited set of patterns(such as read/write to the same vector index or limited reduction support), that is why legalize is needed to make sure the body of VLA for loop satiesfies the pattern.

In this way we can likely get a similar set of things without hacking into get a ramp with VL size

MeeraN7 · 2021-09-10T11:11:59Z

Hi @tqchen, thank you for the comment. To clarify a few things, VL is not added to the Ramp Node at all, it is simply a string that is used when printing TIR for visual representation. The only addition to the Ramp Node (and also Broadcast Node) is a boolean called "is_scalable" which should not affect anything else as a separate constructor was added. I don't think any other information needs to be added to these nodes or data types.

tqchen · 2021-09-10T13:21:06Z

Thanks @MeeraN7 . Yes I get what you mean. Right now we are adding a "is_scalable" field to indicate that the broadcast and ramp are "context dependent" on VL. Additionally, we might need to update DataType to indicate a scalable data type.

This context dependency is the missing information I mentioned here. The set of code is really undefined and should be parameterized by VL. Additionally, considering the case of two loops with different VL1 and VL2 and want to do some transformations, we might fall into the trap of thinking them as same type(because only "is_scalable" is marked) but in reality they are not, as a implicit dependency on VL can be ignored.

I can understand that the additional flag can be helpful as we could reuse some of the vectorization logic. However, the "is_scalable" field might introduce additional confusion as above, and the additional ramp node may not carry too much additional information(apart from the fact that we use a scalar vs a vector type). So my main question is that whether or not we could use a separate normal form to hint the code generator without changing the current DataType, ramp and broadcast.

Specifically, a regular loop as follows would carry same amount of information(the access to i(VLA index)) would indicate a vector load, and access to other indices would become a normal load. The main constraint could be that we can only allow i to appear in certain locations(say in the inner most to represent a ramp like pattern), and defer the generation of SVE code to the codegen phase, by pattern matching the indices and lookup intermediate value:

N1: A possible loop normal form via annotation

  for (i: int32, 0, 17;i, annotation={"VLA"}) {
    C_2[i] = A_2[i] + B_2[i];
  }

N1 would indeed push a bit more pressure to the code generator, because now code generator needs to pattern match load/store of VLA index(i), and possibly perform broadcast if necessary. However, the additional overhead may not be too large and could help us to keep the code cleaner. My guess is that this approach is also easier to generalize to later loop patterns such as scalable matrix instruction, in which case we cannot really reuse ramp and broadcast

sjoerdmeijer · 2021-09-12T07:34:15Z

Thanks for commenting @tqchen
Could you further clarify a few things for me please? See remarks inlined.

Thanks @MeeraN7 . Yes I get what you mean. Right now we are adding a "is_scalable" field to indicate that the broadcast and ramp are "context dependent" on VL. Additionally, we might need to update DataType to indicate a scalable data type.

This context dependency is the missing information I mentioned here.

I don't think I understand what you mean by context dependency. Basically, in my view of the world, the Ramp node means that we can process elements in a data parallel way. How exactly this is done, is up to backends depending on the architecture and the code. What we are doing here is annotating this Ramp node with a bit of state, which is a hint that we want a special kind of vectorisation. From this point of view it is syntactic sugar: if the hint is dropped or ignored, it could still be vectorised, but not in a vector length agnostic way.

I don't think we need to encode much more information than one boolean that specifies the vectorisation style. You could indeed argue that this is ad-hoc, but if we are going to do it in a different way, we would still need to keep a bit of state around, like the annotation={"VLA"} example that you gave earlier. From that point of view, I don't see any difference.

The set of code is really undefined and should be parameterized by VL.

What do you mean by undefined here?

Additionally, considering the case of two loops with different VL1 and VL2 and want to do some transformations, we might fall into the trap of thinking them as same type(because only "is_scalable" is marked) but in reality they are not, as a implicit dependency on VL can be ignored.

I can understand that the additional flag can be helpful as we could reuse some of the vectorization logic. However, the "is_scalable" field might introduce additional confusion as above, and the additional ramp node may not carry too much additional information(apart from the fact that we use a scalar vs a vector type). So my main question is that whether or not we could use a separate normal form to hint the code generator without changing the current DataType, ramp and broadcast.

Correct me if I am wrong, but your assumption is that explicit scalable state is bad. Looking at your example:

N1: A possible loop normal form via annotation
  for (i: int32, 0, 17;i, annotation={"VLA"}) {
    C_2[i] = A_2[i] + B_2[i];
  } 

This annotation looks equivalent to a loop pragma. In Clang you can for example do:

 #pragma loop vectorize_width(4, scalable)
  for (I =0; I<17; ++i) {
    C_2[i] = A_2[i] + B_2[I];
  }

and thus request scalable vectorisation. If this annotation results in scalable vectorisation, the LLVM vectoriser will lower this to operations using <n x 4 x i32> scalable vector IR types.

What I would like to say with these examples, is that there are 2 things going on at different levels. I think:

The annotation={"VLA"} corresponds to a loop pragma in Clang,
The TIR Ramp node extension corresponds to LLVM IR scalable types, e.g. <n x 4 x i32>.

And I think these 2 concepts are different things that both have their value and place.

I we can only annotate loops as being scalable, we loose the finer grained control to request this on a statement level. I don't know if mixing fixed and scalable will be an important use-case, but I think it is possible.

Summarising, I don't think explicit encoding of scalable in TIR nodes is a bad thing, the opposite actually, I think we need it, and the annotation on the loop might be a complementary technique to this.

What do you think?

Some small additions to Solution Approach and Next Steps sections following discussion about representation of VL and the vectorize_scalable function. Also, modified style to use back single quotes for function and variable names and any other code related text.

tqchen · 2021-09-29T19:43:48Z

Thanks @sjoerdmeijer , sorry for getting back to this late. If LLVM also encodes the SVE vector as a special type(and not tying the n) it could be a precedence that we can learn from. I do want to point out that the same approach won't work when it comes to scalable matrix instruction, so it would be good to think about a loop pattern based approach from that angle.

If we conclude that we really want to introduce the scalable vector into data type and so on. A remaining thing we need to resolve is the encoding. DataType and DLDataType are part of DLPack standard, used as standard ABI convention. So ideally we do not want to change the type signature of DataType DLDataType. What we might be able to is to introduce a magic lane number (say DataType::kScalableVectorLaneMark = -1) that indicate a scalable vector type, and also use similar approaches for other language construct as well. This would alleviate the problems from that side.

This being said. It would still be good to get @MeeraN7 @giuseros @sjoerdmeijer 's take on the possibility of using a normalized loop pattern vs embedding into the vector.

Thank you for all the discussions so far!

sjoerdmeijer · 2021-10-04T07:54:38Z

Hi @tqchen , thanks for the great feedback!

This being said. It would still be good to get @MeeraN7 @giuseros @sjoerdmeijer 's take on the possibility of using a normalized loop pattern vs embedding into the vector.

Yes, I can see that as being very useful. If my analogy is correct, the loop annotation corresponds to a loop pragma, and the scalable ramp/vector node to an IR extension. Like I wrote before, I think there's place for both of these concepts.

Can you please advise how to proceed? I guess we need to document this in the RFC. But do we want to prototype this too? Or do we just list it as an alternative, and discuss here what we would like to do first?

I have not yet commented on your remark about the ABI convention, as I am not familiar enough with TVM to comment about this. It looks though that is a solvable problem.

tqchen · 2021-10-04T12:46:10Z

The ABI issue is important since it affects the DLPack, so I would suggest we agree on the encoding convention before we proceed. I am fine with having a scalable vector repr if we can take the right encoding.

kparzysz-quic · 2021-10-11T15:42:32Z

0018-initial-sve-addition.md

+where the Ramp TIR node has the form 'Ramp(base, stride, lanes)' showing that
+these elements are processed in (vector) lanes.
+
+The size of 18 has been chosen to demonstrate the challenges of vectorising


The size is 17. There are other mentions of 18 in this paragraph as well.

kparzysz-quic · 2021-10-11T15:43:36Z

0018-initial-sve-addition.md

+example, and importantly no scalar epilogue. But since we do not need to
+process 5 * 4 = 20 elements, the last vector operation only needs to write two
+elements, which can be achieved by predication as we can enable the first two
+lanes and disable the last 2 lanes.


Even though SVE may be adding per-lane predication for ARM, per-lane predication as a concept is unrelated to SVE and should be done independently. This is especially important for non-ARM architectures that do support it.

kparzysz-quic · 2021-10-11T16:01:15Z

0018-initial-sve-addition.md

+In addition to predication, and also related to it, some new vector 
+architectures also allow scalable vectorisation. As opposed to so called fixed
+width vectorisation (e.g. AArch Neon), the Arm architecture SVE vector
+extension allows implementations to choose a vector register length between 128


Please clarify that "implementations" means "processor implementations".

kparzysz-quic · 2021-10-11T17:16:38Z

DataType and DLDataType are part of DLPack standard,

The DataType type is defined in TVM, it's not present in DLPack. Or am I missing something?

tqchen · 2021-10-11T17:46:36Z

@kparzysz-quic DatatType is a thin wrapper around DLDataType

kparzysz-quic · 2021-10-11T17:59:38Z

Right, but users of DLPack will not see DataType, only DLDataType, unless they use TVM as well. Changing DataType will only affect those that use include/tvm/runtime/data_type.h (i.e. the ABI breakage will be limited to users of TVM).

kparzysz-quic

This RFC is not about SVE. It's about vectorizing (countable) loops with bounds not known at compile time. I think that would be very valuable for all targets and we can determine the list of necessary features that a target should satisfy to implement it.

Even though SVE can be used to implement it, it only applies to ARM, and there is no compelling reason to introduce SVE-specific features to TIR. There may be room for unknown-length vectors in TIR, but that should be done separately from SVE.

Whether SVE it used to implement that or not should be left to the ARM codegen to decide.

tqchen · 2021-10-11T20:16:41Z

@kparzysz-quic this is right. On the other hand, we still on conversion of DLDataType and DataType(e.g. getting DataType from ndarray) in many cases. So ideally we want to keep them consistent

smeijer1234 · 2021-10-27T09:42:09Z

A belated thank you for sharing further thoughts on this @tqchen and @kparzysz-quic . I am on holiday this week (and a little bit of next), but want to pick this up soon after that. In the mean time, some remarks/questions from my side.

First of all, the ABI discussion goes over my head at the moment to be honest; I am not familiar enough yet to comment or address this. My understanding from the discussion so far is, that there will be an ABI break, but it is rather limited. So that doesn't require any further attention here, would that be a correct summary?

Perhaps more fundamental is @kparzysz-quic 's remark:

This RFC is not about SVE. It's about vectorizing (countable) loops with bounds not known at compile time. I think that would be very valuable for all targets and we can determine the list of necessary features that a target should satisfy to implement it.

Even though SVE can be used to implement it, it only applies to ARM, and there is no compelling reason to introduce SVE-specific features to TIR. There may be room for unknown-length vectors in TIR, but that should be done separately from SVE.

Yep, I agree, and you're right about this: this is not really about SVE. The "only" addition to TIR is the is_scalable part, which I would not consider to be SVE-specific, because it indeed just represents an unknown length vector and any back-end can deal with that in any way they like (and there are quite a few architectures that now support scalable-like vectors). So just for my understanding, would you mind elaborating on your thoughts here? For example, is it just a naming issue and we need to find another term, or is it more fundamental how we would like to do things in TIR? Thanks!

tqchen · 2021-10-27T14:27:48Z

@smeijer1234 to be more specific. I was recommending an approach that won't break the ABI.

Introdice

DataType::kScalableVectorLaneMark = -1

And in DataType, introduce a method IsScalable(), which checks if the lane equals -1 and if yes return true. This way we reuse the data layout of the original data structure without introducing the is_scalable field.

manupak · 2022-06-21T17:20:31Z

Hi @tqchen @kparzysz-quic @masahi @tkonolige @smeijer1234 ,

We are looking to revive this work. I have gone through the thread.
Summary so far is as follows :

We want to introduce/enhance a scheduling vectorization primitive that could be controlled by user/auto-tuner/auto-scheduler either to use scalable vectors in the backend codegen.
- The conversation has resolved to be extending the existing vectorize scheduling primitive i.e. : s[C].vectorize(..., scalable=True)
Usage of this scheduling primitive should result in creating a for loop with a Ramp nodes with either an additional argument "is_scalable" or special number for lanes.
- I think @tqchen was suggesting to use the special lane number (-1) as opposed to introducing an additional argument to all TIR nodes such as Ramp and Broadcast as well as DataType (and to DLDataType) to avoid ABI breakages.
- Moreover, VectorizeLoopScalable will be modified to create a While node.
The name of the RFC is confusing ? @kparzysz-quic . I suppose for TIR, what we are adding is vector-length agnostic vectorization support for TIR, while demonstrating the codegen of VLA vectorized TIR using Arm(R) SVE instructions in the codegen.

Please confirm whether this is a right summary of the current state.
As for next steps, I would like to propose/resolve each of the outstanding discussion points and update the RFC.

areusch · 2022-06-30T20:16:13Z

@manupa-arm i think your summary is roughly correct.

@tqchen @masahi @kparzysz-quic perhaps you guys could have a look and see if we could resolve the question of how to represent the scalable concept in DLDataType. also cc @alter-xp in case they would like to leverage this for RISC-V (i'm aware you guys are doing a BYOC impl, but perhaps this could be useful if the approach goes beyond that in the future).

i do tend to agree with @kparzysz-quic that we could make the title a bit more generic since the DLDataType question is at the center of this RFC.

alter-xp · 2022-07-01T03:07:20Z

Thanks for bringing up. this is also very useful on RISC-V. we look forward to progress in this regard.

kparzysz-quic · 2022-07-01T18:59:02Z

To reiterate---my original concern was that the first RFC was proposing changes to target-independent part of TVM to add support for a very target-specific feature. However, I do think that we can move this forward in way that would be overall useful.

Here is the outline of my thoughts on this. Let me know what you think.

First, a couple of observations:

Architectures that support vectors can be assumed to also support vector predication. I'm talking specifically about masked operations, and in particular about predicated loads and stores.
For ARM/AArch64, it may be beneficial to distinguish vectorization via fixed-length vectors from one via scalable vectors. If this choice is to be made by auto-scheduling, it should be expressible in TIR.

What this RFC proposes is very close to allowing vectorization of countable loops with variable iteration count, and I insist that we keep this in mind as a goal.

The way that vectorization works right now is that a loop like

for (i : [0, 130)) {
  C[i] = A[i] + B[i]
  D[i] = A[i] * B[i]
}

will be replaced with statements

C[Ramp(0, 1, 130)] = A[Ramp(0, 1, 130)] + B[Ramp(0, 1, 130)]
D[Ramp(0, 1, 130)] = A[Ramp(0, 1, 130)] * B[Ramp(0, 1, 130)]

The expressions within these statement are all PrimExpr, whose type must be expressible by DataType. All parameters in DataType are compile-time integers, which means that a single statement can only represent vectors with a known number of lanes. In other words, neither VIC nor VLA can be implemented without some changes. These changes may be in how types are represented in DataType, or in how vectorization is done (or a combination of these two).

We are already considering a special value for DataType::lanes that would represent the yet-unknown vector length (VL). Following Halide's approach to vectorization, I propose that we change vectorization to take an explicit vector length as a parameter. As a special case for SVE, the scalable VL could be represented by the same constant we chose for DataType::lanes. For compatibility with existing code, stage.vectorize() would be equivalent to stage.vectorize(vector_length=iter_count), since currently only loops with known iteration count can be vectorized. The argument value vector_length=VL would indicate using SVE. With vectorize(vector_length=32), the loop above would be turned into

for (i = [0, (130+31)/32) {
  // i-th vector is [32*i..32*(i+1))
  C[Ramp(32*i, 1, 32), pred=(Ramp(32*i, 1, 32) < Broadcast(130, 32))] = A[Ramp..., pred=...] + ...
  ...
}

If the loop iteration count changed from a known integer 130 to some expression N, the generated code would remain mostly the same: the structure does not depend on the fact that 130 is a compile-time constant. Similarly the 32 indicating vector length could be replaced with the predefined value for "scalable vector length", with the only issue potentially with calculating the iteration count of the for loop above. If we were to allow an explicit "stride" to For, the issue would go away (the RFC proposes something like that).

To summarize:

Introduce kScalableVectorLaneMark (as suggested by @tqchen).
Make vector length a parameter to stage.vectorize.
Introduce "predicate" to BufferLoad and BufferStore.
Allow non-unit strides in For loops (as per the RFC).

wrongtest-intellif · 2022-07-03T03:21:30Z

Hi~ here are my two questions :)
cc @kparzysz-quic

2. Make vector length a parameter to stage.vectorize.

What is the different between
- sch[C].vectorize(v, vector_length=32) and
- vo, vi = sch[C].split(v, 32) then sch[C].vectorize(vi)
It seems that we could also choose to properly lower the split's predicate to reach the same goal as proposed below. For example, weapons introduced in RFC [RFC] Buffer Layout Padding #77 may help?
3. Introduce "predicate" to BufferLoad and BufferStore.

Our team also get confused on how to represent predicated ld/st, when several months ago the upstream upgrade T.load/T.store (who have 1D predicate field) to BufferLoad/BufferStore. Now since BufferLoad/BufferStore are multi-dimensional, the predicate seems to also be multi-dimensional field?

Another concern is whether embedding predicate into BufferLoad/BufferStore increase the complexity (or break) buffer region related analysis in existing implementations. Could we leverage T.select(pred, A[...], undef) to represent A[..., pred], or just match the predicated memory access pattern like if (pred) C[...] = ...?

Thanks!

kparzysz-quic · 2022-07-05T13:30:49Z

Hi~ here are my two questions :) cc @kparzysz-quic
What is the different between

  * `sch[C].vectorize(v, vector_length=32)` and
  * `vo, vi = sch[C].split(v, 32)` then `sch[C].vectorize(vi)`
  
  It seems that we could also choose to properly lower the split's predicate to reach the same goal as proposed below. For example,  weapons introduced in RFC [[RFC] Buffer Layout Padding #77](https://github.com/apache/tvm-rfcs/pull/77) may help?

Yes. The padding proposed in the padding RFC could be utilized for this. What I wrote didn't take that into account.

  Our team also get confused on how to represent predicated ld/st, when several months ago the upstream upgrade `T.load`/`T.store` (who have 1D predicate field) to `BufferLoad`/`BufferStore`. Now since `BufferLoad`/`BufferStore` are multi-dimensional, the predicate seems to also be multi-dimensional field?

I suppose it could be multi-dimensional, but effectively it would be the conjunction of all the per-dimension predicates.

  Another concern is whether embedding predicate into `BufferLoad`/`BufferStore` increase the complexity (or break) buffer region related analysis in existing implementations. Could we leverage `T.select(pred, A[...], undef)` to represent `A[..., pred]`, or just match the predicated memory access pattern like `if (pred) C[...] = ...`?

The region analysis could simply ignore the predicate (assume that it's true for all lanes). This is presumably what would happen with the select. For stores, we'd have to have an if statement that accepts a vector of booleans to properly emulate the store predicate. Adding a predicate to BufferLoadNode and BufferStoreNode is a feature that could have some value on its own, so perhaps it deserves its own RFC...

tqchen · 2022-07-12T12:27:22Z

Thanks folks, just to come back on this, my main comment is wrt to the data structure change, leverage the special mark so we don't break ABI of runtime type

DataType::kScalableVectorLaneMark = -1

areusch · 2022-07-12T18:28:24Z

@tqchen just to clarify, do we then add this to the DLPack repo or do we consider this a specialized use of lane internal to TVM?

tqchen · 2022-08-10T16:04:47Z

@areusch this can be something that is internal in TVM for now given this is only used as a compiler abstraction not runtime

lhutton1 · 2024-02-20T11:42:34Z

Closing as superseded by: #104

MeeraN7 mentioned this pull request Aug 4, 2021

Adding initial SVE support to TVM apache/tvm#8655

Closed

tkonolige requested changes Aug 4, 2021

View reviewed changes

masahi reviewed Aug 4, 2021

View reviewed changes

rfcs/initial_sve_addition.md Outdated Show resolved Hide resolved

tqchen added the status: need review RFC needs review label Aug 5, 2021

tqchen assigned tqchen, masahi and areusch and unassigned tqchen Aug 5, 2021

MeeraN7 force-pushed the initial-sve-addition branch from a1d2cf6 to 125582f Compare August 19, 2021 14:14

masahi reviewed Aug 27, 2021

View reviewed changes

rfcs/0018-initial-sve-addition.md Show resolved Hide resolved

masahi reviewed Aug 27, 2021

View reviewed changes

rfcs/0018-initial-sve-addition.md Outdated Show resolved Hide resolved

masahi reviewed Aug 27, 2021

View reviewed changes

rfcs/0018-initial-sve-addition.md Show resolved Hide resolved

hogepodge reviewed Sep 9, 2021

View reviewed changes

MeeraN7 force-pushed the initial-sve-addition branch from 29c3914 to b6bc21a Compare September 16, 2021 15:48

kparzysz-quic reviewed Oct 11, 2021

View reviewed changes

kparzysz-quic suggested changes Oct 11, 2021

View reviewed changes

tqchen mentioned this pull request Oct 2, 2022

[RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE) #94

Merged

lhutton1 closed this Feb 20, 2024

		In this RFC we would like to propose a TIR extension to support scalable
		vectorisation. This is an introductory RFC to see if the design of our

		stride _VL_, which stands for Vector Length. _VL_ is only showed for ease of
		representation and we don't store _VL_ anywhere inside the TIR data structures.

[RFC] Adding initial SVE implementation #18

[RFC] Adding initial SVE implementation #18

Uh oh!

Conversation

MeeraN7 commented Aug 4, 2021

Uh oh!

tkonolige left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

masahi commented Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

comaniac commented Aug 4, 2021

Uh oh!

MeeraN7 commented Aug 5, 2021

Uh oh!

tqchen commented Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tqchen commented Aug 5, 2021

Uh oh!

giuseros commented Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MeeraN7 commented Aug 6, 2021

Uh oh!

MeeraN7 commented Aug 6, 2021

Uh oh!

tqchen commented Aug 6, 2021

Uh oh!

Uh oh!

masahi Aug 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

masahi commented Aug 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjoerdmeijer commented Sep 1, 2021

Uh oh!

masahi commented Sep 1, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tqchen commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MeeraN7 commented Sep 10, 2021

Uh oh!

tqchen commented Sep 10, 2021

Uh oh!

sjoerdmeijer commented Sep 12, 2021

Uh oh!

tqchen commented Sep 29, 2021

Uh oh!

masahi commented Aug 4, 2021 •

edited

Loading

tqchen commented Aug 5, 2021 •

edited

Loading

giuseros commented Aug 5, 2021 •

edited

Loading

masahi Aug 27, 2021 •

edited

Loading

masahi commented Aug 27, 2021 •

edited

Loading

tqchen commented Sep 10, 2021 •

edited

Loading

manupak commented Jun 21, 2022 •

edited

Loading

wrongtest-intellif commented Jul 3, 2022 •

edited

Loading