Skip to content

Conversation

@MeeraN7
Copy link

@MeeraN7 MeeraN7 commented Aug 4, 2021

Introducing the addition of Arm Architecture's Scalable Vector Extension (SVE) in TVM containing initial VLA and predication implementation, based on earlier work by Giuseppe Rossini.

@mbaret

Copy link
Contributor

@tkonolige tkonolige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the RFC! I've left some comments.

Comment on lines 9 to 10
In this RFC we would like to propose a TIR extension to support scalable
vectorisation. This is an introductory RFC to see if the design of our
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you define "scalable vectorization" in this paragraph.

Comment on lines 106 to 107
stride _VL_, which stands for Vector Length. _VL_ is only showed for ease of
representation and we don't store _VL_ anywhere inside the TIR data structures.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is the vector length of the loop determined? Is this decided by the hardware?

a file called codegen_aarch64.cc which handles the creation of SVE intrinsics
in LLVM by visiting the Load, Store and For nodes (to be replaced with a While
node) and generating the relevant LLVM IR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a drawbacks section.

Expression language, for example:
```
s[C].vectorize_scalable(x)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding a new ForKind so that we can express scalable vectorization in TIR.

has changed, it is now scalable. The constructor is:

```
DataType(int code, int bits, int lanes, bool is_scalable = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a special value for lanes to indicate that it can be a variable number?

pass visits the TIR nodes and transforms them so that they account for the
unknown vector length. Currently, our prototype was implemented before the
addition of the While node to TVM, and so it generates a variable for loop. One
change we are planning to make is transforming a for loop into a while loop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a while loop necessary?

Comment on lines 39 to 42
However, most modern vector architectures (e.g. X86 AVX and the Arm
Architecture's MVE and SVE extensions) support predicated vector instructions,
removing the need for such a scalar epilogue and also allowing more code to be
vectorised. Lane predication allows the enabling/disabling of certain lanes in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this PR talking about using predicated vector instructions or scalable vector instructions? If you aren't talking about predicated vector instructions, you could probably just remove this paragraph.


We would like to add support for Arm Architecture's Scalable Vector Extension (SVE) in TVM
by introducing features for Vector Length Agnostic (VLA) programs and
predication, i.e. the 2 main new SVE features. Thus we would like to express
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mention predication here, yet you never talk about how it is being added to TIR.


We can also see the syntax of the Ramp nodes have now been modified to handle
an unknown vector length, as seen by _ramp(i, 1, VL)_, instead of a fixed
integer. The form is still _Ramp(base, stride, lanes)_ and the semantics of it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a concept of an unknown scalar value at compile time: Any. It seems like you could just have the lanes of Ramp be a PrimExpr instead of an int. Is there are reason not to take this approach?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it has taken a while to get back to you. But to answer your question we feel it would be much simpler to keep the number of lanes as an int, since we know that the value will always be an integer but the exact number is what is unknown. Also, it would be slightly misleading to use the value Any as the number of lanes can't be "any" number but instead a multiple of a minimal number (in this case 128).

@masahi
Copy link
Member

masahi commented Aug 4, 2021

I think it is a good idea to invite people working on RISC-V support for TVM for review/discuss, since the RISC V vector extension is similar to ARM SVE. I remember some people in Taiwan are working on this. Maybe @comaniac knows who?

@comaniac
Copy link
Contributor

comaniac commented Aug 4, 2021

I think it is a good idea to invite people working on RISC-V support for TVM for review/discuss, since the RISC V vector extension is similar to ARM SVE. I remember some people in Taiwan are working on this. Maybe @comaniac knows who?

Thanks for bringing up. cc @yrchen

@MeeraN7
Copy link
Author

MeeraN7 commented Aug 5, 2021

@sjoerdmeijer @giuseros

@tqchen
Copy link
Member

tqchen commented Aug 5, 2021

Thank you @MeeraN7 for the RFC. SVE is certainly an interesting topic.

Because we do not yet have SVE support in TIR. It would be useful to think carefully about how SVE can be presented and transformed. Right now the proposal contains one way to do so. However, we could use a bit of more context information, to see how the proposed design impacts the general lowering flow.

Specifically, it would be very helpful to also show examples of the code along the transformations, so we can have a better understanding the possible design tradeoffs. It might be helpful to translate some of the examples in the whitepaper to tvmscript form.

Specifically:

  • The TIR before SVE vectorization
  • The TIR after SVE vectorization right before LLVM codegen
  • The corresponding lowered llvm intrinsics

To touch a bit on the design alternatives(disclaimer, I only learnt the VLA/SVE by quickly reading through the manual, so I could be wrong). Based on my understanding, the main goal of VLA is to use intrisnics to represent a some what restricted loop pattern(in terms of possible items we can do). While the previous fixed length vectorization pushes the items onto the vector constructs such as Ramp and Broadcast.

I wonder if we could take a different approach for VLA/SVE. Because VLA in nature is closer to the loop pattern. Specifically, I feel we could come up with some form of "loop SVE legalization" that legalize a loop's body to all the patterns that SVE support, then leave the for loop as it is with annotation VLA. Then the code generator can take that and generate VLA code.

@tqchen tqchen added the status: need review RFC needs review label Aug 5, 2021
@tqchen
Copy link
Member

tqchen commented Aug 5, 2021

cc @junrushao1994 @vinx13 as it can be related to the future tensorization optimizations

@tqchen tqchen assigned tqchen, masahi and areusch and unassigned tqchen Aug 5, 2021
@giuseros
Copy link

giuseros commented Aug 5, 2021

HI @tqchen ,
I will try to sporadically comment, since this is a project I prototyped (and enjoyed :) ) when I was in Arm.

If I understand your comment correctly, what @MeeraN7 is doing is closer to what you are proposing. Instead of transforming a loop into a Ramp, and passing the ramp "as is" to LLVM (which is what is done for fixed vector length, but not doable for SVE) @MeeraN7 is legalising the loop in TIR and passing the legal loop down to the LLVM code-generator. In other words, the following loop:

for i = 1:1:10
A[i] = B[i]+C[i];
end

Once legalised becomes

for i = 1:VL:10
A[VL] = B[VL]+C[VL]
end

And then the LLVM code generator, knowing that this is a variable loop, translates this it with some LLVM intrinsics for:

  • Predicated load/store
  • Loop increment
  • Predication mask calculation

Please note that only load/store needs to be predicated. Other register-to-register operations (e.g., add/sub/mul) won't need predication.

Please also note that while predication is not a TIR concept, we use it to support VLA (cc @tkonolige) in the LLVM codegen. In future it should be quite straightforward to expose predication also in TIR (if required).

@MeeraN7 feel free to jump in if something I said is not correct (very likely :) )

@MeeraN7
Copy link
Author

MeeraN7 commented Aug 6, 2021

Thank you @giuseros for your answer, it looked good to me. Just to add on, we know which loops need to be legalised as in TIR we do add a new ForKind called kVectorizedScalable (cc @tkonolige) which marks the loop as able to be vectorized but the value of VL is unknown. These loops are then legalised and transformed to include the unknown value VL and later passed to the code generator as @giuseros mentioned.

@MeeraN7
Copy link
Author

MeeraN7 commented Aug 6, 2021

Thank you all for the comments so far, we are currently working on a response to address them.

@tqchen
Copy link
Member

tqchen commented Aug 6, 2021

Thanks @MeeraN7 . I think the overall loop legalization makes sense. I wonder then if it is necessary to update the node construct such as Ramp, or can we directly make use of scalar loop in the body

@MeeraN7 MeeraN7 force-pushed the initial-sve-addition branch from a1d2cf6 to 125582f Compare August 19, 2021 14:14
We can also see the syntax of the Ramp nodes have now been modified to handle
an unknown vector length, as seen by _ramp(i, 1, VL)_, instead of a fixed
integer. The form is still _Ramp(base, stride, lanes)_ and the semantics of it
are still the same, the only difference is that the number of lanes is unknown
Copy link
Member

@masahi masahi Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific example, the semantics does change, doesn't it? Because in your example C[ramp(0, 1, 17)] = A[ramp(0, 1, 17)] + B[ramp(0, 1, 17)], 17 is not the length of a vector, but its the size of an input. While in the SVE example above the input length is explicitly divided by the vector length at TIR level.

Of course, I understand that the semantics of Ramp node in general does not change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid confusion, it is probably better to talk about different sense in which Ramp node is used, between fixed-width vs scalable vectorization. The former one treats the entire input as one chunk while the latter one is specifically for vector-length wide chunk.

@masahi
Copy link
Member

masahi commented Aug 27, 2021

Thanks @MeeraN7 @giuseros, I like the approach of making the vectorized loop explicit with VL parameter at the TIR level, in contrast to how the fixed-width vectorization is done today. It would be great if you can add a section on why an alternative implementation strategy was chosen. Perhaps to make the LLVM codegen simpler?

If possible, I think it is better not to introduce user facing changes, since as far as an API is concerned, s[C].vectorize(...) is already vector-length agnostic.

@sjoerdmeijer
Copy link

@masahi, about:

If possible, I think it is better not to introduce user facing changes, since as far as an API is concerned, s[C].vectorize(...) is already vector-length agnostic.

which I think is very closely related to an earlier inline comment:

As I commented above, I'd like to continue using s[C].vectorize(...) and when the feature is available, enable SVE by a target attribute. So I don't expect any user facing work.

I think we do need a user-facing option to toggle fixed/scalable vectorisation. If the vectorisation strategy is selected based on an available target attribute, we loose control to choose fixed/scalable vectorisation. For architectures that do support scalable vectorisation, fixed width might still be preferable in some cases.

I think this is similar to Clang's loop pragmas. For example, the vectorise_width pragma has been extended with an optional second argument fixed|scalable:

vectorize_width(_value_[, fixed|scalable]),

see also the Clang docs here.

So I see two approaches:

  • we extend s[C].vectorize(...) to take an optional fixed/scalable boolean value, similar to Clang's loop pragma, which defaults to fixed if omitted,
  • or introduce s[C].vectorize_scalable(...) as proposed in this RFC.

I personally don't have any preference. But now I am wondering if extending s[C].vectorize(...), the first option, would be better. What do you think?

@masahi
Copy link
Member

masahi commented Sep 1, 2021

Thanks @sjoerdmeijer @giuseros, I didn't imagine that there would be a case where mixing fixed and scalable vectorization is beneficial. I prefer s[C].vectorize(..., scalable=True) to s[C].vectorize_scalable(...) but both seem fine.

Any other comments @tqchen @tkonolige?

s[C].vectorize(x)
```

Vectorisation along the x-axis is requested with _vectorize(x)_, and will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within markdown, code is typically denoted with a pairback single quotes surrounding the code. I suggest using that as the convention to enclose methods, variables, and other code.

@tqchen
Copy link
Member

tqchen commented Sep 10, 2021

Thanks @MeeraN7 @giuseros, to make the discussion more concrete, right now the IR after legalization looks like

  for (i: int32, 0, 17;i+=VL) {
    C_2[ramp(i, 1, VL)] = ((int8xVL*)A_2[ramp(i, 1, VL)] + (int8xVL*)B_2[ramp(i, 1, VL)])
  }

This would require changes such as the ramp data structure and data type to support the VL vector types, which can be a bit adhoc because there are also additional information needed to be encoded(e.g. this VL and that VL are the same) but nevertheless not clearly encoded here.

Given we are mostly matching a for pattern, I also wonder if they are really necessary. Since we could represent a VLA loop with some form of restricted for loop, with special annotations. Here is a possible alternative way to do so

  for (i: int32, 0, 17;i, annotation={"VLA"}) {
    C_2[i] = A_2[i] + B_2[i];
  }

And we will be defering the vectorized instruction generation to the codegen phase, by specially handling the patterns in the for that is annotated with VLA loop. Of course we can only support a limited set of patterns(such as read/write to the same vector index or limited reduction support), that is why legalize is needed to make sure the body of VLA for loop satiesfies the pattern.

In this way we can likely get a similar set of things without hacking into get a ramp with VL size

@MeeraN7
Copy link
Author

MeeraN7 commented Sep 10, 2021

Hi @tqchen, thank you for the comment. To clarify a few things, VL is not added to the Ramp Node at all, it is simply a string that is used when printing TIR for visual representation. The only addition to the Ramp Node (and also Broadcast Node) is a boolean called "is_scalable" which should not affect anything else as a separate constructor was added. I don't think any other information needs to be added to these nodes or data types.

@tqchen
Copy link
Member

tqchen commented Sep 10, 2021

Thanks @MeeraN7 . Yes I get what you mean. Right now we are adding a "is_scalable" field to indicate that the broadcast and ramp are "context dependent" on VL. Additionally, we might need to update DataType to indicate a scalable data type.

This context dependency is the missing information I mentioned here. The set of code is really undefined and should be parameterized by VL. Additionally, considering the case of two loops with different VL1 and VL2 and want to do some transformations, we might fall into the trap of thinking them as same type(because only "is_scalable" is marked) but in reality they are not, as a implicit dependency on VL can be ignored.

I can understand that the additional flag can be helpful as we could reuse some of the vectorization logic. However, the "is_scalable" field might introduce additional confusion as above, and the additional ramp node may not carry too much additional information(apart from the fact that we use a scalar vs a vector type). So my main question is that whether or not we could use a separate normal form to hint the code generator without changing the current DataType, ramp and broadcast.

Specifically, a regular loop as follows would carry same amount of information(the access to i(VLA index)) would indicate a vector load, and access to other indices would become a normal load. The main constraint could be that we can only allow i to appear in certain locations(say in the inner most to represent a ramp like pattern), and defer the generation of SVE code to the codegen phase, by pattern matching the indices and lookup intermediate value:

N1: A possible loop normal form via annotation

  for (i: int32, 0, 17;i, annotation={"VLA"}) {
    C_2[i] = A_2[i] + B_2[i];
  } 

N1 would indeed push a bit more pressure to the code generator, because now code generator needs to pattern match load/store of VLA index(i), and possibly perform broadcast if necessary. However, the additional overhead may not be too large and could help us to keep the code cleaner. My guess is that this approach is also easier to generalize to later loop patterns such as scalable matrix instruction, in which case we cannot really reuse ramp and broadcast

@sjoerdmeijer
Copy link

Thanks for commenting @tqchen
Could you further clarify a few things for me please? See remarks inlined.

Thanks @MeeraN7 . Yes I get what you mean. Right now we are adding a "is_scalable" field to indicate that the broadcast and ramp are "context dependent" on VL. Additionally, we might need to update DataType to indicate a scalable data type.

This context dependency is the missing information I mentioned here.

I don't think I understand what you mean by context dependency. Basically, in my view of the world, the Ramp node means that we can process elements in a data parallel way. How exactly this is done, is up to backends depending on the architecture and the code. What we are doing here is annotating this Ramp node with a bit of state, which is a hint that we want a special kind of vectorisation. From this point of view it is syntactic sugar: if the hint is dropped or ignored, it could still be vectorised, but not in a vector length agnostic way.

I don't think we need to encode much more information than one boolean that specifies the vectorisation style. You could indeed argue that this is ad-hoc, but if we are going to do it in a different way, we would still need to keep a bit of state around, like the annotation={"VLA"} example that you gave earlier. From that point of view, I don't see any difference.

The set of code is really undefined and should be parameterized by VL.

What do you mean by undefined here?

Additionally, considering the case of two loops with different VL1 and VL2 and want to do some transformations, we might fall into the trap of thinking them as same type(because only "is_scalable" is marked) but in reality they are not, as a implicit dependency on VL can be ignored.

I can understand that the additional flag can be helpful as we could reuse some of the vectorization logic. However, the "is_scalable" field might introduce additional confusion as above, and the additional ramp node may not carry too much additional information(apart from the fact that we use a scalar vs a vector type). So my main question is that whether or not we could use a separate normal form to hint the code generator without changing the current DataType, ramp and broadcast.

Correct me if I am wrong, but your assumption is that explicit scalable state is bad. Looking at your example:

N1: A possible loop normal form via annotation

  for (i: int32, 0, 17;i, annotation={"VLA"}) {
    C_2[i] = A_2[i] + B_2[i];
  } 

This annotation looks equivalent to a loop pragma. In Clang you can for example do:

 #pragma loop vectorize_width(4, scalable)
  for (I =0; I<17; ++i) {
    C_2[i] = A_2[i] + B_2[I];
  } 

and thus request scalable vectorisation. If this annotation results in scalable vectorisation, the LLVM vectoriser will lower this to operations using <n x 4 x i32> scalable vector IR types.

What I would like to say with these examples, is that there are 2 things going on at different levels. I think:

  • The annotation={"VLA"} corresponds to a loop pragma in Clang,
  • The TIR Ramp node extension corresponds to LLVM IR scalable types, e.g. <n x 4 x i32>.

And I think these 2 concepts are different things that both have their value and place.

I we can only annotate loops as being scalable, we loose the finer grained control to request this on a statement level. I don't know if mixing fixed and scalable will be an important use-case, but I think it is possible.

Summarising, I don't think explicit encoding of scalable in TIR nodes is a bad thing, the opposite actually, I think we need it, and the annotation on the loop might be a complementary technique to this.

What do you think?

Some small additions to Solution Approach and Next Steps
sections following discussion about representation of VL
and the vectorize_scalable function. Also, modified style
to use back single quotes for function and variable names
and any other code related text.
@MeeraN7 MeeraN7 force-pushed the initial-sve-addition branch from 29c3914 to b6bc21a Compare September 16, 2021 15:48
@tqchen
Copy link
Member

tqchen commented Sep 29, 2021

Thanks @sjoerdmeijer , sorry for getting back to this late. If LLVM also encodes the SVE vector as a special type(and not tying the n) it could be a precedence that we can learn from. I do want to point out that the same approach won't work when it comes to scalable matrix instruction, so it would be good to think about a loop pattern based approach from that angle.

If we conclude that we really want to introduce the scalable vector into data type and so on. A remaining thing we need to resolve is the encoding. DataType and DLDataType are part of DLPack standard, used as standard ABI convention. So ideally we do not want to change the type signature of DataType DLDataType. What we might be able to is to introduce a magic lane number (say DataType::kScalableVectorLaneMark = -1) that indicate a scalable vector type, and also use similar approaches for other language construct as well. This would alleviate the problems from that side.

This being said. It would still be good to get @MeeraN7 @giuseros @sjoerdmeijer 's take on the possibility of using a normalized loop pattern vs embedding into the vector.

Thank you for all the discussions so far!

@sjoerdmeijer
Copy link

Hi @tqchen , thanks for the great feedback!

This being said. It would still be good to get @MeeraN7 @giuseros @sjoerdmeijer 's take on the possibility of using a normalized loop pattern vs embedding into the vector.

Yes, I can see that as being very useful. If my analogy is correct, the loop annotation corresponds to a loop pragma, and the scalable ramp/vector node to an IR extension. Like I wrote before, I think there's place for both of these concepts.

Can you please advise how to proceed? I guess we need to document this in the RFC. But do we want to prototype this too? Or do we just list it as an alternative, and discuss here what we would like to do first?

I have not yet commented on your remark about the ABI convention, as I am not familiar enough with TVM to comment about this. It looks though that is a solvable problem.

@tqchen
Copy link
Member

tqchen commented Oct 4, 2021

The ABI issue is important since it affects the DLPack, so I would suggest we agree on the encoding convention before we proceed. I am fine with having a scalable vector repr if we can take the right encoding.

where the Ramp TIR node has the form 'Ramp(base, stride, lanes)' showing that
these elements are processed in (vector) lanes.

The size of 18 has been chosen to demonstrate the challenges of vectorising

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size is 17. There are other mentions of 18 in this paragraph as well.

example, and importantly no scalar epilogue. But since we do not need to
process 5 * 4 = 20 elements, the last vector operation only needs to write two
elements, which can be achieved by predication as we can enable the first two
lanes and disable the last 2 lanes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though SVE may be adding per-lane predication for ARM, per-lane predication as a concept is unrelated to SVE and should be done independently. This is especially important for non-ARM architectures that do support it.

In addition to predication, and also related to it, some new vector
architectures also allow scalable vectorisation. As opposed to so called fixed
width vectorisation (e.g. AArch Neon), the Arm architecture SVE vector
extension allows implementations to choose a vector register length between 128

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify that "implementations" means "processor implementations".

@kparzysz-quic
Copy link

DataType and DLDataType are part of DLPack standard,

The DataType type is defined in TVM, it's not present in DLPack. Or am I missing something?

@tqchen
Copy link
Member

tqchen commented Oct 11, 2021

@kparzysz-quic DatatType is a thin wrapper around DLDataType

@kparzysz-quic
Copy link

Right, but users of DLPack will not see DataType, only DLDataType, unless they use TVM as well. Changing DataType will only affect those that use include/tvm/runtime/data_type.h (i.e. the ABI breakage will be limited to users of TVM).

Copy link

@kparzysz-quic kparzysz-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This RFC is not about SVE. It's about vectorizing (countable) loops with bounds not known at compile time. I think that would be very valuable for all targets and we can determine the list of necessary features that a target should satisfy to implement it.

Even though SVE can be used to implement it, it only applies to ARM, and there is no compelling reason to introduce SVE-specific features to TIR. There may be room for unknown-length vectors in TIR, but that should be done separately from SVE.

Whether SVE it used to implement that or not should be left to the ARM codegen to decide.

@tqchen
Copy link
Member

tqchen commented Oct 11, 2021

@kparzysz-quic this is right. On the other hand, we still on conversion of DLDataType and DataType(e.g. getting DataType from ndarray) in many cases. So ideally we want to keep them consistent

@smeijer1234
Copy link

A belated thank you for sharing further thoughts on this @tqchen and @kparzysz-quic . I am on holiday this week (and a little bit of next), but want to pick this up soon after that. In the mean time, some remarks/questions from my side.

First of all, the ABI discussion goes over my head at the moment to be honest; I am not familiar enough yet to comment or address this. My understanding from the discussion so far is, that there will be an ABI break, but it is rather limited. So that doesn't require any further attention here, would that be a correct summary?

Perhaps more fundamental is @kparzysz-quic 's remark:

This RFC is not about SVE. It's about vectorizing (countable) loops with bounds not known at compile time. I think that would be very valuable for all targets and we can determine the list of necessary features that a target should satisfy to implement it.

Even though SVE can be used to implement it, it only applies to ARM, and there is no compelling reason to introduce SVE-specific features to TIR. There may be room for unknown-length vectors in TIR, but that should be done separately from SVE.

Yep, I agree, and you're right about this: this is not really about SVE. The "only" addition to TIR is the is_scalable part, which I would not consider to be SVE-specific, because it indeed just represents an unknown length vector and any back-end can deal with that in any way they like (and there are quite a few architectures that now support scalable-like vectors). So just for my understanding, would you mind elaborating on your thoughts here? For example, is it just a naming issue and we need to find another term, or is it more fundamental how we would like to do things in TIR? Thanks!

@tqchen
Copy link
Member

tqchen commented Oct 27, 2021

@smeijer1234 to be more specific. I was recommending an approach that won't break the ABI.

Introdice

DataType::kScalableVectorLaneMark = -1

And in DataType, introduce a method IsScalable(), which checks if the lane equals -1 and if yes return true. This way we reuse the data layout of the original data structure without introducing the is_scalable field.

@manupak
Copy link
Contributor

manupak commented Jun 21, 2022

Hi @tqchen @kparzysz-quic @masahi @tkonolige @smeijer1234 ,

We are looking to revive this work. I have gone through the thread.
Summary so far is as follows :

  • We want to introduce/enhance a scheduling vectorization primitive that could be controlled by user/auto-tuner/auto-scheduler either to use scalable vectors in the backend codegen.

    • The conversation has resolved to be extending the existing vectorize scheduling primitive i.e. : s[C].vectorize(..., scalable=True)
  • Usage of this scheduling primitive should result in creating a for loop with a Ramp nodes with either an additional argument "is_scalable" or special number for lanes.

    • I think @tqchen was suggesting to use the special lane number (-1) as opposed to introducing an additional argument to all TIR nodes such as Ramp and Broadcast as well as DataType (and to DLDataType) to avoid ABI breakages.
    • Moreover, VectorizeLoopScalable will be modified to create a While node.
  • The name of the RFC is confusing ? @kparzysz-quic . I suppose for TIR, what we are adding is vector-length agnostic vectorization support for TIR, while demonstrating the codegen of VLA vectorized TIR using Arm(R) SVE instructions in the codegen.

Please confirm whether this is a right summary of the current state.
As for next steps, I would like to propose/resolve each of the outstanding discussion points and update the RFC.

@areusch
Copy link
Contributor

areusch commented Jun 30, 2022

@manupa-arm i think your summary is roughly correct.

@tqchen @masahi @kparzysz-quic perhaps you guys could have a look and see if we could resolve the question of how to represent the scalable concept in DLDataType. also cc @alter-xp in case they would like to leverage this for RISC-V (i'm aware you guys are doing a BYOC impl, but perhaps this could be useful if the approach goes beyond that in the future).

i do tend to agree with @kparzysz-quic that we could make the title a bit more generic since the DLDataType question is at the center of this RFC.

@alter-xp
Copy link
Contributor

alter-xp commented Jul 1, 2022

Thanks for bringing up. this is also very useful on RISC-V. we look forward to progress in this regard.

@kparzysz-quic
Copy link

To reiterate---my original concern was that the first RFC was proposing changes to target-independent part of TVM to add support for a very target-specific feature. However, I do think that we can move this forward in way that would be overall useful.

Here is the outline of my thoughts on this. Let me know what you think.

First, a couple of observations:

  1. Architectures that support vectors can be assumed to also support vector predication. I'm talking specifically about masked operations, and in particular about predicated loads and stores.
  2. For ARM/AArch64, it may be beneficial to distinguish vectorization via fixed-length vectors from one via scalable vectors. If this choice is to be made by auto-scheduling, it should be expressible in TIR.

What this RFC proposes is very close to allowing vectorization of countable loops with variable iteration count, and I insist that we keep this in mind as a goal.

The way that vectorization works right now is that a loop like

for (i : [0, 130)) {
  C[i] = A[i] + B[i]
  D[i] = A[i] * B[i]
}

will be replaced with statements

C[Ramp(0, 1, 130)] = A[Ramp(0, 1, 130)] + B[Ramp(0, 1, 130)]
D[Ramp(0, 1, 130)] = A[Ramp(0, 1, 130)] * B[Ramp(0, 1, 130)]

The expressions within these statement are all PrimExpr, whose type must be expressible by DataType. All parameters in DataType are compile-time integers, which means that a single statement can only represent vectors with a known number of lanes. In other words, neither VIC nor VLA can be implemented without some changes. These changes may be in how types are represented in DataType, or in how vectorization is done (or a combination of these two).

We are already considering a special value for DataType::lanes that would represent the yet-unknown vector length (VL). Following Halide's approach to vectorization, I propose that we change vectorization to take an explicit vector length as a parameter. As a special case for SVE, the scalable VL could be represented by the same constant we chose for DataType::lanes. For compatibility with existing code, stage.vectorize() would be equivalent to stage.vectorize(vector_length=iter_count), since currently only loops with known iteration count can be vectorized. The argument value vector_length=VL would indicate using SVE. With vectorize(vector_length=32), the loop above would be turned into

for (i = [0, (130+31)/32) {
  // i-th vector is [32*i..32*(i+1))
  C[Ramp(32*i, 1, 32), pred=(Ramp(32*i, 1, 32) < Broadcast(130, 32))] = A[Ramp..., pred=...] + ...
  ...
}

If the loop iteration count changed from a known integer 130 to some expression N, the generated code would remain mostly the same: the structure does not depend on the fact that 130 is a compile-time constant. Similarly the 32 indicating vector length could be replaced with the predefined value for "scalable vector length", with the only issue potentially with calculating the iteration count of the for loop above. If we were to allow an explicit "stride" to For, the issue would go away (the RFC proposes something like that).

To summarize:

  1. Introduce kScalableVectorLaneMark (as suggested by @tqchen).
  2. Make vector length a parameter to stage.vectorize.
  3. Introduce "predicate" to BufferLoad and BufferStore.
  4. Allow non-unit strides in For loops (as per the RFC).

@wrongtest-intellif
Copy link

wrongtest-intellif commented Jul 3, 2022

Hi~ here are my two questions :)
cc @kparzysz-quic

  • 2. Make vector length a parameter to stage.vectorize.

    What is the different between

    • sch[C].vectorize(v, vector_length=32) and
    • vo, vi = sch[C].split(v, 32) then sch[C].vectorize(vi)

    It seems that we could also choose to properly lower the split's predicate to reach the same goal as proposed below. For example, weapons introduced in RFC [RFC] Buffer Layout Padding #77 may help?

  • 3. Introduce "predicate" to BufferLoad and BufferStore.

    Our team also get confused on how to represent predicated ld/st, when several months ago the upstream upgrade T.load/T.store (who have 1D predicate field) to BufferLoad/BufferStore. Now since BufferLoad/BufferStore are multi-dimensional, the predicate seems to also be multi-dimensional field?

    Another concern is whether embedding predicate into BufferLoad/BufferStore increase the complexity (or break) buffer region related analysis in existing implementations. Could we leverage T.select(pred, A[...], undef) to represent A[..., pred], or just match the predicated memory access pattern like if (pred) C[...] = ...?

Thanks!

@kparzysz-quic
Copy link

Hi~ here are my two questions :) cc @kparzysz-quic
What is the different between

  * `sch[C].vectorize(v, vector_length=32)` and
  * `vo, vi = sch[C].split(v, 32)` then `sch[C].vectorize(vi)`
  
  It seems that we could also choose to properly lower the split's predicate to reach the same goal as proposed below. For example,  weapons introduced in RFC [[RFC] Buffer Layout Padding #77](https://github.com/apache/tvm-rfcs/pull/77) may help?

Yes. The padding proposed in the padding RFC could be utilized for this. What I wrote didn't take that into account.

  Our team also get confused on how to represent predicated ld/st, when several months ago the upstream upgrade `T.load`/`T.store` (who have 1D predicate field) to `BufferLoad`/`BufferStore`. Now since `BufferLoad`/`BufferStore` are multi-dimensional, the predicate seems to also be multi-dimensional field?

I suppose it could be multi-dimensional, but effectively it would be the conjunction of all the per-dimension predicates.

  Another concern is whether embedding predicate into `BufferLoad`/`BufferStore` increase the complexity (or break) buffer region related analysis in existing implementations. Could we leverage `T.select(pred, A[...], undef)` to represent `A[..., pred]`, or just match the predicated memory access pattern like `if (pred) C[...] = ...`?

The region analysis could simply ignore the predicate (assume that it's true for all lanes). This is presumably what would happen with the select. For stores, we'd have to have an if statement that accepts a vector of booleans to properly emulate the store predicate. Adding a predicate to BufferLoadNode and BufferStoreNode is a feature that could have some value on its own, so perhaps it deserves its own RFC...

@tqchen
Copy link
Member

tqchen commented Jul 12, 2022

Thanks folks, just to come back on this, my main comment is wrt to the data structure change, leverage the special mark so we don't break ABI of runtime type

DataType::kScalableVectorLaneMark = -1

@areusch
Copy link
Contributor

areusch commented Jul 12, 2022

@tqchen just to clarify, do we then add this to the DLPack repo or do we consider this a specialized use of lane internal to TVM?

@tqchen
Copy link
Member

tqchen commented Aug 10, 2022

@areusch this can be something that is internal in TVM for now given this is only used as a compiler abstraction not runtime

@lhutton1
Copy link
Contributor

Closing as superseded by: #104

@lhutton1 lhutton1 closed this Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: need review RFC needs review

Projects

None yet

Development

Successfully merging this pull request may close these issues.