-
Notifications
You must be signed in to change notification settings - Fork 78
[RFC] Adding initial SVE implementation #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tkonolige
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the RFC! I've left some comments.
rfcs/initial_sve_addition.md
Outdated
| In this RFC we would like to propose a TIR extension to support scalable | ||
| vectorisation. This is an introductory RFC to see if the design of our |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you define "scalable vectorization" in this paragraph.
rfcs/initial_sve_addition.md
Outdated
| stride _VL_, which stands for Vector Length. _VL_ is only showed for ease of | ||
| representation and we don't store _VL_ anywhere inside the TIR data structures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is the vector length of the loop determined? Is this decided by the hardware?
rfcs/initial_sve_addition.md
Outdated
| a file called codegen_aarch64.cc which handles the creation of SVE intrinsics | ||
| in LLVM by visiting the Load, Store and For nodes (to be replaced with a While | ||
| node) and generating the relevant LLVM IR. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include a drawbacks section.
rfcs/initial_sve_addition.md
Outdated
| Expression language, for example: | ||
| ``` | ||
| s[C].vectorize_scalable(x) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding a new ForKind so that we can express scalable vectorization in TIR.
rfcs/initial_sve_addition.md
Outdated
| has changed, it is now scalable. The constructor is: | ||
|
|
||
| ``` | ||
| DataType(int code, int bits, int lanes, bool is_scalable = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a special value for lanes to indicate that it can be a variable number?
rfcs/initial_sve_addition.md
Outdated
| pass visits the TIR nodes and transforms them so that they account for the | ||
| unknown vector length. Currently, our prototype was implemented before the | ||
| addition of the While node to TVM, and so it generates a variable for loop. One | ||
| change we are planning to make is transforming a for loop into a while loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a while loop necessary?
rfcs/initial_sve_addition.md
Outdated
| However, most modern vector architectures (e.g. X86 AVX and the Arm | ||
| Architecture's MVE and SVE extensions) support predicated vector instructions, | ||
| removing the need for such a scalar epilogue and also allowing more code to be | ||
| vectorised. Lane predication allows the enabling/disabling of certain lanes in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this PR talking about using predicated vector instructions or scalable vector instructions? If you aren't talking about predicated vector instructions, you could probably just remove this paragraph.
rfcs/initial_sve_addition.md
Outdated
|
|
||
| We would like to add support for Arm Architecture's Scalable Vector Extension (SVE) in TVM | ||
| by introducing features for Vector Length Agnostic (VLA) programs and | ||
| predication, i.e. the 2 main new SVE features. Thus we would like to express |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mention predication here, yet you never talk about how it is being added to TIR.
rfcs/initial_sve_addition.md
Outdated
|
|
||
| We can also see the syntax of the Ramp nodes have now been modified to handle | ||
| an unknown vector length, as seen by _ramp(i, 1, VL)_, instead of a fixed | ||
| integer. The form is still _Ramp(base, stride, lanes)_ and the semantics of it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a concept of an unknown scalar value at compile time: Any. It seems like you could just have the lanes of Ramp be a PrimExpr instead of an int. Is there are reason not to take this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it has taken a while to get back to you. But to answer your question we feel it would be much simpler to keep the number of lanes as an int, since we know that the value will always be an integer but the exact number is what is unknown. Also, it would be slightly misleading to use the value Any as the number of lanes can't be "any" number but instead a multiple of a minimal number (in this case 128).
|
I think it is a good idea to invite people working on RISC-V support for TVM for review/discuss, since the RISC V vector extension is similar to ARM SVE. I remember some people in Taiwan are working on this. Maybe @comaniac knows who? |
|
Thank you @MeeraN7 for the RFC. SVE is certainly an interesting topic. Because we do not yet have SVE support in TIR. It would be useful to think carefully about how SVE can be presented and transformed. Right now the proposal contains one way to do so. However, we could use a bit of more context information, to see how the proposed design impacts the general lowering flow. Specifically, it would be very helpful to also show examples of the code along the transformations, so we can have a better understanding the possible design tradeoffs. It might be helpful to translate some of the examples in the whitepaper to tvmscript form. Specifically:
To touch a bit on the design alternatives(disclaimer, I only learnt the VLA/SVE by quickly reading through the manual, so I could be wrong). Based on my understanding, the main goal of VLA is to use intrisnics to represent a some what restricted loop pattern(in terms of possible items we can do). While the previous fixed length vectorization pushes the items onto the vector constructs such as Ramp and Broadcast. I wonder if we could take a different approach for VLA/SVE. Because VLA in nature is closer to the loop pattern. Specifically, I feel we could come up with some form of "loop SVE legalization" that legalize a loop's body to all the patterns that SVE support, then leave the for loop as it is with annotation VLA. Then the code generator can take that and generate VLA code. |
|
cc @junrushao1994 @vinx13 as it can be related to the future tensorization optimizations |
|
HI @tqchen , If I understand your comment correctly, what @MeeraN7 is doing is closer to what you are proposing. Instead of transforming a loop into a Ramp, and passing the ramp "as is" to LLVM (which is what is done for fixed vector length, but not doable for SVE) @MeeraN7 is legalising the loop in TIR and passing the legal loop down to the LLVM code-generator. In other words, the following loop: Once legalised becomes And then the LLVM code generator, knowing that this is a variable loop, translates this it with some LLVM intrinsics for:
Please note that only load/store needs to be predicated. Other register-to-register operations (e.g., add/sub/mul) won't need predication. Please also note that while predication is not a TIR concept, we use it to support VLA (cc @tkonolige) in the LLVM codegen. In future it should be quite straightforward to expose predication also in TIR (if required). @MeeraN7 feel free to jump in if something I said is not correct (very likely :) ) |
|
Thank you @giuseros for your answer, it looked good to me. Just to add on, we know which loops need to be legalised as in TIR we do add a new |
|
Thank you all for the comments so far, we are currently working on a response to address them. |
|
Thanks @MeeraN7 . I think the overall loop legalization makes sense. I wonder then if it is necessary to update the node construct such as Ramp, or can we directly make use of scalar loop in the body |
a1d2cf6 to
125582f
Compare
| We can also see the syntax of the Ramp nodes have now been modified to handle | ||
| an unknown vector length, as seen by _ramp(i, 1, VL)_, instead of a fixed | ||
| integer. The form is still _Ramp(base, stride, lanes)_ and the semantics of it | ||
| are still the same, the only difference is that the number of lanes is unknown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this specific example, the semantics does change, doesn't it? Because in your example C[ramp(0, 1, 17)] = A[ramp(0, 1, 17)] + B[ramp(0, 1, 17)], 17 is not the length of a vector, but its the size of an input. While in the SVE example above the input length is explicitly divided by the vector length at TIR level.
Of course, I understand that the semantics of Ramp node in general does not change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid confusion, it is probably better to talk about different sense in which Ramp node is used, between fixed-width vs scalable vectorization. The former one treats the entire input as one chunk while the latter one is specifically for vector-length wide chunk.
|
Thanks @MeeraN7 @giuseros, I like the approach of making the vectorized loop explicit with If possible, I think it is better not to introduce user facing changes, since as far as an API is concerned, |
|
@masahi, about:
which I think is very closely related to an earlier inline comment:
I think we do need a user-facing option to toggle fixed/scalable vectorisation. If the vectorisation strategy is selected based on an available target attribute, we loose control to choose fixed/scalable vectorisation. For architectures that do support scalable vectorisation, fixed width might still be preferable in some cases. I think this is similar to Clang's loop pragmas. For example, the
see also the Clang docs here. So I see two approaches:
I personally don't have any preference. But now I am wondering if extending |
|
Thanks @sjoerdmeijer @giuseros, I didn't imagine that there would be a case where mixing fixed and scalable vectorization is beneficial. I prefer Any other comments @tqchen @tkonolige? |
rfcs/0018-initial-sve-addition.md
Outdated
| s[C].vectorize(x) | ||
| ``` | ||
|
|
||
| Vectorisation along the x-axis is requested with _vectorize(x)_, and will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within markdown, code is typically denoted with a pairback single quotes surrounding the code. I suggest using that as the convention to enclose methods, variables, and other code.
|
Thanks @MeeraN7 @giuseros, to make the discussion more concrete, right now the IR after legalization looks like for (i: int32, 0, 17;i+=VL) {
C_2[ramp(i, 1, VL)] = ((int8xVL*)A_2[ramp(i, 1, VL)] + (int8xVL*)B_2[ramp(i, 1, VL)])
}This would require changes such as the ramp data structure and data type to support the VL vector types, which can be a bit adhoc because there are also additional information needed to be encoded(e.g. this VL and that VL are the same) but nevertheless not clearly encoded here. Given we are mostly matching a for pattern, I also wonder if they are really necessary. Since we could represent a VLA loop with some form of restricted for loop, with special annotations. Here is a possible alternative way to do so for (i: int32, 0, 17;i, annotation={"VLA"}) {
C_2[i] = A_2[i] + B_2[i];
}And we will be defering the vectorized instruction generation to the codegen phase, by specially handling the patterns in the for that is annotated with VLA loop. Of course we can only support a limited set of patterns(such as read/write to the same vector index or limited reduction support), that is why legalize is needed to make sure the body of VLA for loop satiesfies the pattern. In this way we can likely get a similar set of things without hacking into get a ramp with VL size |
|
Hi @tqchen, thank you for the comment. To clarify a few things, VL is not added to the Ramp Node at all, it is simply a string that is used when printing TIR for visual representation. The only addition to the Ramp Node (and also Broadcast Node) is a boolean called "is_scalable" which should not affect anything else as a separate constructor was added. I don't think any other information needs to be added to these nodes or data types. |
|
Thanks @MeeraN7 . Yes I get what you mean. Right now we are adding a "is_scalable" field to indicate that the broadcast and ramp are "context dependent" on VL. Additionally, we might need to update DataType to indicate a scalable data type. This context dependency is the missing information I mentioned here. The set of code is really undefined and should be parameterized by VL. Additionally, considering the case of two loops with different VL1 and VL2 and want to do some transformations, we might fall into the trap of thinking them as same type(because only "is_scalable" is marked) but in reality they are not, as a implicit dependency on VL can be ignored. I can understand that the additional flag can be helpful as we could reuse some of the vectorization logic. However, the "is_scalable" field might introduce additional confusion as above, and the additional ramp node may not carry too much additional information(apart from the fact that we use a scalar vs a vector type). So my main question is that whether or not we could use a separate normal form to hint the code generator without changing the current DataType, ramp and broadcast. Specifically, a regular loop as follows would carry same amount of information(the access to i(VLA index)) would indicate a vector load, and access to other indices would become a normal load. The main constraint could be that we can only allow N1: A possible loop normal form via annotation N1 would indeed push a bit more pressure to the code generator, because now code generator needs to pattern match load/store of VLA index(i), and possibly perform broadcast if necessary. However, the additional overhead may not be too large and could help us to keep the code cleaner. My guess is that this approach is also easier to generalize to later loop patterns such as scalable matrix instruction, in which case we cannot really reuse ramp and broadcast |
|
Thanks for commenting @tqchen
I don't think I understand what you mean by context dependency. Basically, in my view of the world, the Ramp node means that we can process elements in a data parallel way. How exactly this is done, is up to backends depending on the architecture and the code. What we are doing here is annotating this Ramp node with a bit of state, which is a hint that we want a special kind of vectorisation. From this point of view it is syntactic sugar: if the hint is dropped or ignored, it could still be vectorised, but not in a vector length agnostic way. I don't think we need to encode much more information than one boolean that specifies the vectorisation style. You could indeed argue that this is ad-hoc, but if we are going to do it in a different way, we would still need to keep a bit of state around, like the
What do you mean by undefined here?
Correct me if I am wrong, but your assumption is that explicit scalable state is bad. Looking at your example:
This annotation looks equivalent to a loop pragma. In Clang you can for example do: and thus request scalable vectorisation. If this annotation results in scalable vectorisation, the LLVM vectoriser will lower this to operations using What I would like to say with these examples, is that there are 2 things going on at different levels. I think:
And I think these 2 concepts are different things that both have their value and place. I we can only annotate loops as being scalable, we loose the finer grained control to request this on a statement level. I don't know if mixing fixed and scalable will be an important use-case, but I think it is possible. Summarising, I don't think explicit encoding of scalable in TIR nodes is a bad thing, the opposite actually, I think we need it, and the annotation on the loop might be a complementary technique to this. What do you think? |
Some small additions to Solution Approach and Next Steps sections following discussion about representation of VL and the vectorize_scalable function. Also, modified style to use back single quotes for function and variable names and any other code related text.
29c3914 to
b6bc21a
Compare
|
Thanks @sjoerdmeijer , sorry for getting back to this late. If LLVM also encodes the SVE vector as a special type(and not tying the n) it could be a precedence that we can learn from. I do want to point out that the same approach won't work when it comes to scalable matrix instruction, so it would be good to think about a loop pattern based approach from that angle. If we conclude that we really want to introduce the scalable vector into data type and so on. A remaining thing we need to resolve is the encoding. DataType and DLDataType are part of DLPack standard, used as standard ABI convention. So ideally we do not want to change the type signature of DataType DLDataType. What we might be able to is to introduce a magic lane number (say This being said. It would still be good to get @MeeraN7 @giuseros @sjoerdmeijer 's take on the possibility of using a normalized loop pattern vs embedding into the vector. Thank you for all the discussions so far! |
|
Hi @tqchen , thanks for the great feedback!
Yes, I can see that as being very useful. If my analogy is correct, the loop annotation corresponds to a loop pragma, and the scalable ramp/vector node to an IR extension. Like I wrote before, I think there's place for both of these concepts. Can you please advise how to proceed? I guess we need to document this in the RFC. But do we want to prototype this too? Or do we just list it as an alternative, and discuss here what we would like to do first? I have not yet commented on your remark about the ABI convention, as I am not familiar enough with TVM to comment about this. It looks though that is a solvable problem. |
|
The ABI issue is important since it affects the DLPack, so I would suggest we agree on the encoding convention before we proceed. I am fine with having a scalable vector repr if we can take the right encoding. |
| where the Ramp TIR node has the form 'Ramp(base, stride, lanes)' showing that | ||
| these elements are processed in (vector) lanes. | ||
|
|
||
| The size of 18 has been chosen to demonstrate the challenges of vectorising |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size is 17. There are other mentions of 18 in this paragraph as well.
| example, and importantly no scalar epilogue. But since we do not need to | ||
| process 5 * 4 = 20 elements, the last vector operation only needs to write two | ||
| elements, which can be achieved by predication as we can enable the first two | ||
| lanes and disable the last 2 lanes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though SVE may be adding per-lane predication for ARM, per-lane predication as a concept is unrelated to SVE and should be done independently. This is especially important for non-ARM architectures that do support it.
| In addition to predication, and also related to it, some new vector | ||
| architectures also allow scalable vectorisation. As opposed to so called fixed | ||
| width vectorisation (e.g. AArch Neon), the Arm architecture SVE vector | ||
| extension allows implementations to choose a vector register length between 128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify that "implementations" means "processor implementations".
The |
|
@kparzysz-quic DatatType is a thin wrapper around DLDataType |
|
Right, but users of DLPack will not see |
kparzysz-quic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This RFC is not about SVE. It's about vectorizing (countable) loops with bounds not known at compile time. I think that would be very valuable for all targets and we can determine the list of necessary features that a target should satisfy to implement it.
Even though SVE can be used to implement it, it only applies to ARM, and there is no compelling reason to introduce SVE-specific features to TIR. There may be room for unknown-length vectors in TIR, but that should be done separately from SVE.
Whether SVE it used to implement that or not should be left to the ARM codegen to decide.
|
@kparzysz-quic this is right. On the other hand, we still on conversion of DLDataType and DataType(e.g. getting DataType from ndarray) in many cases. So ideally we want to keep them consistent |
|
A belated thank you for sharing further thoughts on this @tqchen and @kparzysz-quic . I am on holiday this week (and a little bit of next), but want to pick this up soon after that. In the mean time, some remarks/questions from my side. First of all, the ABI discussion goes over my head at the moment to be honest; I am not familiar enough yet to comment or address this. My understanding from the discussion so far is, that there will be an ABI break, but it is rather limited. So that doesn't require any further attention here, would that be a correct summary? Perhaps more fundamental is @kparzysz-quic 's remark:
Yep, I agree, and you're right about this: this is not really about SVE. The "only" addition to TIR is the |
|
@smeijer1234 to be more specific. I was recommending an approach that won't break the ABI. Introdice And in DataType, introduce a method IsScalable(), which checks if the lane equals -1 and if yes return true. This way we reuse the data layout of the original data structure without introducing the is_scalable field. |
|
Hi @tqchen @kparzysz-quic @masahi @tkonolige @smeijer1234 , We are looking to revive this work. I have gone through the thread.
Please confirm whether this is a right summary of the current state. |
|
@manupa-arm i think your summary is roughly correct. @tqchen @masahi @kparzysz-quic perhaps you guys could have a look and see if we could resolve the question of how to represent the scalable concept in DLDataType. also cc @alter-xp in case they would like to leverage this for RISC-V (i'm aware you guys are doing a BYOC impl, but perhaps this could be useful if the approach goes beyond that in the future). i do tend to agree with @kparzysz-quic that we could make the title a bit more generic since the DLDataType question is at the center of this RFC. |
|
Thanks for bringing up. this is also very useful on RISC-V. we look forward to progress in this regard. |
|
To reiterate---my original concern was that the first RFC was proposing changes to target-independent part of TVM to add support for a very target-specific feature. However, I do think that we can move this forward in way that would be overall useful. Here is the outline of my thoughts on this. Let me know what you think. First, a couple of observations:
What this RFC proposes is very close to allowing vectorization of countable loops with variable iteration count, and I insist that we keep this in mind as a goal. The way that vectorization works right now is that a loop like will be replaced with statements The expressions within these statement are all We are already considering a special value for If the loop iteration count changed from a known integer To summarize:
|
|
Hi~ here are my two questions :)
Thanks! |
Yes. The padding proposed in the padding RFC could be utilized for this. What I wrote didn't take that into account.
I suppose it could be multi-dimensional, but effectively it would be the conjunction of all the per-dimension predicates.
The region analysis could simply ignore the predicate (assume that it's true for all lanes). This is presumably what would happen with the |
|
Thanks folks, just to come back on this, my main comment is wrt to the data structure change, leverage the special mark so we don't break ABI of runtime type |
|
@tqchen just to clarify, do we then add this to the DLPack repo or do we consider this a specialized use of lane internal to TVM? |
|
@areusch this can be something that is internal in TVM for now given this is only used as a compiler abstraction not runtime |
|
Closing as superseded by: #104 |
Introducing the addition of Arm Architecture's Scalable Vector Extension (SVE) in TVM containing initial VLA and predication implementation, based on earlier work by Giuseppe Rossini.
@mbaret