[RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE) #94

ekalda · 2022-09-28T11:23:21Z

This RFC is to add CodeGenAArch64 backend with SVE.

ekalda · 2022-09-28T11:28:40Z

There is more context around where this is going in the meta-RFC :)

tqchen

most comment summarized in followup convos

tqchen · 2022-10-02T13:30:24Z

rfcs/0094-aarch64-backend-with-sve.md

+
+With SVE enabled, this TIR would further be lowered to LLVM:
+
+```


Based on this description, seems the proposed approach is that:

we pattern matching a fixed vectorization( lane=5)

raise it back to SVE pattern (with vscale and lane!=5)

codegen

One concern is that the code can be simplified by the assumption(lane=5) during lowering phase, but that simplification does not work for the general case.

Edit: After thinking a bit more, i now think the above concern can be addressed by clarifying a strict set of raising rules. so feel free to ignore this

tqchen · 2022-10-02T14:26:41Z

Thanks @ekalda . It is great to see us having conversations on bringing in SVE. The main question we want to resolve likely is going to be what is the TIR spec goes into codegen that contains SVE info.

Three alternatives have been discussed so far:

A0: Loop with annotation but body as scalar

  for (i: int32, 0, 20;i, annotation={"VLA"}) {
    C_2[i] = A_2[i] + B_2[i];
  }

A1: Vectorized loop with constant vector factor

  for (i: int32, 0, 20; i) {
    C_2[ramp(i, 0, 5)] = A_2[ramp(i, 0, 5)] + B_2[ramp(i, 0, 5)];
  }

A2: Vectorized loop with some form of TIR repr for sve vector

  for (i: int32, 0, 20; i) {
    C_2[ramp(i, 0, vscale)] = A_2[ramp(i, 0, vscale)] + B_2[ramp(i, 0, vscale)];
  }

This would involve updates to the ramp note TIR. See kScalableVectorLaneMark comment in previous discussion

Discussion

The above three perspective are to setup the stage for discussion. We discussion comparing A0, A1, A2 in this to setup context or followups, and they do not need to block this RFC.

This RFC proposes A1. Because it is a proposed change to codegen only, which does not change TIR. If A1 can be implemented robustly, then it think it is a positive step(close to S0 type change we had in other conversations) even if we want to do things in several stages(with follow up S1 changes).

The main question of discussion is how can we implement A1 robustly.

Since turning a specialized code into general one is a bit like raising (from special case to general ones). It would be good to add high-level description about the pattern match and conversation rules. For some background, initially I thought that there might be some traps when the code contains some specializations to lane, but thinking a bit more I find my initial thought of counter example actually is fine under A1. So I am more convinced of this approach.

It would be good to add some clarification around the following lines:

We would only turn SVE specialization if the code satisfies the following pattern

Pattern match all ramped load/store A[ramp(iter*lanes, 0, lanes)] to ensure they have same lanes, change lane to VL with predication
Change the outer loop iter to vector loop.
If there is a vector/load that does not satisfy the pattern, we abort.

ekalda · 2022-10-12T10:06:46Z

Thanks for your input and suggestions @tqchen, much appreciated! I added a paragraph about pattern matching TIR, see if it makes sense.

Yes, this RFC propses A1 change. A2 style TIR intrinsic is in the plan further down the line, it would let us expose SVE capabilities to the core compiler, so we could explore a larger space of optimisations. The decision to enable it initially just in the TIR->LLVM boundary came from a realisation that we can generate perfectly valid SVE from just looking at the TIR, without having to modify it.

I have spent some time playing around with the current LLVM codegen and I think you make a very good point with the robustness. I have been looking at simple vectorized loads and stores (simple meaning here that the stride is 1 and that the index expression is a Ramp node, not a complex non-linear calculation with Ramp as a leaf node), the main challenge I currently see is that while the index itself is 1D at the point of code generation, the loop nest necessarily isn't, so I have to figure out the right loop bound that needs changing from the base of the Ramp node. It seems to me that we have to do some sort of analysis pass just before the codegen to collect that info. It would have been nice to directly generate the SVE LLVM "as we go" during the LLVM codegen, but it seems that we generate LLVM with the loop bounds fixed before we visit the loop body (so before we discover the Ramp nodes) and we can't change the bound afterwards. I think doing an analysis pass would help with the robustness since we can gather as much information from the TIR graph as we need to.

I haven't worked a lot with LLVM backends, so interested in hearing any thoughts/suggestions.

tqchen · 2022-10-12T13:15:52Z

Thanks @ekalda i don't have further comments at this pt

leandron · 2022-10-14T14:44:11Z

Thanks @tqchen @ekalda. This is been up for a few days, and getting no new questions, so I'm merging it and we'll continue with the work towards what's described in the RFC.

[RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE)

5f9cc83

This RFC is to add CodeGenAArch64 backend with SVE.

tqchen reviewed Oct 2, 2022

View reviewed changes

Add a paragraph about matching TIR

35eb1f5

tqchen approved these changes Oct 12, 2022

View reviewed changes

leandron merged commit 04b9909 into apache:main Oct 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE) #94

[RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE) #94

Uh oh!

ekalda commented Sep 28, 2022

Uh oh!

ekalda commented Sep 28, 2022

Uh oh!

tqchen left a comment •

edited

Loading

Uh oh!

tqchen Oct 2, 2022 •

edited

Loading

Uh oh!

tqchen commented Oct 2, 2022 •

edited

Loading

Uh oh!

ekalda commented Oct 12, 2022

Uh oh!

tqchen commented Oct 12, 2022

Uh oh!

leandron commented Oct 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		With SVE enabled, this TIR would further be lowered to LLVM:

		```

[RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE) #94

[RFC] CodeGenAArch64 backend with Scalable Vector Extension (SVE) #94

Uh oh!

Conversation

ekalda commented Sep 28, 2022

Uh oh!

ekalda commented Sep 28, 2022

Uh oh!

tqchen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tqchen Oct 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tqchen commented Oct 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

A0: Loop with annotation but body as scalar

A1: Vectorized loop with constant vector factor

A2: Vectorized loop with some form of TIR repr for sve vector

Discussion

Uh oh!

ekalda commented Oct 12, 2022

Uh oh!

tqchen commented Oct 12, 2022

Uh oh!

leandron commented Oct 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tqchen left a comment •

edited

Loading

tqchen Oct 2, 2022 •

edited

Loading

tqchen commented Oct 2, 2022 •

edited

Loading