Skip to content

riscv64: Implementing Fixed Width SIMD with the V Vector extension #6118

@afonso360

Description

@afonso360

👋 Hey,

I've been thinking about how to implement SIMD operations on the RISC-V backend. These are sort of my notes / thoughts about this. Hopefully I haven't missed something too big that invalidates all of this.

Feedback would be appreciated!

RISC-V Vector Intro

This is a small introduction to the Vector Extensions in case people aren't familiar with it.

  • In RISC-V we have 32 vector registers.
  • These registers have some uarch specific size.
    • This can be queried at run time
  • Each vector register has a minimum of 32bits (The maximum is 64KiB 👀).
    • The minimum for Application Processors is 128 bits
  • It can process elements of size 8, 16, 32 and 64 bits
    • Element size is part of architectural state and not each explicit in each instruction
    • So, we have a generic add instruction that is both add.i32 and add.i64 depending on how the hardware is currently configured
  • If necessary we can group multiple registers to get a larger register
    • If we have 32 x 128bit registers, we can use them as 16 x 256bit registers instead
    • This is also part of architectural state and not defined in each instruction
    • We probably won't use this, but its a cool trick
  • Hardware is configured using the vset{i}vl{i} instruction
    • I'm mention it here, because I use that instruction name a few times in the rest of the document
  • Masking is supported
    • We have one mask register (v0)
    • The mask register has double use as both a regular vector register and a mask register
    • I don't think we will use masking anywhere in this proposal
      • Maybe for some weird SIMD instruction?

I like this blog post that explains how this all works in a much better way.

1. Planned implementation

With some careful orchestration we can operate the Vector hardware for fixed width SIMD operations.

The general idea is that we can emulate a iadd.i32x4 by emitting the following code:

vsetivli zero, 4, e32, m1, ta, ma
vadd.vv v0, v1, v2

Here's an explanation of that:

vsetivli  ;; This instruction configures the vector hardware
         zero, ;; Ignore the amount of processed elements. (We know that we can process them all in one go)
               4, ;; Process at most 4 elements
                  e32, ;; Each element is 32 bits
                       m1, ;; Do not group registers
                           ta, ;; Tail-Agnostic Mode (The rest of the register past 128bits is left undefined)
                               ma ;; Mask-Agnostic Mode (Similar to ta but for the mask register)

vadd.vv ;; Vector Add (Vector-Vector)
        v0, ;; Store the results in v0 (It is a usable register for vectors unlike x0)
            v1, ;; LHS is the v1 register
                v2 ;; RHS is the v2 register

vsetivli has an output register that saves the amount of elements it can do, but since we know that all processors support a minimum of 128bits per vector register we have a guarantee that all elements will be processed by a single instruction and don't need to check the output register. So we set it to the zero register to ignore it. (There are some asterisks here, see Regalloc fun for small vectors implementations for more details on that!)

We also only need to reconfigure the vector hardware when we change element-width or element-count. So this CLIF code:

v0 = fadd.f32x4 v1, v2
v3 = iadd.i32x4 v4, v5
v6 = iadd.i64x2 v7, v8

Could be lowered to:

vsetivli zero, 4, e32, m1, ta, ma ;; 4 elements of 32bit size
vfadd.vv v0, v1, v2
;; Here we don't actually need to change the hardware despite it being a different CLIF type!
vadd.vv v3, v4, v5
vsetivli zero, 2, e64, m1, ta, ma ;; 2 elements of 64bit size
vadd.vv v6, v7, v8

Switching vector modes is not done during instruction selection, but on a VCode pass that runs post Instruction Selection.

Each lowered vector instruction carries the full vector configuration that it needs, and in the VCode Pass we insert vsetvli's as necessary (i.e. between instructions with different vector configurations).

1.1 VCode Instructions

The first step is carrying the full vector configuration in each VCode instruction. Here's how I expect these instructions to look like

vfadd.vv v0, v1, v2 #avl=4 #vtype=(e32, m1, ta, ma)
 vadd.vv v3, v4, v5 #avl=4 #vtype=(e32, m1, ta, ma)
 vadd.vv v6, v7, v8 #avl=2 #vtype=(e64, m1, ta, ma)

I've lifted these names out of the ISA spec.

avl (Application Vector Length) is the maximum number of elements that we want to process. For this SIMD proposal it's always an immediate with the number of lanes. However for the purposes of VCode it can also be a register, this is required for interoperability with a future dynamic vector implementation.

vtype is the rest of the configuration data for the vector hardware.

There is additional state that I'm ignoring here:

  • vxrm: Vector fixed-point rounding mode register
  • vxsat: Vector fixed-point saturation flag register

Not sure if we need these, but we can handle them in the same manner as vtype, and insert their respective mode switching instructions in the same pass.

Additionally each instruction has an optional mask register. When unmasked this does not show up in the assembly, this is handled as a normal register input to each instruction.

1.2 The VCode Pass

After Instruction Selection (but before Register Allocation!) we need to run a custom VCode pass.

This pass walks the VCode forwards and keeps track of the "current" vector configuration. Whenever a instruction requests a different one we emit a vsetvli.

The reason for this being done Pre-Regalloc is that for the actual dynamic vectors. avl is probably not an immediate, but a register with the number of elements that we want to process. So we also need to have that interaction with regalloc. I don't expect to have to do that for SIMD yet, but this pass should probably work for both the current SIMD implementation and a future Dynamic Vectors implementation.

The current calling convention clobbers the vector configuration on all calls. So we also need to keep track of that and query the ABI layer.

A neat idea to further optimize this is by inheriting the vector configuration if all the dominator blocks to the current block end in the same vector configuration. This avoids us having to emit a vsetvli on each basic block if the configuration never changes.

A downside of this pass is that we need to run it to get correct codegen. Even if we never emit a vector instruction. I don't know the performance implications of this, but it's something to keep in mind.

This approach is quite similar to what LLVM does, see Page 41 of this presentation for more details on that.

Some other ideas in Alternative ways of emitting vsetvli

2. Additional considerations

2.1. Spills and Reloads

We can't do a dynamic sized spill/reload, which is going to be an issue for implementing full dynamic vectors. (See also the discussion here: bytecodealliance/rfcs#19 (comment))

But since that isn't implemented yet, and we don't use vector registers for anything else maybe we can do a fixed size 128b store/load for now?

This is definitely incompatible with a full Dynamic Vector implementation. But for that to work we need to save the full registers and with that the scheme above should still work.

2.2. Calling Convention for Vectors

All registers are Caller-Saved, vl and vtype are also Caller-Saved.

Standard States:

Vector registers are not used for passing arguments or return values; we intend to define a new calling convention variant to allow that as a future software optimization.

Clang does a dynamic stack store and seems to pass everything via stack.. This is the same problem as 2.1. Spills and Reloads

2.3. Regalloc fun for small vectors implementations

(See §18.1. Zvl*: Minimum Vector Length Standard Extensions of the V Extension Spec)

The minimum vector register width is 32bits. This means that in the worse case we need to group up 4 registers to process a single 128b operation. (This is something you can do with RISC-V Vector hardware, but hopefully we won't have to)

This affects regalloc since if we compile with a minimum vector register width of 32bits, we effectively only have 8 registers to work with.

This is a problem because we have to share our regalloc space with the float registers since we don't have space for an additional register class (see: regalloc2/#47). This means that we need to have the same number of float registers as vector registers. (At least I'm not seeing any clever regalloc tricks that we can pull off here)

My solution for this is to ignore Zvl32b and Zvl64b for now.

Additionally §18.3. V: Vector Extension for Application Processors states:

The V vector extension requires Zvl128b.

So it seems like a reasonable expectation that Linux running RISC-V CPU's will have Zvl128b, and if this turns out to be untrue we can change regalloc to deal with it.

3. Alternative ways of emitting vsetvli

This applies to both the SIMD implementation and future Dynamic Vector implementations so we need to keep that in mind.

3.1 Keeping state during instruction selection

It would be neat if we could query the last set element length during instruction selection, that way we could minimize the amount of emitted vsetvli instructions

Instruction selection is done in backwards order, so this seems unfeasable. (If anyone has any ideas here let me know!)

3.2 Post-Lowering VCode Delete pass

I'm including this for completeness because it doesen't seem like a great option

To avoid making the VCode pass mandatory we could emit a vsetvli + inst pair for every instruction that we lower. That way the initial lowering would be "correct", just not great.

After that we can have a optional forward VCode Pass that removes redundant vsetvli instructions.

The advantage here is that this pass is optional! If it is slower to run this than emitting (almost) double the instructions we can avoid it when lowering unoptimized code.

I don't like this option very much.

3.3 Using the Register allocator somehow?

While writing this up I kept thinking, all of this seems awfully similar to a small register allocator. Where vtype is an implicit register to each instruction. Can we make this somehow interact with regalloc?

We're out of regclasses anyway so I didn't consider this option for too long.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions