Skip to content

Cranelift x64 SIMD: some special-cases to make i64x2 packing faster.#8253

Merged
fitzgen merged 1 commit intobytecodealliance:mainfrom
cfallin:faster-i64x2-vector-construction
Mar 28, 2024
Merged

Cranelift x64 SIMD: some special-cases to make i64x2 packing faster.#8253
fitzgen merged 1 commit intobytecodealliance:mainfrom
cfallin:faster-i64x2-vector-construction

Conversation

@cfallin
Copy link
Member

@cfallin cfallin commented Mar 28, 2024

Sometimes, when in the course of silly optimizations to make the most of one's registers, one might want to pack two i64s into one v128, and one might want to do it without any loads or stores.

In clang targeting Wasm at least, building an i64x2 (with wasm_i64x2_make(a, b) from <wasm_simd128.h>) will generate (i) an i64x2.splat to create a new v128 with lane 0's value in both lanes, then i64x2.replace_lane to put lane 1's value in place. Or, in the case that one of the lanes is zero, it will generate a v128.const 0 then insert the other lane.

Cranelift's lowerings for both of these patterns on x64 are slightly less optimal than they could be.

  • For the former (replace-lane of splat), the 64-bit value is moved over to the XMM register, then the rest of the splat semantics are implemented by a pshufd (shuffle), even though we're just about to overwrite the only other lane. We could omit that shuffle instead, and everything would work fine.

    This optimization is specific to i64x2 (that is, only two lanes): we need to know that the only other lane that the splat is splatting into is overwritten. We could in theory match a chain of replace-lane operators for higher-lane-count types, but let's save that for the case that we actually need it later.

  • For the latter (replace-lane of constant zero), the load of a constant zero from the constant pool is the part that bothers me most. While I like zeroed memory as much as the next person, there is a vector XOR instruction right there under our noses, and we'd be silly not to use it. This applies to any vconst 0, not just ones that occur as a source to replace-lane.

Sometimes, when in the course of silly optimizations to make the most of
one's registers, one might want to pack two `i64`s into one `v128`, and
one might want to do it without any loads or stores.

In clang targeting Wasm at least, building an `i64x2` (with
`wasm_i64x2_make(a, b)` from `<wasm_simd128.h>`) will generate (i) an
`i64x2.splat` to create a new v128 with lane 0's value in both lanes,
then `i64x2.replace_lane` to put lane 1's value in place. Or, in the
case that one of the lanes is zero, it will generate a `v128.const 0`
then insert the other lane.

Cranelift's lowerings for both of these patterns on x64 are slightly
less optimal than they could be.

- For the former (replace-lane of splat), the 64-bit value is moved over
  to the XMM register, then the rest of the `splat` semantics are
  implemented by a `pshufd` (shuffle), even though we're just about to
  overwrite the only other lane. We could omit that shuffle instead, and
  everything would work fine.

  This optimization is specific to `i64x2` (that is, only two lanes): we
  need to know that the only other lane that the `splat` is splatting
  into is overwritten. We could in theory match a chain of
  replace-lane operators for higher-lane-count types, but let's save
  that for the case that we actually need it later.

- For the latter (replace-lane of constant zero), the load of a constant
  zero from the constant pool is the part that bothers me most. While I
  like zeroed memory as much as the next person, there is a vector XOR
  instruction *right there* under our noses, and we'd be silly not to
  use it. This applies to any `vconst 0`, not just ones that occur as a
  source to replace-lane.
@cfallin cfallin requested a review from a team as a code owner March 28, 2024 05:21
@cfallin cfallin requested review from abrown and removed request for a team March 28, 2024 05:21
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen labels Mar 28, 2024
Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@fitzgen fitzgen added this pull request to the merge queue Mar 28, 2024
Merged via the queue into bytecodealliance:main with commit 9c92881 Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants