Cranelift x64 SIMD: some special-cases to make i64x2 packing faster. by cfallin · Pull Request #8253 · bytecodealliance/wasmtime

cfallin · 2024-03-28T05:21:42Z

Sometimes, when in the course of silly optimizations to make the most of one's registers, one might want to pack two i64s into one v128, and one might want to do it without any loads or stores.

In clang targeting Wasm at least, building an i64x2 (with wasm_i64x2_make(a, b) from <wasm_simd128.h>) will generate (i) an i64x2.splat to create a new v128 with lane 0's value in both lanes, then i64x2.replace_lane to put lane 1's value in place. Or, in the case that one of the lanes is zero, it will generate a v128.const 0 then insert the other lane.

Cranelift's lowerings for both of these patterns on x64 are slightly less optimal than they could be.

For the former (replace-lane of splat), the 64-bit value is moved over to the XMM register, then the rest of the splat semantics are implemented by a pshufd (shuffle), even though we're just about to overwrite the only other lane. We could omit that shuffle instead, and everything would work fine.

This optimization is specific to i64x2 (that is, only two lanes): we need to know that the only other lane that the splat is splatting into is overwritten. We could in theory match a chain of replace-lane operators for higher-lane-count types, but let's save that for the case that we actually need it later.
For the latter (replace-lane of constant zero), the load of a constant zero from the constant pool is the part that bothers me most. While I like zeroed memory as much as the next person, there is a vector XOR instruction right there under our noses, and we'd be silly not to use it. This applies to any vconst 0, not just ones that occur as a source to replace-lane.

Sometimes, when in the course of silly optimizations to make the most of one's registers, one might want to pack two `i64`s into one `v128`, and one might want to do it without any loads or stores. In clang targeting Wasm at least, building an `i64x2` (with `wasm_i64x2_make(a, b)` from `<wasm_simd128.h>`) will generate (i) an `i64x2.splat` to create a new v128 with lane 0's value in both lanes, then `i64x2.replace_lane` to put lane 1's value in place. Or, in the case that one of the lanes is zero, it will generate a `v128.const 0` then insert the other lane. Cranelift's lowerings for both of these patterns on x64 are slightly less optimal than they could be. - For the former (replace-lane of splat), the 64-bit value is moved over to the XMM register, then the rest of the `splat` semantics are implemented by a `pshufd` (shuffle), even though we're just about to overwrite the only other lane. We could omit that shuffle instead, and everything would work fine. This optimization is specific to `i64x2` (that is, only two lanes): we need to know that the only other lane that the `splat` is splatting into is overwritten. We could in theory match a chain of replace-lane operators for higher-lane-count types, but let's save that for the case that we actually need it later. - For the latter (replace-lane of constant zero), the load of a constant zero from the constant pool is the part that bothers me most. While I like zeroed memory as much as the next person, there is a vector XOR instruction *right there* under our noses, and we'd be silly not to use it. This applies to any `vconst 0`, not just ones that occur as a source to replace-lane.

fitzgen

LGTM!

cfallin requested a review from a team as a code owner March 28, 2024 05:21

cfallin requested review from abrown and removed request for a team March 28, 2024 05:21

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen labels Mar 28, 2024

fitzgen approved these changes Mar 28, 2024

View reviewed changes

fitzgen added this pull request to the merge queue Mar 28, 2024

Merged via the queue into bytecodealliance:main with commit 9c92881 Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cranelift x64 SIMD: some special-cases to make i64x2 packing faster.#8253

Cranelift x64 SIMD: some special-cases to make i64x2 packing faster.#8253
fitzgen merged 1 commit intobytecodealliance:mainfrom
cfallin:faster-i64x2-vector-construction

cfallin commented Mar 28, 2024

Uh oh!

fitzgen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cfallin commented Mar 28, 2024

Uh oh!

fitzgen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants