Cranelift x64 SIMD: some special-cases to make i64x2 packing faster.#8253
Merged
fitzgen merged 1 commit intobytecodealliance:mainfrom Mar 28, 2024
Merged
Conversation
Sometimes, when in the course of silly optimizations to make the most of one's registers, one might want to pack two `i64`s into one `v128`, and one might want to do it without any loads or stores. In clang targeting Wasm at least, building an `i64x2` (with `wasm_i64x2_make(a, b)` from `<wasm_simd128.h>`) will generate (i) an `i64x2.splat` to create a new v128 with lane 0's value in both lanes, then `i64x2.replace_lane` to put lane 1's value in place. Or, in the case that one of the lanes is zero, it will generate a `v128.const 0` then insert the other lane. Cranelift's lowerings for both of these patterns on x64 are slightly less optimal than they could be. - For the former (replace-lane of splat), the 64-bit value is moved over to the XMM register, then the rest of the `splat` semantics are implemented by a `pshufd` (shuffle), even though we're just about to overwrite the only other lane. We could omit that shuffle instead, and everything would work fine. This optimization is specific to `i64x2` (that is, only two lanes): we need to know that the only other lane that the `splat` is splatting into is overwritten. We could in theory match a chain of replace-lane operators for higher-lane-count types, but let's save that for the case that we actually need it later. - For the latter (replace-lane of constant zero), the load of a constant zero from the constant pool is the part that bothers me most. While I like zeroed memory as much as the next person, there is a vector XOR instruction *right there* under our noses, and we'd be silly not to use it. This applies to any `vconst 0`, not just ones that occur as a source to replace-lane.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sometimes, when in the course of silly optimizations to make the most of one's registers, one might want to pack two
i64s into onev128, and one might want to do it without any loads or stores.In clang targeting Wasm at least, building an
i64x2(withwasm_i64x2_make(a, b)from<wasm_simd128.h>) will generate (i) ani64x2.splatto create a new v128 with lane 0's value in both lanes, theni64x2.replace_laneto put lane 1's value in place. Or, in the case that one of the lanes is zero, it will generate av128.const 0then insert the other lane.Cranelift's lowerings for both of these patterns on x64 are slightly less optimal than they could be.
For the former (replace-lane of splat), the 64-bit value is moved over to the XMM register, then the rest of the
splatsemantics are implemented by apshufd(shuffle), even though we're just about to overwrite the only other lane. We could omit that shuffle instead, and everything would work fine.This optimization is specific to
i64x2(that is, only two lanes): we need to know that the only other lane that thesplatis splatting into is overwritten. We could in theory match a chain of replace-lane operators for higher-lane-count types, but let's save that for the case that we actually need it later.For the latter (replace-lane of constant zero), the load of a constant zero from the constant pool is the part that bothers me most. While I like zeroed memory as much as the next person, there is a vector XOR instruction right there under our noses, and we'd be silly not to use it. This applies to any
vconst 0, not just ones that occur as a source to replace-lane.