This repository was archived by the owner on Dec 22, 2021. It is now read-only.
i64x2.gt_u, i64x2.lt_u, i64x2.ge_u, and i64x2.le_u instructions #414
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduction
This is proposal to add 64-bit variant of existing
gt_u,lt_u,ge_u, andle_uinstructions. ARM64 and x86-64 XOP natively support these instructions, but on other instruction sets they need to be emulated. On SSE4.2 emulation costs 5-6 instructions, but on older SSE extension and on ARMv7 NEON the emulation cost is more significant.Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX512F, AVX512DQ, and AVX512VL instruction sets
y = i64x2.gt_u(a, b)is lowered to:VPCMPUQ k_tmp, xmm_a, xmm_b, 6VPMOVM2Q xmm_y, k_tmpy = i64x2.lt_u(a, b)is lowered to:VPCMPUQ k_tmp, xmm_a, xmm_b, 1VPMOVM2Q xmm_y, k_tmpy = i64x2.ge_u(a, b)is lowered to:VPCMPUQ k_tmp, xmm_a, xmm_b, 5VPMOVM2Q xmm_y, k_tmpy = i64x2.le_u(a, b)is lowered to:VPCMPUQ k_tmp, xmm_a, xmm_b, 2VPMOVM2Q xmm_y, k_tmpx86/x86-64 processors with XOP instruction set
y = i64x2.gt_u(a, b)is lowered toVPCOMGTUQ xmm_y, xmm_a, xmm_by = i64x2.lt_u(a, b)is lowered toVPCOMLTUQ xmm_y, xmm_a, xmm_by = i64x2.ge_u(a, b)is lowered toVPCOMGEUQ xmm_y, xmm_a, xmm_by = i64x2.le_u(a, b)is lowered toVPCOMLEUQ xmm_y, xmm_a, xmm_bx86/x86-64 processors with AVX instruction set
y = i64x2.gt_u(a, b)(yis notb) is lowered to:VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]VPXOR xmm_y, xmm_y, xmm_tmpVPXOR xmm_tmp, xmm_b, xmm_tmpVPCMPGTQ xmm_y, xmm_y, xmm_tmpy = i64x2.lt_u(a, b)(yis notb) is lowered to:VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]VPXOR xmm_y, xmm_y, xmm_tmpVPXOR xmm_tmp, xmm_b, xmm_tmpVPCMPGTQ xmm_y, xmm_tmp, xmm_yy = i64x2.ge_u(a, b)(yis notb) is lowered to:VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]VPXOR xmm_y, xmm_y, xmm_tmpVPXOR xmm_tmp, xmm_b, xmm_tmpVPCMPGTQ xmm_y, xmm_tmp, xmm_yVPXOR xmm_y, xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]y = i64x2.le_u(a, b)(yis notb) is lowered to:VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]VPXOR xmm_y, xmm_y, xmm_tmpVPXOR xmm_tmp, xmm_b, xmm_tmpVPCMPGTQ xmm_y, xmm_y, xmm_tmpVPXOR xmm_y, xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]x86/x86-64 processors with SSE4.2 instruction set
y = i64x2.gt_u(a, b)(yis notaandyis notb) is lowered to:MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]MOVDQA xmm_tmp, xmm_yPXOR xmm_y, xmm_aPXOR xmm_tmp, xmm_bPCMPGTQ xmm_y, xmm_tmpy = i64x2.lt_u(a, b)(yis notaandyis notb) is lowered to:MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]MOVDQA xmm_tmp, xmm_yPXOR xmm_y, xmm_bPXOR xmm_tmp, xmm_aPCMPGTQ xmm_y, xmm_tmpy = i64x2.ge_u(a, b)(yis notaandyis notb) is lowered to:MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]MOVDQA xmm_tmp, xmm_yPXOR xmm_y, xmm_bPXOR xmm_tmp, xmm_aPCMPGTQ xmm_y, xmm_tmpPXOR xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]y = i64x2.le_u(a, b)(yis notaandyis notb) is lowered to:MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]MOVDQA xmm_tmp, xmm_yPXOR xmm_y, xmm_bPXOR xmm_tmp, xmm_aPCMPGTQ xmm_y, xmm_tmpPXOR xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]x86/x86-64 processors with SSE2 instruction set
Based on this answer by user aqrit on Stack Overflow
y = i64x2.gt_u(a, b)(yis notaandyis notb) is lowered to:MOVDQA xmm_tmp, xmm_bMOVDQA xmm_y, xmm_bPSUBQ xmm_tmp, xmm_aPXOR xmm_y, xmm_aPANDN xmm_y, xmm_tmpMOVDQA xmm_tmp, xmm_bPANDN xmm_tmp, xmm_aPOR xmm_y, xmm_tmpPSRAD xmm_y, 31PSHUFD xmm_y, xmm_y, 0xF5y = i64x2.lt_u(a, b)(yis notaandyis notb) is lowered to:MOVDQA xmm_tmp, xmm_aMOVDQA xmm_y, xmm_bPSUBQ xmm_tmp, xmm_bPXOR xmm_y, xmm_bPANDN xmm_y, xmm_tmpMOVDQA xmm_tmp, xmm_aPANDN xmm_tmp, xmm_bPOR xmm_y, xmm_tmpPSRAD xmm_y, 31PSHUFD xmm_y, xmm_y, 0xF5y = i64x2.ge_u(a, b)(yis notaandyis notb) is lowered to:MOVDQA xmm_tmp, xmm_aMOVDQA xmm_y, xmm_bPSUBQ xmm_tmp, xmm_bPXOR xmm_y, xmm_bPANDN xmm_y, xmm_tmpMOVDQA xmm_tmp, xmm_aPANDN xmm_tmp, xmm_bPOR xmm_y, xmm_tmpPSRAD xmm_y, 31PSHUFD xmm_y, xmm_y, 0xF5PXOR xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]y = i64x2.le_u(a, b)(yis notaandyis notb) is lowered to:MOVDQA xmm_tmp, xmm_bMOVDQA xmm_y, xmm_bPSUBQ xmm_tmp, xmm_aPXOR xmm_y, xmm_aPANDN xmm_y, xmm_tmpMOVDQA xmm_tmp, xmm_bPANDN xmm_tmp, xmm_aPOR xmm_y, xmm_tmpPSRAD xmm_y, 31PSHUFD xmm_y, xmm_y, 0xF5PXOR xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]ARM64 processors
y = i64x2.gt_u(a, b)is lowered toCMHI Vy.2D, Va.2D, Vb.2Dy = i64x2.lt_u(a, b)is lowered toCMHI Vy.2D, Vb.2D, Va.2Dy = i64x2.ge_u(a, b)is lowered toCMHS Vy.2D, Va.2D, Vb.2Dy = i64x2.le_u(a, b)is lowered toCMHS Vy.2D, Vb.2D, Va.2DARMv7 processors with NEON instruction set
y = i64x2.gt_u(a, b)is lowered to:VQSUB.U64 Qy, Qa, QbVCGT.U32 Qy, Qy, 0VREV64.32 Qtmp, QyVAND Qy, Qy, Qtmpy = i64x2.lt_u(a, b)is lowered to:VQSUB.U64 Qy, Qb, QaVCGT.U32 Qy, Qy, 0VREV64.32 Qtmp, QyVAND Qy, Qy, Qtmpy = i64x2.ge_u(a, b)is lowered to:VQSUB.U64 Qy, Qb, QaVCEQ.I32 Qy, Qy, 0VREV64.32 Qtmp, QyVAND Qy, Qy, Qtmpy = i64x2.le_u(a, b)is lowered to:VQSUB.U64 Qy, Qa, QbVCEQ.I32 Qy, Qy, 0VREV64.32 Qtmp, QyVAND Qy, Qy, Qtmp