Skip to content

RyuJIT SIMD: optimize codegen when Op(in)Equality that produces bool result is compared against 1 or 0 #6728

@sivarv

Description

@sivarv

As per Intel TechEmPower benchmark analysis, Kestrel.Internal.Infrastructure.MemoryPoolIterator.Seek() is one of the hot methods. It has the following code in the inner-most loop of a doubly nested while loop.

var data = new Vector<byte>(array, index);

// The below code is repeated 3 times in the method
// against 3 different byte vectors
var byte0Equals = Vector.Equals(data, byte0Vector);
if (!byte0Equals.Equals(Vector<byte>.Zero))
{
     byte0Index = FindFirstEqualByte(ref byte0Equals);
}

The if-condition generates the following IR
!byte0Equals.Equals(Vector.Zero)

***** BB13, stmt 61
     ( 10,  9) [000371] ------------             *  stmtExpr  void  (IL 0x109...0x115)
N007 ( 10,  9) [000370] ----G-------             \--*  jmpTrue   void  
N005 (  1,  1) [000368] ------------                |  /--*  const     int    0 $40
N006 (  8,  7) [000369] J---G--N----                \--*  !=        int    $21d
N003 (  2,  2) [000366] ------------                   |  /--*  simd      simd32 int init $148
N002 (  1,  1) [000365] ------------                   |  |  \--*  const     int    0 $40
N004 (  6,  5) [000367] ----G-------                   \--*  simd      int    ubyte == $149
N001 (  3,  2) [000363] ----G-------                      \--*  lclVar    simd32(AX) V18 loc13         $349

SIMD (in)equality produces a bool result in a register, which is checked to see != 0 above. Here is the code generated

// SIMD opEquality produces the below code
IN0058:        vmovaps  ymm2, ymm0
IN0059:        vpcmpeqd ymm2, ymm1
IN005a:        vextractf128 ymm3, ymm2, 1
IN005b:        vandps   ymm2, ymm3
IN005c:        vpshufd  ymm3, ymm2, 78
IN005d:        vandps   ymm2, ymm3
IN005e:        vpshufd  ymm3, ymm2, 1
IN005f:        vpand    ymm2, ymm3
IN0060:        vmovd    r14d, xmm2
IN0061:        cmp      r14d, 0xFFFFFFFF
IN0062:        sete     r14b
IN0063:        movzx    r14, r14b

// (SIMD opEquality != 0) produces the following code
test     r14d, r14d
jne      L_M48761_BB15

Here there is no need to produce the result of SIMD opEquality into a register. It would just suffice to set flags. Comparison operation !=0 is a redundant. We should be able to produce the following code

IN0058:        vmovaps  ymm2, ymm0
IN0059:        vpcmpeqd ymm2, ymm1
IN005a:        vextractf128 ymm3, ymm2, 1
IN005b:        vandps   ymm2, ymm3
IN005c:        vpshufd  ymm3, ymm2, 78
IN005d:        vandps   ymm2, ymm3
IN005e:        vpshufd  ymm3, ymm2, 1
IN005f:        vpand    ymm2, ymm3
IN0060:        vmovd    r14d, xmm2
IN0061:        cmp      r14d, 0xFFFFFFFF

jne      L_M48761_BB15

Assuming as per #6719, we optimize codegen of SIMD opEquality against Vector Zero on AVX, resulting code would be

ptest ymm0, ymm0
jne      L_M48761_BB15 

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIoptimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions