Skip to content

JIT: Some SIMD intrinsics operations don't reliably fold memory addressing logic #33002

@GrabYourPitchforks

Description

@GrabYourPitchforks

Found while looking at codegen for #32994. From SharpLab:

public unsafe static void WidenAndWrite_Foo(byte* pInput, char* pOutput, ulong offset)
{
    Vector128<byte> zero = Vector128<byte>.Zero;
    Vector128<byte> narrow = Sse2.LoadVector128(pInput + offset);

    Vector128<byte> wideLow = Sse2.UnpackLow(narrow, zero);
    Vector128<byte> wideHigh = Sse2.UnpackHigh(narrow, zero);

    Sse2.Store((byte*)(pOutput + offset), wideLow);
    Sse2.Store((byte*)(pOutput + offset + 0x08), wideHigh);
}

public unsafe static void WidenAndWrite_Bar(byte* pInput, char* pOutput, ulong offset)
{
    Vector128<byte> zero = Vector128<byte>.Zero;
    Vector128<byte> narrow = Sse2.LoadVector128(pInput + offset);

    Vector128<byte> wideLow = Sse2.UnpackLow(narrow, zero);
    Vector128<byte> wideHigh = Sse2.UnpackHigh(narrow, zero);

    Sse2.Store((byte*)(pOutput + offset), wideLow);
    Sse2.Store((byte*)pOutput + (offset << 1) + 0x10, wideHigh);
}
WidenAndWrite_Foo(Byte*, Char*, UInt64)
    L0000: vzeroupper
    L0003: vxorps xmm0, xmm0, xmm0
    L0007: vmovdqu xmm1, [rcx+r8]
    L000d: vpunpcklbw xmm2, xmm1, xmm0
    L0011: vpunpckhbw xmm0, xmm1, xmm0
    L0015: lea rax, [rdx+r8*2]
    L0019: vmovdqu [rax], xmm2
    L001d: vmovdqu [rax+0x10], xmm0
    L0022: ret

WidenAndWrite_Bar(Byte*, Char*, UInt64)
    L0000: vzeroupper
    L0003: vxorps xmm0, xmm0, xmm0
    L0007: vmovdqu xmm1, [rcx+r8]
    L000d: vpunpcklbw xmm2, xmm1, xmm0
    L0011: vpunpckhbw xmm0, xmm1, xmm0
    L0015: vmovdqu [rdx+r8*2], xmm2
    L001b: vmovdqu [rdx+r8*2+0x10], xmm0
    L0022: ret

Is there any reason why the first sample uses a lea to help with calculating the destination memory address, while the second sample foregoes lea and folds the memory addressing logic directly into the vmovdqu instruction? As far as I can tell they take around the same amount of time to execute, but the first sample burns a register unnecessarily.

Related: #10923

category:cq
theme:addressing-modes
skill-level:expert
cost:medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    JitUntriagedCLR JIT issues needing additional triagearea-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIoptimization

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions