JIT: Some SIMD intrinsics operations don't reliably fold memory addressing logic

Found while looking at codegen for https://github.com/dotnet/runtime/pull/32994. From SharpLab:

```cs
public unsafe static void WidenAndWrite_Foo(byte* pInput, char* pOutput, ulong offset)
{
    Vector128<byte> zero = Vector128<byte>.Zero;
    Vector128<byte> narrow = Sse2.LoadVector128(pInput + offset);

    Vector128<byte> wideLow = Sse2.UnpackLow(narrow, zero);
    Vector128<byte> wideHigh = Sse2.UnpackHigh(narrow, zero);

    Sse2.Store((byte*)(pOutput + offset), wideLow);
    Sse2.Store((byte*)(pOutput + offset + 0x08), wideHigh);
}

public unsafe static void WidenAndWrite_Bar(byte* pInput, char* pOutput, ulong offset)
{
    Vector128<byte> zero = Vector128<byte>.Zero;
    Vector128<byte> narrow = Sse2.LoadVector128(pInput + offset);

    Vector128<byte> wideLow = Sse2.UnpackLow(narrow, zero);
    Vector128<byte> wideHigh = Sse2.UnpackHigh(narrow, zero);

    Sse2.Store((byte*)(pOutput + offset), wideLow);
    Sse2.Store((byte*)pOutput + (offset << 1) + 0x10, wideHigh);
}
```

```asm
WidenAndWrite_Foo(Byte*, Char*, UInt64)
    L0000: vzeroupper
    L0003: vxorps xmm0, xmm0, xmm0
    L0007: vmovdqu xmm1, [rcx+r8]
    L000d: vpunpcklbw xmm2, xmm1, xmm0
    L0011: vpunpckhbw xmm0, xmm1, xmm0
    L0015: lea rax, [rdx+r8*2]
    L0019: vmovdqu [rax], xmm2
    L001d: vmovdqu [rax+0x10], xmm0
    L0022: ret

WidenAndWrite_Bar(Byte*, Char*, UInt64)
    L0000: vzeroupper
    L0003: vxorps xmm0, xmm0, xmm0
    L0007: vmovdqu xmm1, [rcx+r8]
    L000d: vpunpcklbw xmm2, xmm1, xmm0
    L0011: vpunpckhbw xmm0, xmm1, xmm0
    L0015: vmovdqu [rdx+r8*2], xmm2
    L001b: vmovdqu [rdx+r8*2+0x10], xmm0
    L0022: ret
```

Is there any reason why the first sample uses a `lea` to help with calculating the destination memory address, while the second sample foregoes `lea` and folds the memory addressing logic directly into the `vmovdqu` instruction? As far as I can tell they take around the same amount of time to execute, but the first sample burns a register unnecessarily.

Related: https://github.com/dotnet/runtime/issues/10923

category:cq
theme:addressing-modes
skill-level:expert
cost:medium

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JIT: Some SIMD intrinsics operations don't reliably fold memory addressing logic #33002

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

JIT: Some SIMD intrinsics operations don't reliably fold memory addressing logic #33002

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions