-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Closed
Closed
Copy link
Labels
JitUntriagedCLR JIT issues needing additional triageCLR JIT issues needing additional triagearea-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMICLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIoptimization
Milestone
Description
Found while looking at codegen for #32994. From SharpLab:
public unsafe static void WidenAndWrite_Foo(byte* pInput, char* pOutput, ulong offset)
{
Vector128<byte> zero = Vector128<byte>.Zero;
Vector128<byte> narrow = Sse2.LoadVector128(pInput + offset);
Vector128<byte> wideLow = Sse2.UnpackLow(narrow, zero);
Vector128<byte> wideHigh = Sse2.UnpackHigh(narrow, zero);
Sse2.Store((byte*)(pOutput + offset), wideLow);
Sse2.Store((byte*)(pOutput + offset + 0x08), wideHigh);
}
public unsafe static void WidenAndWrite_Bar(byte* pInput, char* pOutput, ulong offset)
{
Vector128<byte> zero = Vector128<byte>.Zero;
Vector128<byte> narrow = Sse2.LoadVector128(pInput + offset);
Vector128<byte> wideLow = Sse2.UnpackLow(narrow, zero);
Vector128<byte> wideHigh = Sse2.UnpackHigh(narrow, zero);
Sse2.Store((byte*)(pOutput + offset), wideLow);
Sse2.Store((byte*)pOutput + (offset << 1) + 0x10, wideHigh);
}WidenAndWrite_Foo(Byte*, Char*, UInt64)
L0000: vzeroupper
L0003: vxorps xmm0, xmm0, xmm0
L0007: vmovdqu xmm1, [rcx+r8]
L000d: vpunpcklbw xmm2, xmm1, xmm0
L0011: vpunpckhbw xmm0, xmm1, xmm0
L0015: lea rax, [rdx+r8*2]
L0019: vmovdqu [rax], xmm2
L001d: vmovdqu [rax+0x10], xmm0
L0022: ret
WidenAndWrite_Bar(Byte*, Char*, UInt64)
L0000: vzeroupper
L0003: vxorps xmm0, xmm0, xmm0
L0007: vmovdqu xmm1, [rcx+r8]
L000d: vpunpcklbw xmm2, xmm1, xmm0
L0011: vpunpckhbw xmm0, xmm1, xmm0
L0015: vmovdqu [rdx+r8*2], xmm2
L001b: vmovdqu [rdx+r8*2+0x10], xmm0
L0022: retIs there any reason why the first sample uses a lea to help with calculating the destination memory address, while the second sample foregoes lea and folds the memory addressing logic directly into the vmovdqu instruction? As far as I can tell they take around the same amount of time to execute, but the first sample burns a register unnecessarily.
Related: #10923
category:cq
theme:addressing-modes
skill-level:expert
cost:medium
Metadata
Metadata
Assignees
Labels
JitUntriagedCLR JIT issues needing additional triageCLR JIT issues needing additional triagearea-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMICLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIoptimization