Skip to content

Vector128.Create() codegen: ReadOnlySpan<byte> vs. byte[] overloads #71885

@am11

Description

@am11

While modernizing some code in HexConverter to C# 11 ""u8 syntax:

Vector128<byte> asciiTable = (casing == Casing.Upper) ?
Vector128.Create((byte)'0', (byte)'1', (byte)'2', (byte)'3',
(byte)'4', (byte)'5', (byte)'6', (byte)'7',
(byte)'8', (byte)'9', (byte)'A', (byte)'B',
(byte)'C', (byte)'D', (byte)'E', (byte)'F') :
Vector128.Create((byte)'0', (byte)'1', (byte)'2', (byte)'3',
(byte)'4', (byte)'5', (byte)'6', (byte)'7',
(byte)'8', (byte)'9', (byte)'a', (byte)'b',
(byte)'c', (byte)'d', (byte)'e', (byte)'f');
to become:

Vector128<byte> asciiTable = Vector128.Create((casing == Casing.Upper) ?
    "0123456789ABCDEF"u8 : "0123456789abcdef"u8);

I noticed a significant regression in codegen. The culprit appears to be Vector128.Create's ReadOnlySpan<T> overload.

Repro

# bash on linux-x64

$ dotnet7 --version
7.0.100-preview.7.22358.13

$ dotnet7 new classlib -n VectorCreation
$ cat > VectorCreation/Class1.cs << EOF
using System.Runtime.Intrinsics;
public class D
{
    public Vector128<byte> CreateWithROSpan() => Vector128.Create("0123456789ABCDEF"u8);

    public Vector128<byte> CreateWithByteArray() => Vector128.Create(
                                 (byte)'0', (byte)'1', (byte)'2', (byte)'3',
                                 (byte)'4', (byte)'5', (byte)'6', (byte)'7',
                                 (byte)'8', (byte)'9', (byte)'A', (byte)'B',
                                 (byte)'C', (byte)'D', (byte)'E', (byte)'F');
}
EOF

$ dotnet7 publish VectorCreation -c Release -o dist --use-current-runtime -p:PublishAot=true -p:LangVersion=preview

$ gdb dist/VectorCreation.so -batch \
    -ex 'set disassembly-flavor intel' \
    -ex 'disassemble VectorCreation_D__CreateWithByteArray' \
    -ex 'disassemble VectorCreation_D__CreateWithROSpan' \
    -ex 'disassemble S_P_CoreLib_System_Runtime_Intrinsics_Vector128__Create_14<UInt8>'

Codegen

Dump of assembler code for function VectorCreation_D__CreateWithByteArray:
   0x0000000000211e10 <+0>:	movups xmm0,XMMWORD PTR [rip+0x111689]        # 0x3234a0
   0x0000000000211e17 <+7>:	movups XMMWORD PTR [rsi],xmm0
   0x0000000000211e1a <+10>:	mov    rax,rsi
   0x0000000000211e1d <+13>:	ret    
End of assembler dump.

vs.

Dump of assembler code for function VectorCreation_D__CreateWithROSpan:
   0x0000000000211df0 <+0>:	mov    rdi,rsi
   0x0000000000211df3 <+3>:	lea    rsi,[rip+0x10f33e]        # 0x321138
   0x0000000000211dfa <+10>:	mov    edx,0x10
   0x0000000000211dff <+15>:	jmp    0x273a50 <S_P_CoreLib_System_Runtime_Intrinsics_Vector128__Create_14<UInt8>>
End of assembler dump.
Dump of assembler code for function S_P_CoreLib_System_Runtime_Intrinsics_Vector128__Create_14<UInt8>:
   0x0000000000273a50 <+0>:	push   rbp
   0x0000000000273a51 <+1>:	mov    rbp,rsp
   0x0000000000273a54 <+4>:	cmp    edx,0x10
   0x0000000000273a57 <+7>:	jl     0x273a64 <S_P_CoreLib_System_Runtime_Intrinsics_Vector128__Create_14<UInt8>+20>
   0x0000000000273a59 <+9>:	movups xmm0,XMMWORD PTR [rsi]
   0x0000000000273a5c <+12>:	movups XMMWORD PTR [rdi],xmm0
   0x0000000000273a5f <+15>:	mov    rax,rdi
   0x0000000000273a62 <+18>:	pop    rbp
   0x0000000000273a63 <+19>:	ret    
   0x0000000000273a64 <+20>:	mov    edi,0x6
   0x0000000000273a69 <+25>:	call   0x17ff40 <S_P_CoreLib_System_ThrowHelper__ThrowArgumentOutOfRangeException_0>
   0x0000000000273a6e <+30>:	int3   
End of assembler dump.

Side note: with -p:IlcInstructionSet=avx2, we get additional vzerouppers (in CreateWithByteArray as well), which seems to be related to #11496.

Metadata

Metadata

Assignees

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions