Skip to content

Suboptimal ASM emmited for Vector256<T>.Zero and Vector128<T>.Zero #76067

@israellot

Description

@israellot

When using Vector256.Zero, I would expect it to be kept in a fixed register and reused. Instead, what I see is a vxorps operation emitted every time.

AVX2 has 16 YMM registers.

Bellow is one example. I can get the desired behavior by forcing a zero vector variable instead of using Vector256.Zero;
Assigning Vector256<byte>.Zero to a variable alone does not do the trick. Only the extra xor operation ensures it stays in a fixed register.

var byteVector = Vector256.LoadUnsafe<byte>(ref spanRef);
           
var low = Avx2.UnpackLow(byteVector, Vector256<byte>.Zero);
var high = Avx2.UnpackHigh(byteVector, Vector256<byte>.Zero);

var added = Avx2.Add(low.AsInt16(), high.AsInt16());

added = Avx2.HorizontalAdd(added, Vector256<short>.Zero);
added = Avx2.HorizontalAdd(added, Vector256<short>.Zero);
added = Avx2.HorizontalAdd(added, Vector256<short>.Zero);

//ASM

mov      rax, bword ptr [rcx]
vmovdqu  ymm0, ymmword ptr[rax]
vxorps   ymm1, ymm1, ymm1
vpunpcklbw ymm1, ymm0, ymm1
vxorps   ymm2, ymm2, ymm2
vpunpckhbw ymm0, ymm0, ymm2
vpaddw   ymm0, ymm1, ymm0
vxorps   ymm1, ymm1, ymm1
vphaddw  ymm0, ymm0, ymm1
vxorps   ymm1, ymm1, ymm1
vphaddw  ymm0, ymm0, ymm1
vxorps   ymm1, ymm1, ymm1
vphaddw  ymm0, ymm0, ymm1
var byteVector = Vector256.LoadUnsafe<byte>(ref spanRef);

var zero = Vector256<byte>.Zero;
zero = Avx2.Xor(zero, zero); //forces fixed register
            
var low = Avx2.UnpackLow(byteVector, zero);
var high = Avx2.UnpackHigh(byteVector, zero);

var added = Avx2.Add(low.AsInt16(), high.AsInt16());
added = Avx2.HorizontalAdd(added, zero.AsInt16());
added = Avx2.HorizontalAdd(added, zero.AsInt16());
added = Avx2.HorizontalAdd(added, zero.AsInt16());


//ASM
mov      rax, bword ptr [rcx]
vxorps   ymm0, ymm0, ymm0
vmovdqu  ymm1, ymmword ptr[rax]
vpunpcklbw ymm2, ymm1, ymm0
vpunpckhbw ymm1, ymm1, ymm0
vpaddw   ymm1, ymm2, ymm1
vphaddw  ymm1, ymm1, ymm0
vphaddw  ymm1, ymm1, ymm0
vphaddw  ymm1, ymm1, ymm0

category:cq
theme:cse
skill-level:intermediate
cost:medium
impact:small

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMItenet-performancePerformance related issue

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions