Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 68 additions & 34 deletions src/coreclr/src/jit/codegenxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2989,13 +2989,19 @@ void CodeGen::genCodeForInitBlkUnroll(GenTreeBlk* node)
// Fill as much as possible using SSE2 stores.
if (size >= XMM_REGSIZE_BYTES)
{
#ifdef FEATURE_SIMD
bool useYmm = (compiler->getSIMDVectorRegisterByteLength() == YMM_REGSIZE_BYTES) && (size >= YMM_REGSIZE_BYTES);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to tie this to the size of Vector<T> rather than to the underlying ISAs that are available (based on the instructions used)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to replace compiler->getSIMDVectorRegisterByteLength() == YMM_REGSIZE_BYTES with compiler->>getSIMDSupportLevel() == SIMD_AVX2_Supported? I do not have a preference here, what is your opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, to replace it with a compSupports(InstructionSet_AVX2) check so we aren't tied to Vector<T> when we aren't actually using Vector<T>.

The appropriate check might change slightly with #33274, but I imagine this PR would get merged first. CC. @davidwrighton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, will do that before merge, thanks.

#else
bool useYmm = false;
#endif
regNumber srcXmmReg = node->GetSingleTempReg(RBM_ALLFLOAT);

if (src->gtSkipReloadOrCopy()->IsIntegralConst(0))
{
unsigned regSize = useYmm ? YMM_REGSIZE_BYTES : XMM_REGSIZE_BYTES;
// If the source is constant 0 then always use xorps, it's faster
// than copying the constant from a GPR to a XMM register.
emit->emitIns_R_R(INS_xorps, EA_16BYTE, srcXmmReg, srcXmmReg);
emit->emitIns_R_R(INS_xorps, EA_ATTR(regSize), srcXmmReg, srcXmmReg);
}
else
{
Expand All @@ -3005,20 +3011,36 @@ void CodeGen::genCodeForInitBlkUnroll(GenTreeBlk* node)
// For x86, we need one more to convert it from 8 bytes to 16 bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else block could be more efficient as just a shuffle (SSE2) or broadcast (AVX2). You can see the pattern we use for setting all elements of a vector to value here:

Likewise for V256 it can be a shuffle + insert (AVX) or broadcast (AVX2):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We start with 2 bytes of data after emit->emitIns_R_R(INS_mov_i2xmm, EA_PTRSIZE, srcXmmReg, srcIntReg);, if we want to use a broadcast we will need to copy it up to 4 bytes. It could still be profitable, I think it is worth to open an issue from your comment, mark it as easy and up-for-grabs probably.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we start with 2 bytes of data? this is mov_i2xmm which is movd which only supports reading 32-bits or 64-bits (also hence the EA_PTRSIZE).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes, we are transferring from 1 byte that Initblk accepts to int/long/float size in fgMorphPromoteLocalInitBlock

emit->emitIns_R_R(INS_punpckldq, EA_16BYTE, srcXmmReg, srcXmmReg);
#endif
if (useYmm)
{
emit->emitIns_R_R(INS_punpckldq, EA_32BYTE, srcXmmReg, srcXmmReg);
}
}

instruction simdMov = simdUnalignedMovIns();
for (unsigned regSize = XMM_REGSIZE_BYTES; size >= regSize; size -= regSize, dstOffset += regSize)
{
if (dstLclNum != BAD_VAR_NUM)
{
emit->emitIns_S_R(simdMov, EA_ATTR(regSize), srcXmmReg, dstLclNum, dstOffset);
}
else

auto unrollUsingXMM = [&](unsigned regSize) {
for (; size >= regSize; size -= regSize, dstOffset += regSize)
{
emit->emitIns_ARX_R(simdMov, EA_ATTR(regSize), srcXmmReg, dstAddrBaseReg, dstAddrIndexReg,
dstAddrIndexScale, dstOffset);
if (dstLclNum != BAD_VAR_NUM)
{
emit->emitIns_S_R(simdMov, EA_ATTR(regSize), srcXmmReg, dstLclNum, dstOffset);
}
else
{
emit->emitIns_ARX_R(simdMov, EA_ATTR(regSize), srcXmmReg, dstAddrBaseReg, dstAddrIndexReg,
dstAddrIndexScale, dstOffset);
}
}
};

if (useYmm)
{
unrollUsingXMM(YMM_REGSIZE_BYTES);
}
if (size >= XMM_REGSIZE_BYTES)
{
unrollUsingXMM(XMM_REGSIZE_BYTES);
}

// TODO-CQ-XArch: On x86 we could initialize 8 byte at once by using MOVQ instead of two 4 byte MOV stores.
Expand Down Expand Up @@ -3206,34 +3228,46 @@ void CodeGen::genCodeForCpBlkUnroll(GenTreeBlk* node)
{
regNumber tempReg = node->GetSingleTempReg(RBM_ALLFLOAT);

instruction simdMov = simdUnalignedMovIns();
for (unsigned regSize = XMM_REGSIZE_BYTES; size >= regSize;
size -= regSize, srcOffset += regSize, dstOffset += regSize)
{
if (srcLclNum != BAD_VAR_NUM)
{
emit->emitIns_R_S(simdMov, EA_ATTR(regSize), tempReg, srcLclNum, srcOffset);
}
else
auto unrollUsingXMM = [&](unsigned regSize, regNumber tempReg) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to make these actual methods rather than lambdas. To me the lambdas make it more difficult to see what's being modified, and it seems that sometimes the debugger(s) don't support them all that well. That said, doing that here would require some out (pointer) arguments). @BruceForstall do we have any guidance on this for the JIT?
I'm interested in others' thoughts on this @dotnet/jit-contrib. Perhaps it's only me that finds them obfuscating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did not want to do a method because it would need 7+ arguments and ~5 of them were pointers or references.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - I realize that there are tradeoffs, but I would probably make them in the other direction. I'd still like to hear from other JIT devs on this question, because I think it will continue to come up.
ping @dotnet/jit-contrib.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is mainly just used to cover the regular + trailing case, and the trailing case should only ever be one iteration, might it be simpler to just copy the logic for the one trailing case needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For just this case I don't think it's worth debating over - but I think this is a good opportunity to discuss these tradeoffs relative to the JIT coding conventions. It may be that most others feel as I do, but it may also be that most JIT devs would prefer the slight obfuscation and possible debug challenges over the admittedly messy approach of passing in a bunch of pointer arguments. As I say, it's a tradeoff and it would be nice to get some consensus on where the balance should lie.

instruction simdMov = simdUnalignedMovIns();
for (; size >= regSize; size -= regSize, srcOffset += regSize, dstOffset += regSize)
{
emit->emitIns_R_ARX(simdMov, EA_ATTR(regSize), tempReg, srcAddrBaseReg, srcAddrIndexReg,
srcAddrIndexScale, srcOffset);
}
if (srcLclNum != BAD_VAR_NUM)
{
emit->emitIns_R_S(simdMov, EA_ATTR(regSize), tempReg, srcLclNum, srcOffset);
}
else
{
emit->emitIns_R_ARX(simdMov, EA_ATTR(regSize), tempReg, srcAddrBaseReg, srcAddrIndexReg,
srcAddrIndexScale, srcOffset);
}

if (dstLclNum != BAD_VAR_NUM)
{
emit->emitIns_S_R(simdMov, EA_ATTR(regSize), tempReg, dstLclNum, dstOffset);
}
else
{
emit->emitIns_ARX_R(simdMov, EA_ATTR(regSize), tempReg, dstAddrBaseReg, dstAddrIndexReg,
dstAddrIndexScale, dstOffset);
if (dstLclNum != BAD_VAR_NUM)
{
emit->emitIns_S_R(simdMov, EA_ATTR(regSize), tempReg, dstLclNum, dstOffset);
}
else
{
emit->emitIns_ARX_R(simdMov, EA_ATTR(regSize), tempReg, dstAddrBaseReg, dstAddrIndexReg,
dstAddrIndexScale, dstOffset);
}
}
}

// TODO-CQ-XArch: On x86 we could copy 8 byte at once by using MOVQ instead of four 4 byte MOV stores.
// On x64 it may also be worth copying a 4/8 byte remainder using MOVD/MOVQ, that avoids the need to
// allocate a GPR just for the remainder.
// TODO-CQ-XArch: On x86 we could copy 8 byte at once by using MOVQ instead of four 4 byte MOV stores.
// On x64 it may also be worth copying a 4/8 byte remainder using MOVD/MOVQ, that avoids the need to
// allocate a GPR just for the remainder.
};

#ifdef FEATURE_SIMD
if ((compiler->getSIMDVectorRegisterByteLength() == YMM_REGSIZE_BYTES) && (size >= YMM_REGSIZE_BYTES))
{
unrollUsingXMM(YMM_REGSIZE_BYTES, tempReg);
}
#endif
if (size >= XMM_REGSIZE_BYTES)
{
unrollUsingXMM(XMM_REGSIZE_BYTES, tempReg);
}
}

if (size > 0)
Expand Down
2 changes: 1 addition & 1 deletion src/coreclr/src/jit/compiler.h
Original file line number Diff line number Diff line change
Expand Up @@ -8173,7 +8173,7 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
{
return false;
}
#endif // FEATURE_SIMD
#endif // !FEATURE_SIMD

public:
//------------------------------------------------------------------------
Expand Down