Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Implement SetAllVector128 SSE2 hardware intrinsic#16758

Closed
4creators wants to merge 3 commits into
dotnet:masterfrom
dotnetrt:SetAllVector128
Closed

Implement SetAllVector128 SSE2 hardware intrinsic#16758
4creators wants to merge 3 commits into
dotnet:masterfrom
dotnetrt:SetAllVector128

Conversation

@4creators
Copy link
Copy Markdown

@4creators 4creators commented Mar 5, 2018

FYI @CarolEidt @fiigii @tannergooding @mikedn

This PR depends on #16736 and contains implementation of Sse2.SetScalarVector128 from that PR.

Here I am facing couple of problems with code gen which are difficult to resolve without some external advice. In particular i do not how to avoid duplication of generated code (subtree) while maintaining use of the same data (preferably same register):

                case TYP_SHORT:
                case TYP_USHORT:
                {
                    src = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_ConvertScalarToVector128Int32, TYP_INT, simdSize);
                    GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtCloneExpr(src), NI_SSE2_UnpackLow, TYP_SHORT, simdSize);
                    retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, tmp, gtNewIconNode(0), NI_SSE2_Shuffle, TYP_INT, simdSize);
                    break;
                }

Use of gtCloneExpr(src) leads to initialization of separate XMM register while I need to use the same register initialized earlier for both operands.

break;
}

case NI_SSE2_SetAllVector128:
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My plan is to have final refactoring of case case NI_SSE2_SetAllVector128 after code gen problems will be solved.

Comment thread src/jit/hwintrinsicxarch.cpp Outdated
src = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_ConvertScalarToVector128Int32, TYP_INT,
simdSize);
GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtCloneExpr(src), NI_SSE2_UnpackLow,
TYP_SHORT, simdSize);
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First tree duplication.

Comment thread src/jit/hwintrinsicxarch.cpp Outdated
GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtCloneExpr(src), NI_SSE2_UnpackLow,
TYP_BYTE, simdSize);
GenTree* shu = gtNewSimdHWIntrinsicNode(TYP_SIMD16, tmp, gtCloneExpr(tmp), NI_SSE2_UnpackLow,
TYP_SHORT, simdSize);
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second set of tree duplications

@tannergooding
Copy link
Copy Markdown
Member

Use of gtCloneExpr(src) leads to initialization of separate XMM register while I need to use the same register initialized earlier for both operands.

I think it was @mikedn who mentioned using gtCloneExpr might have problems with more complex trees.

I would guess you want to create a local for src and then pass the local to NI_SSE2_Shuffle node. However, I am not familiar enough with the JIT yet to know how to do that (or if there is a better way to do it).

@AndyAyersMS
Copy link
Copy Markdown
Member

impCloneExpr is what you need. It copies very simple trees and spills more complex ones to temps that it creates.

@4creators
Copy link
Copy Markdown
Author

I would guess you want to create a local for src and then pass the local to NI_SSE2_Shuffle

Yes, that's the idea just to use single register for all operations.

@mikedn
Copy link
Copy Markdown

mikedn commented Mar 5, 2018

Here I am facing couple of problems with code gen which are difficult to resolve without some external advice. In particular i do not how to avoid duplication of generated code (subtree) while maintaining use of the same data (preferably same register):

It's not obvious to me why do you need to expand "set all" in the importer. Why can't NI_SSE2_SetAllVector128 be imported "as is" and have codegen generate the necessary instruction sequence?

Comment thread src/jit/hwintrinsiccodegenxarch.cpp Outdated
// Ensure we aren't overwriting targetReg
assert(tmpReg != targetReg);

emit->emitIns_R_R(INS_movapd, emitTypeSize(TYP_SIMD16), tmpReg, op1Reg);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ps instructions when they are equivalent to pd instructions. They have shorted encoding when VEX is not available.

Applies to the xorpd below as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've not been doing this for helper intrinsics so far. @CarolEidt, @fiigii, does this seem like something we can/should do for the helper intrinsics specifically?

For the regular intrinsics, we have a contract mapping it to a given instruction (but not encoding). However, for the helper intrinsics, we have a contract of behavior instead, so it might make sense for these to produce the "most optimal" codegen.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You really should print that contract on a piece of paper and burn it.

@tannergooding
Copy link
Copy Markdown
Member

It's not obvious to me why do you need to expand "set all" in the importer. Why can't NI_SSE2_SetAllVector128 be imported "as is"

The general idea is to give more flexibility for these helper intrinsics. If we handle it in the importer by transforming it to the "equivalent" set of intrinsics nodes that we would actually generate, it is easier for containment or other optimizations to be handled appropriately.

It also makes it significantly easier to handle things like SetVector128(byte), which takes 16 args.

@mikedn
Copy link
Copy Markdown

mikedn commented Mar 5, 2018

The general idea is to give more flexibility for these helper intrinsics. If we handle it in the importer by transforming it to the "equivalent" set of intrinsics nodes that we would actually generate, it is easier for containment or other optimizations to be handled appropriately.

The idea of "general idea" should be taken with a grain of salt in the JIT world. It would be great if the JIT IR could handle all the intricacies of assembly instructions but for better or worse that isn't the case (likely for better since it would be quite costly otherwise). As such, choosing how to represent something in the IR is based on a "best idea" and not on a "general idea". The current approach is not exactly best due to the need of temporaries so let's say it has -1 points. How many points does the codegen approach has?

It also makes it significantly easier to handle things like SetVector128(byte), which takes 16 args.

That's such an extreme case that it should be analyzed separately. I suspect that in that case you have no choice but to expand in the importer (or maybe in lowering), otherwise the RA will probably have a hard time dealing with 16 args that may all need to be in GPRs...

@4creators
Copy link
Copy Markdown
Author

For word sized types codegen is almost as desired, except for unnecessary register spill to the stack and later usage in vpunpcklwd and for SSE encoding usage of additional XMM1 register which is not necessary:

       8B8D38FCFFFF         mov      ecx, dword ptr [rbp-3C8H]
       0FB7C9               movzx    rcx, cx
       C4E1796EC1           vmovd    xmm0, xrcx
       C4E179298560F8FFFF   vmovapd  xmmword ptr [rbp-7A0H], xmm0
       C4E179288560F8FFFF   vmovapd  xmm0, xmmword ptr [rbp-7A0H]
       C4E179618560F8FFFF   vpunpcklwd xmm0, xmm0, xmmword ptr [rbp-7A0H]
       C4E17970C000         vpshufd  xmm0, xmm0, 0
       8B8D38FCFFFF         mov      ecx, dword ptr [rbp-3C8H]
       0FB7C9               movzx    rcx, cx
       660F6EC1             movd     xmm0, xrcx
       0F298560F8FFFF       movaps   xmmword ptr [rbp-7A0H], xmm0
       0F288560F8FFFF       movaps   xmm0, xmmword ptr [rbp-7A0H]
       0F288D60F8FFFF       movaps   xmm1, xmmword ptr [rbp-7A0H]
       660F61C1             punpcklwd xmm0, xmm1
       660F70C000           pshufd   xmm0, xmm0, 0

@4creators
Copy link
Copy Markdown
Author

The importer code for the above is:

                case TYP_SHORT:
                case TYP_USHORT:
                {
                    GenTree* srcClone = nullptr;

                    // Initialize XMM register with initial value
                    src = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_ConvertScalarToVector128Int32, TYP_INT,
                                                   simdSize);

                    // src is used twice - clone it
                    CORINFO_CLASS_HANDLE vectorHandle = (baseType == TYP_SHORT) ? Vector128ShortHandle : Vector128UShortHandle;
                    src = impCloneExpr(src, &srcClone, vectorHandle, (unsigned)CHECK_SPILL_ALL,
                        nullptr DEBUGARG("SetAllVector - duplicate src tree with initialized XMM register"));

                    GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, srcClone, NI_SSE2_UnpackLow,
                                                            TYP_SHORT, simdSize);
                    retNode =
                        gtNewSimdHWIntrinsicNode(TYP_SIMD16, tmp, gtNewIconNode(0), NI_SSE2_Shuffle, TYP_INT, simdSize);
                    break;
                }

@4creators
Copy link
Copy Markdown
Author

For byte sized operands codegen is good as well except for repetitive register spills:

       8B8D18FCFFFF         mov      ecx, dword ptr [rbp-3E8H]
       480FBEC9             movsx    rcx, cl
       C4E1796EC1           vmovd    xmm0, xrcx
       C4E179298540F8FFFF   vmovapd  xmmword ptr [rbp-7C0H], xmm0
       C4E179288540F8FFFF   vmovapd  xmm0, xmmword ptr [rbp-7C0H]
       C4E179608540F8FFFF   vpunpcklbw xmm0, xmm0, xmmword ptr [rbp-7C0H]
       C4E179298530F8FFFF   vmovapd  xmmword ptr [rbp-7D0H], xmm0
       C4E179288530F8FFFF   vmovapd  xmm0, xmmword ptr [rbp-7D0H]
       C4E179618530F8FFFF   vpunpcklwd xmm0, xmm0, xmmword ptr [rbp-7D0H]
       C4E17970C000         vpshufd  xmm0, xmm0, 0

for legacy SSE encoding unnecessary additional XMM register is used:

       480FBECE             movsx    rcx, sil
       660F6EC1             movd     xmm0, xrcx
       0F28C8               movaps   xmm1, xmm0
       660F60C8             punpcklbw xmm1, xmm0
       0F28C1               movaps   xmm0, xmm1
       660F61C1             punpcklwd xmm0, xmm1
       660F70C000           pshufd   xmm0, xmm0, 0

C++ code:

                case TYP_BYTE:
                case TYP_UBYTE:
                {
                    GenTree* srcClone = nullptr;
                    GenTree* tmpCloneOne = nullptr;

                    // Initialize XMM register with initial value
                    src = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_ConvertScalarToVector128Int32, TYP_INT,
                                                   simdSize);

                    // src is used twice - clone it
                    CORINFO_CLASS_HANDLE vectorHandle = (baseType == TYP_BYTE) ? Vector128ByteHandle : Vector128UByteHandle;
                    src = impCloneExpr(src, &srcClone, vectorHandle, (unsigned)CHECK_SPILL_ALL,
                        nullptr DEBUGARG("SetAllVector - duplicate src tree with initialized XMM register"));

                    GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, srcClone, NI_SSE2_UnpackLow,
                                                            TYP_BYTE, simdSize);

                    // tmp is used three times so we clone it as well
                    tmp = impCloneExpr(tmp, &tmpCloneOne, vectorHandle, (unsigned)CHECK_SPILL_ALL,
                        nullptr DEBUGARG("SetAllVector - duplicate tmp tree with initialized XMM register"));

                    GenTree* shu = gtNewSimdHWIntrinsicNode(TYP_SIMD16, tmp, tmpCloneOne, NI_SSE2_UnpackLow,
                                                            TYP_SHORT, simdSize);
                    retNode =
                        gtNewSimdHWIntrinsicNode(TYP_SIMD16, shu, gtNewIconNode(0), NI_SSE2_Shuffle, TYP_INT, simdSize);
                    break;
                }

Comment thread src/jit/hwintrinsicxarch.cpp Outdated
case TYP_DOUBLE:
{
src = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_SetScalarVector128, baseType, simdSize);
retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtCloneExpr(src), gtNewIconNode(0),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't be arbitrarily cloning trees.
I think the right thing to use is Compiler::fgMakeMultiUse, which will clone it if it is a local, and otherwise create a temp and modify the existing tree to add the assignment. Either way you get back a GT_LCL_VAR node to use for the next instance. You can see it in use here: https://github.com/dotnet/coreclr/blob/master/src/jit/morph.cpp#L15008

Copy link
Copy Markdown

@fiigii fiigii Mar 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, we can change Shuffle intrinsics to also support 1-arg form internally to solve this problem.

Copy link
Copy Markdown
Author

@4creators 4creators Mar 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, we can change Shuffle intrinsics to also support 1-arg form internally to solve this problem.

This not necessary since you can use TYP_INT Shuffle as follows to get identical result.

    retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtNewIconNode(0b01000100), NI_SSE2_Shuffle,
                                       TYP_INT, simdSize);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CarolEidt Do you think that usage of impCloneExpr for cases TYP_SHORT and TYP_BYTE as suggested by @AndyAyersMS is OK or I should better switch to Compiler::fgMakeMultiUse

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are slightly different, and I think Compiler::fgMakeMultiUse is the better choice. For one, it has a cleaner interface, because you don't need to pass it a GenTree**. The other difference is that impCloneExpr will clone any tree that doesn't have side-effects, which is probably not what you want for SIMD, because it will recompute the value, e.g. if it was a SIMD operation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's really better to go via fgMakeMultiUse it would be good to understand why; like I said the rest of the importer uses impCloneExpr pretty heavily.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impCloneExpr is used pervasively in the importer for these kinds of decisions. It calls gtClone with default last param false. So it won't clone anything complex.

@AndyAyersMS - impCloneExpr actually passes true to gtClone, but I see that it will only clone simple trees. So that may be the way to go.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any hints how to block second XMM register use and get rid of movaps xmm1, xmm0 operations?

I would have to look at a jitdump to see what's going on.

Copy link
Copy Markdown
Author

@4creators 4creators Mar 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CarolEidt

These are dumps comprising only short, ushort, byte and sbyte tests for AVX and SSE codegen:

SetAllVector128RoDump.txt
SetAllVector128RoNoAVXDump.txt

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, yeah, it passes true, so it will clone add/sub/addr etc.

If there is something deficient about it it would be good to know what it is since it is heavily used. On the surface I don't see anything obvious.

@fiigii
Copy link
Copy Markdown

fiigii commented Mar 5, 2018

It's not obvious to me why do you need to expand "set all" in the importer. Why can't NI_SSE2_SetAllVector128 be imported "as is" and have codegen generate the necessary instruction sequence?

@mikedn Another concern is that most of the helpers have different codgen solutions on different hardware. For example NI_SSE2_SetAllVector128 will use unpack on SSE2, insert on SSE4.1, or broadcast on AVX2 machines. They may have different register demands. So, I think expanding the herpler-intrinsics as early as possible can simplify the backend implementation.

@fiigii
Copy link
Copy Markdown

fiigii commented Mar 5, 2018

For byte sized operands codegen is good as well except for repetitive register spills:

@4creators Just fouse on SSE encoding. This implementation will never be used on AVX machines, I will optimize Set* via SSE4.1 and AVX2 instructions later.

Comment thread src/jit/hwintrinsicxarch.cpp Outdated
src = impCloneExpr(src, &srcClone, vectorHandle, (unsigned)CHECK_SPILL_ALL,
nullptr DEBUGARG("SetAllVector - duplicate src tree with initialized XMM register"));

GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, srcClone, NI_SSE2_UnpackLow,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change NI_SSE2_UnpackLow's numArg field to -1 to let UnpackLow accept 1-arg (internally), which avoids the impCloneExpr here.

BTW, the backend supports 1-arg from codgen for punpack*.

@4creators 4creators force-pushed the SetAllVector128 branch 2 times, most recently from 3709d0c to e8bb858 Compare March 5, 2018 23:35
Comment thread src/jit/compiler.h
// and transform the graph accordingly.
GenTree* fgInsertCommaFormTemp(GenTree** ppTree, CORINFO_CLASS_HANDLE structType = nullptr);
GenTree* fgMakeMultiUse(GenTree** ppTree);
GenTree* fgMakeMultiUse(GenTree** ppTree, CORINFO_CLASS_HANDLE structType = nullptr);
Copy link
Copy Markdown
Author

@4creators 4creators Mar 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was necessary to change Compiler::fgMakeMultiUse to use CORINFO_CLASS_HANDLE structType = nullptr default parameter as nodes which are structures require structType. Otherwise, my code was hitting assertion:

Assert failure(PID 18536 [0x00004868], Thread: 8836 [0x2284]): Assertion failed 'structType != nullptr' in 'IntelHardwareIntrinsicTest.Program:Main(ref):int' (IL size 1093)

    File: e:\src\ms\dotnet\coreclr\src\jit\morph.cpp Line: 2725

Comment thread src/jit/morph.cpp
else
{
GenTree* result = fgInsertCommaFormTemp(pOp);
GenTree* result = fgInsertCommaFormTemp(pOp, structType);
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was arising here due to fgInsertCommaFormTemp logic which requires it when pOp is structure.

case TYP_DOUBLE:
{
src = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_SetScalarVector128, baseType, simdSize);
retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtNewIconNode(0b01000100), NI_SSE2_Shuffle,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we may need a phase to eliminate "floating conversion" intrinsics (like compiler-generated or user-written NI_SSE2_SetScalarVector128).
In most cases, these intrinsics just make the type system happy (match Vector128<float/double> with float/double), but the codegen is unnecessary.

Copy link
Copy Markdown

@fiigii fiigii Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we directly change the return type of op1 to TYP_SIMD16 to avoid NI_SSE2_SetScalarVector128 here? Or make a new IR to match the types?
The current CQ of NI_SSE/SSE2_SetScalarVector128 is not good due to the "zero upper" semantics.

@CarolEidt @AndyAyersMS

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since CPU instructions are type agnostic we should have a type free but only operand size constrained internal codegen for intrinsics.

Copy link
Copy Markdown

@mikedn mikedn Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we may need a phase to eliminate "floating conversion" intrinsics (like compiler-generated or user-written NI_SSE2_SetScalarVector128).

A phase to do that is probably more expensive that it's worth it. Also, it's not clear to what extent you can rely on the upper 64/96 bits of a float/double value to be 0 so in many cases it probably won't work.

Can we directly change the return type of op1 to TYP_SIMD16 to avoid NI_SSE2_SetScalarVector128 here?

Probably, HWINTRINSIC nodes anyway accept and produce a variety of types. Most of the JIT doesn't pay attention to these nodes. It may also be possible to use a BITCAST node but I don't think it was ever used so early in the JIT so you may encounter issues.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikedn Looks a new helper-intrinsic is simpler? #16772

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we directly change the return type of op1 to TYP_SIMD16 to avoid NI_SSE2_SetScalarVector128 here?

I don't think that is possible without introducing non-deterministic behavior.


In reply to: 172425955 [](ancestors = 172425955)

@4creators
Copy link
Copy Markdown
Author

test Windows_NT x64 Checked jitincompletehwintrinsic
test Windows_NT x64 Checked jitx86hwintrinsicnoavx
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnosimd
test Windows_NT x64 Checked jitnox86hwintrinsic

test Windows_NT x86 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx
test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
test Windows_NT x86 Checked jitx86hwintrinsicnosimd
test Windows_NT x86 Checked jitnox86hwintrinsic

test Ubuntu x64 Checked jitincompletehwintrinsic
test Ubuntu x64 Checked jitx86hwintrinsicnoavx
test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
test Ubuntu x64 Checked jitx86hwintrinsicnosimd
test Ubuntu x64 Checked jitnox86hwintrinsic

@tannergooding
Copy link
Copy Markdown
Member

@4creators, could you update Sse.SetAllVector128 to no longer use gtCloneExpr as well (since you will need to rebase to resolve the merge conflicts anyways)?

@mikedn
Copy link
Copy Markdown

mikedn commented Mar 6, 2018

Another concern is that most of the helpers have different codgen solutions on different hardware. For example NI_SSE2_SetAllVector128 will use unpack on SSE2, insert on SSE4.1, or broadcast on AVX2 machines. They may have different register demands. So, I think expanding the herpler-intrinsics as early as possible can simplify the backend implementation.

At the cost of complicating the importer implementation. This kind of complexity can't go away, it can only be moved around. At least in codegen it wouldn't require temporary variables.

@tannergooding
Copy link
Copy Markdown
Member

At least in codegen it wouldn't require temporary variables.

But I believe it would also require additional complexity in order to ensure that the "ideal" codegen was generated.

It seems like a much simpler (and possibly better) approach, to just import it as the chain of intrinsics it would have been written as, had it been implemented in managed code. This allows the JIT to do all the appropriate transformations, folding, register allocations, etc later down the line and with the full context of the code we are actually generating.

I would like to hear what @CarolEidt or @AndyAyersMS thinks about the two approaches (transform in importer vs custom codegen later on).

@4creators
Copy link
Copy Markdown
Author

I think this is a good idea if the managed implementation can be imlined.

That should not be a problem due to size of code and on top of that we can use attrubute MethodImplementationOptions.AgressiveInlining

@fiigii
Copy link
Copy Markdown

fiigii commented Mar 6, 2018

That would mean that it would be implemented in managed as:

BTW, I think the managed implementation can be a very good non-const fallback of Avx.Insert/Extract.

@mikedn
Copy link
Copy Markdown

mikedn commented Mar 6, 2018

It is, we are only consuming the lowest bits from the vector so the upper bits "don't matter".

Well, one more reason to do this in codegen. If you do it in codegen you can directly emit a shuffle instruction and be done with it.

Could you elaborate?

See my above comment about SetVector128. It's unlikely that you can get that right with a managed only implementation.

The purpose of doing this in the importer was to help simplify and contain the various logic to a single location (the importer) rather than spread it out between lowering, lsra, and codegen.

What I suggest has nothing to do with lowering. The codegen part would be similar to the importer thing. The only additional complexity is that you need to request an internal register in certain cases from lsra. How hard can it be? Might very well turn out to be simpler than the multi use acrobatics.

@fiigii
Copy link
Copy Markdown

fiigii commented Mar 6, 2018

Well, one more reason to do this in codegen. If you do it in codegen you can directly emit a shuffle instruction and be done with it.

Sounds reasonable to me.

@CarolEidt
Copy link
Copy Markdown

I'm not sure if we're zeroing in on a consensus here, but @tannergooding's suggestion to implement some of these in managed code sounds quite reasonable, and @mikedn's comment is right on target that we don't want to forbid expansion in the importer, but rather that we want to ensure that complexity isn't spread throughout the JIT, and that we don't introduce non-determinism in the implementation.

@mikedn
Copy link
Copy Markdown

mikedn commented Mar 6, 2018

I think the long case on x86 needs a bit investigation before moving forward. Post-lowering a long node will usually end up being split into two separate nodes that are than "packed" together using a GT_LONG node. I have no idea how intrinsic handle that currently, if they handle it at all. In any case, this may have an impact on the exact implementation choice for SetAllVector(long) and any other intrinsics that have long arguments or long returns.

@mikedn
Copy link
Copy Markdown

mikedn commented Mar 6, 2018

I looked a bit around for long/ulong intrinsics, the situation seems a bit complicated:

  • There are "real" intrinsics (e.g. ConvertScalarToVector128Int64) that deal with long scalars and these simply refuse to work on x86. In a way this makes sense as the instructions that these intrinsics are supposed to generate are simply not available on x86. But I'm not sure if "makes sense" and "good idea" are friends here. This seems to be a bad idea, people will write code on x64 and then they'll discover that it fails to work on x86.
  • There are long returning helpers (e.g. long Extract(Vector256<long> value, byte index)). If these are supposed to work on x86 then it's not going to be fun to implement them as they probably need to pass through decomposition. Well, at least there's DecomposeLongs::DecomposeSimdGetItem as example.
  • And there are helpers with multiple long arguments (e.g. Vector256<long> SetVector256(long e3, long e2, long e1, long e0)). SetVector will be rather convoluted even without the need for long support...

@tannergooding
Copy link
Copy Markdown
Member

If these are supposed to work on x86

Everything except for the "helper" intrinsics are contracted to emit a particular instruction. If that instruction has no valid encoding on x86, it is supported to throw an exception.

movq is one of the few instructions that explicitly works with 64-bit values and that is also encodable on x86 (although only the mem encoding is available, the register encodings are non-encodable, obviously).

@fiigii
Copy link
Copy Markdown

fiigii commented Mar 6, 2018

There are "real" intrinsics (e.g. ConvertScalarToVector128Int64) that deal with long scalars and these simply refuse to work on x86.

people will write code on x64 and then they'll discover that it fails to work on x86.

There are long returning helpers (e.g. long Extract(Vector256 value, byte index)).

Currently, for this situation, users are responsible to check the underlying hardware via Environment.Is64BitProcess. At this time, providing 32-bit fallback is to complicated to ship this feature...

And there are helpers with multiple long arguments (e.g. Vector256 SetVector256(long e3, long e2, long e1, long e0)). SetVector will be rather convoluted even without the need for long support...

This is a special case, we should provide 32-bit platform codgen for Vector256/128<long> SetVector256/128.

@4creators
Copy link
Copy Markdown
Author

@mikedn is right that HW intrinsic implementation is vulnerable to x86 long handling and some helpers are a bit problematic and are deserving deeper analysis for all intrinsics.

I would propose to move discussion to separate issue to avoid having multiple topics for discussion here.

@CarolEidt
Copy link
Copy Markdown

At this time, providing 32-bit fallback is to complicated to ship this feature...

I think that the right answer is that 32-bit fallback is not supported.

@mikedn
Copy link
Copy Markdown

mikedn commented Mar 6, 2018

The only additional complexity is that you need to request an internal register in certain cases from lsra. How hard can it be? Might very well turn out to be simpler than the multi use acrobatics.

And it looks to me that even that internal register is not needed. It only appears to be needed because the code generate by the current approach uses an additional register for no good reason:

vmovd    xmm0, ecx
vpunpcklbw xmm1, xmm0, xmm0
vpunpcklwd xmm0, xmm1, xmm1
vpshufd  xmm0, xmm0, 0

is just

vmovd    xmm0, ecx
vpunpcklbw xmm0, xmm0, xmm0
vpunpcklwd xmm0, xmm0, xmm0
vpshufd  xmm0, xmm0, 0

Talk about optimal code generation and additional complexity...

@CarolEidt
Copy link
Copy Markdown

OK, here's my view of guidelines here and going forward:

First, all intrinsics must have deterministic behavior.

For "user helper" intrinsics:

  • All things being equal (complexity-wise, and roughly CQ-wise):
    • First choice would be to implement (expand) them in managed code
    • Second choice would be to expand them in codegen
    • Last choice is to expand them in the importer.
  • Where it is clearly simpler or where it generates better code, we take the best approach for the given situation.

We may find that we also want "JIT helper" intrinsics, which are not exposed in the API (the SIMD support has some of these). Those are generally used only for cases where it is useful to abstract some capability out in the importer.

@4creators
Copy link
Copy Markdown
Author

First choice would be to implement (expand) them in managed code

Well I am doing this with great pleasure 😄

@fiigii
Copy link
Copy Markdown

fiigii commented Mar 6, 2018

Well I am doing this with great pleasure

But I think SetAllVector128 may need "expand them in codegen" because this intrinsic usually could be used in performance critical paths.

@tannergooding
Copy link
Copy Markdown
Member

@fiigii, we should probably do whatever is simplest now, for the 2.1 preview of Intrinsics, and we can measure/tweak this later as needed.

@CarolEidt
Copy link
Copy Markdown

But I think SetAllVector128 may need "expand them in codegen" because this intrinsic usually could be used in performance critical paths.

I would have to be first convinced that we could not get the desired performance by expanding in managed code. I don't see an obvious reason why we would not.

@4creators
Copy link
Copy Markdown
Author

But I think SetAllVector128 may need "expand them in codegen" because this intrinsic usually could be used in performance critical paths.

Let me check how it works if we expand it in managed code first.

@fiigii
Copy link
Copy Markdown

fiigii commented Mar 6, 2018

we should probably do whatever is simplest now, for the 2.1 preview of Intrinsics,

I would have to be first convinced that we could not get the desired performance by expanding in managed code.

Agree.

@fiigii
Copy link
Copy Markdown

fiigii commented Mar 6, 2018

Let me check how it works if we expand it in managed code first.

@4creators Thanks, could you also check the generated code on Linux if possible? I remember that RyuJIT/Linux has a little bit inefficient codgen problems for vector returning/passing functions even if with inlining.

@4creators
Copy link
Copy Markdown
Author

could you also check the generated code on Linux if possible?

Yes will do

@CarolEidt
Copy link
Copy Markdown

RyuJIT/Linux has a little bit inefficient codgen problems for vector returning/passing functions even if with inlining.

I don't want us to make implementation decisions based on known deficiencies in codegen.

@4creators
Copy link
Copy Markdown
Author

@tannergooding In the case #16797 will be accepted I would drop two SetAllVector128 commits and leave "Update implementation of SetAllVector128 SSE HW intrinsic" commit here.

@tannergooding
Copy link
Copy Markdown
Member

@4creators, why not just port Sse.SetAllVector128 to managed code as well?

That would best follow the new guidance (and would help prevent other people from using it as a baseline for future implementation work).

@4creators
Copy link
Copy Markdown
Author

@tannergooding Sure, I prefer managed implementation for helpers. Will submit commit to other PR soon.

@tannergooding
Copy link
Copy Markdown
Member

@4creators, can this PR be closed (I assume it was replaced by #16797)?

@4creators
Copy link
Copy Markdown
Author

Closing due to changed implementation in #16797

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants