Implement SetAllVector128 SSE2 hardware intrinsic by 4creators · Pull Request #16758 · dotnet/coreclr

4creators · 2018-03-05T17:27:18Z

FYI @CarolEidt @fiigii @tannergooding @mikedn

This PR depends on #16736 and contains implementation of Sse2.SetScalarVector128 from that PR.

Here I am facing couple of problems with code gen which are difficult to resolve without some external advice. In particular i do not how to avoid duplication of generated code (subtree) while maintaining use of the same data (preferably same register):

                case TYP_SHORT:
                case TYP_USHORT:
                {
                    src = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_ConvertScalarToVector128Int32, TYP_INT, simdSize);
                    GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtCloneExpr(src), NI_SSE2_UnpackLow, TYP_SHORT, simdSize);
                    retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, tmp, gtNewIconNode(0), NI_SSE2_Shuffle, TYP_INT, simdSize);
                    break;
                }

Use of gtCloneExpr(src) leads to initialization of separate XMM register while I need to use the same register initialized earlier for both operands.

4creators · 2018-03-05T17:34:32Z

            break;
        }

+        case NI_SSE2_SetAllVector128:


My plan is to have final refactoring of case case NI_SSE2_SetAllVector128 after code gen problems will be solved.

4creators · 2018-03-05T17:56:37Z

+                    src = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_ConvertScalarToVector128Int32, TYP_INT,
+                                                   simdSize);
+                    GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtCloneExpr(src), NI_SSE2_UnpackLow,
+                                                            TYP_SHORT, simdSize);


First tree duplication.

4creators · 2018-03-05T17:57:04Z

+                    GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtCloneExpr(src), NI_SSE2_UnpackLow,
+                                                            TYP_BYTE, simdSize);
+                    GenTree* shu = gtNewSimdHWIntrinsicNode(TYP_SIMD16, tmp, gtCloneExpr(tmp), NI_SSE2_UnpackLow,
+                                                            TYP_SHORT, simdSize);


Second set of tree duplications

tannergooding · 2018-03-05T18:07:36Z

Use of gtCloneExpr(src) leads to initialization of separate XMM register while I need to use the same register initialized earlier for both operands.

I think it was @mikedn who mentioned using gtCloneExpr might have problems with more complex trees.

I would guess you want to create a local for src and then pass the local to NI_SSE2_Shuffle node. However, I am not familiar enough with the JIT yet to know how to do that (or if there is a better way to do it).

AndyAyersMS · 2018-03-05T18:10:37Z

impCloneExpr is what you need. It copies very simple trees and spills more complex ones to temps that it creates.

4creators · 2018-03-05T18:10:40Z

I would guess you want to create a local for src and then pass the local to NI_SSE2_Shuffle

Yes, that's the idea just to use single register for all operations.

mikedn · 2018-03-05T18:39:30Z

Here I am facing couple of problems with code gen which are difficult to resolve without some external advice. In particular i do not how to avoid duplication of generated code (subtree) while maintaining use of the same data (preferably same register):

It's not obvious to me why do you need to expand "set all" in the importer. Why can't NI_SSE2_SetAllVector128 be imported "as is" and have codegen generate the necessary instruction sequence?

mikedn · 2018-03-05T18:41:14Z

+                // Ensure we aren't overwriting targetReg
+                assert(tmpReg != targetReg);
+
+                emit->emitIns_R_R(INS_movapd, emitTypeSize(TYP_SIMD16), tmpReg, op1Reg);


Use ps instructions when they are equivalent to pd instructions. They have shorted encoding when VEX is not available.

Applies to the xorpd below as well.

We've not been doing this for helper intrinsics so far. @CarolEidt, @fiigii, does this seem like something we can/should do for the helper intrinsics specifically?

For the regular intrinsics, we have a contract mapping it to a given instruction (but not encoding). However, for the helper intrinsics, we have a contract of behavior instead, so it might make sense for these to produce the "most optimal" codegen.

You really should print that contract on a piece of paper and burn it.

tannergooding · 2018-03-05T18:50:02Z

It's not obvious to me why do you need to expand "set all" in the importer. Why can't NI_SSE2_SetAllVector128 be imported "as is"

The general idea is to give more flexibility for these helper intrinsics. If we handle it in the importer by transforming it to the "equivalent" set of intrinsics nodes that we would actually generate, it is easier for containment or other optimizations to be handled appropriately.

It also makes it significantly easier to handle things like SetVector128(byte), which takes 16 args.

mikedn · 2018-03-05T19:16:26Z

The general idea is to give more flexibility for these helper intrinsics. If we handle it in the importer by transforming it to the "equivalent" set of intrinsics nodes that we would actually generate, it is easier for containment or other optimizations to be handled appropriately.

The idea of "general idea" should be taken with a grain of salt in the JIT world. It would be great if the JIT IR could handle all the intricacies of assembly instructions but for better or worse that isn't the case (likely for better since it would be quite costly otherwise). As such, choosing how to represent something in the IR is based on a "best idea" and not on a "general idea". The current approach is not exactly best due to the need of temporaries so let's say it has -1 points. How many points does the codegen approach has?

It also makes it significantly easier to handle things like SetVector128(byte), which takes 16 args.

That's such an extreme case that it should be analyzed separately. I suspect that in that case you have no choice but to expand in the importer (or maybe in lowering), otherwise the RA will probably have a hard time dealing with 16 args that may all need to be in GPRs...

4creators · 2018-03-05T20:31:15Z

For word sized types codegen is almost as desired, except for unnecessary register spill to the stack and later usage in vpunpcklwd and for SSE encoding usage of additional XMM1 register which is not necessary:

       8B8D38FCFFFF         mov      ecx, dword ptr [rbp-3C8H]
       0FB7C9               movzx    rcx, cx
       C4E1796EC1           vmovd    xmm0, xrcx
       C4E179298560F8FFFF   vmovapd  xmmword ptr [rbp-7A0H], xmm0
       C4E179288560F8FFFF   vmovapd  xmm0, xmmword ptr [rbp-7A0H]
       C4E179618560F8FFFF   vpunpcklwd xmm0, xmm0, xmmword ptr [rbp-7A0H]
       C4E17970C000         vpshufd  xmm0, xmm0, 0

       8B8D38FCFFFF         mov      ecx, dword ptr [rbp-3C8H]
       0FB7C9               movzx    rcx, cx
       660F6EC1             movd     xmm0, xrcx
       0F298560F8FFFF       movaps   xmmword ptr [rbp-7A0H], xmm0
       0F288560F8FFFF       movaps   xmm0, xmmword ptr [rbp-7A0H]
       0F288D60F8FFFF       movaps   xmm1, xmmword ptr [rbp-7A0H]
       660F61C1             punpcklwd xmm0, xmm1
       660F70C000           pshufd   xmm0, xmm0, 0

4creators · 2018-03-05T20:33:41Z

The importer code for the above is:

                case TYP_SHORT:
                case TYP_USHORT:
                {
                    GenTree* srcClone = nullptr;

                    // Initialize XMM register with initial value
                    src = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_ConvertScalarToVector128Int32, TYP_INT,
                                                   simdSize);

                    // src is used twice - clone it
                    CORINFO_CLASS_HANDLE vectorHandle = (baseType == TYP_SHORT) ? Vector128ShortHandle : Vector128UShortHandle;
                    src = impCloneExpr(src, &srcClone, vectorHandle, (unsigned)CHECK_SPILL_ALL,
                        nullptr DEBUGARG("SetAllVector - duplicate src tree with initialized XMM register"));

                    GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, srcClone, NI_SSE2_UnpackLow,
                                                            TYP_SHORT, simdSize);
                    retNode =
                        gtNewSimdHWIntrinsicNode(TYP_SIMD16, tmp, gtNewIconNode(0), NI_SSE2_Shuffle, TYP_INT, simdSize);
                    break;
                }

4creators · 2018-03-05T20:45:39Z

For byte sized operands codegen is good as well except for repetitive register spills:

       8B8D18FCFFFF         mov      ecx, dword ptr [rbp-3E8H]
       480FBEC9             movsx    rcx, cl
       C4E1796EC1           vmovd    xmm0, xrcx
       C4E179298540F8FFFF   vmovapd  xmmword ptr [rbp-7C0H], xmm0
       C4E179288540F8FFFF   vmovapd  xmm0, xmmword ptr [rbp-7C0H]
       C4E179608540F8FFFF   vpunpcklbw xmm0, xmm0, xmmword ptr [rbp-7C0H]
       C4E179298530F8FFFF   vmovapd  xmmword ptr [rbp-7D0H], xmm0
       C4E179288530F8FFFF   vmovapd  xmm0, xmmword ptr [rbp-7D0H]
       C4E179618530F8FFFF   vpunpcklwd xmm0, xmm0, xmmword ptr [rbp-7D0H]
       C4E17970C000         vpshufd  xmm0, xmm0, 0

for legacy SSE encoding unnecessary additional XMM register is used:

       480FBECE             movsx    rcx, sil
       660F6EC1             movd     xmm0, xrcx
       0F28C8               movaps   xmm1, xmm0
       660F60C8             punpcklbw xmm1, xmm0
       0F28C1               movaps   xmm0, xmm1
       660F61C1             punpcklwd xmm0, xmm1
       660F70C000           pshufd   xmm0, xmm0, 0

C++ code:

                case TYP_BYTE:
                case TYP_UBYTE:
                {
                    GenTree* srcClone = nullptr;
                    GenTree* tmpCloneOne = nullptr;

                    // Initialize XMM register with initial value
                    src = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_ConvertScalarToVector128Int32, TYP_INT,
                                                   simdSize);

                    // src is used twice - clone it
                    CORINFO_CLASS_HANDLE vectorHandle = (baseType == TYP_BYTE) ? Vector128ByteHandle : Vector128UByteHandle;
                    src = impCloneExpr(src, &srcClone, vectorHandle, (unsigned)CHECK_SPILL_ALL,
                        nullptr DEBUGARG("SetAllVector - duplicate src tree with initialized XMM register"));

                    GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, srcClone, NI_SSE2_UnpackLow,
                                                            TYP_BYTE, simdSize);

                    // tmp is used three times so we clone it as well
                    tmp = impCloneExpr(tmp, &tmpCloneOne, vectorHandle, (unsigned)CHECK_SPILL_ALL,
                        nullptr DEBUGARG("SetAllVector - duplicate tmp tree with initialized XMM register"));

                    GenTree* shu = gtNewSimdHWIntrinsicNode(TYP_SIMD16, tmp, tmpCloneOne, NI_SSE2_UnpackLow,
                                                            TYP_SHORT, simdSize);
                    retNode =
                        gtNewSimdHWIntrinsicNode(TYP_SIMD16, shu, gtNewIconNode(0), NI_SSE2_Shuffle, TYP_INT, simdSize);
                    break;
                }

CarolEidt · 2018-03-05T21:07:48Z

+                case TYP_DOUBLE:
+                {
+                    src     = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_SetScalarVector128, baseType, simdSize);
+                    retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtCloneExpr(src), gtNewIconNode(0),


You shouldn't be arbitrarily cloning trees.
I think the right thing to use is Compiler::fgMakeMultiUse, which will clone it if it is a local, and otherwise create a temp and modify the existing tree to add the assignment. Either way you get back a GT_LCL_VAR node to use for the next instance. You can see it in use here: https://github.com/dotnet/coreclr/blob/master/src/jit/morph.cpp#L15008

Perhaps, we can change Shuffle intrinsics to also support 1-arg form internally to solve this problem.

Perhaps, we can change Shuffle intrinsics to also support 1-arg form internally to solve this problem.

This not necessary since you can use TYP_INT Shuffle as follows to get identical result.

retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtNewIconNode(0b01000100), NI_SSE2_Shuffle, TYP_INT, simdSize);

@CarolEidt Do you think that usage of impCloneExpr for cases TYP_SHORT and TYP_BYTE as suggested by @AndyAyersMS is OK or I should better switch to Compiler::fgMakeMultiUse

They are slightly different, and I think Compiler::fgMakeMultiUse is the better choice. For one, it has a cleaner interface, because you don't need to pass it a GenTree**. The other difference is that impCloneExpr will clone any tree that doesn't have side-effects, which is probably not what you want for SIMD, because it will recompute the value, e.g. if it was a SIMD operation.

If it's really better to go via fgMakeMultiUse it would be good to understand why; like I said the rest of the importer uses impCloneExpr pretty heavily.

impCloneExpr is used pervasively in the importer for these kinds of decisions. It calls gtClone with default last param false. So it won't clone anything complex.

@AndyAyersMS - impCloneExpr actually passes true to gtClone, but I see that it will only clone simple trees. So that may be the way to go.

Do you have any hints how to block second XMM register use and get rid of movaps xmm1, xmm0 operations?

I would have to look at a jitdump to see what's going on.

@CarolEidt

These are dumps comprising only short, ushort, byte and sbyte tests for AVX and SSE codegen:

SetAllVector128RoDump.txt
SetAllVector128RoNoAVXDump.txt

Ooops, yeah, it passes true, so it will clone add/sub/addr etc.

If there is something deficient about it it would be good to know what it is since it is heavily used. On the surface I don't see anything obvious.

fiigii · 2018-03-05T21:14:21Z

It's not obvious to me why do you need to expand "set all" in the importer. Why can't NI_SSE2_SetAllVector128 be imported "as is" and have codegen generate the necessary instruction sequence?

@mikedn Another concern is that most of the helpers have different codgen solutions on different hardware. For example NI_SSE2_SetAllVector128 will use unpack on SSE2, insert on SSE4.1, or broadcast on AVX2 machines. They may have different register demands. So, I think expanding the herpler-intrinsics as early as possible can simplify the backend implementation.

fiigii · 2018-03-05T21:18:19Z

For byte sized operands codegen is good as well except for repetitive register spills:

@4creators Just fouse on SSE encoding. This implementation will never be used on AVX machines, I will optimize Set* via SSE4.1 and AVX2 instructions later.

fiigii · 2018-03-05T22:21:11Z

+                    src = impCloneExpr(src, &srcClone, vectorHandle, (unsigned)CHECK_SPILL_ALL,
+                        nullptr DEBUGARG("SetAllVector - duplicate src tree with initialized XMM register"));
+
+                    GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, srcClone, NI_SSE2_UnpackLow,


We can change NI_SSE2_UnpackLow's numArg field to -1 to let UnpackLow accept 1-arg (internally), which avoids the impCloneExpr here.

BTW, the backend supports 1-arg from codgen for punpack*.

4creators · 2018-03-05T23:42:13Z

    //                  and transform the graph accordingly.
    GenTree* fgInsertCommaFormTemp(GenTree** ppTree, CORINFO_CLASS_HANDLE structType = nullptr);
-    GenTree* fgMakeMultiUse(GenTree** ppTree);
+    GenTree* fgMakeMultiUse(GenTree** ppTree, CORINFO_CLASS_HANDLE structType = nullptr);


It was necessary to change Compiler::fgMakeMultiUse to use CORINFO_CLASS_HANDLE structType = nullptr default parameter as nodes which are structures require structType. Otherwise, my code was hitting assertion:

Assert failure(PID 18536 [0x00004868], Thread: 8836 [0x2284]): Assertion failed 'structType != nullptr' in 'IntelHardwareIntrinsicTest.Program:Main(ref):int' (IL size 1093) File: e:\src\ms\dotnet\coreclr\src\jit\morph.cpp Line: 2725

4creators · 2018-03-05T23:44:11Z

    else
    {
-        GenTree* result = fgInsertCommaFormTemp(pOp);
+        GenTree* result = fgInsertCommaFormTemp(pOp, structType);


The problem was arising here due to fgInsertCommaFormTemp logic which requires it when pOp is structure.

fiigii · 2018-03-06T00:00:59Z

+                case TYP_DOUBLE:
+                {
+                    src     = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_SetScalarVector128, baseType, simdSize);
+                    retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtNewIconNode(0b01000100), NI_SSE2_Shuffle,


In the future, we may need a phase to eliminate "floating conversion" intrinsics (like compiler-generated or user-written NI_SSE2_SetScalarVector128).
In most cases, these intrinsics just make the type system happy (match Vector128<float/double> with float/double), but the codegen is unnecessary.

Can we directly change the return type of op1 to TYP_SIMD16 to avoid NI_SSE2_SetScalarVector128 here? Or make a new IR to match the types?
The current CQ of NI_SSE/SSE2_SetScalarVector128 is not good due to the "zero upper" semantics.

@CarolEidt @AndyAyersMS

Actually, since CPU instructions are type agnostic we should have a type free but only operand size constrained internal codegen for intrinsics.

In the future, we may need a phase to eliminate "floating conversion" intrinsics (like compiler-generated or user-written NI_SSE2_SetScalarVector128).

A phase to do that is probably more expensive that it's worth it. Also, it's not clear to what extent you can rely on the upper 64/96 bits of a float/double value to be 0 so in many cases it probably won't work.

Can we directly change the return type of op1 to TYP_SIMD16 to avoid NI_SSE2_SetScalarVector128 here?

Probably, HWINTRINSIC nodes anyway accept and produce a variety of types. Most of the JIT doesn't pay attention to these nodes. It may also be possible to use a BITCAST node but I don't think it was ever used so early in the JIT so you may encounter issues.

@mikedn Looks a new helper-intrinsic is simpler? #16772

Can we directly change the return type of op1 to TYP_SIMD16 to avoid NI_SSE2_SetScalarVector128 here?

I don't think that is possible without introducing non-deterministic behavior.

In reply to: 172425955 [](ancestors = 172425955)

4creators · 2018-03-06T01:48:11Z

test Windows_NT x64 Checked jitincompletehwintrinsic
test Windows_NT x64 Checked jitx86hwintrinsicnoavx
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnosimd
test Windows_NT x64 Checked jitnox86hwintrinsic

test Windows_NT x86 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx
test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
test Windows_NT x86 Checked jitx86hwintrinsicnosimd
test Windows_NT x86 Checked jitnox86hwintrinsic

test Ubuntu x64 Checked jitincompletehwintrinsic
test Ubuntu x64 Checked jitx86hwintrinsicnoavx
test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
test Ubuntu x64 Checked jitx86hwintrinsicnosimd
test Ubuntu x64 Checked jitnox86hwintrinsic

tannergooding · 2018-03-06T03:18:36Z

@4creators, could you update Sse.SetAllVector128 to no longer use gtCloneExpr as well (since you will need to rebase to resolve the merge conflicts anyways)?

mikedn · 2018-03-06T06:36:12Z

Another concern is that most of the helpers have different codgen solutions on different hardware. For example NI_SSE2_SetAllVector128 will use unpack on SSE2, insert on SSE4.1, or broadcast on AVX2 machines. They may have different register demands. So, I think expanding the herpler-intrinsics as early as possible can simplify the backend implementation.

At the cost of complicating the importer implementation. This kind of complexity can't go away, it can only be moved around. At least in codegen it wouldn't require temporary variables.

tannergooding · 2018-03-06T17:46:27Z

At least in codegen it wouldn't require temporary variables.

But I believe it would also require additional complexity in order to ensure that the "ideal" codegen was generated.

It seems like a much simpler (and possibly better) approach, to just import it as the chain of intrinsics it would have been written as, had it been implemented in managed code. This allows the JIT to do all the appropriate transformations, folding, register allocations, etc later down the line and with the full context of the code we are actually generating.

I would like to hear what @CarolEidt or @AndyAyersMS thinks about the two approaches (transform in importer vs custom codegen later on).

4creators · 2018-03-06T20:17:54Z

I think this is a good idea if the managed implementation can be imlined.

That should not be a problem due to size of code and on top of that we can use attrubute MethodImplementationOptions.AgressiveInlining

fiigii · 2018-03-06T20:20:01Z

That would mean that it would be implemented in managed as:

BTW, I think the managed implementation can be a very good non-const fallback of Avx.Insert/Extract.

mikedn · 2018-03-06T20:20:28Z

It is, we are only consuming the lowest bits from the vector so the upper bits "don't matter".

Well, one more reason to do this in codegen. If you do it in codegen you can directly emit a shuffle instruction and be done with it.

Could you elaborate?

See my above comment about SetVector128. It's unlikely that you can get that right with a managed only implementation.

The purpose of doing this in the importer was to help simplify and contain the various logic to a single location (the importer) rather than spread it out between lowering, lsra, and codegen.

What I suggest has nothing to do with lowering. The codegen part would be similar to the importer thing. The only additional complexity is that you need to request an internal register in certain cases from lsra. How hard can it be? Might very well turn out to be simpler than the multi use acrobatics.

fiigii · 2018-03-06T20:24:20Z

Well, one more reason to do this in codegen. If you do it in codegen you can directly emit a shuffle instruction and be done with it.

Sounds reasonable to me.

CarolEidt · 2018-03-06T20:35:03Z

I'm not sure if we're zeroing in on a consensus here, but @tannergooding's suggestion to implement some of these in managed code sounds quite reasonable, and @mikedn's comment is right on target that we don't want to forbid expansion in the importer, but rather that we want to ensure that complexity isn't spread throughout the JIT, and that we don't introduce non-determinism in the implementation.

mikedn · 2018-03-06T20:37:56Z

I think the long case on x86 needs a bit investigation before moving forward. Post-lowering a long node will usually end up being split into two separate nodes that are than "packed" together using a GT_LONG node. I have no idea how intrinsic handle that currently, if they handle it at all. In any case, this may have an impact on the exact implementation choice for SetAllVector(long) and any other intrinsics that have long arguments or long returns.

mikedn · 2018-03-06T21:22:45Z

I looked a bit around for long/ulong intrinsics, the situation seems a bit complicated:

There are "real" intrinsics (e.g. ConvertScalarToVector128Int64) that deal with long scalars and these simply refuse to work on x86. In a way this makes sense as the instructions that these intrinsics are supposed to generate are simply not available on x86. But I'm not sure if "makes sense" and "good idea" are friends here. This seems to be a bad idea, people will write code on x64 and then they'll discover that it fails to work on x86.
There are long returning helpers (e.g. long Extract(Vector256<long> value, byte index)). If these are supposed to work on x86 then it's not going to be fun to implement them as they probably need to pass through decomposition. Well, at least there's DecomposeLongs::DecomposeSimdGetItem as example.
And there are helpers with multiple long arguments (e.g. Vector256<long> SetVector256(long e3, long e2, long e1, long e0)). SetVector will be rather convoluted even without the need for long support...

tannergooding · 2018-03-06T21:32:50Z

If these are supposed to work on x86

Everything except for the "helper" intrinsics are contracted to emit a particular instruction. If that instruction has no valid encoding on x86, it is supported to throw an exception.

movq is one of the few instructions that explicitly works with 64-bit values and that is also encodable on x86 (although only the mem encoding is available, the register encodings are non-encodable, obviously).

fiigii · 2018-03-06T21:35:19Z

There are "real" intrinsics (e.g. ConvertScalarToVector128Int64) that deal with long scalars and these simply refuse to work on x86.

people will write code on x64 and then they'll discover that it fails to work on x86.

There are long returning helpers (e.g. long Extract(Vector256 value, byte index)).

Currently, for this situation, users are responsible to check the underlying hardware via Environment.Is64BitProcess. At this time, providing 32-bit fallback is to complicated to ship this feature...

And there are helpers with multiple long arguments (e.g. Vector256 SetVector256(long e3, long e2, long e1, long e0)). SetVector will be rather convoluted even without the need for long support...

This is a special case, we should provide 32-bit platform codgen for Vector256/128<long> SetVector256/128.

4creators · 2018-03-06T21:55:48Z

@mikedn is right that HW intrinsic implementation is vulnerable to x86 long handling and some helpers are a bit problematic and are deserving deeper analysis for all intrinsics.

I would propose to move discussion to separate issue to avoid having multiple topics for discussion here.

CarolEidt · 2018-03-06T22:05:28Z

At this time, providing 32-bit fallback is to complicated to ship this feature...

I think that the right answer is that 32-bit fallback is not supported.

mikedn · 2018-03-06T22:06:16Z

The only additional complexity is that you need to request an internal register in certain cases from lsra. How hard can it be? Might very well turn out to be simpler than the multi use acrobatics.

And it looks to me that even that internal register is not needed. It only appears to be needed because the code generate by the current approach uses an additional register for no good reason:

vmovd    xmm0, ecx
vpunpcklbw xmm1, xmm0, xmm0
vpunpcklwd xmm0, xmm1, xmm1
vpshufd  xmm0, xmm0, 0

is just

vmovd    xmm0, ecx
vpunpcklbw xmm0, xmm0, xmm0
vpunpcklwd xmm0, xmm0, xmm0
vpshufd  xmm0, xmm0, 0

Talk about optimal code generation and additional complexity...

CarolEidt · 2018-03-06T22:12:04Z

OK, here's my view of guidelines here and going forward:

First, all intrinsics must have deterministic behavior.

For "user helper" intrinsics:

All things being equal (complexity-wise, and roughly CQ-wise):
- First choice would be to implement (expand) them in managed code
- Second choice would be to expand them in codegen
- Last choice is to expand them in the importer.
Where it is clearly simpler or where it generates better code, we take the best approach for the given situation.

We may find that we also want "JIT helper" intrinsics, which are not exposed in the API (the SIMD support has some of these). Those are generally used only for cases where it is useful to abstract some capability out in the importer.

4creators · 2018-03-06T22:54:00Z

First choice would be to implement (expand) them in managed code

Well I am doing this with great pleasure 😄

fiigii · 2018-03-06T23:02:12Z

Well I am doing this with great pleasure

But I think SetAllVector128 may need "expand them in codegen" because this intrinsic usually could be used in performance critical paths.

tannergooding · 2018-03-06T23:16:42Z

@fiigii, we should probably do whatever is simplest now, for the 2.1 preview of Intrinsics, and we can measure/tweak this later as needed.

CarolEidt · 2018-03-06T23:16:42Z

But I think SetAllVector128 may need "expand them in codegen" because this intrinsic usually could be used in performance critical paths.

I would have to be first convinced that we could not get the desired performance by expanding in managed code. I don't see an obvious reason why we would not.

4creators · 2018-03-06T23:17:45Z

But I think SetAllVector128 may need "expand them in codegen" because this intrinsic usually could be used in performance critical paths.

Let me check how it works if we expand it in managed code first.

fiigii · 2018-03-06T23:46:44Z

we should probably do whatever is simplest now, for the 2.1 preview of Intrinsics,

I would have to be first convinced that we could not get the desired performance by expanding in managed code.

Agree.

fiigii · 2018-03-06T23:49:19Z

Let me check how it works if we expand it in managed code first.

@4creators Thanks, could you also check the generated code on Linux if possible? I remember that RyuJIT/Linux has a little bit inefficient codgen problems for vector returning/passing functions even if with inlining.

4creators · 2018-03-07T00:25:40Z

could you also check the generated code on Linux if possible?

Yes will do

CarolEidt · 2018-03-07T00:27:22Z

RyuJIT/Linux has a little bit inefficient codgen problems for vector returning/passing functions even if with inlining.

I don't want us to make implementation decisions based on known deficiencies in codegen.

4creators · 2018-03-07T14:57:03Z

@tannergooding In the case #16797 will be accepted I would drop two SetAllVector128 commits and leave "Update implementation of SetAllVector128 SSE HW intrinsic" commit here.

tannergooding · 2018-03-07T15:16:11Z

@4creators, why not just port Sse.SetAllVector128 to managed code as well?

That would best follow the new guidance (and would help prevent other people from using it as a baseline for future implementation work).

4creators · 2018-03-07T15:33:19Z

@tannergooding Sure, I prefer managed implementation for helpers. Will submit commit to other PR soon.

tannergooding · 2018-03-10T04:05:48Z

@4creators, can this PR be closed (I assume it was replaced by #16797)?

4creators · 2018-03-10T18:04:38Z

Closing due to changed implementation in #16797

4creators commented Mar 5, 2018

View reviewed changes

4creators force-pushed the SetAllVector128 branch from a116be3 to 74b75de Compare March 5, 2018 17:42

4creators commented Mar 5, 2018

View reviewed changes

mikedn reviewed Mar 5, 2018

View reviewed changes

4creators force-pushed the SetAllVector128 branch from 74b75de to 89ccdfc Compare March 5, 2018 21:05

CarolEidt suggested changes Mar 5, 2018

View reviewed changes

fiigii reviewed Mar 5, 2018

View reviewed changes

fiigii mentioned this pull request Mar 5, 2018

Avoid gtCloneExpr in HW helper-intrinsics #16766

Closed

4creators force-pushed the SetAllVector128 branch 2 times, most recently from 3709d0c to e8bb858 Compare March 5, 2018 23:35

4creators commented Mar 5, 2018

View reviewed changes

fiigii reviewed Mar 6, 2018

View reviewed changes

4creators force-pushed the SetAllVector128 branch from e8bb858 to 93c316a Compare March 6, 2018 00:17

4creators closed this Mar 10, 2018

4creators deleted the SetAllVector128 branch August 4, 2018 10:18

4creators mentioned this pull request Jan 31, 2020

Improve SIMD register allocator for HW intrinsics, avoid unnecessary movaps between SIMD registers dotnet/runtime#9890

Closed

Conversation

4creators commented Mar 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding commented Mar 5, 2018

Uh oh!

AndyAyersMS commented Mar 5, 2018

Uh oh!

4creators commented Mar 5, 2018

Uh oh!

mikedn commented Mar 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding commented Mar 5, 2018

Uh oh!

mikedn commented Mar 5, 2018

Uh oh!

4creators commented Mar 5, 2018

Uh oh!

4creators commented Mar 5, 2018

Uh oh!

4creators commented Mar 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fiigii Mar 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

4creators Mar 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

4creators Mar 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fiigii commented Mar 5, 2018

Uh oh!

fiigii commented Mar 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

4creators Mar 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fiigii Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

4creators commented Mar 5, 2018 •

edited

Loading

mikedn commented Mar 5, 2018 •

edited

Loading

fiigii Mar 5, 2018 •

edited

Loading

4creators Mar 5, 2018 •

edited

Loading

4creators Mar 5, 2018 •

edited

Loading

4creators Mar 5, 2018 •

edited

Loading

fiigii Mar 6, 2018 •

edited

Loading

mikedn Mar 6, 2018 •

edited

Loading

mikedn commented Mar 6, 2018 •

edited

Loading

mikedn commented Mar 6, 2018 •

edited

Loading

fiigii commented Mar 6, 2018 •

edited

Loading

mikedn commented Mar 6, 2018 •

edited

Loading