Implement SetAllVector128 SSE2 hardware intrinsic#16758
Conversation
| break; | ||
| } | ||
|
|
||
| case NI_SSE2_SetAllVector128: |
There was a problem hiding this comment.
My plan is to have final refactoring of case case NI_SSE2_SetAllVector128 after code gen problems will be solved.
a116be3 to
74b75de
Compare
| src = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_ConvertScalarToVector128Int32, TYP_INT, | ||
| simdSize); | ||
| GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtCloneExpr(src), NI_SSE2_UnpackLow, | ||
| TYP_SHORT, simdSize); |
| GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtCloneExpr(src), NI_SSE2_UnpackLow, | ||
| TYP_BYTE, simdSize); | ||
| GenTree* shu = gtNewSimdHWIntrinsicNode(TYP_SIMD16, tmp, gtCloneExpr(tmp), NI_SSE2_UnpackLow, | ||
| TYP_SHORT, simdSize); |
There was a problem hiding this comment.
Second set of tree duplications
I think it was @mikedn who mentioned using I would guess you want to create a local for |
|
|
Yes, that's the idea just to use single register for all operations. |
It's not obvious to me why do you need to expand "set all" in the importer. Why can't |
| // Ensure we aren't overwriting targetReg | ||
| assert(tmpReg != targetReg); | ||
|
|
||
| emit->emitIns_R_R(INS_movapd, emitTypeSize(TYP_SIMD16), tmpReg, op1Reg); |
There was a problem hiding this comment.
Use ps instructions when they are equivalent to pd instructions. They have shorted encoding when VEX is not available.
Applies to the xorpd below as well.
There was a problem hiding this comment.
We've not been doing this for helper intrinsics so far. @CarolEidt, @fiigii, does this seem like something we can/should do for the helper intrinsics specifically?
For the regular intrinsics, we have a contract mapping it to a given instruction (but not encoding). However, for the helper intrinsics, we have a contract of behavior instead, so it might make sense for these to produce the "most optimal" codegen.
There was a problem hiding this comment.
You really should print that contract on a piece of paper and burn it.
The general idea is to give more flexibility for these helper intrinsics. If we handle it in the importer by transforming it to the "equivalent" set of intrinsics nodes that we would actually generate, it is easier for containment or other optimizations to be handled appropriately. It also makes it significantly easier to handle things like |
The idea of "general idea" should be taken with a grain of salt in the JIT world. It would be great if the JIT IR could handle all the intricacies of assembly instructions but for better or worse that isn't the case (likely for better since it would be quite costly otherwise). As such, choosing how to represent something in the IR is based on a "best idea" and not on a "general idea". The current approach is not exactly best due to the need of temporaries so let's say it has -1 points. How many points does the codegen approach has?
That's such an extreme case that it should be analyzed separately. I suspect that in that case you have no choice but to expand in the importer (or maybe in lowering), otherwise the RA will probably have a hard time dealing with 16 args that may all need to be in GPRs... |
|
For word sized types codegen is almost as desired, except for unnecessary register spill to the stack and later usage in 8B8D38FCFFFF mov ecx, dword ptr [rbp-3C8H]
0FB7C9 movzx rcx, cx
C4E1796EC1 vmovd xmm0, xrcx
C4E179298560F8FFFF vmovapd xmmword ptr [rbp-7A0H], xmm0
C4E179288560F8FFFF vmovapd xmm0, xmmword ptr [rbp-7A0H]
C4E179618560F8FFFF vpunpcklwd xmm0, xmm0, xmmword ptr [rbp-7A0H]
C4E17970C000 vpshufd xmm0, xmm0, 0 8B8D38FCFFFF mov ecx, dword ptr [rbp-3C8H]
0FB7C9 movzx rcx, cx
660F6EC1 movd xmm0, xrcx
0F298560F8FFFF movaps xmmword ptr [rbp-7A0H], xmm0
0F288560F8FFFF movaps xmm0, xmmword ptr [rbp-7A0H]
0F288D60F8FFFF movaps xmm1, xmmword ptr [rbp-7A0H]
660F61C1 punpcklwd xmm0, xmm1
660F70C000 pshufd xmm0, xmm0, 0 |
|
The importer code for the above is: case TYP_SHORT:
case TYP_USHORT:
{
GenTree* srcClone = nullptr;
// Initialize XMM register with initial value
src = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_ConvertScalarToVector128Int32, TYP_INT,
simdSize);
// src is used twice - clone it
CORINFO_CLASS_HANDLE vectorHandle = (baseType == TYP_SHORT) ? Vector128ShortHandle : Vector128UShortHandle;
src = impCloneExpr(src, &srcClone, vectorHandle, (unsigned)CHECK_SPILL_ALL,
nullptr DEBUGARG("SetAllVector - duplicate src tree with initialized XMM register"));
GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, srcClone, NI_SSE2_UnpackLow,
TYP_SHORT, simdSize);
retNode =
gtNewSimdHWIntrinsicNode(TYP_SIMD16, tmp, gtNewIconNode(0), NI_SSE2_Shuffle, TYP_INT, simdSize);
break;
} |
|
For byte sized operands codegen is good as well except for repetitive register spills: 8B8D18FCFFFF mov ecx, dword ptr [rbp-3E8H]
480FBEC9 movsx rcx, cl
C4E1796EC1 vmovd xmm0, xrcx
C4E179298540F8FFFF vmovapd xmmword ptr [rbp-7C0H], xmm0
C4E179288540F8FFFF vmovapd xmm0, xmmword ptr [rbp-7C0H]
C4E179608540F8FFFF vpunpcklbw xmm0, xmm0, xmmword ptr [rbp-7C0H]
C4E179298530F8FFFF vmovapd xmmword ptr [rbp-7D0H], xmm0
C4E179288530F8FFFF vmovapd xmm0, xmmword ptr [rbp-7D0H]
C4E179618530F8FFFF vpunpcklwd xmm0, xmm0, xmmword ptr [rbp-7D0H]
C4E17970C000 vpshufd xmm0, xmm0, 0for legacy SSE encoding unnecessary additional XMM register is used: 480FBECE movsx rcx, sil
660F6EC1 movd xmm0, xrcx
0F28C8 movaps xmm1, xmm0
660F60C8 punpcklbw xmm1, xmm0
0F28C1 movaps xmm0, xmm1
660F61C1 punpcklwd xmm0, xmm1
660F70C000 pshufd xmm0, xmm0, 0C++ code: case TYP_BYTE:
case TYP_UBYTE:
{
GenTree* srcClone = nullptr;
GenTree* tmpCloneOne = nullptr;
// Initialize XMM register with initial value
src = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_ConvertScalarToVector128Int32, TYP_INT,
simdSize);
// src is used twice - clone it
CORINFO_CLASS_HANDLE vectorHandle = (baseType == TYP_BYTE) ? Vector128ByteHandle : Vector128UByteHandle;
src = impCloneExpr(src, &srcClone, vectorHandle, (unsigned)CHECK_SPILL_ALL,
nullptr DEBUGARG("SetAllVector - duplicate src tree with initialized XMM register"));
GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, srcClone, NI_SSE2_UnpackLow,
TYP_BYTE, simdSize);
// tmp is used three times so we clone it as well
tmp = impCloneExpr(tmp, &tmpCloneOne, vectorHandle, (unsigned)CHECK_SPILL_ALL,
nullptr DEBUGARG("SetAllVector - duplicate tmp tree with initialized XMM register"));
GenTree* shu = gtNewSimdHWIntrinsicNode(TYP_SIMD16, tmp, tmpCloneOne, NI_SSE2_UnpackLow,
TYP_SHORT, simdSize);
retNode =
gtNewSimdHWIntrinsicNode(TYP_SIMD16, shu, gtNewIconNode(0), NI_SSE2_Shuffle, TYP_INT, simdSize);
break;
} |
74b75de to
89ccdfc
Compare
| case TYP_DOUBLE: | ||
| { | ||
| src = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_SetScalarVector128, baseType, simdSize); | ||
| retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtCloneExpr(src), gtNewIconNode(0), |
There was a problem hiding this comment.
You shouldn't be arbitrarily cloning trees.
I think the right thing to use is Compiler::fgMakeMultiUse, which will clone it if it is a local, and otherwise create a temp and modify the existing tree to add the assignment. Either way you get back a GT_LCL_VAR node to use for the next instance. You can see it in use here: https://github.com/dotnet/coreclr/blob/master/src/jit/morph.cpp#L15008
There was a problem hiding this comment.
Perhaps, we can change Shuffle intrinsics to also support 1-arg form internally to solve this problem.
There was a problem hiding this comment.
Perhaps, we can change Shuffle intrinsics to also support 1-arg form internally to solve this problem.
This not necessary since you can use TYP_INT Shuffle as follows to get identical result.
retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtNewIconNode(0b01000100), NI_SSE2_Shuffle,
TYP_INT, simdSize);There was a problem hiding this comment.
@CarolEidt Do you think that usage of impCloneExpr for cases TYP_SHORT and TYP_BYTE as suggested by @AndyAyersMS is OK or I should better switch to Compiler::fgMakeMultiUse
There was a problem hiding this comment.
They are slightly different, and I think Compiler::fgMakeMultiUse is the better choice. For one, it has a cleaner interface, because you don't need to pass it a GenTree**. The other difference is that impCloneExpr will clone any tree that doesn't have side-effects, which is probably not what you want for SIMD, because it will recompute the value, e.g. if it was a SIMD operation.
There was a problem hiding this comment.
If it's really better to go via fgMakeMultiUse it would be good to understand why; like I said the rest of the importer uses impCloneExpr pretty heavily.
There was a problem hiding this comment.
impCloneExpr is used pervasively in the importer for these kinds of decisions. It calls gtClone with default last param false. So it won't clone anything complex.
@AndyAyersMS - impCloneExpr actually passes true to gtClone, but I see that it will only clone simple trees. So that may be the way to go.
There was a problem hiding this comment.
Do you have any hints how to block second XMM register use and get rid of movaps xmm1, xmm0 operations?
I would have to look at a jitdump to see what's going on.
There was a problem hiding this comment.
These are dumps comprising only short, ushort, byte and sbyte tests for AVX and SSE codegen:
There was a problem hiding this comment.
Ooops, yeah, it passes true, so it will clone add/sub/addr etc.
If there is something deficient about it it would be good to know what it is since it is heavily used. On the surface I don't see anything obvious.
@mikedn Another concern is that most of the helpers have different codgen solutions on different hardware. For example |
@4creators Just fouse on SSE encoding. This implementation will never be used on AVX machines, I will optimize |
| src = impCloneExpr(src, &srcClone, vectorHandle, (unsigned)CHECK_SPILL_ALL, | ||
| nullptr DEBUGARG("SetAllVector - duplicate src tree with initialized XMM register")); | ||
|
|
||
| GenTree* tmp = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, srcClone, NI_SSE2_UnpackLow, |
There was a problem hiding this comment.
We can change NI_SSE2_UnpackLow's numArg field to -1 to let UnpackLow accept 1-arg (internally), which avoids the impCloneExpr here.
BTW, the backend supports 1-arg from codgen for punpack*.
3709d0c to
e8bb858
Compare
| // and transform the graph accordingly. | ||
| GenTree* fgInsertCommaFormTemp(GenTree** ppTree, CORINFO_CLASS_HANDLE structType = nullptr); | ||
| GenTree* fgMakeMultiUse(GenTree** ppTree); | ||
| GenTree* fgMakeMultiUse(GenTree** ppTree, CORINFO_CLASS_HANDLE structType = nullptr); |
There was a problem hiding this comment.
It was necessary to change Compiler::fgMakeMultiUse to use CORINFO_CLASS_HANDLE structType = nullptr default parameter as nodes which are structures require structType. Otherwise, my code was hitting assertion:
Assert failure(PID 18536 [0x00004868], Thread: 8836 [0x2284]): Assertion failed 'structType != nullptr' in 'IntelHardwareIntrinsicTest.Program:Main(ref):int' (IL size 1093)
File: e:\src\ms\dotnet\coreclr\src\jit\morph.cpp Line: 2725
| else | ||
| { | ||
| GenTree* result = fgInsertCommaFormTemp(pOp); | ||
| GenTree* result = fgInsertCommaFormTemp(pOp, structType); |
There was a problem hiding this comment.
The problem was arising here due to fgInsertCommaFormTemp logic which requires it when pOp is structure.
| case TYP_DOUBLE: | ||
| { | ||
| src = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, NI_SSE2_SetScalarVector128, baseType, simdSize); | ||
| retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, src, gtNewIconNode(0b01000100), NI_SSE2_Shuffle, |
There was a problem hiding this comment.
In the future, we may need a phase to eliminate "floating conversion" intrinsics (like compiler-generated or user-written NI_SSE2_SetScalarVector128).
In most cases, these intrinsics just make the type system happy (match Vector128<float/double> with float/double), but the codegen is unnecessary.
There was a problem hiding this comment.
Can we directly change the return type of op1 to TYP_SIMD16 to avoid NI_SSE2_SetScalarVector128 here? Or make a new IR to match the types?
The current CQ of NI_SSE/SSE2_SetScalarVector128 is not good due to the "zero upper" semantics.
There was a problem hiding this comment.
Actually, since CPU instructions are type agnostic we should have a type free but only operand size constrained internal codegen for intrinsics.
There was a problem hiding this comment.
In the future, we may need a phase to eliminate "floating conversion" intrinsics (like compiler-generated or user-written NI_SSE2_SetScalarVector128).
A phase to do that is probably more expensive that it's worth it. Also, it's not clear to what extent you can rely on the upper 64/96 bits of a float/double value to be 0 so in many cases it probably won't work.
Can we directly change the return type of op1 to TYP_SIMD16 to avoid NI_SSE2_SetScalarVector128 here?
Probably, HWINTRINSIC nodes anyway accept and produce a variety of types. Most of the JIT doesn't pay attention to these nodes. It may also be possible to use a BITCAST node but I don't think it was ever used so early in the JIT so you may encounter issues.
There was a problem hiding this comment.
Can we directly change the return type of op1 to TYP_SIMD16 to avoid NI_SSE2_SetScalarVector128 here?
I don't think that is possible without introducing non-deterministic behavior.
In reply to: 172425955 [](ancestors = 172425955)
e8bb858 to
93c316a
Compare
|
test Windows_NT x64 Checked jitincompletehwintrinsic test Windows_NT x86 Checked jitincompletehwintrinsic test Ubuntu x64 Checked jitincompletehwintrinsic |
|
@4creators, could you update |
At the cost of complicating the importer implementation. This kind of complexity can't go away, it can only be moved around. At least in codegen it wouldn't require temporary variables. |
But I believe it would also require additional complexity in order to ensure that the "ideal" codegen was generated. It seems like a much simpler (and possibly better) approach, to just import it as the chain of intrinsics it would have been written as, had it been implemented in managed code. This allows the JIT to do all the appropriate transformations, folding, register allocations, etc later down the line and with the full context of the code we are actually generating. I would like to hear what @CarolEidt or @AndyAyersMS thinks about the two approaches (transform in importer vs custom codegen later on). |
That should not be a problem due to size of code and on top of that we can use attrubute |
BTW, I think the managed implementation can be a very good non-const fallback of |
Well, one more reason to do this in codegen. If you do it in codegen you can directly emit a shuffle instruction and be done with it.
See my above comment about
What I suggest has nothing to do with lowering. The codegen part would be similar to the importer thing. The only additional complexity is that you need to request an internal register in certain cases from lsra. How hard can it be? Might very well turn out to be simpler than the multi use acrobatics. |
Sounds reasonable to me. |
|
I'm not sure if we're zeroing in on a consensus here, but @tannergooding's suggestion to implement some of these in managed code sounds quite reasonable, and @mikedn's comment is right on target that we don't want to forbid expansion in the importer, but rather that we want to ensure that complexity isn't spread throughout the JIT, and that we don't introduce non-determinism in the implementation. |
|
I think the |
|
I looked a bit around for
|
Everything except for the "helper" intrinsics are contracted to emit a particular instruction. If that instruction has no valid encoding on x86, it is supported to throw an exception.
|
Currently, for this situation, users are responsible to check the underlying hardware via
This is a special case, we should provide 32-bit platform codgen for |
|
@mikedn is right that HW intrinsic implementation is vulnerable to x86 long handling and some helpers are a bit problematic and are deserving deeper analysis for all intrinsics. I would propose to move discussion to separate issue to avoid having multiple topics for discussion here. |
I think that the right answer is that 32-bit fallback is not supported. |
And it looks to me that even that internal register is not needed. It only appears to be needed because the code generate by the current approach uses an additional register for no good reason: vmovd xmm0, ecx
vpunpcklbw xmm1, xmm0, xmm0
vpunpcklwd xmm0, xmm1, xmm1
vpshufd xmm0, xmm0, 0is just vmovd xmm0, ecx
vpunpcklbw xmm0, xmm0, xmm0
vpunpcklwd xmm0, xmm0, xmm0
vpshufd xmm0, xmm0, 0Talk about optimal code generation and additional complexity... |
|
OK, here's my view of guidelines here and going forward: First, all intrinsics must have deterministic behavior. For "user helper" intrinsics:
We may find that we also want "JIT helper" intrinsics, which are not exposed in the API (the SIMD support has some of these). Those are generally used only for cases where it is useful to abstract some capability out in the importer. |
Well I am doing this with great pleasure 😄 |
But I think |
|
@fiigii, we should probably do whatever is simplest now, for the 2.1 preview of Intrinsics, and we can measure/tweak this later as needed. |
I would have to be first convinced that we could not get the desired performance by expanding in managed code. I don't see an obvious reason why we would not. |
Let me check how it works if we expand it in managed code first. |
Agree. |
@4creators Thanks, could you also check the generated code on Linux if possible? I remember that RyuJIT/Linux has a little bit inefficient codgen problems for vector returning/passing functions even if with inlining. |
Yes will do |
I don't want us to make implementation decisions based on known deficiencies in codegen. |
|
@tannergooding In the case #16797 will be accepted I would drop two SetAllVector128 commits and leave "Update implementation of SetAllVector128 SSE HW intrinsic" commit here. |
|
@4creators, why not just port That would best follow the new guidance (and would help prevent other people from using it as a baseline for future implementation work). |
|
@tannergooding Sure, I prefer managed implementation for helpers. Will submit commit to other PR soon. |
|
@4creators, can this PR be closed (I assume it was replaced by #16797)? |
|
Closing due to changed implementation in #16797 |
FYI @CarolEidt @fiigii @tannergooding @mikedn
This PR depends on #16736 and contains implementation of
Sse2.SetScalarVector128from that PR.Here I am facing couple of problems with code gen which are difficult to resolve without some external advice. In particular i do not how to avoid duplication of generated code (subtree) while maintaining use of the same data (preferably same register):
Use of
gtCloneExpr(src)leads to initialization of separate XMM register while I need to use the same register initialized earlier for both operands.