Adding partial support for the SSE41 hardware intrinsics. by tannergooding · Pull Request #16558 · dotnet/coreclr

tannergooding · 2018-02-25T17:30:14Z

This partially resolves https://github.com/dotnet/coreclr/issues/16458

Still todo are:

Extract
Insert

tannergooding · 2018-02-25T17:30:30Z

FYI. @fiigii, @CarolEidt, @eerhardt, @RussKeldorph

fiigii · 2018-02-25T17:57:51Z

+            }
+
+            op1     = impSIMDPopStack(TYP_SIMD16);
+            retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, op2, intrinsic, baseType, simdSize);


For CeilingScalar/FloorScalar, I think we can duplicate the source-operand for one-arg overloads in the importer. Then we can set NumArg as 2 in the table, which may simplify the implementation.

I suggest we do that, in general, later and only after we have confirmed the codegen will be as good.

@CarolEidt, do you know if the JIT is smart enough to recognize that op2 is a clone of op1 and only needs to be read from memory once?

I suggest we do that, in general, later and only after we have confirmed the codegen will be as good.

It is okay to me. But numArgOfHWIntrinsic should be modified if we distinguish one-arg and two-arg overloads in the back-end.

op2 is a clone of op1 and only needs to be read from memory once

Not sure what "clone" means here. But the "only read from memory once" typically happens only if CSE kicks in.

Not sure what "clone" means here

@mikedn, gtCloneExpr(op1).

Currently, we only have op1 (op2 is nullptr) and it's all we carry through codegen. op1 will get loaded into register (via movups) and we will ultimately emit INS op1Reg, op1Reg, imm8.

If we instead do what @fiigii is suggesting my concern would be that the JIT isn't smart enough to always recognize that op1 and op2 represent the same underlying location. If it isn't smart enough we might end up emitting two separate movups and INS op1Reg, op2Reg, imm8 or one movups and a folded load (INS [op1], op2Reg, imm8).

I don't think that using gtCloneExpr here is an option, best I can tell op1 is an arbitrary tree and you can't quite clone that. If it's a complex tree the only reasonable way to "clone" it is to spill it to a local variable and make both op1 and op2 GT_LCL_VARs. And since they both use the same variable they should end up using the same register.

But spilling the tree is not without drawbacks so keeping op2 as nullptr seems somehow better.

If you do end up cloning, impCloneExpr may be what you're looking for, it clones "cheap" trees and creates new temps for "expensive" trees.

@mikden, @AndyAyersMS: Thanks for the tips.

I think I will keep it "as is" for now. The current implementation is producing good results and is trivial to handle in codegen.

I would agree that keeping as-is is best, at least until such time as we come across a good reason for doing otherwise.

fiigii · 2018-02-25T18:00:02Z

+        case NI_SSE41_DotProduct:
+        {
+            instruction ins = Compiler::insOfHWIntrinsic(intrinsicID, node->gtSIMDBaseType);
+            genHWIntrinsic_FullRangeImm8(node, ins);


I suggest to wait for #16183 or merge it soon...

I'm hoping that I can rebase this PR after #16183 goes in.

tannergooding · 2018-02-26T13:56:18Z

Most of the test failures are https://github.com/dotnet/coreclr/issues/16566. Will rerun after the fix is merged.

There was another job that failed due to the "paging file is too small" issue. It has been re-queued.

There was also an issue with MultipleSumAbsoluteDifferences in the x86 No AVX job. I am investigating.

CarolEidt

LGTM

CarolEidt · 2018-02-26T14:58:48Z

+            }
+
+            op1     = impSIMDPopStack(TYP_SIMD16);
+            retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, op2, intrinsic, baseType, simdSize);


I would agree that keeping as-is is best, at least until such time as we come across a good reason for doing otherwise.

tannergooding · 2018-02-26T15:08:30Z

@CarolEidt, Just as an FYI: I am going to rebase this PR onto #16183 after it is merged. The resulting change should be to the genHWIntrinsic_FullRangeImm8 call here: https://github.com/dotnet/coreclr/pull/16558/files#diff-847442c7efdf7a78e78711bea30fd4b2R1071

tannergooding · 2018-02-27T03:54:17Z

Rebased onto dotnet/master. Change is slightly smaller thanks to #16183.

tannergooding · 2018-02-27T15:03:14Z

@CarolEidt, @fiigii, the MultipleSumAbsoluteDifferences failure is actually a more general issue.

targetReg != op1Reg, but targetReg == op2Reg. On x86, when we inject the intermediate movaps targetReg, op1Reg we end up overwriting the value in op2Reg.

I should have a fix up momentarily and am just going through our other code paths to try and ensure we don't have any similar scenarios elsewhere.

CarolEidt · 2018-02-27T15:13:08Z

Would it be useful to have a flag indicating whether an intrinsic has RMW semantics on op1 and therefore needs delay free on the second source?

tannergooding · 2018-02-27T15:19:36Z

Would it be useful to have a flag indicating whether an intrinsic has RMW semantics on op1 and therefore needs delay free on the second source?

@CarolEidt, probably long term, yes (although a flag saying it doesn't have RMW semantics is probably better, I believe most intrinsics default to having it).

The fix I am doing right now is to just blanket the intrinsics second op as delay free for UseVEXEncoding() == false, as well as placing asserts in the appropriate locations.

tannergooding · 2018-02-27T15:48:57Z

@CarolEidt, is there any chance that a temporary register (node->GetSingleTempReg()) is the same as one of the normally allocated registers?

Basically wondering if I need to worry about:

assert(baseType == TYP_DOUBLE);
op2Reg             = op2->gtRegNum;
instruction ins    = Compiler::insOfHWIntrinsic(intrinsicID, baseType);
regNumber   tmpReg = node->GetSingleTempReg();

emit->emitIns_R_R(ins, emitTypeSize(TYP_SIMD16), op1Reg, op2Reg);
emit->emitIns_R(INS_setpe, EA_1BYTE, targetReg);
emit->emitIns_R(INS_setne, EA_1BYTE, tmpReg);
emit->emitIns_R_R(INS_or, EA_1BYTE, tmpReg, targetReg);
emit->emitIns_R(INS_setne, EA_1BYTE, targetReg);
emit->emitIns_R_R(INS_movzx, EA_1BYTE, targetReg, targetReg);

mikedn · 2018-02-27T17:24:51Z

is there any chance that a temporary register (node->GetSingleTempReg()) is the same as one of the normally allocated registers?

That's why TreeNodeInfo::isInternalRegDelayFree exists.

tannergooding · 2018-02-27T17:30:15Z

@mikedn, thanks. I'll get a separate PR up that sets that (since it isn't actually impacting CI right now and can be handled separately).

mikedn · 2018-02-27T17:32:56Z

Basically wondering if I need to worry about:

I'm not sure if that particular implementation of FP compare check is the best. The JIT uses a different approach for floating point relops, something that looks like:

       0F9BC0               setpo    al
       7A03                 jpe      SHORT G_M55886_IG03
       0F94C0               sete     al
G_M55886_IG03:
       0FB6C0               movzx    rax, al

It does use a branch but it's likely to be a rarely taken/not taken branch due it being used only for NaNs. At the same time the code is shorter and uses a single register. Hmm, I need to take a closer look at intrinsics that use flags, they're likely to have problems. Especially when used with conditional branches.

tannergooding · 2018-02-27T17:50:11Z

I'm not sure if that particular implementation of FP compare check is the best.

@mikedn, it could probably do with some improvement (and needs to handle the jcc/movcc folding support eventually).

I implemented this way because Clang and MSVC do it this way for the "trivial" example: https://godbolt.org/g/9xS7zh.

I would assume that the other code would generally be used only if register pressure was a concern. I also don't think it is shorter in all cases, depending on which register is used for the and operation, it should either be the same length or only 1 byte longer.

0F9BC0               setnp  al
0F94C1               sete   cl
20C1                 and    cl,al
0FB6C1               movzx  eax,cl

mikedn · 2018-02-27T18:01:04Z

I implemented this way because Clang and MSVC do it this way for the "trivial" example:

That one looks better. The JIT code you shown above seems to generate an extra setcc. And it's interesting that gcc does it quite differently...

tannergooding · 2018-02-27T18:16:36Z

That one looks better. The JIT code you shown above seems to generate an extra setcc.

You're right, seems the JIT code should be:

  emit->emitIns_R_R(ins, emitTypeSize(TYP_SIMD16), op1Reg, op2Reg);
  emit->emitIns_R(INS_setpo, EA_1BYTE, targetReg);
  emit->emitIns_R(INS_sete, EA_1BYTE, tmpReg);
  emit->emitIns_R_R(INS_and, EA_1BYTE, tmpReg, targetReg);
- emit->emitIns_R(INS_setne, EA_1BYTE, targetReg);
- emit->emitIns_R_R(INS_movzx, EA_1BYTE, targetReg, targetReg);
+ emit->emitIns_R_R(INS_movzx, EA_1BYTE, targetReg, tmpReg);

And it's interesting that gcc does it quite differently...

Yeah, it is doing an unordered compare and just checking ZF, which I'm not sure is "technically" correct (basically the only difference between ucomiss and comiss is whether an exception is raised for QNaN)

Based on the documentation, I would expect that both PF and ZF need to be checked, otherwise it would treat two NaN as equal:

Compares the single-precision floating-point values in the low quadwords of operand 1 (first operand) and operand 2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unordered, greater than, less than, or equal). The OF, SF and AF flags in the EFLAGS register are set to 0. The unordered result is returned if either source operand is a NaN (QNaN or SNaN).

RESULT <- OrderedCompare(DEST[31:0] <> SRC[31:0]) {
(* Set EFLAGS *) CASE (RESULT) OF
    UNORDERED: ZF,PF,CF <- 111;
    GREATER_THAN: ZF,PF,CF <- 000;
    LESS_THAN: ZF,PF,CF <- 001;
    EQUAL: ZF,PF,CF <- 100;
ESAC;
OF, AF, SF <- 0; }

tannergooding · 2018-02-27T18:47:47Z

Test failures are caused by #16571. CC @jkotas

CarolEidt · 2018-02-27T22:38:11Z

is there any chance that a temporary register (node->GetSingleTempReg()) is the same as one of the normally allocated registers?

That's why TreeNodeInfo::isInternalRegDelayFree exists.

It's only needed if the tempReg needs to be different from the target. The internal registers never conflict with incoming sources.

tannergooding · 2018-02-27T22:40:05Z

It's only needed if the tempReg needs to be different from the target. The internal registers never conflict with incoming sources.

In this particular case it will, so I will need to update. I will have that PR up later tonight.

tannergooding · 2018-02-27T23:01:45Z

@CarolEidt, do the new changes look good to you as well? Just want to confirm so I can get this merged when the tests finish here shortly.

tannergooding · 2018-02-28T00:53:24Z

Updated to have an actual flag (HW_Flag_NoRMWSemantics) and ensured it was set on the various intrinsics, as appropriate.

…erand registers.

tannergooding · 2018-02-28T06:02:34Z

Had a wrong check against op2 != nullptr instead of info->srcCount >= 2

tannergooding · 2018-02-28T06:02:58Z

test Windows_NT x64 Checked jitincompletehwintrinsic
test Windows_NT x64 Checked jitx86hwintrinsicnoavx
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnosimd
test Windows_NT x64 Checked jitnox86hwintrinsic

test Windows_NT x86 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx
test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
test Windows_NT x86 Checked jitx86hwintrinsicnosimd
test Windows_NT x86 Checked jitnox86hwintrinsic

test Ubuntu x64 Checked jitincompletehwintrinsic
test Ubuntu x64 Checked jitx86hwintrinsicnoavx
test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
test Ubuntu x64 Checked jitx86hwintrinsicnosimd
test Ubuntu x64 Checked jitnox86hwintrinsic

test OSX10.12 x64 Checked jitincompletehwintrinsic
test OSX10.12 x64 Checked jitx86hwintrinsicnoavx
test OSX10.12 x64 Checked jitx86hwintrinsicnoavx2
test OSX10.12 x64 Checked jitx86hwintrinsicnosimd
test OSX10.12 x64 Checked jitnox86hwintrinsic

CarolEidt

LGTM, but in future I'd like us to (re)consider the usability/understandability of the negative flags in namedintrinsiclist.h

CarolEidt · 2018-02-28T11:29:52Z

+
+    // No Read/Modify/Write Semantics
+    // the intrinsic does not have read/modify/write semantics and doesn't need
+    HW_Flag_NoRMWSemantics = 0x4000,


I am not a fan of all these negative flags, as it is always confusing when you are checking for the positive case by checking the negation of negative case. But I see there's a lot of precedent for that in these flags already.

I'll log an issue tracking an investigation towards fixing this.

fiigii reviewed Feb 25, 2018

View reviewed changes

CarolEidt approved these changes Feb 26, 2018

View reviewed changes

tannergooding mentioned this pull request Feb 27, 2018

Delete left-over globalization CoreCLR tests #16571

Merged

tannergooding added 2 commits February 27, 2018 16:14

Adding partial support for the SSE41 hardware intrinsics

f85b8a1

Adding tests for the implemented SSE41 hardware intrinsics

315f8e9

This was referenced Feb 28, 2018

Set isInternalRegDelayFree for several of the x86 hwintrinsics #16649

Merged

Updating the CompareEqual{Ordered|Unordered}Scalar intrinsics to have slightly better codegen #16651

Merged

Adding some asserts that we won't overwrite one of the hwintrinsic op…

ed361f0

…erand registers.

CarolEidt approved these changes Feb 28, 2018

View reviewed changes

tannergooding merged commit ed43fdd into dotnet:master Feb 28, 2018

tannergooding deleted the hardwareintrin-sse41 branch May 30, 2018 04:14

Conversation

tannergooding commented Feb 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tannergooding commented Feb 25, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding commented Feb 26, 2018

Uh oh!

CarolEidt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding commented Feb 26, 2018

Uh oh!

tannergooding commented Feb 27, 2018

Uh oh!

tannergooding commented Feb 27, 2018

Uh oh!

CarolEidt commented Feb 27, 2018

Uh oh!

tannergooding commented Feb 27, 2018

Uh oh!

tannergooding commented Feb 27, 2018

Uh oh!

mikedn commented Feb 27, 2018

Uh oh!

tannergooding commented Feb 27, 2018

Uh oh!

mikedn commented Feb 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tannergooding commented Feb 27, 2018

Uh oh!

mikedn commented Feb 27, 2018

Uh oh!

tannergooding commented Feb 27, 2018

Uh oh!

tannergooding commented Feb 27, 2018

Uh oh!

CarolEidt commented Feb 27, 2018

Uh oh!

tannergooding commented Feb 27, 2018

Uh oh!

tannergooding commented Feb 27, 2018

Uh oh!

tannergooding commented Feb 28, 2018

Uh oh!

tannergooding commented Feb 28, 2018

Uh oh!

tannergooding commented Feb 28, 2018

Uh oh!

CarolEidt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

tannergooding commented Feb 25, 2018 •

edited

Loading

mikedn commented Feb 27, 2018 •

edited

Loading