Adding partial support for the SSE41 hardware intrinsics.#16558
Conversation
|
FYI. @fiigii, @CarolEidt, @eerhardt, @RussKeldorph |
| } | ||
|
|
||
| op1 = impSIMDPopStack(TYP_SIMD16); | ||
| retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, op2, intrinsic, baseType, simdSize); |
There was a problem hiding this comment.
For CeilingScalar/FloorScalar, I think we can duplicate the source-operand for one-arg overloads in the importer. Then we can set NumArg as 2 in the table, which may simplify the implementation.
There was a problem hiding this comment.
I suggest we do that, in general, later and only after we have confirmed the codegen will be as good.
@CarolEidt, do you know if the JIT is smart enough to recognize that op2 is a clone of op1 and only needs to be read from memory once?
There was a problem hiding this comment.
I suggest we do that, in general, later and only after we have confirmed the codegen will be as good.
It is okay to me. But numArgOfHWIntrinsic should be modified if we distinguish one-arg and two-arg overloads in the back-end.
There was a problem hiding this comment.
op2 is a clone of op1 and only needs to be read from memory once
Not sure what "clone" means here. But the "only read from memory once" typically happens only if CSE kicks in.
There was a problem hiding this comment.
Not sure what "clone" means here
@mikedn, gtCloneExpr(op1).
Currently, we only have op1 (op2 is nullptr) and it's all we carry through codegen. op1 will get loaded into register (via movups) and we will ultimately emit INS op1Reg, op1Reg, imm8.
If we instead do what @fiigii is suggesting my concern would be that the JIT isn't smart enough to always recognize that op1 and op2 represent the same underlying location. If it isn't smart enough we might end up emitting two separate movups and INS op1Reg, op2Reg, imm8 or one movups and a folded load (INS [op1], op2Reg, imm8).
There was a problem hiding this comment.
I don't think that using gtCloneExpr here is an option, best I can tell op1 is an arbitrary tree and you can't quite clone that. If it's a complex tree the only reasonable way to "clone" it is to spill it to a local variable and make both op1 and op2 GT_LCL_VARs. And since they both use the same variable they should end up using the same register.
But spilling the tree is not without drawbacks so keeping op2 as nullptr seems somehow better.
There was a problem hiding this comment.
If you do end up cloning, impCloneExpr may be what you're looking for, it clones "cheap" trees and creates new temps for "expensive" trees.
There was a problem hiding this comment.
@mikden, @AndyAyersMS: Thanks for the tips.
I think I will keep it "as is" for now. The current implementation is producing good results and is trivial to handle in codegen.
There was a problem hiding this comment.
I would agree that keeping as-is is best, at least until such time as we come across a good reason for doing otherwise.
| case NI_SSE41_DotProduct: | ||
| { | ||
| instruction ins = Compiler::insOfHWIntrinsic(intrinsicID, node->gtSIMDBaseType); | ||
| genHWIntrinsic_FullRangeImm8(node, ins); |
|
Most of the test failures are https://github.com/dotnet/coreclr/issues/16566. Will rerun after the fix is merged. There was another job that failed due to the "paging file is too small" issue. It has been re-queued. There was also an issue with |
| } | ||
|
|
||
| op1 = impSIMDPopStack(TYP_SIMD16); | ||
| retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, op2, intrinsic, baseType, simdSize); |
There was a problem hiding this comment.
I would agree that keeping as-is is best, at least until such time as we come across a good reason for doing otherwise.
|
@CarolEidt, Just as an FYI: I am going to rebase this PR onto #16183 after it is merged. The resulting change should be to the |
|
Rebased onto dotnet/master. Change is slightly smaller thanks to #16183. |
|
@CarolEidt, @fiigii, the
I should have a fix up momentarily and am just going through our other code paths to try and ensure we don't have any similar scenarios elsewhere. |
|
Would it be useful to have a flag indicating whether an intrinsic has RMW semantics on op1 and therefore needs delay free on the second source? |
@CarolEidt, probably long term, yes (although a flag saying it doesn't have RMW semantics is probably better, I believe most intrinsics default to having it). The fix I am doing right now is to just blanket the intrinsics second op as delay free for |
|
@CarolEidt, is there any chance that a temporary register ( Basically wondering if I need to worry about: |
That's why |
|
@mikedn, thanks. I'll get a separate PR up that sets that (since it isn't actually impacting CI right now and can be handled separately). |
I'm not sure if that particular implementation of FP compare check is the best. The JIT uses a different approach for floating point relops, something that looks like: It does use a branch but it's likely to be a rarely taken/not taken branch due it being used only for NaNs. At the same time the code is shorter and uses a single register. Hmm, I need to take a closer look at intrinsics that use flags, they're likely to have problems. Especially when used with conditional branches. |
@mikedn, it could probably do with some improvement (and needs to handle the jcc/movcc folding support eventually). I implemented this way because Clang and MSVC do it this way for the "trivial" example: https://godbolt.org/g/9xS7zh. I would assume that the other code would generally be used only if register pressure was a concern. I also don't think it is shorter in all cases, depending on which register is used for the |
That one looks better. The JIT code you shown above seems to generate an extra setcc. And it's interesting that gcc does it quite differently... |
You're right, seems the JIT code should be: emit->emitIns_R_R(ins, emitTypeSize(TYP_SIMD16), op1Reg, op2Reg);
emit->emitIns_R(INS_setpo, EA_1BYTE, targetReg);
emit->emitIns_R(INS_sete, EA_1BYTE, tmpReg);
emit->emitIns_R_R(INS_and, EA_1BYTE, tmpReg, targetReg);
- emit->emitIns_R(INS_setne, EA_1BYTE, targetReg);
- emit->emitIns_R_R(INS_movzx, EA_1BYTE, targetReg, targetReg);
+ emit->emitIns_R_R(INS_movzx, EA_1BYTE, targetReg, tmpReg);
Yeah, it is doing an unordered compare and just checking ZF, which I'm not sure is "technically" correct (basically the only difference between Based on the documentation, I would expect that both PF and ZF need to be checked, otherwise it would treat two NaN as equal:
|
It's only needed if the tempReg needs to be different from the target. The internal registers never conflict with incoming sources. |
In this particular case it will, so I will need to update. I will have that PR up later tonight. |
|
@CarolEidt, do the new changes look good to you as well? Just want to confirm so I can get this merged when the tests finish here shortly. |
|
Updated to have an actual flag ( |
|
Had a wrong check against |
|
test Windows_NT x64 Checked jitincompletehwintrinsic test Windows_NT x86 Checked jitincompletehwintrinsic test Ubuntu x64 Checked jitincompletehwintrinsic test OSX10.12 x64 Checked jitincompletehwintrinsic |
CarolEidt
left a comment
There was a problem hiding this comment.
LGTM, but in future I'd like us to (re)consider the usability/understandability of the negative flags in namedintrinsiclist.h
|
|
||
| // No Read/Modify/Write Semantics | ||
| // the intrinsic does not have read/modify/write semantics and doesn't need | ||
| HW_Flag_NoRMWSemantics = 0x4000, |
There was a problem hiding this comment.
I am not a fan of all these negative flags, as it is always confusing when you are checking for the positive case by checking the negation of negative case. But I see there's a lot of precedent for that in these flags already.
There was a problem hiding this comment.
I'll log an issue tracking an investigation towards fixing this.
This partially resolves https://github.com/dotnet/coreclr/issues/16458
Still todo are: