Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Adding partial support for the SSE41 hardware intrinsics.#16558

Merged
tannergooding merged 3 commits into
dotnet:masterfrom
tannergooding:hardwareintrin-sse41
Feb 28, 2018
Merged

Adding partial support for the SSE41 hardware intrinsics.#16558
tannergooding merged 3 commits into
dotnet:masterfrom
tannergooding:hardwareintrin-sse41

Conversation

@tannergooding
Copy link
Copy Markdown
Member

@tannergooding tannergooding commented Feb 25, 2018

This partially resolves https://github.com/dotnet/coreclr/issues/16458

Still todo are:

  • Extract
  • Insert

@tannergooding
Copy link
Copy Markdown
Member Author

FYI. @fiigii, @CarolEidt, @eerhardt, @RussKeldorph

}

op1 = impSIMDPopStack(TYP_SIMD16);
retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, op2, intrinsic, baseType, simdSize);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CeilingScalar/FloorScalar, I think we can duplicate the source-operand for one-arg overloads in the importer. Then we can set NumArg as 2 in the table, which may simplify the implementation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we do that, in general, later and only after we have confirmed the codegen will be as good.

@CarolEidt, do you know if the JIT is smart enough to recognize that op2 is a clone of op1 and only needs to be read from memory once?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we do that, in general, later and only after we have confirmed the codegen will be as good.

It is okay to me. But numArgOfHWIntrinsic should be modified if we distinguish one-arg and two-arg overloads in the back-end.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

op2 is a clone of op1 and only needs to be read from memory once

Not sure what "clone" means here. But the "only read from memory once" typically happens only if CSE kicks in.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what "clone" means here

@mikedn, gtCloneExpr(op1).

Currently, we only have op1 (op2 is nullptr) and it's all we carry through codegen. op1 will get loaded into register (via movups) and we will ultimately emit INS op1Reg, op1Reg, imm8.

If we instead do what @fiigii is suggesting my concern would be that the JIT isn't smart enough to always recognize that op1 and op2 represent the same underlying location. If it isn't smart enough we might end up emitting two separate movups and INS op1Reg, op2Reg, imm8 or one movups and a folded load (INS [op1], op2Reg, imm8).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that using gtCloneExpr here is an option, best I can tell op1 is an arbitrary tree and you can't quite clone that. If it's a complex tree the only reasonable way to "clone" it is to spill it to a local variable and make both op1 and op2 GT_LCL_VARs. And since they both use the same variable they should end up using the same register.

But spilling the tree is not without drawbacks so keeping op2 as nullptr seems somehow better.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do end up cloning, impCloneExpr may be what you're looking for, it clones "cheap" trees and creates new temps for "expensive" trees.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikden, @AndyAyersMS: Thanks for the tips.

I think I will keep it "as is" for now. The current implementation is producing good results and is trivial to handle in codegen.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree that keeping as-is is best, at least until such time as we come across a good reason for doing otherwise.

Comment thread src/jit/hwintrinsiccodegenxarch.cpp Outdated
case NI_SSE41_DotProduct:
{
instruction ins = Compiler::insOfHWIntrinsic(intrinsicID, node->gtSIMDBaseType);
genHWIntrinsic_FullRangeImm8(node, ins);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to wait for #16183 or merge it soon...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping that I can rebase this PR after #16183 goes in.

@tannergooding
Copy link
Copy Markdown
Member Author

Most of the test failures are https://github.com/dotnet/coreclr/issues/16566. Will rerun after the fix is merged.

There was another job that failed due to the "paging file is too small" issue. It has been re-queued.

There was also an issue with MultipleSumAbsoluteDifferences in the x86 No AVX job. I am investigating.

Copy link
Copy Markdown

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

op1 = impSIMDPopStack(TYP_SIMD16);
retNode = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op1, op2, intrinsic, baseType, simdSize);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree that keeping as-is is best, at least until such time as we come across a good reason for doing otherwise.

@tannergooding
Copy link
Copy Markdown
Member Author

@CarolEidt, Just as an FYI: I am going to rebase this PR onto #16183 after it is merged. The resulting change should be to the genHWIntrinsic_FullRangeImm8 call here: https://github.com/dotnet/coreclr/pull/16558/files#diff-847442c7efdf7a78e78711bea30fd4b2R1071

@tannergooding
Copy link
Copy Markdown
Member Author

Rebased onto dotnet/master. Change is slightly smaller thanks to #16183.

@tannergooding
Copy link
Copy Markdown
Member Author

@CarolEidt, @fiigii, the MultipleSumAbsoluteDifferences failure is actually a more general issue.

targetReg != op1Reg, but targetReg == op2Reg. On x86, when we inject the intermediate movaps targetReg, op1Reg we end up overwriting the value in op2Reg.

I should have a fix up momentarily and am just going through our other code paths to try and ensure we don't have any similar scenarios elsewhere.

@CarolEidt
Copy link
Copy Markdown

Would it be useful to have a flag indicating whether an intrinsic has RMW semantics on op1 and therefore needs delay free on the second source?

@tannergooding
Copy link
Copy Markdown
Member Author

Would it be useful to have a flag indicating whether an intrinsic has RMW semantics on op1 and therefore needs delay free on the second source?

@CarolEidt, probably long term, yes (although a flag saying it doesn't have RMW semantics is probably better, I believe most intrinsics default to having it).

The fix I am doing right now is to just blanket the intrinsics second op as delay free for UseVEXEncoding() == false, as well as placing asserts in the appropriate locations.

@tannergooding
Copy link
Copy Markdown
Member Author

@CarolEidt, is there any chance that a temporary register (node->GetSingleTempReg()) is the same as one of the normally allocated registers?

Basically wondering if I need to worry about:

assert(baseType == TYP_DOUBLE);
op2Reg             = op2->gtRegNum;
instruction ins    = Compiler::insOfHWIntrinsic(intrinsicID, baseType);
regNumber   tmpReg = node->GetSingleTempReg();

emit->emitIns_R_R(ins, emitTypeSize(TYP_SIMD16), op1Reg, op2Reg);
emit->emitIns_R(INS_setpe, EA_1BYTE, targetReg);
emit->emitIns_R(INS_setne, EA_1BYTE, tmpReg);
emit->emitIns_R_R(INS_or, EA_1BYTE, tmpReg, targetReg);
emit->emitIns_R(INS_setne, EA_1BYTE, targetReg);
emit->emitIns_R_R(INS_movzx, EA_1BYTE, targetReg, targetReg);

@mikedn
Copy link
Copy Markdown

mikedn commented Feb 27, 2018

is there any chance that a temporary register (node->GetSingleTempReg()) is the same as one of the normally allocated registers?

That's why TreeNodeInfo::isInternalRegDelayFree exists.

@tannergooding
Copy link
Copy Markdown
Member Author

@mikedn, thanks. I'll get a separate PR up that sets that (since it isn't actually impacting CI right now and can be handled separately).

@mikedn
Copy link
Copy Markdown

mikedn commented Feb 27, 2018

Basically wondering if I need to worry about:

I'm not sure if that particular implementation of FP compare check is the best. The JIT uses a different approach for floating point relops, something that looks like:

       0F9BC0               setpo    al
       7A03                 jpe      SHORT G_M55886_IG03
       0F94C0               sete     al
G_M55886_IG03:
       0FB6C0               movzx    rax, al

It does use a branch but it's likely to be a rarely taken/not taken branch due it being used only for NaNs. At the same time the code is shorter and uses a single register. Hmm, I need to take a closer look at intrinsics that use flags, they're likely to have problems. Especially when used with conditional branches.

@tannergooding
Copy link
Copy Markdown
Member Author

I'm not sure if that particular implementation of FP compare check is the best.

@mikedn, it could probably do with some improvement (and needs to handle the jcc/movcc folding support eventually).

I implemented this way because Clang and MSVC do it this way for the "trivial" example: https://godbolt.org/g/9xS7zh.

I would assume that the other code would generally be used only if register pressure was a concern. I also don't think it is shorter in all cases, depending on which register is used for the and operation, it should either be the same length or only 1 byte longer.

0F9BC0               setnp  al
0F94C1               sete   cl
20C1                 and    cl,al
0FB6C1               movzx  eax,cl

@mikedn
Copy link
Copy Markdown

mikedn commented Feb 27, 2018

I implemented this way because Clang and MSVC do it this way for the "trivial" example:

That one looks better. The JIT code you shown above seems to generate an extra setcc. And it's interesting that gcc does it quite differently...

@tannergooding
Copy link
Copy Markdown
Member Author

That one looks better. The JIT code you shown above seems to generate an extra setcc.

You're right, seems the JIT code should be:

  emit->emitIns_R_R(ins, emitTypeSize(TYP_SIMD16), op1Reg, op2Reg);
  emit->emitIns_R(INS_setpo, EA_1BYTE, targetReg);
  emit->emitIns_R(INS_sete, EA_1BYTE, tmpReg);
  emit->emitIns_R_R(INS_and, EA_1BYTE, tmpReg, targetReg);
- emit->emitIns_R(INS_setne, EA_1BYTE, targetReg);
- emit->emitIns_R_R(INS_movzx, EA_1BYTE, targetReg, targetReg);
+ emit->emitIns_R_R(INS_movzx, EA_1BYTE, targetReg, tmpReg);

And it's interesting that gcc does it quite differently...

Yeah, it is doing an unordered compare and just checking ZF, which I'm not sure is "technically" correct (basically the only difference between ucomiss and comiss is whether an exception is raised for QNaN)

Based on the documentation, I would expect that both PF and ZF need to be checked, otherwise it would treat two NaN as equal:

Compares the single-precision floating-point values in the low quadwords of operand 1 (first operand) and operand 2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unordered, greater than, less than, or equal). The OF, SF and AF flags in the EFLAGS register are set to 0. The unordered result is returned if either source operand is a NaN (QNaN or SNaN).

RESULT <- OrderedCompare(DEST[31:0] <> SRC[31:0]) {
(* Set EFLAGS *) CASE (RESULT) OF
    UNORDERED: ZF,PF,CF <- 111;
    GREATER_THAN: ZF,PF,CF <- 000;
    LESS_THAN: ZF,PF,CF <- 001;
    EQUAL: ZF,PF,CF <- 100;
ESAC;
OF, AF, SF <- 0; }

@tannergooding
Copy link
Copy Markdown
Member Author

Test failures are caused by #16571. CC @jkotas

@CarolEidt
Copy link
Copy Markdown

is there any chance that a temporary register (node->GetSingleTempReg()) is the same as one of the normally allocated registers?

That's why TreeNodeInfo::isInternalRegDelayFree exists.

It's only needed if the tempReg needs to be different from the target. The internal registers never conflict with incoming sources.

@tannergooding
Copy link
Copy Markdown
Member Author

It's only needed if the tempReg needs to be different from the target. The internal registers never conflict with incoming sources.

In this particular case it will, so I will need to update. I will have that PR up later tonight.

@tannergooding
Copy link
Copy Markdown
Member Author

@CarolEidt, do the new changes look good to you as well? Just want to confirm so I can get this merged when the tests finish here shortly.

@tannergooding
Copy link
Copy Markdown
Member Author

Updated to have an actual flag (HW_Flag_NoRMWSemantics) and ensured it was set on the various intrinsics, as appropriate.

@tannergooding
Copy link
Copy Markdown
Member Author

Had a wrong check against op2 != nullptr instead of info->srcCount >= 2

@tannergooding
Copy link
Copy Markdown
Member Author

test Windows_NT x64 Checked jitincompletehwintrinsic
test Windows_NT x64 Checked jitx86hwintrinsicnoavx
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnosimd
test Windows_NT x64 Checked jitnox86hwintrinsic

test Windows_NT x86 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx
test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
test Windows_NT x86 Checked jitx86hwintrinsicnosimd
test Windows_NT x86 Checked jitnox86hwintrinsic

test Ubuntu x64 Checked jitincompletehwintrinsic
test Ubuntu x64 Checked jitx86hwintrinsicnoavx
test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
test Ubuntu x64 Checked jitx86hwintrinsicnosimd
test Ubuntu x64 Checked jitnox86hwintrinsic

test OSX10.12 x64 Checked jitincompletehwintrinsic
test OSX10.12 x64 Checked jitx86hwintrinsicnoavx
test OSX10.12 x64 Checked jitx86hwintrinsicnoavx2
test OSX10.12 x64 Checked jitx86hwintrinsicnosimd
test OSX10.12 x64 Checked jitnox86hwintrinsic

Copy link
Copy Markdown

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but in future I'd like us to (re)consider the usability/understandability of the negative flags in namedintrinsiclist.h


// No Read/Modify/Write Semantics
// the intrinsic does not have read/modify/write semantics and doesn't need
HW_Flag_NoRMWSemantics = 0x4000,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of all these negative flags, as it is always confusing when you are checking for the positive case by checking the negation of negative case. But I see there's a lot of precedent for that in these flags already.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll log an issue tracking an investigation towards fixing this.

@tannergooding tannergooding merged commit ed43fdd into dotnet:master Feb 28, 2018
@tannergooding tannergooding deleted the hardwareintrin-sse41 branch May 30, 2018 04:14
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement SSE4.1 hardware intrinsics

5 participants