Implement LoadHigh, LoadLow, and SetScalarVector128 SSE2 HW intrinsics#16736
Conversation
|
FYI @CarolEidt @tannergooding @fiigii Implementation of |
| // number of arguments | ||
| // | ||
| int Compiler::numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic* node) | ||
| int Compiler::numArgsOfHWIntrinsicSlow(GenTreeHWIntrinsic* node) |
There was a problem hiding this comment.
I do not think we need two member functions of numArgsOfHWIntrinsic, please combine them to one.
There was a problem hiding this comment.
Yes, I thought there was a general consensus that we should have a single method.
| // number of arguments | ||
| // | ||
| int Compiler::numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic* node) | ||
| int Compiler::numArgsOfHWIntrinsicSlow(GenTreeHWIntrinsic* node) |
There was a problem hiding this comment.
Yes, I thought there was a general consensus that we should have a single method.
|
Yes, the consensus was reached that we should have one function but it was based on the assumption that we will have usages where function may be called with node with explicit discussion link: #15777 (comment) Unfortunately, function in the current code will be called only with node argument which is not
The last arguments in favor of proposed design are:
Finally I have introduced signature change by removing |
|
On Ubuntu we are hitting failures due to renamed
|
Not really, our consensus is that the function always needs the node as passed argument.
Interesting, how did you get the data?
I am not a fan of early optimization, but if you can make sure that the function cannot be in-lined in the release build, this approach is reasonable to me. But the slower path can be a static local function instead of private member. |
|
test Windows_NT x64 Checked jitx86hwintrinsicnosimd |
|
test Windows_NT x86 Checked jitx86hwintrinsicnosimd |
|
I agree with @fiigii, if we have some actual data indicating this is a good perf scenario, we should make this change. However, the |
|
Reverted to single function due to problems with getting @tannergooding @CarolEidt Could you assign it to me? |
|
The problem we are hitting now is that Disassembly of release build of .text:0000000180129C88 CodeGen::genHWIntrinsic proc near
.text:0000000180129C88 var_B0 = dword ptr -0B0h
.text:0000000180129C88 var_AC = dword ptr -0ACh
.text:0000000180129C88 var_A8 = qword ptr -0A8h
.text:0000000180129C88 var_A0 = dword ptr -0A0h
.text:0000000180129C88 var_98 = qword ptr -98h
.text:0000000180129C88 var_90 = xmmword ptr -90h
.text:0000000180129C88 var_80 = xmmword ptr -80h
.text:0000000180129C88 var_70 = xmmword ptr -70h
.text:0000000180129C88 var_60 = xmmword ptr -60h
.text:0000000180129C88 var_50 = xmmword ptr -50h
.text:0000000180129C88 var_40 = xmmword ptr -40h
.text:0000000180129C88 var_s18 = dword ptr 18h
.text:0000000180129C88 var_s20 = dword ptr 20h
.text:0000000180129C88 var_s28 = dword ptr 28h
.text:0000000180129C88 arg_0 = qword ptr 40h
.text:0000000180129C88
.text:0000000180129C88 mov [rsp-38h+arg_0], rbx
.text:0000000180129C8D push rbp
.text:0000000180129C8E push rsi
.text:0000000180129C8F push rdi
.text:0000000180129C90 push r12
.text:0000000180129C92 push r13
.text:0000000180129C94 push r14
.text:0000000180129C96 push r15
.text:0000000180129C98 lea rbp, [rsp-27h]
.text:0000000180129C9D sub rsp, 0C0h
.text:0000000180129CA4 mov r12d, [rdx+48h]
.text:0000000180129CA8 mov r14, rcx
.text:0000000180129CAB lea rcx, hwIntrinsicInfoArray
.text:0000000180129CB2 mov rsi, rdx
.text:0000000180129CB5 lea eax, [r12-6]
.text:0000000180129CBA mov r11d, eax
.text:0000000180129CBD lea rax, [rax+rax*4]
.text:0000000180129CC1 add rax, rax
.text:0000000180129CC4 mov r10d, [rcx+rax*8+4Ch]
.text:0000000180129CC9 mov r8d, [rcx+rax*8+10h]
.text:0000000180129CCE mov r15d, [rcx+rax*8+48h]
.text:0000000180129CD3 mov edi, [rcx+rax*8+14h]
.text:0000000180129CD7 mov rcx, rdx
.text:0000000180129CDA mov [rbp+57h+var_A0], r10d
.text:0000000180129CDE call Compiler__numArgsOfHWIntrinsic
.text:0000000180129CE3 lea ecx, [r15-3]
.text:0000000180129CE7 mov r9d, eax
.text:0000000180129CEA mov eax, 4
.text:0000000180129CEF test ecx, 0FFFFFFFAh
.text:0000000180129CF5 jnz short loc_180129D00
|
That's fine, I do not believe that calling |
|
As I said, if a helper function can make |
| HARDWARE_INTRINSIC(SSE2_SetZeroVector128, "SetZeroVector128", SSE2, -1, 16, 0, {INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_invalid, INS_xorpd}, HW_Category_Helper, HW_Flag_NoRMWSemantics) | ||
| HARDWARE_INTRINSIC(SSE2_SetAllVector128, "SetAllVector128", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Helper, HW_Flag_NoCodeGen|HW_Flag_NoRMWSemantics) | ||
| HARDWARE_INTRINSIC(SSE2_SetScalarVector128, "SetScalarVector128", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movsdsse2}, HW_Category_Helper, HW_Flag_MultiIns|HW_Flag_NoRMWSemantics) | ||
| HARDWARE_INTRINSIC(SSE2_SetVector128, "SetVector128", SSE2, -1, 16, -1, {INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_invalid, INS_xorpd}, HW_Category_Helper, HW_Flag_NoFlag) |
There was a problem hiding this comment.
Please remove the intrinsic definitions from this table, if you do not implement in the PR.
|
@fiigii Addressed code review feedback |
|
@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test |
|
On Ubuntu there are failures of Windows sub jobs due to:
https://ci.dot.net/job/dotnet_coreclr/job/master/job/checked_windows_nt_prtest/3399/ |
|
test Ubuntu x64 Checked jitincompletehwintrinsic |
|
Again hitting timeouts after 4 hours runs |
|
test Ubuntu x64 Checked jitincompletehwintrinsic |
| assert(baseType == TYP_DOUBLE); | ||
| assert(op2 == nullptr); | ||
|
|
||
| instruction ins = Compiler::insOfHWIntrinsic(intrinsicID, node->gtSIMDBaseType); |
There was a problem hiding this comment.
nit: It might be useful to extract this up-top, like we do in genSSEIntrinsic
There was a problem hiding this comment.
@tannergooding Checked the pattern and it seems that in Sse instructions are hard coded:
case NI_SSE_SetScalarVector128:
{
assert(baseType == TYP_FLOAT);
assert(op2 == nullptr);
if (op1Reg == targetReg)
{
regNumber tmpReg = node->GetSingleTempReg();
// Ensure we aren't overwriting targetReg
assert(tmpReg != targetReg);
emit->emitIns_R_R(INS_movaps, emitTypeSize(TYP_SIMD16), tmpReg, op1Reg);
op1Reg = tmpReg;
}
emit->emitIns_SIMD_R_R_R(INS_xorps, emitTypeSize(TYP_SIMD16), targetReg, targetReg, targetReg);
emit->emitIns_SIMD_R_R_R(INS_movss, emitTypeSize(TYP_SIMD16), targetReg, targetReg, op1Reg);
break;
}I would prefer to not follow that pattern.
There was a problem hiding this comment.
Hmmm, seems I was thinking of another method.
I was suggesting that it not be hard-coded, but you move the instruction ins = Compiler::insOfHWIntrinsic line to the top of genSSE2Intrinsic, so it can be more generally shared.
There was a problem hiding this comment.
Actually, I moved it inside case switches due to explicit code review feedback from @CarolEidt
There was a problem hiding this comment.
Hmmm - I don't recall giving that feedback (though I don't doubt that I did). It makes sense to pull it out in this case, as nearly all the cases depend upon it. If a large number had a one-to-one mapping, it might make sense to push it into the cases that require it.
In reply to: 172245582 [](ancestors = 172245582)
There was a problem hiding this comment.
The comment is here: #16262 (comment)
@CarolEidt Pls confirm if I should refactor
There was a problem hiding this comment.
Thanks for the pointer. At that time, there weren't that many cases, so it seemed better to push it down into the cases that needed it, but now there are many cases with the majority requiring it, so it would make sense to pull it up. However, I don't think it's critical.
There was a problem hiding this comment.
I am keen to get better implementation. May I do it together with next PRs?
There was a problem hiding this comment.
I am keen to get better implementation. May I do it together with next PRs?
That would be fine, though we will want to slow the churn in the code as we move closer to release.
| { | ||
| regNumber tmpReg = node->GetSingleTempReg(); | ||
|
|
||
| assert(tmpReg != targetReg); |
There was a problem hiding this comment.
Please ensure you copy over comments as well when copying code (missing // Ensure we aren't overwriting targetReg)
| // node unless it is nullptr | ||
| // numArgsOfHWIntrinsic: gets the number of arguments for the hardware intrinsic. | ||
| // This attempts to do a table based lookup but will fallback to the number | ||
| // of operands in 'node' if the table entry is - 1. |
There was a problem hiding this comment.
nit: Remove the space in - 1
| { | ||
| numArgs = 0; | ||
| GenTreeArgList* list = op1->AsArgList(); | ||
| int numArgs = 0; |
a8d4288 to
a72b412
Compare
tannergooding
left a comment
There was a problem hiding this comment.
Overall LGTM.
I plan on creating a template for the Load/Store tests shortly, which should help in code coverage.
|
test Windows_NT x64 Checked jitincompletehwintrinsic test Windows_NT x86 Checked jitincompletehwintrinsic test Ubuntu x64 Checked jitincompletehwintrinsic |
| } | ||
|
|
||
| // TODO_XARCH-CQ - refactoring of numArgsOfHWIntrinsic fast path into inlinable | ||
| // function and slow local static function may increase performance significantly |
There was a problem hiding this comment.
I don't think we need this comment. As far as I know there is no evidence that this has a measurable impact on throughput.
There was a problem hiding this comment.
Thanks for review. May I correct it in next PR?
Addresses code review feedback from #15777