Update the table-driven framework to support x86 imm-intrinsics#16183
Conversation
|
I will submit test cases later. |
| case INS_psubusb: | ||
| case INS_psubusw: | ||
| case INS_psubw: | ||
| case INS_psllw: |
There was a problem hiding this comment.
nit: it would help if we would keep alphabetical order of instructions
|
|
||
| void emitter::emitIns_SIMD_R_R_I(instruction ins, emitAttr attr, regNumber reg, regNumber reg1, int ival) | ||
| { | ||
| bool isSSEShift = false; |
There was a problem hiding this comment.
Maybe this should be in a separate "isSseShift" function?
| { | ||
| if (op3->IsCnsIntOrI()) | ||
| { | ||
|
|
| { | ||
| if (op2->IsCnsIntOrI()) | ||
| { | ||
| ssize_t ival = op2->AsIntConCommon()->IconValue(); |
There was a problem hiding this comment.
Since you have used IsCnsIntOrI using AsIntCon would make more sense. AsIntConCommon also accepts GT_CNS_LNG which is not needed here.
| assert(op2Reg != REG_NA); | ||
| NamedIntrinsic intrinsicID = node->gtHWIntrinsicId; | ||
| var_types baseType = node->gtSIMDBaseType; | ||
| emitAttr attr = (emitAttr)node->gtSIMDSize; |
There was a problem hiding this comment.
I don't think it makes a significant difference but in many places the EA_ATTR macro is used for this.
|
|
||
| // Emit the jump table | ||
|
|
||
| JITDUMP("\n J_M%03u_DS%02u LABEL DWORD\n", Compiler::s_compMethodsCount, jmpTableBase); |
There was a problem hiding this comment.
I don't think this is really useful. It looks like it was copied from genJumpTable but then it didn't belong there either. This kind of dump messages, if really needed, should be in emitBBTableDataGenBeg and emitDataGenData.
Or better, the data section should be included in the disassembly produced by the JIT. But that's a different story.
| // Return Value: | ||
| // the max imm-value of non-full-range IMM intrinsic | ||
| // | ||
| int Compiler::immUpperBoundOfHWIntrinsic(NamedIntrinsic intrinsic, var_types baseType) |
There was a problem hiding this comment.
I just don't understand why people insist in adding more stuff to Compiler...
There was a problem hiding this comment.
Hmm.. these functions are needed by multiple passes (e.g., importer, codegen, lowering, etc). Is there other better approach to organize these functions?
There was a problem hiding this comment.
One reasonable option seems to be for these functions to be static members of HWIntrinsicInfo. Haven't really checked if this is actually possible or not. Another option might be global functions, similar to varType functions for example. Yet another option would be a HWIntrinsicTable class that basically acts as an abstraction for hwIntrinsicInfoArray.
That said, I'm not sure if it's worth bothering with this ATM.
| case NI_AVX2_ShiftLeftLogical: | ||
| case NI_AVX2_ShiftRightArithmetic: | ||
| case NI_AVX2_ShiftRightLogical: | ||
| return genTypeSize(baseType) * 8 - 1; |
There was a problem hiding this comment.
Something seems fishy here - the proper upper bound of shift is likely 255, not a number depending on the element type size.
| GenTree* Compiler::impOutOfRangeFallback(NamedIntrinsic intrinsic, var_types simdType, var_types baseType) | ||
| { | ||
| assert((flagsOfHWIntrinsic(intrinsic) & HW_Flag_OutOfRangeFallbackIMM) != 0); | ||
| assert(simdType == TYP_SIMD32 || simdType == TYP_SIMD16); |
There was a problem hiding this comment.
I see that the formatter doesn't flag these but the usual practice is to parenthesize this kind of expressions: assert((simdType == TYP_SIMD32) || (simdType == TYP_SIMD16));. Happens in multiple places.
| // - AND, the imm-intrinsic does not have out-of-range fallback | ||
|
|
||
| if (!mustExpand && lastOp->IsCnsIntOrI() && | ||
| lastOp->AsIntConCommon()->IconValue() > immUpperBoundOfHWIntrinsic(intrinsic, baseType)) |
| void genLZCNTIntrinsic(GenTreeHWIntrinsic* node); | ||
| void genPCLMULQDQIntrinsic(GenTreeHWIntrinsic* node); | ||
| void genPOPCNTIntrinsic(GenTreeHWIntrinsic* node); | ||
| void genJumpTableFallback( |
There was a problem hiding this comment.
I think @sdmaclea had a good approach to this in one of his PRs.
Specifically, the codegen is essentially the same except for the actual emitted instruction and could be handled via a lambda.
| case NI_SSE2_ShiftRightLogical128BitLane: | ||
| impPopStack(); | ||
| impPopStack(); | ||
| return gtNewSimdHWIntrinsicNode(simdType, NI_SSE2_SetZeroVector128, baseType, genTypeSize(simdType)); |
There was a problem hiding this comment.
This can be combined with AVX2 with if / else clause for choosing NI_SSE2_SetZeroVector128 or NI_AVX_SetZeroVector256 intrinsic parameter
1ae1ee1 to
b847cb0
Compare
| HARDWARE_INTRINSIC(SSE2_CompareNotLessThanOrEqual, "CompareNotLessThanOrEqual", SSE2, 6, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cmppd}, HW_Category_SimpleSIMD, HW_Flag_NoFlag) | ||
| HARDWARE_INTRINSIC(SSE2_CompareOrdered, "CompareOrdered", SSE2, 7, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cmppd}, HW_Category_SimpleSIMD, HW_Flag_NoFlag) | ||
| HARDWARE_INTRINSIC(SSE2_CompareUnordered, "CompareUnordered", SSE2, 3, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cmppd}, HW_Category_SimpleSIMD, HW_Flag_NoFlag) | ||
| HARDWARE_INTRINSIC(SSE2_ConvertScalarToVector128Int32, "ConvertScalarToVector128Int32", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_mov_i2xmm, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_NoFlag) |
There was a problem hiding this comment.
@4creators No worries, I just use this intrinsic as an internal helper. Once you implement it, I will remove here.
| { | ||
| emit->emitIns_R_I(INS_cmp, EA_4BYTE, nonConstImmReg, maxImmValue); | ||
| emitJumpKind jge = genJumpKindForOper(GT_GE, CK_UNSIGNED); | ||
| genJumpToThrowHlpBlk(jge, SCK_ARG_RNG_EXCPN); |
There was a problem hiding this comment.
The runtime intermittently gets "failed to find exception throw block". @CarolEidt @tannergooding @mikedn @jkotas Do you know why?
19:06:15 CLRJIT! CHECK::Trigger + 0x30E (0x7354b789)
19:06:15 CLRJIT! CodeGen::genJumpToThrowHlpBlk + 0x13C (0x73435bd7)
19:06:15 CLRJIT! CodeGen::genJumpTableFallback<<lambda_8c168ac1d3dcdd43425f08b8cdfa80f0> > + 0xF7 (0x7353e539)
19:06:15 CLRJIT! CodeGen::genHWIntrinsic + 0x2AE (0x7353ec79)
19:06:15 CLRJIT! CodeGen::genCodeForTreeNode + 0x43D (0x73516b55)
19:06:15 CLRJIT! CodeGen::genCodeForBBlist + 0x80E (0x7343936e)
19:06:15 CLRJIT! CodeGen::genGenerateCode + 0x283 (0x73434f23)
19:06:15 CLRJIT! Compiler::compCompile + 0x72D (0x73440571)
19:06:15 CLRJIT! Compiler::compCompileHelper + 0x589 (0x73441767)
19:06:15 CLRJIT! Compiler::compCompile + 0x622 (0x73440c53)
19:06:15 File: d:\j\workspace\x86_checked_w---b1af1919\src\jit\codegencommon.cpp Line: 2826
19:06:15 Image: D:\j\workspace\x86_checked_w---b1af1919\bin\tests\Windows_NT.x86.Checked\Tests\Core_Root\CoreRun.exe
{
/* Find the helper-block which raises the exception. */
Compiler::AddCodeDsc* add =
compiler->fgFindExcptnTarget(codeKind, compiler->bbThrowIndex(compiler->compCurBB));
PREFIX_ASSUME_MSG((add != nullptr), ("ERROR: failed to find exception throw block"));
tgtBlk = add->acdDstBlk;
}There was a problem hiding this comment.
Range checks must be exposed in the importer, so that they can be appropriately modeled in the control flow. Check out the places in simd.cpp where it generates GT_SIMD_CHK.
There was a problem hiding this comment.
@CarolEidt Thanks for pointing out. IMO, I still do not like the current solution to non-const (that hardcodes a jump-table fallback in the compiler), which
- needs more IR to represent the range-check because the out-of-range constant argument will become a variable in the recursive function-body.
- the codegen always needs to consider keeping consistent semantics with the front-end.
So I think the software switch-table fallback can significantly simplify the compiler implementation that only deals with constant imm-argument.
There was a problem hiding this comment.
I thought we determined that, we could also do this as:
result Intrinsic(op1, op2)
{
RangeCheck();
return Intrinsic(op1, op2);
}Which only leaves the switch table up to the JIT.
There was a problem hiding this comment.
If we do it all in the JIT however, it still seems like a one time cost in implementing a helper function to do this, and it only needs to happen in the fallback case, which is already on its own codepath.
There was a problem hiding this comment.
I think the software switch-table is not a problem now. In the future, if the compiler determines to optimize away the switch, we can use an attribute to disable the optimization on these intrinsics.
There was a problem hiding this comment.
In the future, if the compiler determines to optimize away the switch, we can use an attribute to disable the optimization on these intrinsics.
There are multiple different compilers, and some (such as the C# compilers) don't have an option to disable optimizations for a given method. This would only work for an AOT or JIT compiler that was performing the optimization, and only if it respected the flag.
I think it would be good to better understand the specific pushback here. We already discussed this at some length in at least one of the other threads and landed on one of these two approaches:
- do the range check in managed and emit the switch table in the JIT
- emit the range check and the switch table in the JIT
There was a problem hiding this comment.
As far as I'm concerned, it is a non-starter to consider introducing a throw in codegen that has not been introduced into the IR earlier. In addition to the technical infeasibility, it would render it impossible to apply range-check optimizations to eliminate it.
I am mildly in favor of emitting both the range check and the switch table in the JIT, primarily because it doesn't rely on the IL implementation and the JIT to both (and separately) specify the appropriate range values.
b157c86 to
f84095e
Compare
|
Rebased and resolved a conflict from stack-level-setter. |
| void CodeGen::genRangeCheck(GenTree* oper) | ||
| { | ||
| #ifdef FEATURE_SIMD | ||
| #if defined(FEATURE_HW_INTRINSICS) && defined(FEATURE_SIMD) |
There was a problem hiding this comment.
Shouldn't the standalone GT_HWIntrinsic_CHK case also be handled (at the very least, based on the ifdefs elsewhere)?
There was a problem hiding this comment.
So far, all the imm-intrinsics are SIMD. Do you think it is necessary?
There was a problem hiding this comment.
I think that all #ifdefs concerning the HWIntrinsic Check should be consistent. So they either all need to be under #if defined(FEATURE_HW_INTRINSICS) && defined(FEATURE_SIMD) or they should all be under #ifdef FEATURE_HW_INTRINSICS. Having a mix of both (when there isn't two actual distinctions of where the code applies) just makes it confusing (and prone to breakage) later down the road.
There was a problem hiding this comment.
Got it, I will add a standalone #ifdef GT_HWIntrinsic_CHK.
| } | ||
| } | ||
|
|
||
| static bool isSSEShift(instruction ins) |
There was a problem hiding this comment.
Is Sse the appropriate term here, since it also looks to cover AVX instructions, etc?
There was a problem hiding this comment.
These special encoding shift instructions were introduced by SSE2, and AVX just gives them the VEX-encoding version. There are some AVX/AVX2-only shift instructions that don't belong to this group.
|
|
||
| void emitter::emitIns_SIMD_R_R_I(instruction ins, emitAttr attr, regNumber reg, regNumber reg1, int ival) | ||
| { | ||
| // TODO-XARCH refactoring emitIns_R_R_I to handle SSE2/AVX2 shift as well as emitIns_R_I |
There was a problem hiding this comment.
What about this isn't working today?
There was a problem hiding this comment.
SSE2 shift instructions require special encoding, we need to improve emitIns_R_R_I to handle https://github.com/dotnet/coreclr/blob/master/src/jit/emitxarch.cpp#L3688-L3700 as well.
There was a problem hiding this comment.
Could we log an actual bug, as well as this TODO?
There was a problem hiding this comment.
|
@CarolEidt Could you please take a look when you get a chance? This PR is blocking certain subsequent work. |
CarolEidt
left a comment
There was a problem hiding this comment.
I am going to approve this PR so that it can be merged and unlbock other work (though I will defer to @tannergooding to do the merge. I hope that you will consider doing a quick follow-up PR to address my comment requests, and respond with clarification on whether there is a known case where you would have only one of HW_Flag_NoJmpTableIMM and HW_Flag_MaybeIMM set.
I filed #16568 to consider merging of the range check operators.
|
|
||
| // NoJmpTable IMM | ||
| // the imm intrinsic does not need jumptable fallback when it gets non-const argument | ||
| HW_Flag_NoJmpTableIMM = 0x2000, |
There was a problem hiding this comment.
I still don't understand why both of these are required. In what case would we require a jump table if there was a non-immediate form available, and conversely, in what case would there be a non-immediate form available but we still require a jump table? Are there cases where the values covered by an immediate are somehow broader or otherwise disjoint from the values covered by a non-immediate form? I note that the existing intrinsics you added always have neither or both of these flags set.
There was a problem hiding this comment.
I still don't understand why both of these are required.
@CarolEidt The new-style intrinsic is based on function names rather than signatures, and some intrinsics have "imm" and simple-SIMD overloads both. For example
public static Vector256<sbyte> Shuffle(Vector256<sbyte> value, Vector256<sbyte> mask);
public static Vector256<int> Shuffle(Vector256<int> value, byte control) ;so a category HW_Category_IMM is not enough to indicate a intrinsic requiring imm8 or not (the above intrinsic should have flag HW_Flag_MaybeIMM). Consequently, we have to use the function bool Compiler::isImmHWIntrinsic(NamedIntrinsic intrinsic, GenTree* lastOp).
In this kind of imm-intrinsic (HW_Flag_MaybeIMM), certain intrinsics are special, which have encoding that can carry the imm8 operand in XMM registers, for example
public static Vector256<ulong> ShiftRightLogical(Vector256<ulong> value, byte count);
public static Vector256<ulong> ShiftRightLogical(Vector256<ulong> value, Vector128<ulong> count);
^^^^^^So, this intrinsic can have flag HW_Flag_NoJmpTableIMM to use XMM version as the non-constant fallback (rather than a jump-table).
I note that the existing intrinsics you added always have neither or both of these flags set.
Consequently, Avx2.Shuffle will need HW_Flag_MaybeIMM but no HW_Flag_NoJmpTableIMM.
There was a problem hiding this comment.
So, this intrinsic can have flag HW_Flag_NoJmpTableIMM to use XMM version as the non-constant fallback (rather than a jump-table).
I believe that the ISA design (carry the imm8 operand in XMM registers) is intentional for this situation, and ARM also has similar instructions.
| return false; | ||
| } | ||
|
|
||
| if ((flagsOfHWIntrinsic(intrinsic) & HW_Flag_MaybeIMM) != 0 && genActualType(lastOp->TypeGet()) != TYP_INT) |
There was a problem hiding this comment.
It is not obvious why this is correct - it needs comments. This relies on the observation that all the HW_Flag_MaybeIMM cases have either an immmediate integer argument, or a non-integer argument, but not a non-constant integer argument.
Perhaps your "Return Value" comment should say:
Return true iff the intrinsics is an imm-intrinsic overload.
Note that some intrinsics, with HW_Flag_MaybeIMM set, have both imm (integer immediate) and vector (i.e. non-TYP_INT) overloads.
| return nullptr; | ||
| } | ||
|
|
||
| // TODO-XARCH-CQ repleace imm-intrinsic by certain lower-latency intrinsic based-on imm-values. |
There was a problem hiding this comment.
I would urge you to just remove this comment.
| #endif // FEATURE_SIMD | ||
|
|
||
| #ifdef FEATURE_HW_INTRINSICS | ||
| GTNODE(HW_INTRINSIC_CHK , GenTreeBoundsChk ,0,GTK_SPECIAL|GTK_NOVALUE)// Compare whether an imm8 argument is in the valid range, and throw ArgumentOutOfRangeException if not. |
There was a problem hiding this comment.
I concede your point. These should probably be merged, though I am not 100% sure that is without risk, and I really don't want to hold this up for something that might lead to complications.
|
@fiigii, we should run the HWIntrinsic tests before merging. NOTE: These jobs will currently fail due to an unrelated issue: https://github.com/dotnet/coreclr/issues/16566. I've held off on kicking these jobs myself until after the PR resolving that is merged: #16567 |
|
@fiigii, I would also like to see bugs logged for the 2-3 pending issues before merging (they all correspond to |
| emitIns_R_R(INS_movaps, attr, reg, reg1); | ||
| } | ||
|
|
||
| if ((ins == INS_psrldq || ins == INS_pslldq) && ival > 15) |
There was a problem hiding this comment.
Why just psrldq and pslldq instead of isSseShift
|
|
||
| const unsigned maxByte = (unsigned)Compiler::immUpperBoundOfHWIntrinsic(intrinsic) + 1; | ||
| assert(maxByte <= 256); | ||
| BasicBlock* jmpTable[256]; |
There was a problem hiding this comment.
nit: BasicBlock* jmpTable[maxByte]?
There was a problem hiding this comment.
maxByte is not a compile-time constant.
| for (unsigned i = 0; i < maxByte; i++) | ||
| { | ||
| jmpTable[i] = genCreateTempLabel(); | ||
| JITDUMP(" DD L_M%03u_BB%02u\n", Compiler::s_compMethodsCount, jmpTable[i]->bbNum); |
There was a problem hiding this comment.
nit: You removed JITDUMP("\n J_M%03u_DS%02u LABEL DWORD\n", Compiler::s_compMethodsCount, jmpTableBase); from above, but not this corresponding line.
There was a problem hiding this comment.
Yeah, I was indicating that if one line is removed, the other should be as well.
| for (unsigned i = 0; i < maxByte; i++) | ||
| { | ||
| genDefineTempLabel(jmpTable[i]); | ||
| emitSwCase((int)i); |
There was a problem hiding this comment.
Why not keep this as a byte?
There was a problem hiding this comment.
The underlying functions all accept int, so it may not be necessary to introduce one more type to mess here...
There was a problem hiding this comment.
But the loop variable is still unsigned :)
| noway_assert(oper->OperGet() == GT_ARR_BOUNDS_CHECK); | ||
| #endif // !FEATURE_SIMD | ||
|
|
||
| assert(oper->OperIsBoundsCheck()); |
There was a problem hiding this comment.
Should this be kept as a noway_assert?
There was a problem hiding this comment.
No, noway_assert is pointless here. All other code gen functions use assert.
| if (!lastOp->IsCnsIntOrI() && !mustExpand) | ||
| // The imm-HWintrinsics that do not accept all imm8 values may throw exception or transform to other | ||
| // intrinsics when the imm argument is not in the valid range | ||
| if ((flags & HW_Flag_FullRangeIMM) == 0) |
There was a problem hiding this comment.
Can you log a bug tracking us adding a helper for checking if a flag is set or unset?
There was a problem hiding this comment.
Will do, and this comment should be updated, thanks.
And add a new range-check IR for x86 imm-intrinsics.
| for (unsigned i = 0; i < maxByte; i++) | ||
| { | ||
| genDefineTempLabel(jmpTable[i]); | ||
| emitSwCase(i); |
There was a problem hiding this comment.
@tannergooding @mikedn Changed emitSwCase to accept unsigned, so that this function only uses unsigned variables.
|
test Windows_NT x64 Checked jitincompletehwintrinsic test Windows_NT x86 Checked jitincompletehwintrinsic test Ubuntu x64 Checked jitincompletehwintrinsic test OSX10.12 x64 Checked jitincompletehwintrinsic |
|
@CarolEidt @tannergooding @mikedn Thank you so much for your reviwe. All the feedback should already get fixed. Logged the remaining works at https://github.com/dotnet/coreclr/issues/16594 https://github.com/dotnet/coreclr/issues/16575 https://github.com/dotnet/coreclr/issues/16543 https://github.com/dotnet/coreclr/issues/16568 |
|
@fiigii thanks for responding to the feedback so far. I'll get this merged after all the tests complete (and pass). |
|
Thanks for the changes & responses. I'm happy with where we are on this. |
|
@4creators This PR is going to be merged. Would you like to complete the SSE2 intrinsic implementation? As I know, the code freeze date of .NET Core 2.1 may be coming soon, we probably prefer to enable all the SSE/SSE2/SSE3/SSSE3/SSE4.1/AVX intrinsics in this version of runtime. |
|
@fiigii Thanks for information. I will try to implement remaining SSE2 intrinsics asap. What is a plan on AVX2? Do we intend to finish them for 2.1 release? |
I will try my best to implement as many as AVX2 intrinsics (depends on the deadline). |
|
@4creators, if you could send me an e-mail (tagoo at microsoft dot com) it might be a bit easier to sync up on this without polluting the thread. |
@tannergooding Just did it |
|
OSX tests get build timed out. Other tests all passed. |
This PR updates the table-driven framework to support x86 imm-intrinsics. After this update, implementing imm-intrinsics just needs one-line definition in
hwintrinsiclistxarch.h.This PR implements certain imm-intrinsics which have special semanctics
Avx.Compare\CompareScalarjust accepts imm-value in 0-31. Out of range argument will triggerArgumentOutOfRangeException.Avx2\Sse2.ShiftLeftLogical\ShiftRightLogical\ShiftRightArithmeticnever throw any exceptionshift* ymm, ymm, xmm(orshift* xmm, xmm)form when the imm-arg is not a constant. For example,Avx2.ShiftLeftLogical(v, i)would generatevmovd xmm1, edi; vpsllw ymm0, ymm0, xmm1instead of the jumptable fallback.ShiftLeftLogical\ShiftRightLogicalwill be converted toSetZeroVector256\128on out of range imm-arg.ShiftRightArithmetic's imm-arg will be normalized to the maximum valid value on out of range imm-arg.Avx2\Sse2.ShiftLeftLogical128BitLanewill be converted toSetZeroVector256\128on out of range imm-arg.