[Arm64] Implement Simd.Extract +#16085
Conversation
sdmaclea
left a comment
There was a problem hiding this comment.
This implements the first intrinsic with const immediate. It handles the non-const case by generating a switch table
@CarolEidt @RussKeldorph @tannergooding PTAL
@dotnet/arm64-contrib @dotnet/jit-contrib FYI
| inst_JMP(EJ_jmp, labelBreakTarget); | ||
| } | ||
| genDefineTempLabel(labelBreakTarget); | ||
| } |
There was a problem hiding this comment.
This genHWIntrinsicSwitchTable is intended to be reusable for any intrinsic requiring a switch table. It is designed to work with single instruction intrinsics, so the case spacing is hard coded to two instructions (8 bytes).
There was a problem hiding this comment.
Why hardcode it, instead of defining a label per offset, so it can be dynamically sized, if required?
That is what I did for the x86 codegen (https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsiccodegenxarch.cpp#L655).
I do like the approach to making it reusable 😄
There was a problem hiding this comment.
Why hardcode it, instead of defining a label per offset, so it can be dynamically sized, if required?
Arm64 instructions are fixed size. So it makes sense. Single instruction per case is the expected use case. (I would add a case instruction count parameter if needed).
Label per offset would require more complexity. Either a separate direct branch table so the target address could be calculated, or an if else compare and branch chain. Or ...
There was a problem hiding this comment.
Looks like you chose the separate direct branch table.
There was a problem hiding this comment.
The generated code will be smaller. Since the jump table is not needed.
Certainly if a variable sized case was needed, We could add the jump table. Right now I expect homogeneous cases of single instructions.
There was a problem hiding this comment.
Looks like you chose the separate direct branch table
Yes, x86 instructions can potentially range in size from 1 byte to 16+ bytes (depending on prefixes, encoding, etc).
Although, within a given HWIntrinsic jump table, they will likely be fairly consistent.
There was a problem hiding this comment.
It seems that for Arm64 this approach is best. If it were found to be necessary, it could be parameterized based on the instruction (though I don't see us needing this for anything that would require more than a single instruction per case).
|
|
||
| int lanes = emitTypeSize(simdType) / baseTypeSize; | ||
|
|
||
| auto emitSwCase = [&](int lane) { |
There was a problem hiding this comment.
Using a lambda to populate instructions in the switch table
| { | ||
| int lane = op2->AsIntConCommon()->IconValue(); | ||
|
|
||
| emitSwCase(lane); |
There was a problem hiding this comment.
Lambda used even when not generating the switch.
|
Tested with new tests added to #16008 |
|
|
||
| if (op2->isContainedIntOrIImmed()) | ||
| { | ||
| int lane = op2->AsIntConCommon()->IconValue(); |
There was a problem hiding this comment.
src\jit\codegenarm64.cpp(5147): warning C4244: 'initializing': conversion from 'ssize_t' to 'int', possible loss of data [D:\j\workspace\x64_checked_w---76911707\bin\obj\Windows_NT.x64.Checked\src\jit\protononjit\protononjit.vcxproj]
98f3450 to
e1e2bb4
Compare
10ae965 to
0cf806a
Compare
|
Simple rebase to fix merge conflict. |
|
@CarolEidt ping. Can this be reviewed/merged? |
CarolEidt
left a comment
There was a problem hiding this comment.
I would like to see an assert verifying the single-instruction requirement, and function headers are missing from even some of the pre-existing methods.
| genProduceReg(node); | ||
| } | ||
|
|
||
| template <typename HWIntrinsicSwitchCaseBody> |
There was a problem hiding this comment.
This needs a function header. See https://github.com/dotnet/coreclr/blob/master/Documentation/coding-guidelines/clr-jit-coding-conventions.md#94-function-header-comment
I would in particular describe the scenario this supports, as it may be confusing at first why we have a HW intrinsic switch.
| inst_JMP(EJ_jmp, labelBreakTarget); | ||
| } | ||
| genDefineTempLabel(labelBreakTarget); | ||
| } |
There was a problem hiding this comment.
It seems that for Arm64 this approach is best. If it were found to be necessary, it could be parameterized based on the instruction (though I don't see us needing this for anything that would require more than a single instruction per case).
| genDefineTempLabel(labelFirst); | ||
| for (int i = 0; i < swMax; ++i) | ||
| { | ||
| emitSwCase(i); |
There was a problem hiding this comment.
Since you are assuming that this generates a single instruction, you might do something like:
// This code assumes that emitSwCase() generates a single instruction.
unsigned prevInsCount = getEmitter()->emitInsCount;
for (int i = 0; i < swMax; ++i)
{
emitSwCase(i);
newInsCount = getEmitter()->emitInsCount;
assert(newInsCount == (prevInsCount + 1));
prevInsCount = newInsCount;
}
I think that emitInsCount is public.
| int lanes = emitTypeSize(simdType) / baseTypeSize; | ||
|
|
||
| auto emitSwCase = [&](int lane) { | ||
| assert(lane >= 0); |
There was a problem hiding this comment.
To me, lane is not very mnemonic, though others may disagree. Something like caseImmediate or caseImm?
There was a problem hiding this comment.
@CarolEidt
The immediate is the vector element lane for this instruction.
caseImm ... feels too generic.
lane and lanes could become:
element&elementsvectorElement&vectorElementsvectorIndex&vectorLength
Preference?
There was a problem hiding this comment.
Right - I hadn't really internalized that this is the specific case (not the general case as in genHWIntrinsicSwitchTable()). I think element and elements would be good.
| bool is16Byte = (node->gtSIMDSize > 8); | ||
| emitAttr attr = is16Byte ? EA_16BYTE : EA_8BYTE; | ||
|
|
||
| // Arm64 has three bit select forms each use three source registers |
There was a problem hiding this comment.
nit: I would add a ';' after "forms':
// Arm64 has three bit select forms; each use three source registers
0cf806a to
340f055
Compare
|
Adding Unsigned compare zero lowering. Unsigned compare zero tests are now all passing. PTAL |
| //------------------------------------------------------------------------ | ||
| // genHWIntrinsicSimdBinaryOp: | ||
| // | ||
| // Produce code for a GT_HWIntrinsic node with form SimdBinaryOp. |
There was a problem hiding this comment.
Does the SIMD size matter?
There was a problem hiding this comment.
Was just wanting to make sure since there will be Vector64 and Vector128 types, and eventually Vector256+ (if SVE is supported).
| // need to generate functionally correct code when the operand is not constant | ||
| // | ||
| // This is required by the HW Intrinsic design to handle: | ||
| // debugger calls |
There was a problem hiding this comment.
It might be better to list this as: to handle indirect calls, such as:
| // op1 is the first operand | ||
| // op2 is the second operand | ||
| // op3 is the third operand | ||
| op3 = impSIMDPopStack(simdType); |
There was a problem hiding this comment.
It might be good to have a general helper method for popping/validating the types, as was requested/done for x86.
There was a problem hiding this comment.
I am not sure why. Each form has different requirements. Not sure how a helper would help.
Maybe impScalarPopStack()
There was a problem hiding this comment.
x86 has this method: https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsicxarch.cpp#L276
Which validates the type of a struct is a SIMD Type and the type of a scalar matches what the signature expects.
There was a problem hiding this comment.
I believe @CarolEidt was the one who requested we validate the type of the scalar values as well, rather than just calling impPopStack().val
There was a problem hiding this comment.
I was assuming we relied on the C# (et al) compiler to validate this.
Perhaps this is to check hand written IL.
Or perhaps it is just defensive programming,
In any case I think this can safely be a separate PR.
There was a problem hiding this comment.
Or perhaps it is just defensive programming,
Yes, that's the idea (that's true of many of the asserts and checks in the JIT, but it's surprising how often the "obvious" checks catch a problem.
In any case I think this can safely be a separate PR.
Me too
| enum Flags | ||
| { | ||
| None | ||
| None = 0, |
There was a problem hiding this comment.
It is inside HWIntrinsicInfo so it is HWIntrinsicInfo::None
| GenTree* op1 = intrinsicTree->gtOp.gtOp1; | ||
| GenTree* op2 = intrinsicTree->gtOp.gtOp2; | ||
|
|
||
| if (op1->OperIs(GT_LIST)) |
There was a problem hiding this comment.
It is an explicit helper method for checking OperIs(GT_LIST). We have a few of them and they look to be the most prevalent in the codebase (or at least the most prevalent in the code I've touched so far).
There was a problem hiding this comment.
Why? OperIs() is pretty standard in Arm64 lower.
There was a problem hiding this comment.
FWIW I don't think we have a guidelines or even a lot of consistency on this. OperIs() is very useful for making checks for multiple oper values more concise and easier to read. For a single value, I think either is fine.
There was a problem hiding this comment.
I don't have a strong preference either, I was just mostly wondering why one over the other was used here.
| } | ||
| else | ||
| { | ||
| info->srcCount += GetOperandInfo(op1); |
There was a problem hiding this comment.
Is ARM not going to have 0 operand nodes or just not yet?
There was a problem hiding this comment.
I do not see a reason. Unless we need to support barriers.
There was a problem hiding this comment.
Unless we need to support barriers.
x86 added StoreFence, LoadFence, and MemoryFence. It also has a couple of helper methods (such as Sse.SetZeroVector128) which takes 0 args.
| auto intrinsicID = node->gtHWIntrinsicId; | ||
| auto intrinsicInfo = comp->getHWIntrinsicInfo(node->gtHWIntrinsicId); | ||
|
|
||
| if ((intrinsicInfo.flags & HWIntrinsicInfo::LowerCmpUZero) && varTypeIsUnsigned(node->gtSIMDBaseType)) |
There was a problem hiding this comment.
This could really use a conspicuous comment above it. This is all about handling unsigned, and it's easy to miss that.
There was a problem hiding this comment.
@sdmaclea - I'd love to see the additional comment, but I'm OK with merging now and you can add with another PR. Let me know.
tannergooding
left a comment
There was a problem hiding this comment.
This LGTM as well.
Just had a few questions about the differences between this and the x86 implementation.
|
@CarolEidt @tannergooding Pushed a final comment patch. Based on 340f055 test results above this could be merged once format checks pass. |
|
All checks passed. This can be merged. |
|
@sdmaclea, Thanks! |
No description provided.