Handle addressing modes for HW intrinsics#22944
Conversation
| genUpdateLife(tree); | ||
| } | ||
| #ifdef FEATURE_HW_INTRINSICS | ||
| else if (tree->OperIs(GT_HWIntrinsic)) |
There was a problem hiding this comment.
GT_HWIntrinsic is already handled a few lines below.
There was a problem hiding this comment.
Thanks, I missed that. I don't believe that we have (or will have) any Arm64 intrinsics that can be contained, so I'll remove the one further down.
There was a problem hiding this comment.
Did this get addressed? I don't see the matching removal
There was a problem hiding this comment.
It's a bit hard to see, with the way the diffs are shown, but if you look below, this line was deleted:
else if (tree->OperIsHWIntrinsic())
and the genConsumeReg below it is actually the one for the OperIsInitVal case.
| genConsumeAddress(op1); | ||
| // Until we improve the handling of addressing modes in the emitter, we'll create a | ||
| // temporary GT_IND to generate code with. | ||
| GenTreeIndir load = indirForm(node->TypeGet(), op1); |
There was a problem hiding this comment.
Meh, and I just got rid of the last usage of indirForm in XARCH codegen in a recent PR 😞 . I'll clean this up once the 3.0 release storm passes.
There was a problem hiding this comment.
Yes, after looking at your change here: mikedn@af3df85#diff-04ea18d1ee0bf7efd98adb6840a48369 I seriously considered waiting for and/or incorporating those changes, but decided to take the expedient path so that this stands a better chance of getting into 3.0.
| // rmOp -- The second operand, which may be a memory node or a node producing a register | ||
| // ival -- The immediate operand | ||
| // | ||
| void emitter::emitIns_R_RM_I(instruction ins, emitAttr attr, regNumber reg1, GenTree* rmOp, int ival) |
There was a problem hiding this comment.
I wouldn't put this in the emitter. Ideally the emitter interface should not deal with GenTree, only instructions, registers and immediates/displacements.
There was a problem hiding this comment.
You could probably just update genHWIntrinsic_R_RM_I():
coreclr/src/jit/hwintrinsiccodegenxarch.cpp
Line 528 in 407cc7d
There was a problem hiding this comment.
Hmm, it actually looks like most of this logic was pulled from the previous impl of genHWIntrinsic_R_RM_I
There was a problem hiding this comment.
I extracted this from genHWIntrinsic_R_RM_I because it is not specific to HW intrinsics. And historically the emitter has dealt with GenTree nodes for addresses. That said, I take the point that it should be in CodeGen instead of in emitter. I'll make that change.
There was a problem hiding this comment.
I think most of the genHWIntrinsic_R_... methods are probably more "general-purpose" than HWIntrinsic specific.
That is, outside some asserts and the explicit code path for containedOp->OperIsHWIntrinsic(), they are just taking a node and determining which emitter call to make (R, AR, C, A, S, etc).
It might be worth refactoring all of these to be more generally usable?
There was a problem hiding this comment.
The various "RM" handling functions and the relevant emitter interface functions are pretty messed up as is. It will take a significant effort to address all that and it certainly doesn't belong in this PR. For now just avoiding adding more stuff to the emitter interface should be enough. It already suffers from duplication, bad function and parameter names etc.
There was a problem hiding this comment.
It will take a significant effort to address all that and it certainly doesn't belong in this PR.
Completely right, I was more asking as a potential future work item.
There was a problem hiding this comment.
As it turns out, extracting the HWIntrinsic-dependencies from this was a bit problematic, so although I moved to to CodeGen::inst_RV_TT_IV(), it is still under #ifdef FEATURE_HW_INTRINSICS.
| return; | ||
| } | ||
| if (op1->OperIs(GT_LIST)) | ||
| { |
There was a problem hiding this comment.
nit: assert(node->gtGetOp2() == nullptr) when op1 is a list?
| assert(node->gtGetOp2() == nullptr); | ||
| return; | ||
| } | ||
| if (op1->OperIs(GT_LIST)) |
There was a problem hiding this comment.
I don't think we have a convention for this, but the OperIsXXX methods pre-date @mikedn's addition of the OperIs method, which I tend to prefer since it makes it clear that you are checking for one or more specific operators. I prefer the OperIsXXX when it is used to check for a "class" of operators. cc @dotnet/jit-contrib for opinions on usage of these.
There was a problem hiding this comment.
I wouldn't bother too much with this. Hopefully we'll just get rid of GT_LIST. I have ~75% of it done, just waiting for intrinsics to finally stabilize so I don't have to keep fixing conflicts.
| // None. | ||
| // | ||
|
|
||
| void CodeGen::genConsumeHWIntrinsicOperands(GenTreeHWIntrinsic* node) |
There was a problem hiding this comment.
Might be nice to centralize the asserting of numArgs to number of node operands here as well?
| assert((ival >= 0) && (ival <= 127)); | ||
| genHWIntrinsic_R_R_RM_I(node, ins, ival); | ||
| } | ||
| else if (category == HW_Category_MemoryLoad) |
There was a problem hiding this comment.
Why isn't this case genConsumeAddress like the other loads?
There was a problem hiding this comment.
It's sort of arbitrary. genConsumeRegs will correctly handle an address. genConsumeAddress is useful if you already know you have an address, because it does fewer checks. This is distinct from genConsumeReg which is only used if a node is known to produce a single register.
| op1Reg = op1->gtRegNum; | ||
| genConsumeOperands(node); | ||
| } | ||
| genConsumeHWIntrinsicOperands(node); |
There was a problem hiding this comment.
Shouldn't there be some additional removals of genConsumeReg from the below (for this and similar edits)?
There was a problem hiding this comment.
I didn't transform all the methods in this file to use genConsumeHWIntrinsicOperands - only the ones where it was specifically beneficial to abstract out the possibility that they might be contained memory ops/addresses. That may be a useful future change, but I wouldn't necessarily do that with this PR.
|
test OSX10.12 x64 Checked Innerloop Build and Test |
|
@dotnet/jit-contrib PTAL |
| // Until we improve the handling of addressing modes in the emitter, we'll create a | ||
| // temporary GT_IND to generate code with. | ||
| GenTreeIndir load = indirForm(node->TypeGet(), op1); | ||
| emit->emitInsLoadInd(ins, simdSize, node->gtRegNum, &load); |
There was a problem hiding this comment.
The old code was using emitIns_R_AR, that one has a Is4ByteSSEInstruction check that increases the instruction size by 2. emitInsLoadInd does not seem to have such a check.
There was a problem hiding this comment.
Right - I took a quick look at that when you made that comment earlier, and I wasn't sure it was actually required. I need to take another look.
There was a problem hiding this comment.
So far I believe that this 2-byte increment should actually be 1 byte. That said, there is definitely inconsistency in how the sizes are determined, and the code allows the size to be over-estimated, so those errors go undetected. I'm trying to decide whether to make a more general cleanup here or just go with the 1 byte increment for this case.
There was a problem hiding this comment.
I thought we had a work item tracking the cleanup of emitInsSize, but I can't seem to find it.
From what I remember, these one offs were done because it was non-trivial to update the existing emitInsSize checks due to other complications at the time. I think it is +2 to handle the non-VEX encoded case but only needs to be +1 for the VEX encoding.
There was a problem hiding this comment.
I believe it's the case that for these addressing-mode instructions, the SSE encodings require one additional bit (and the VEX encodings don't require an additional bytes). I've run all the hw intrinsic tests without AVX & AVX2 with the most recent set of changes.
The size analysis is so convoluted that I am mostly relying on empirical results (i.e. testing).
There was a problem hiding this comment.
I added a note to https://github.com/dotnet/coreclr/issues/23006 that this should also be addressed.
| else | ||
| { | ||
| regNumber rmOpReg = rmOp->gtRegNum; | ||
| getEmitter()->emitIns_SIMD_R_R_I(ins, attr, reg1, rmOpReg, ival); |
There was a problem hiding this comment.
This isn't really SIMD-specific (and should probably be moved), but it's the method that will do a copy to the target if needed for the RMW form.
|
@dotnet-bot test Ubuntu x64 Checked CoreFX Test |
|
@dotnet-bot test Ubuntu x64 Checked CoreFX Tests |
|
@dotnet/jit-contrib @tannergooding @fiigii PTAL |
| GenTree* op1 = node->gtGetOp1(); | ||
| if (op1 == nullptr) | ||
| { | ||
| assert((numArgs == 0) && (node->gtGetOp2() == nullptr)); |
There was a problem hiding this comment.
nit: It is generally nice to break && asserts into two separate ones. That makes it much clearer which assert is the problematic one when they are hit.
There was a problem hiding this comment.
I generally disagree. Asserts already clutter the code, and though they are useful, they should impact the readability & flow of the code as little as possible, IMO.
There was a problem hiding this comment.
And this is another bit of code that will go away when I get rid of GT_LIST...
| { | ||
| assert((ival >= 0) && (ival <= 127)); | ||
| genHWIntrinsic_R_RM_I(node, ins, (int8_t)ival); | ||
| genConsumeAddress(op1); |
There was a problem hiding this comment.
I believe you mentioned that genConsumeOperands will also handle the address case, but genConsumeAddress is more efficient.
My question then would be: Is it actually worth doing explicit genConsumeAddress calls here? It looks like it makes the overall logic slightly more complex...
There was a problem hiding this comment.
In what way does it make the logic more complex? It's probably a matter of taste, but this case needs special handling anyway, and it seems clearest to explicitly consume an address.
There was a problem hiding this comment.
It's probably a matter of taste
Possibly. I would have thought that the existing genConsumeOperands(node) call would have handled things, in which case the changes could have been minimized to just changing emit->emitIns_R_AR(ins, simdSize, targetReg, op1Reg, 0); to:
// Until we improve the handling of addressing modes in the emitter, we'll create a
// temporary GT_IND to generate code with.
GenTreeIndir load = indirForm(node->TypeGet(), op1);
emit->emitInsLoadInd(ins, simdSize, node->gtRegNum, &load);
This would have made the delta smaller and kept the nesting of the if blocks smaller. It also gives a single spot that operand consumption happens so future changes don't need to worry about it.
Likewise, down on L144, just the if block could have been added and the existing genConsumeOperands(node) would have handle both the memory store and non-memory store cases (rather than splitting this into 4 separate calls).
There was a problem hiding this comment.
Please avoid genConsumeOperands in SIMD and HWIntrinsic code, it's incompatible with my pending GT_LIST changes.
| // This is the aligned case. | ||
| Sse2.StoreScalar(p + alignmentOffset + 2, Sse2.Subtract(v, Sse2.LoadAlignedVector128(p + offset + alignmentOffset + 4))); | ||
| // This is the unaligned case. | ||
| Sse2.StoreScalar(p + alignmentOffset + 1, Sse2.Subtract(v, Sse2.LoadVector128(p + offset + alignmentOffset + 1))); |
There was a problem hiding this comment.
The test cases look not enough. This PR also changes several Is4ByteSSEInstruction points, so I suggest adding a Sse41.LoadAlignedVector128NonTemporal test and an Vector256<T> load/store test.
In the future, it would be nice to update the intrinsic test template to cover more address modes.
|
@redknightlois - I plan to get this in for 3.0, though while waiting for the CI issues to resolve I've gotten side-tracked on other things. I should have an updated PR by the end of this week. |
Also, eliminate some places where the code size estimates were over-estimating. Contribute to #19550 Fix #19521
|
@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test (Jit - TieredCompilation=0) |
|
@dotnet-bot test Windows_NT arm Cross Checked Innerloop Build and Test |
|
|
||
| // Now do a non-temporal load from the aligned location. | ||
| // That should give us {0, 0, 1, 0}. | ||
| Vector128<float> v2 = Sse41.LoadAlignedVector128NonTemporal((byte*)(p + alignmentOffset)).AsSingle(); |
There was a problem hiding this comment.
Right - of course! Forgot all about the checks!
| { | ||
| try | ||
| { | ||
| Avx.Store(p + 1, Avx.Subtract(v, Avx.LoadVector256(p + offset + 1))); |
|
@dotnet-bot test Windows_NT arm64 Cross Checked Innerloop Build and Test |
|
@dotnet/jit-contrib @tannergooding @fiigii PTAL |
|
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2 @dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx2 @dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx2 |
|
|
||
| // Now do a non-temporal load from the aligned location. | ||
| // That should give us {0, 0, 1, 0}. | ||
| Vector128<float> v2 = Sse41.LoadAlignedVector128NonTemporal((byte*)(p + alignmentOffset)).AsSingle(); |
There was a problem hiding this comment.
We need if (Sse41.IsSupported) for this case.
There was a problem hiding this comment.
Excellent, thanks. I split this out so that we do an Sse2 load when Sse41 isn't available.
|
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2 @dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx2 @dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx2 |
|
@dotnet/jit-contrib @tannergooding ping - I addressed @fiigii 's feedback on the test case, and fixed an error in the guard for it. I think this is ready to merge. |
* Handle addressing modes for HW intrinsics Also, eliminate some places where the code size estimates were over-estimating. Contribute to dotnet/coreclr#19550 Fix dotnet/coreclr#19521 Commit migrated from dotnet/coreclr@da6ed11
Contribute to #19550
Fix #19521