-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Do not mark op2 as delayRegFree if op1==op2 #53964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
With Benchmarks.run.windows.x64.checkedDetail diffsDetail diffsCoreclr_tests.pmi.windows.x64.checked.1Detail diffsDetail diffsLibraries.pmi.windows.x64.checkedDetail diffsDetail diffs |
|
@dotnet/jit-contrib |
sandreenko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
src/coreclr/jit/lsraxarch.cpp
Outdated
|
|
||
| srcCount += BuildDelayFreeUses(op2); | ||
| // Unless, op1 and op2 are same, in which case we can overwrite op2. | ||
| if (GenTree::NodesAreEquivalentLeaves(op1, op2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there other RMW locations setting BuildDelayFreeUses that could/should also be updated? Is this already handled on the scalar operation (that is int + int or float - float) code path, for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd imagine, for example, the check a few lines down dealing with making BuildDelayFreeUses(op3) would also benefit from this check: https://github.com/dotnet/runtime/pull/53964/files#diff-9626112837daf480c93d401a587012b3e398dd90a953c870797d472cff36839dR2510
Maybe the right fix is to call this overload of BuildDelayFreeUses:
runtime/src/coreclr/jit/lsrabuild.cpp
Line 3059 in c2c92c4
| int LinearScan::BuildDelayFreeUses(GenTree* node, GenTree* rmwNode, regMaskTP candidates) |
It looks to already do some checks on
rmwNode vs the node being set as delayFree and might be a natural place to add this NodesAreEquivalentLeaves check, if its generally applicable?
Edit: It looks like its not an overload, just one method where rmwNode defaults to nullptr and where many places (at least for HWIntrinsics) aren't passing in the rmwNode, likely because it (the rmwNode parameter) was added ~6 months back: #45135
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there other RMW locations setting BuildDelayFreeUses that could/should also be updated?
I am not super familiar with the RMW semantics. I can sync-up offline to understand which other scenarios can be benefitted.
and where many places (at least for HWIntrinsics) aren't passing in the rmwNode,
That's right, at most places, we pass rmwNode as nullptr and I am not 100% sure the impact of refactoring it such that we can fit in NodesAreEquivalent() method inside it. For this bug, I would keep it where it is currently.
a few lines down dealing with making BuildDelayFreeUses(op3) would also benefit from this
Can you give an example of what is the semantics of rmw for this and which 2 nodes I should be checking for equivalence?
Is this already handled on the scalar operation (that is int + int or float - float) code path, for example?
Again, could you give an example for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not super familiar with the RMW semantics.
RMW (Read-Modify-Write) just means one of the sources is also the destination and so is "destructive" (the value isn't preserved).
This means that if the value needs to be preserved (generally because src1 is not "last use") that you need an additional mov to copy the value to the destination:
; Operation is logically `dst = src1 + src2`
mov dst, src1
add dst, src2Because of this, it also means that dst != src2 since, if it was, the move would overwrite src2 and change the logical operation. BuildDelayFreeUses exists to cover this scenario by ensuring that dst != src2:
; If `dst == src2 && src1 != src2`, operation becomes `dst = src1 + src1` and so is problematic
mov dst, src1
add dst, src2All of this works great when src1 is last use and when src1 != src2 because this means the register allocator can make dst == src1 and we can elide the copy generating just:
; Operation is logically `src1 += src2`
add src1, src2However, in certain cases such as when src1 == src2, setting src2 to be "delay free" is problematic because this forces the register allocator to configure it as dst != src2, but since src1 == src2 this also means the register allocator configures dst != src1, thus forcing us to generate a move. In practice however, it is safe for this exact scenario to not be delay free because we would never overwrite either input.
The PR you are providing here "fixes" the issue by avoiding us setting "delay free" when rmwOp == delayFreeOp (this is safe since it means we won't ever overwrite delayFreeOp). This "should be" applicable to all RMW setups whether its specifically this SIMD example or even other examples like integer additions and so I'd think that we want to make the support for this "broader" so we can always generate the more efficient code.
As far as I can tell, the fix you have here should work for any BuildDelayFreeUses call and so I think we could just extend most calls to BuildDelayFreeUses to pass in rmwOp when it exists and for it to do the if (GenTree::NodesAreEquivalentLeaves(delayFreeOp, rmwOp)) { /* dont set delay free */ } else { /* set delay free */ }
- For the
2-operandRMW instructions (likeadd,sub,mul,div, etc) this is what you are already doing for this specific SIMD case. - It should likewise extend to
3-operand(or more) RMW instructions (likefmawhere you havedst = (src1 * src2) + src3). In the three operand case, eliding the move requires:(dst == src1) && (dst != src2) && (dst != src3). However the latter two restrictions (which are achieved by setting "delay free") can be relaxed whensrc1 == src2or whensrc1 == src3, respectively (since whenrmwOp == delayFreeOp, we can't overwrite thedelayFreeOp).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that I'm not super familiar with the register allocator logic myself, just stating logically how I'd expect this to be handled, and so it isn't clear to me if this is already "being handled" by the following check:
runtime/src/coreclr/jit/lsrabuild.cpp
Line 3100 in c2c92c4
| if ((use->getInterval() != rmwInterval) || (!rmwIsLastUse && !use->lastUse)) |
If it is, it might be simple enough to just ensure we pass in rmwOp everywhere and the right things will happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the wonderful explanation @tannergooding . I think I heard from Carol about delayRegFree, but your concrete examples helped me understand it more clearly. Regarding, RMW (last I heard was when working on Arm64), I thought that only certain instructions fall under that category (and that's why was asking which one are those), but looks like it is applicable to even GP register cases like add. I will investigate and try to come up with "broader" solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, from what I understand, rmwOp is essentially the one that is acting as source as well as destination operand, while other operand (e.g. node parameter in BuildDelayFreeUses()) refers to other operand of that operation. I need to look deeper, but what determines what should be node parameter to BuildDelayFreeUses()? Assuming op1 is always considered as potential source and destination for 2/3/4 operand cases (and hence it is rmwNode), but for 3+ operand cases, should node be op2 or op3 or it depends on the GenTree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but looks like it is applicable to even GP register cases like add
Right. On ARM64 most instructions are not RMW, that is they can separately encode dst, src1, and src2. Only a handful of instructions end up being RMW and many of them are more advanced SIMD instructions so its a rarity to deal with.
x86/x64 on the other hand has most instructions being RMW and outside some newer instructions or using the VEX encoding, its something that codegen frequently has to deal with.
Also, from what I understand, rmwOp is essentially the one that is acting as source as well as destination operand, while other operand (e.g. node parameter in BuildDelayFreeUses()) refers to other operand of that operation
Right. rmwOp is the operand that is a source but also the destination.
Assuming op1 is always considered as potential source and destination for 2/3/4 operand cases (and hence it is rmwNode), but for 3+ operand cases, should node be op2 or op3 or it depends on the GenTree?
We have to build uses for all operands and so typically we:
- Set the
rmwOp(which is typically, but not always,op1) to betgtPrefUse:runtime/src/coreclr/jit/lsraxarch.cpp
Line 2455 in c2c92c4
tgtPrefUse = BuildUse(op1); - For each other operand, we call
BuildDelayFreeUses:andruntime/src/coreclr/jit/lsraxarch.cpp
Line 2491 in c2c92c4
srcCount += BuildDelayFreeUses(op2); runtime/src/coreclr/jit/lsraxarch.cpp
Line 2510 in c2c92c4
srcCount += isRMW ? BuildDelayFreeUses(op3) : BuildOperandUses(op3);
There are many examples of this throughput lsraxarch.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the 4 operand examples is here:
runtime/src/coreclr/jit/lsraxarch.cpp
Lines 2422 to 2425 in c2c92c4
| srcCount += BuildOperandUses(op1); | |
| srcCount += BuildDelayFreeUses(op2); | |
| srcCount += BuildDelayFreeUses(op3); | |
| srcCount += BuildDelayFreeUses(op4); |
and it would be nice if we could make this something like:
srcCount += BuildOperandUses(op1);
srcCount += BuildDelayFreeUses(op2, op1);
srcCount += BuildDelayFreeUses(op3, op1);
srcCount += BuildDelayFreeUses(op4, op1);Rather than needing to do something like the following:
srcCount += BuildOperandUses(op1);
srcCount += GenTree::NodesAreEquivalentLeaves(op2, op1) ? BuildOperandUses(op2) : BuildDelayFreeUses(op2, op1);
srcCount += GenTree::NodesAreEquivalentLeaves(op3, op1) ? BuildOperandUses(op3) : BuildDelayFreeUses(op3, op1);
srcCount += GenTree::NodesAreEquivalentLeaves(op4, op1) ? BuildOperandUses(op4) : BuildDelayFreeUses(op4, op1);(assuming there isn't some complexity in the register allocator preventing this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding - once again thanks for the insights. The change was easier than I thought because BuildDelayFreeUses() captures those scenarios.
|
Here are some improvements with Benchmarks.run.windows.x64.checkedDetail diffsLibraries.crossgen2.windows.x64.checkedDetail diffsLibraries.pmi.windows.x64.checkedDetail diffsHere are some improvements with Libraries.crossgen2.windows.x64.checkedDetail diffsLibraries.crossgen2.windows.x86.checkedDetail diffs |
tannergooding
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It would probably be good for @echesakovMSFT to give a second pair of eyes over the ARM64 code
|
It would probably be also good to run the ISA stress tests |
|
Failures are unrelated. |
src/coreclr/jit/lsraarm64.cpp
Outdated
| varNum3 = intrin.op3->AsLclVar()->GetLclNum(); | ||
| op1LastUse |= ((varNum1 == varNum3) && intrin.op3->HasLastUse()); | ||
| varNum2 = intrin.op2->AsLclVar()->GetLclNum(); | ||
| assert((varNum1 == varNum2) && intrin.op2->HasLastUse()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole block under ifdef DEBUG doesn't make sense to me - why are we asserting that for Vector64.GetElement(vector, index) and Vector128.GetElement(vector, index) their operands correspond the same local? This will never be true - since op1 is a SIMD value and op2 is a int.
In fact, I don't think it will ever execute - neither NI_Vector64_GetElement nor NI_Vector128_GetElement is a RMW intrinsic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. I misread the code thinking that only way assert(!op2DelayFree); in previous code would get triggered if (isRMW == true) && (varNum1 == varNum2) but didn't realize that it can be because isRMW == false. I will remove the #ifdef DEBUG block.
echesakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Can you please publish jit-diff results for arm64?
No diffs, because arm64 was already doing the right thing. I just refactored the code to handle it inside |
Thanks for confirming! |
If
op1 == op2, do not markop2as delayFree so we can reuse the register.Fixes: #9896