Enable unrolling of SIMD_LIMIT loops#8001
Conversation
|
@briansull, @dotnet/jit-contrib PTAL. This is likely easier to review by looking at the individual commits. In the added benchmark which is a micro benchmark copy of the key loop from aspnet/KestrelHttpServer#1138, the speedups for various indices from a local run are:
|
80c5fd7 to
b2df59e
Compare
|
Also: DDR was clean with stress mode forced on. |
b2df59e to
5a54a12
Compare
|
Ping. @AndyAyersMS or @pgavlin maybe? |
|
@dotnet-bot test Windows_NT x64 Debug Build and Test |
|
|
||
| // Visit loops from highest to lowest number to vist them in innermost | ||
| // to outermost order | ||
| for (unsigned lnum = optLoopCount - 1; lnum != ~0U; --lnum) |
There was a problem hiding this comment.
Good thing unsigned integer wraparound is defined in C++...
| if (!BasicBlock::CloneBlockState(this, newBlock, block, lvar, lval)) | ||
| { | ||
| /* Stop if we've reached the end of the loop */ | ||
| // cloneExpr doesn't handle everything |
There was a problem hiding this comment.
Is there any intermediate information left over that we need to restore? I'm thinking in particular about lclVar ref counts.
There was a problem hiding this comment.
I don't think so; the IR gets inserted, it's just unreachable.
There was a problem hiding this comment.
So it's still represented in the BB list?
There was a problem hiding this comment.
Yeah, it's in newBlock which is set to the return from fgNewBBafter a few lines above.
No, I snip it out just below (sorry, this went through a few iterations). I was just following suit from the previous behavior here. Did I do that wrong, or do you think there's a bug before & after? DDR passed with unroll stress... what is the ref count invariant and are we sure it's expected to hold here? Loop unrolling runs just before lvaMarkLocalVars.
There was a problem hiding this comment.
Loop unrolling runs just before
lvaMarkLocalVars.
Okay, then we are probably fine w.r.t. ref counts: I think that phase will recalculate them.
| loopList = loopLast = nullptr; | ||
| /* Create the unrolled loop statement list */ | ||
| { | ||
| BlockToBlockMap blockMap(getAllocator()); |
There was a problem hiding this comment.
It may be worth using a SmallHashTable here to avoid heap allocations for the map elements.
There was a problem hiding this comment.
The call to optRedirectBlock below wants a BlockToBlockMap.
| if (sideEffList == nullptr) | ||
| { | ||
| break; | ||
| testCopyExpr->gtBashToNOP(); |
There was a problem hiding this comment.
Why not just remove the last statement entirely in this case?
There was a problem hiding this comment.
Yeah, that's better, thanks; updated.
| } | ||
|
|
||
| /* Append the expression to our list */ | ||
| if (block == bottom) |
There was a problem hiding this comment.
Nit: this could be folded into the block for the if statement above.
| // Now redirect any branches within the newly-cloned iteration | ||
| for (block = head->bbNext; block != bottom; block = block->bbNext) | ||
| { | ||
| if (block == bottom) |
There was a problem hiding this comment.
Shouldn't this be dead code due to the loop condition?
AndyAyersMS
left a comment
There was a problem hiding this comment.
Still looking but here are some initial thoughts
| unrollLimitSz *= 10; | ||
| } | ||
| #endif | ||
| if (optLoopTable[lnum].lpFlags & (LPFLG_DONT_UNROLL | LPFLG_REMOVED)) |
There was a problem hiding this comment.
Why not retest loopFlags here?
There was a problem hiding this comment.
Didn't change that line (see 2928 before), didn't notice it. Will update...
| int unrollCostSz; | ||
| unrollCostSz = (loopCostSz * totalIter) - (loopCostSz + fixedLoopCostSz); | ||
| int unrollCostSz; | ||
| unrollCostSz = (loopCostSz * totalIter) - (loopCostSz + fixedLoopCostSz); |
There was a problem hiding this comment.
Do we know the first subexpression can't overflow here (realize it was like this before...)?
There was a problem hiding this comment.
Added a change to use ClrSafeInt to detect overflow here.
| { | ||
| continue; | ||
| // Unrolling would require cloning EH regions | ||
| goto DONE_LOOP; |
There was a problem hiding this comment.
Shouldn't this set DONT_UNROLL?
There was a problem hiding this comment.
I got rid of the fixpoint iteration so we won't revisit the same loop twice anymore.
| // LPFLG_DO_WHILE - required because this transform only handles loops of this form | ||
| // LPFLG_CONST - required because this transform only handles full unrolls | ||
| // LPFLG_SIMD_LIMIT - included here as a heuristic, not for correctness/structural reasons | ||
| requiredFlags = LPFLG_DO_WHILE | LPFLG_CONST | LPFLG_SIMD_LIMIT; |
There was a problem hiding this comment.
Not sure I understand this part of the change -- the old code looked for ONE_EXIT and now you look for SIMD_LIMIT.
There was a problem hiding this comment.
This one makes more sense looking at the individual commits -- one change enables unrolling loops with branches and multiple exits, and hence removes the ONE_EXIT restriction here; the last change enables unrolling for SIMD loops and so adds that flag.
There was a problem hiding this comment.
Ok, thanks. I see how removing ONE_EXIT makes sense. But how does adding SIMD_LIMIT make sense? Doesn't having it like this block unrolling for non-SIMD cases?
There was a problem hiding this comment.
Yes. We've identified that SIMD loops are good candidates for full unrolls, and we've got this implementation of a full-unroller sitting here bit-rotting (this check that I'm removing currently disables the unroller for all loops), so this change revives that code but (at least for now) as a heuristic does so only for SIMD cases.
There was a problem hiding this comment.
Didn't realize this was not enabled already. Makes sense now, thanks.
| * (the last value of the iterator in the loop) | ||
| * and drop the jump condition since the unrolled loop will always execute */ | ||
| block->bbFlags &= ~(BBF_NEEDS_GCPOLL | BBF_LOOP_HEAD); | ||
| if (BasicBlock* jumpDest = block->bbJumpDest) |
There was a problem hiding this comment.
If this cannot be if ((BasicBlock* jumpDest = block->bbJumpDest) != nullptr), please move the declaration out of the condition s.t. the condition can test against nullptr.
There was a problem hiding this comment.
Turns out that variable is unused in the latest version of this change, so I just removed the declaration.
| /* Update bbRefs and bbPreds */ | ||
| /* Here head->bbNext is bottom !!! - Replace it */ | ||
|
|
||
| fgRemoveRefPred(head->bbNext, bottom); |
There was a problem hiding this comment.
Why were this call and the one below removed? Are they now covered by fgUpdateFlowGraph?
5a54a12 to
55f491e
Compare
|
@AndyAyersMS, @pgavlin, I think I've addressed your feedback so far. |
|
LGTM too. |
Expect instead to see arithmetic nodes that are arguments of separate assign nodes.
There's no need for fixpoint iteration; the loop indices are a pre-order, so walking them in reverse order will visit inner loops before outer ones.
Lift both the single-exit restriction and the no-internal-branching restriction. Share some utilities with the loop cloner to facilitate this (particularly `CloneBlockState` and `fgUpdateChangedFlowGraph`).
Make sure to avoid trying to unroll cases so large as to overlow the cost.
Since the Vector<T> abstraction has a `Count` that is not a C#-compile-time constant, it encourages use of iteration to search/aggregate individual elements using symbolic indexing, which in turn leads to codegen that spills the vector to memory for each element access, and performs bounds checks for each access. These loops will have low trip counts that are jit-compile-time constant, and constant indexing into Vector<T> allows more efficient register-to-register sequences and bounds-check elision. This change enables RyuJit's loop unroller when such a loop is discovered, and increases the size threshold to target optimizing such loops much more aggressively than the unroller's previous incarnation. Add a test with a motivating loop to the Performance/CodeQuality/SIMD suite. Closes #7843.
94d9d70 to
e3e7d1a
Compare
Since the Vector abstraction has a
Countthat is not aC#-compile-time constant, it encourages use of iteration to
search/aggregate individual elements using symbolic indexing, which in
turn leads to codegen that spills the vector to memory for each element
access, and performs bounds checks for each access. These loops will have
low trip counts that are jit-compile-time constant, and constant indexing
into Vector allows more efficient register-to-register sequences and
bounds-check elision. This change enables RyuJit's loop unroller when
such a loop is discovered, and increases the size threshold to target
optimizing such loops much more aggressively than the unroller's previous
incarnation.
Add a test with a motivating loop from aspnet/KestrelHttpServer#1138 to the
Performance/CodeQuality/SIMD suite.
Closes #7843.