Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Enable unrolling of SIMD_LIMIT loops#8001

Merged
JosephTremoulet merged 5 commits into
dotnet:masterfrom
JosephTremoulet:UnrollSimd
Nov 11, 2016
Merged

Enable unrolling of SIMD_LIMIT loops#8001
JosephTremoulet merged 5 commits into
dotnet:masterfrom
JosephTremoulet:UnrollSimd

Conversation

@JosephTremoulet
Copy link
Copy Markdown

Since the Vector abstraction has a Count that is not a
C#-compile-time constant, it encourages use of iteration to
search/aggregate individual elements using symbolic indexing, which in
turn leads to codegen that spills the vector to memory for each element
access, and performs bounds checks for each access. These loops will have
low trip counts that are jit-compile-time constant, and constant indexing
into Vector allows more efficient register-to-register sequences and
bounds-check elision. This change enables RyuJit's loop unroller when
such a loop is discovered, and increases the size threshold to target
optimizing such loops much more aggressively than the unroller's previous
incarnation.

Add a test with a motivating loop from aspnet/KestrelHttpServer#1138 to the
Performance/CodeQuality/SIMD suite.

Closes #7843.

@JosephTremoulet
Copy link
Copy Markdown
Author

@briansull, @dotnet/jit-contrib PTAL.

This is likely easier to review by looking at the individual commits.

In the added benchmark which is a micro benchmark copy of the key loop from aspnet/KestrelHttpServer#1138, the speedups for various indices from a local run are:

Index Improvement
1 14%
3 14%
11 52%
19 43%
27 46%

@JosephTremoulet
Copy link
Copy Markdown
Author

Also: DDR was clean with stress mode forced on.

@JosephTremoulet
Copy link
Copy Markdown
Author

Ping. @AndyAyersMS or @pgavlin maybe?

@JosephTremoulet
Copy link
Copy Markdown
Author

@dotnet-bot test Windows_NT x64 Debug Build and Test

Comment thread src/jit/optimizer.cpp

// Visit loops from highest to lowest number to vist them in innermost
// to outermost order
for (unsigned lnum = optLoopCount - 1; lnum != ~0U; --lnum)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thing unsigned integer wraparound is defined in C++...

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep...

Comment thread src/jit/optimizer.cpp
if (!BasicBlock::CloneBlockState(this, newBlock, block, lvar, lval))
{
/* Stop if we've reached the end of the loop */
// cloneExpr doesn't handle everything
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any intermediate information left over that we need to restore? I'm thinking in particular about lclVar ref counts.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so; the IR gets inserted, it's just unreachable.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's still represented in the BB list?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's in newBlock which is set to the return from fgNewBBafter a few lines above.
No, I snip it out just below (sorry, this went through a few iterations). I was just following suit from the previous behavior here. Did I do that wrong, or do you think there's a bug before & after? DDR passed with unroll stress... what is the ref count invariant and are we sure it's expected to hold here? Loop unrolling runs just before lvaMarkLocalVars.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loop unrolling runs just before lvaMarkLocalVars.

Okay, then we are probably fine w.r.t. ref counts: I think that phase will recalculate them.

Comment thread src/jit/optimizer.cpp
loopList = loopLast = nullptr;
/* Create the unrolled loop statement list */
{
BlockToBlockMap blockMap(getAllocator());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth using a SmallHashTable here to avoid heap allocations for the map elements.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to optRedirectBlock below wants a BlockToBlockMap.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, shame.

Comment thread src/jit/optimizer.cpp Outdated
if (sideEffList == nullptr)
{
break;
testCopyExpr->gtBashToNOP();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just remove the last statement entirely in this case?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's better, thanks; updated.

Comment thread src/jit/optimizer.cpp Outdated
}

/* Append the expression to our list */
if (block == bottom)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this could be folded into the block for the if statement above.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/jit/optimizer.cpp Outdated
// Now redirect any branches within the newly-cloned iteration
for (block = head->bbNext; block != bottom; block = block->bbNext)
{
if (block == bottom)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be dead code due to the loop condition?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Fixed.

Copy link
Copy Markdown
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking but here are some initial thoughts

Comment thread src/jit/optimizer.cpp Outdated
unrollLimitSz *= 10;
}
#endif
if (optLoopTable[lnum].lpFlags & (LPFLG_DONT_UNROLL | LPFLG_REMOVED))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not retest loopFlags here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't change that line (see 2928 before), didn't notice it. Will update...

Comment thread src/jit/optimizer.cpp Outdated
int unrollCostSz;
unrollCostSz = (loopCostSz * totalIter) - (loopCostSz + fixedLoopCostSz);
int unrollCostSz;
unrollCostSz = (loopCostSz * totalIter) - (loopCostSz + fixedLoopCostSz);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know the first subexpression can't overflow here (realize it was like this before...)?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a change to use ClrSafeInt to detect overflow here.

Comment thread src/jit/optimizer.cpp Outdated
{
continue;
// Unrolling would require cloning EH regions
goto DONE_LOOP;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this set DONT_UNROLL?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of the fixpoint iteration so we won't revisit the same loop twice anymore.

Comment thread src/jit/optimizer.cpp
// LPFLG_DO_WHILE - required because this transform only handles loops of this form
// LPFLG_CONST - required because this transform only handles full unrolls
// LPFLG_SIMD_LIMIT - included here as a heuristic, not for correctness/structural reasons
requiredFlags = LPFLG_DO_WHILE | LPFLG_CONST | LPFLG_SIMD_LIMIT;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this part of the change -- the old code looked for ONE_EXIT and now you look for SIMD_LIMIT.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one makes more sense looking at the individual commits -- one change enables unrolling loops with branches and multiple exits, and hence removes the ONE_EXIT restriction here; the last change enables unrolling for SIMD loops and so adds that flag.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. I see how removing ONE_EXIT makes sense. But how does adding SIMD_LIMIT make sense? Doesn't having it like this block unrolling for non-SIMD cases?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We've identified that SIMD loops are good candidates for full unrolls, and we've got this implementation of a full-unroller sitting here bit-rotting (this check that I'm removing currently disables the unroller for all loops), so this change revives that code but (at least for now) as a heuristic does so only for SIMD cases.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize this was not enabled already. Makes sense now, thanks.

Comment thread src/jit/optimizer.cpp Outdated
* (the last value of the iterator in the loop)
* and drop the jump condition since the unrolled loop will always execute */
block->bbFlags &= ~(BBF_NEEDS_GCPOLL | BBF_LOOP_HEAD);
if (BasicBlock* jumpDest = block->bbJumpDest)
Copy link
Copy Markdown

@pgavlin pgavlin Nov 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this cannot be if ((BasicBlock* jumpDest = block->bbJumpDest) != nullptr), please move the declaration out of the condition s.t. the condition can test against nullptr.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out that variable is unused in the latest version of this change, so I just removed the declaration.

Comment thread src/jit/optimizer.cpp
/* Update bbRefs and bbPreds */
/* Here head->bbNext is bottom !!! - Replace it */

fgRemoveRefPred(head->bbNext, bottom);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were this call and the one below removed? Are they now covered by fgUpdateFlowGraph?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@JosephTremoulet
Copy link
Copy Markdown
Author

@AndyAyersMS, @pgavlin, I think I've addressed your feedback so far.

Copy link
Copy Markdown

@pgavlin pgavlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AndyAyersMS
Copy link
Copy Markdown
Member

LGTM too.

Expect instead to see arithmetic nodes that are arguments of separate
assign nodes.
There's no need for fixpoint iteration; the loop indices are a pre-order,
so walking them in reverse order will visit inner loops before outer ones.
Lift both the single-exit restriction and the no-internal-branching
restriction.  Share some utilities with the loop cloner to facilitate
this (particularly `CloneBlockState` and `fgUpdateChangedFlowGraph`).
Make sure to avoid trying to unroll cases so large as to overlow the cost.
Since the Vector<T> abstraction has a `Count` that is not a
C#-compile-time constant, it encourages use of iteration to
search/aggregate individual elements using symbolic indexing, which in
turn leads to codegen that spills the vector to memory for each element
access, and performs bounds checks for each access.  These loops will have
low trip counts that are jit-compile-time constant, and constant indexing
into Vector<T> allows more efficient register-to-register sequences and
bounds-check elision.  This change enables RyuJit's loop unroller when
such a loop is discovered, and increases the size threshold to target
optimizing such loops much more aggressively than the unroller's previous
incarnation.

Add a test with a motivating loop to the Performance/CodeQuality/SIMD
suite.

Closes #7843.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants