Enable unrolling of SIMD_LIMIT loops by JosephTremoulet · Pull Request #8001 · dotnet/coreclr

JosephTremoulet · 2016-11-04T20:34:28Z

Since the Vector abstraction has a Count that is not a
C#-compile-time constant, it encourages use of iteration to
search/aggregate individual elements using symbolic indexing, which in
turn leads to codegen that spills the vector to memory for each element
access, and performs bounds checks for each access. These loops will have
low trip counts that are jit-compile-time constant, and constant indexing
into Vector allows more efficient register-to-register sequences and
bounds-check elision. This change enables RyuJit's loop unroller when
such a loop is discovered, and increases the size threshold to target
optimizing such loops much more aggressively than the unroller's previous
incarnation.

Add a test with a motivating loop from aspnet/KestrelHttpServer#1138 to the
Performance/CodeQuality/SIMD suite.

Closes #7843.

JosephTremoulet · 2016-11-04T20:40:40Z

@briansull, @dotnet/jit-contrib PTAL.

This is likely easier to review by looking at the individual commits.

In the added benchmark which is a micro benchmark copy of the key loop from aspnet/KestrelHttpServer#1138, the speedups for various indices from a local run are:

Index	Improvement
1	14%
3	14%
11	52%
19	43%
27	46%

JosephTremoulet · 2016-11-04T20:50:26Z

Also: DDR was clean with stress mode forced on.

JosephTremoulet · 2016-11-10T18:07:25Z

Ping. @AndyAyersMS or @pgavlin maybe?

JosephTremoulet · 2016-11-10T18:07:47Z

@dotnet-bot test Windows_NT x64 Debug Build and Test

pgavlin · 2016-11-10T18:16:59Z

+
+    // Visit loops from highest to lowest number to vist them in innermost
+    // to outermost order
+    for (unsigned lnum = optLoopCount - 1; lnum != ~0U; --lnum)


Good thing unsigned integer wraparound is defined in C++...

pgavlin · 2016-11-10T18:44:17Z

+                    if (!BasicBlock::CloneBlockState(this, newBlock, block, lvar, lval))
                    {
-                        /* Stop if we've reached the end of the loop */
+                        // cloneExpr doesn't handle everything


Is there any intermediate information left over that we need to restore? I'm thinking in particular about lclVar ref counts.

I don't think so; the IR gets inserted, it's just unreachable.

So it's still represented in the BB list?

~~Yeah, it's in newBlock which is set to the return from fgNewBBafter a few lines above.~~
No, I snip it out just below (sorry, this went through a few iterations). I was just following suit from the previous behavior here. Did I do that wrong, or do you think there's a bug before & after? DDR passed with unroll stress... what is the ref count invariant and are we sure it's expected to hold here? Loop unrolling runs just before lvaMarkLocalVars.

Loop unrolling runs just before lvaMarkLocalVars.

Okay, then we are probably fine w.r.t. ref counts: I think that phase will recalculate them.

pgavlin · 2016-11-10T18:44:41Z

-            loopList = loopLast = nullptr;
+        /* Create the unrolled loop statement list */
+        {
+            BlockToBlockMap blockMap(getAllocator());


It may be worth using a SmallHashTable here to avoid heap allocations for the map elements.

The call to optRedirectBlock below wants a BlockToBlockMap.

pgavlin · 2016-11-10T18:45:39Z

+                        if (sideEffList == nullptr)
                        {
-                            break;
+                            testCopyExpr->gtBashToNOP();


Why not just remove the last statement entirely in this case?

Yeah, that's better, thanks; updated.

pgavlin · 2016-11-10T18:48:35Z

+                    }

-                        /* Append the expression to our list */
+                    if (block == bottom)


Nit: this could be folded into the block for the if statement above.

pgavlin · 2016-11-10T18:49:35Z

+                // Now redirect any branches within the newly-cloned iteration
+                for (block = head->bbNext; block != bottom; block = block->bbNext)
+                {
+                    if (block == bottom)


Shouldn't this be dead code due to the loop condition?

Yes. Fixed.

AndyAyersMS

Still looking but here are some initial thoughts

AndyAyersMS · 2016-11-10T18:32:09Z

-                unrollLimitSz *= 10;
-            }
-#endif
+        if (optLoopTable[lnum].lpFlags & (LPFLG_DONT_UNROLL | LPFLG_REMOVED))


Why not retest loopFlags here?

Didn't change that line (see 2928 before), didn't notice it. Will update...

AndyAyersMS · 2016-11-10T18:37:35Z

-            int unrollCostSz;
-            unrollCostSz = (loopCostSz * totalIter) - (loopCostSz + fixedLoopCostSz);
+        int unrollCostSz;
+        unrollCostSz = (loopCostSz * totalIter) - (loopCostSz + fixedLoopCostSz);


Do we know the first subexpression can't overflow here (realize it was like this before...)?

Added a change to use ClrSafeInt to detect overflow here.

AndyAyersMS · 2016-11-10T18:38:05Z

            {
-                continue;
+                // Unrolling would require cloning EH regions
+                goto DONE_LOOP;


Shouldn't this set DONT_UNROLL?

I got rid of the fixpoint iteration so we won't revisit the same loop twice anymore.

AndyAyersMS · 2016-11-10T18:52:28Z

+        // LPFLG_DO_WHILE - required because this transform only handles loops of this form
+        // LPFLG_CONST - required because this transform only handles full unrolls
+        // LPFLG_SIMD_LIMIT - included here as a heuristic, not for correctness/structural reasons
+        requiredFlags = LPFLG_DO_WHILE | LPFLG_CONST | LPFLG_SIMD_LIMIT;


Not sure I understand this part of the change -- the old code looked for ONE_EXIT and now you look for SIMD_LIMIT.

This one makes more sense looking at the individual commits -- one change enables unrolling loops with branches and multiple exits, and hence removes the ONE_EXIT restriction here; the last change enables unrolling for SIMD loops and so adds that flag.

Ok, thanks. I see how removing ONE_EXIT makes sense. But how does adding SIMD_LIMIT make sense? Doesn't having it like this block unrolling for non-SIMD cases?

Yes. We've identified that SIMD loops are good candidates for full unrolls, and we've got this implementation of a full-unroller sitting here bit-rotting (this check that I'm removing currently disables the unroller for all loops), so this change revives that code but (at least for now) as a heuristic does so only for SIMD cases.

Didn't realize this was not enabled already. Makes sense now, thanks.

pgavlin · 2016-11-10T18:54:42Z

-             * (the last value of the iterator in the loop)
-             * and drop the jump condition since the unrolled loop will always execute */
+                block->bbFlags &= ~(BBF_NEEDS_GCPOLL | BBF_LOOP_HEAD);
+                if (BasicBlock* jumpDest = block->bbJumpDest)


If this cannot be if ((BasicBlock* jumpDest = block->bbJumpDest) != nullptr), please move the declaration out of the condition s.t. the condition can test against nullptr.

Turns out that variable is unused in the latest version of this change, so I just removed the declaration.

pgavlin · 2016-11-10T18:56:34Z

-            /* Update bbRefs and bbPreds */
-            /* Here head->bbNext is bottom !!! - Replace it */
-
-            fgRemoveRefPred(head->bbNext, bottom);


Why were this call and the one below removed? Are they now covered by fgUpdateFlowGraph?

JosephTremoulet · 2016-11-10T22:11:55Z

@AndyAyersMS, @pgavlin, I think I've addressed your feedback so far.

pgavlin

LGTM

AndyAyersMS · 2016-11-11T19:20:57Z

LGTM too.

Expect instead to see arithmetic nodes that are arguments of separate assign nodes.

There's no need for fixpoint iteration; the loop indices are a pre-order, so walking them in reverse order will visit inner loops before outer ones.

Lift both the single-exit restriction and the no-internal-branching restriction. Share some utilities with the loop cloner to facilitate this (particularly `CloneBlockState` and `fgUpdateChangedFlowGraph`).

Make sure to avoid trying to unroll cases so large as to overlow the cost.

Since the Vector<T> abstraction has a `Count` that is not a C#-compile-time constant, it encourages use of iteration to search/aggregate individual elements using symbolic indexing, which in turn leads to codegen that spills the vector to memory for each element access, and performs bounds checks for each access. These loops will have low trip counts that are jit-compile-time constant, and constant indexing into Vector<T> allows more efficient register-to-register sequences and bounds-check elision. This change enables RyuJit's loop unroller when such a loop is discovered, and increases the size threshold to target optimizing such loops much more aggressively than the unroller's previous incarnation. Add a test with a motivating loop to the Performance/CodeQuality/SIMD suite. Closes #7843.

dnfclas added the cla-already-signed label Nov 4, 2016

JosephTremoulet force-pushed the UnrollSimd branch from 80c5fd7 to b2df59e Compare November 4, 2016 20:48

JosephTremoulet force-pushed the UnrollSimd branch from b2df59e to 5a54a12 Compare November 10, 2016 16:09

pgavlin reviewed Nov 10, 2016

View reviewed changes

AndyAyersMS reviewed Nov 10, 2016

View reviewed changes

pgavlin reviewed Nov 10, 2016

View reviewed changes

JosephTremoulet force-pushed the UnrollSimd branch from 5a54a12 to 55f491e Compare November 10, 2016 21:58

pgavlin approved these changes Nov 11, 2016

View reviewed changes

JosephTremoulet added 5 commits November 11, 2016 14:36

Stop expecting ASG_ operators in loop unroller

4251467

Expect instead to see arithmetic nodes that are arguments of separate assign nodes.

Unroll loops in inner-to-outer order

56dac6e

There's no need for fixpoint iteration; the loop indices are a pre-order, so walking them in reverse order will visit inner loops before outer ones.

Allow unrolling loops with multiple branches

705c337

Lift both the single-exit restriction and the no-internal-branching restriction. Share some utilities with the loop cloner to facilitate this (particularly `CloneBlockState` and `fgUpdateChangedFlowGraph`).

Detect overflow in unroller cost computation

634e339

Make sure to avoid trying to unroll cases so large as to overlow the cost.

JosephTremoulet force-pushed the UnrollSimd branch from 94d9d70 to e3e7d1a Compare November 11, 2016 19:37

JosephTremoulet merged commit 5ae721c into dotnet:master Nov 11, 2016

JosephTremoulet deleted the UnrollSimd branch November 11, 2016 19:37

JosephTremoulet mentioned this pull request Nov 11, 2016

SuperCharged MemoryPoolIterator aspnet/KestrelHttpServer#1138

Merged

benaadams mentioned this pull request Feb 16, 2017

Added SIMD-based IndexOf dotnet/corefxlab#1222

Merged

karelz modified the milestone: 2.0.0 Aug 28, 2017

gfoidl mentioned this pull request Jun 6, 2019

Marking Vector128<T>.Count and Vector256<T>.Count as [Intrinsic] #24991

Merged

Conversation

JosephTremoulet commented Nov 4, 2016

Uh oh!

JosephTremoulet commented Nov 4, 2016

Uh oh!

JosephTremoulet commented Nov 4, 2016

Uh oh!

JosephTremoulet commented Nov 10, 2016

Uh oh!

JosephTremoulet commented Nov 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pgavlin Nov 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JosephTremoulet commented Nov 10, 2016

Uh oh!

pgavlin Nov 10, 2016 •

edited

Loading