Skip to content

JIT: Always compute loop iteration estimate in loop inversion if we have PGO data#116104

Merged
amanasifkhalid merged 11 commits intodotnet:mainfrom
amanasifkhalid:loop-inversion-iteration-count
Aug 5, 2025
Merged

JIT: Always compute loop iteration estimate in loop inversion if we have PGO data#116104
amanasifkhalid merged 11 commits intodotnet:mainfrom
amanasifkhalid:loop-inversion-iteration-count

Conversation

@amanasifkhalid
Copy link
Copy Markdown
Contributor

@amanasifkhalid amanasifkhalid commented May 29, 2025

Ensure loop inversion always comes up with a loop iteration estimate better than BB_LOOP_WEIGHT_SCALE if we have PGO data.

Copilot AI review requested due to automatic review settings May 29, 2025 17:37
@github-actions github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 29, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the loop inversion logic to skip inverting loops that are expected to iterate only a few times, based on profile weight data.

  • Simplifies the iteration count estimation by using the likely weight of the test block and the called count.
  • Removes the previous, more complex handling of profile weights and loop entry estimation.

Comment thread src/coreclr/jit/optimizer.cpp Outdated
@amanasifkhalid
Copy link
Copy Markdown
Contributor Author

amanasifkhalid commented May 29, 2025

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs show large size decreases (with libraries_tests being the outlier), as well as some size increases from RBO being pessimized by less branch duplication. I'm not sure what the cutoff for inversion should be, so if these diffs seem too big, I can reduce it a bit.

It's worth noting that I'm not cutting off any loops when we don't have PGO data. Since inversion currently runs before optSetBlockWeights, I don't think I can make any assumptions about loop iteration counts here.

@AndyAyersMS
Copy link
Copy Markdown
Member

I think this is a tricky one to get right.

  • If a loop has low average iteration count it can still have instances with high iteration counts.
  • If a loop has low iteration counts the method with the loop may be called frequently, or the loop may be inside another loop with high iteration counts, etc.

@amanasifkhalid
Copy link
Copy Markdown
Contributor Author

or the loop may be inside another loop with high iteration counts

In this case, wouldn't we compute a high iteration count for the nested loop too (assuming the parent loop doesn't conditionally execute the child loop)?

I agree that this approach isn't sensitive to the other cases you mentioned. The loop inversion diffs that inspired this change didn't necessarily involve loops with low iteration counts; rather, they were loops that are more likely to fall through than loop, or otherwise weren't likely to run more than once per method call. It feels a bit crude, but I could take a safer approach and skip loops that don't iterate at least twice on average -- in other words, it has to behave like a loop on average to be inverted.

@AndyAyersMS
Copy link
Copy Markdown
Member

or the loop may be inside another loop with high iteration counts

In this case, wouldn't we compute a high iteration count for the nested loop too (assuming the parent loop doesn't conditionally execute the child loop)?

Ah, I should have looked more closely. You are computing a method-entry relative count, not a loop-entry relative count... I just assumed "iteration count" meant the latter.

So yes what you are doing would handle the nested case ok.

I'd like to see what a size-based heuristic looks like. I think that is perhaps less prone to mis-estimating importance or potential benefit from inversion (?).

@amanasifkhalid
Copy link
Copy Markdown
Contributor Author

I'd like to see what a size-based heuristic looks like.

I was thinking of reusing the size heuristic you added for loop cloning: If a loop is too big to likely benefit from cloning, then it's probably not tight enough to benefit from inversion. Does that seem like a reasonable starting point?

I don't think we can easily separate out the size heuristic change from #116017, since we need the loop data structures computed to easily compute the loop size. I can push a change to that PR with the size restriction and see how the diffs change.

@AndyAyersMS
Copy link
Copy Markdown
Member

I'd like to see what a size-based heuristic looks like.

I was thinking of reusing the size heuristic you added for loop cloning: If a loop is too big to likely benefit from cloning, then it's probably not tight enough to benefit from inversion. Does that seem like a reasonable starting point?

Sure, using the same size threshold seems reasonable.

@amanasifkhalid
Copy link
Copy Markdown
Contributor Author

Based on my trial and error with different size limits for loop inversion (comment), I think we're unlikely to pursue a loop iteration heuristic for now. I'm going to remove the heuristic portion and just make this into a refactor of the loop iteration computation, so that we're at least always doing it.

@amanasifkhalid amanasifkhalid changed the title JIT: Don't invert loops with low iterations counts JIT: Always compute loop iteration estimate in loop inversion if we have PGO data May 30, 2025
@amanasifkhalid
Copy link
Copy Markdown
Contributor Author

@AndyAyersMS I thought I'd revive this to cut down on my PR backlog. The only material change in this is we always try to estimate the loop iteration count when we have PGO data, even if the weights into the loop are inconsistent. Because we run profile repair right before loop inversion, we don't encounter inconsistency all that often. From what I've seen, most of the cases where block weights are still inconsistent are under OSR, which is known to trip up profile repair. Under OSR, we can assume the loop is very hot, so even if the loop iteration count loses some imprecision from the lack of profile consistency, I suspect the computed value is always more realistic than BB_LOOP_WEIGHT_SCALE (8).

The diffs are small, and seem to be inflated by duplicate method contexts, according to the disasm summaries. Ex:

Top method regressions (bytes):
          44 (6.667% of base) : 45083.dasm - System.Text.Ascii:NarrowUtf16ToAscii_Intrinsified(ptr,ptr,nuint):nuint (Tier1)
          44 (6.667% of base) : 73098.dasm - System.Text.Ascii:NarrowUtf16ToAscii_Intrinsified(ptr,ptr,nuint):nuint (Tier1)
          44 (6.707% of base) : 46611.dasm - System.Text.Ascii:NarrowUtf16ToAscii_Intrinsified(ptr,ptr,nuint):nuint (Tier1)
          44 (6.667% of base) : 57315.dasm - System.Text.Ascii:NarrowUtf16ToAscii_Intrinsified(ptr,ptr,nuint):nuint (Tier1)
          44 (6.707% of base) : 41594.dasm - System.Text.Ascii:NarrowUtf16ToAscii_Intrinsified(ptr,ptr,nuint):nuint (Tier1)
          44 (6.707% of base) : 58719.dasm - System.Text.Ascii:NarrowUtf16ToAscii_Intrinsified(ptr,ptr,nuint):nuint (Tier1)
          44 (4.331% of base) : 41588.dasm - System.Text.Ascii:NarrowUtf16ToAscii(ptr,ptr,nuint):nuint (Tier1)
          44 (4.435% of base) : 45023.dasm - System.Text.Ascii:NarrowUtf16ToAscii(ptr,ptr,nuint):nuint (Tier1)
          44 (4.418% of base) : 45068.dasm - System.Text.Ascii:NarrowUtf16ToAscii(ptr,ptr,nuint):nuint (Tier1)
          44 (4.331% of base) : 58707.dasm - System.Text.Ascii:NarrowUtf16ToAscii(ptr,ptr,nuint):nuint (Tier1)
          44 (4.418% of base) : 57303.dasm - System.Text.Ascii:NarrowUtf16ToAscii(ptr,ptr,nuint):nuint (Tier1)
          44 (4.331% of base) : 46598.dasm - System.Text.Ascii:NarrowUtf16ToAscii(ptr,ptr,nuint):nuint (Tier1)
          44 (4.435% of base) : 67635.dasm - System.Text.Ascii:NarrowUtf16ToAscii(ptr,ptr,nuint):nuint (Tier1)
          44 (4.365% of base) : 73086.dasm - System.Text.Ascii:NarrowUtf16ToAscii(ptr,ptr,nuint):nuint (Tier1)
          44 (1.321% of base) : 41587.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(ptr,int,ptr,int,byref,byref):int (Tier1)
          44 (1.327% of base) : 57302.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(ptr,int,ptr,int,byref,byref):int (Tier1)
          44 (1.330% of base) : 45022.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(ptr,int,ptr,int,byref,byref):int (Tier1)
          44 (1.330% of base) : 67634.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(ptr,int,ptr,int,byref,byref):int (Tier1)
          44 (1.319% of base) : 46597.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(ptr,int,ptr,int,byref,byref):int (Tier1)
          44 (1.329% of base) : 45067.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(ptr,int,ptr,int,byref,byref):int (Tier1)

Copy link
Copy Markdown
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's take this one.

@amanasifkhalid amanasifkhalid merged commit 67541ba into dotnet:main Aug 5, 2025
103 of 105 checks passed
@amanasifkhalid amanasifkhalid deleted the loop-inversion-iteration-count branch August 5, 2025 18:19
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants