Skip to content

Conversation

@AndyAyersMS
Copy link
Member

Optimize cases where we copy a struct and then immedately read just one field
to read from the original struct instead.

Optimize cases where we copy a struct and then immedately read just one field
to read from the original struct instead.
@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 10, 2022
@ghost ghost assigned AndyAyersMS Aug 10, 2022
@ghost
Copy link

ghost commented Aug 10, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Optimize cases where we copy a struct and then immedately read just one field
to read from the original struct instead.

Author: AndyAyersMS
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member Author

cc @dotnet/jit-contrib

Partially inspired by #66776. Just a small simple piece of struct copy prop.

BenchmarkDotNet=v0.13.1.1786-nightly, OS=Windows 11 (10.0.22000.856/21H2)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-rc.1.22374.1
[Host] : .NET 6.0.8 (6.0.822.36306), X64 RyuJIT
Job-ZUQRZK : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-SYRZBQ : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000 InvocationCount=5000 IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 MinWarmupIterationCount=6
UnrollFactor=1 WarmupCount=-1

Method Tool Size Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Allocated Alloc Ratio
LinqQuery BASE 512 53.49 us 0.791 us 0.740 us 53.62 us 52.50 us 54.99 us 1.00 0.00 5.6000 0.6000 34.33 KB 1.00
LinqQuery DIFF 512 48.59 us 0.631 us 0.559 us 48.66 us 47.16 us 49.37 us 0.91 0.02 5.6000 0.6000 34.33 KB 1.00

Still a rough. Some things to sort through:

  • various odd restrictions
  • should we bail if we're making a local DNER
  • should we try and handle the promoted case
  • could we do some of this in the inliner
  • what are the rules for when we have Layout

Some nice looking diffs, eg

;;; BEFORE

; Assembly listing for method Microsoft.CodeAnalysis.ExternalAccess.VSTypeScript.Api.VSTypeScriptCodeFixContextExtensions:IsBlocking(Microsoft.CodeAnalysis.CodeFixes.CodeFixContext):bool
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No matching PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  6   )   byref  ->  rcx         single-def
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  2,  4   )  struct (56) [rsp+00H]   do-not-enreg[SF] class-hnd exact "Single-def Box Helper"
;
; Lcl frame size = 56

G_M40065_IG01:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
       sub      rsp, 56
       vzeroupper 
						;; size=7 bbWeight=1    PerfScore 1.25
G_M40065_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000002 {rcx}, byref, nogc
       ; byrRegs +[rcx]
       vmovdqu  ymm0, ymmword ptr[rcx]
       vmovdqu  ymmword ptr[rsp], ymm0
       vmovdqu  xmm0, xmmword ptr [rcx+32]
       vmovdqu  xmmword ptr [rsp+20H], xmm0
       mov      rax, qword ptr [rcx+48]
       mov      qword ptr [rsp+30H], rax
						;; size=29 bbWeight=1    PerfScore 14.00
G_M40065_IG03:        ; , extend
       movzx    rax, byte  ptr [rsp+18H]
						;; size=5 bbWeight=1    PerfScore 1.00
G_M40065_IG04:        ; , epilog, nogc, extend
       add      rsp, 56
       ret      

becomes

;;;; AFTER

; Assembly listing for method Microsoft.CodeAnalysis.ExternalAccess.VSTypeScript.Api.VSTypeScriptCodeFixContextExtensions:IsBlocking(Microsoft.CodeAnalysis.CodeFixes.CodeFixContext):bool
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No matching PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  6   )   byref  ->  rcx         single-def
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  struct (56) zero-ref    do-not-enreg[SF] class-hnd exact "Single-def Box Helper"
;
; Lcl frame size = 0

G_M40065_IG01:        ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
						;; size=0 bbWeight=1    PerfScore 0.00
G_M40065_IG02:        ; gcrefRegs=00000000 {}, byrefRegs=00000002 {rcx}, byref
       ; byrRegs +[rcx]
       movzx    rax, byte  ptr [rcx+24]
						;; size=4 bbWeight=1    PerfScore 2.00
G_M40065_IG03:        ; , epilog, nogc, extend
       ret   

@AndyAyersMS
Copy link
Member Author

Failure is from folding

STMT00000 ( 0x000[E-] ... 0x00C )
               [000005] -A-X-------                         *  ASG       simd16 (copy)
               [000003] D------N---                         +--*  LCL_VAR   simd16<System.Numerics.Vector4> V02 tmp1         
               [000002] n--X-------                         \--*  OBJ       simd16<System.Numerics.Vector4>
               [000001] -----------                            \--*  ADDR      byref 
               [000000] -------N---                               \--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1[System.Single]> V00 arg0         

STMT00001 ( ??? ... ??? )
               [000007] -----------                         *  RETURN    simd16
               [000006] -------N---                         \--*  LCL_VAR   simd16<System.Numerics.Vector4> V02 tmp1 

to

    [000005]:  [000006] is only use of [000003] (V02) fold OBJ(ADDR(X)) [000002] into X [000000], with different handles
 [folding OBJ(ADDR(LCL...))] [marking V00 as multi-reg-ret] -- fwd subbing [000000]; new next stmt is

               [000007] -----------                         *  RETURN    simd16
               [000000] m------N---                         \--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1[System.Single]> V00 arg0  

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Aug 11, 2022

Probably need some vestige of the old handle check back, layout compat isn't enough.

Looks like in crossgen Vector4 and Vector128 aren't ABI compatible, but we need to wait until after folding obj(addr(lcl)) to spot this.

@JulieLeeMSFT
Copy link
Member

@AndyAyersMS, is this PR for .NET 7 or 8? Please put milestone accordingly.

@AndyAyersMS
Copy link
Member Author

It is a draft PR, would be for 8 if anything.

@AndyAyersMS AndyAyersMS added this to the 8.0.0 milestone Aug 11, 2022
@SingleAccretion
Copy link
Contributor

SingleAccretion commented Aug 11, 2022

Failure is from folding mismatched things under return

Yes, this is one case where we still need the handle check. It will be fixed "soon" (#72887 is the prelude).

could we do some of this in the inliner

I notice in #66776 we have a tree characteristic of struct arguments to inlinees, with an unnecessary copy made due to the OBJ(ADDR(LCL_VAR)) wrapping by impNormStructVal. In principle, this wrapping can now be simply deleted, but it will lead to large diffs with some regressions mixed in due to exposing various downstream deficiencies.

should we try and handle the promoted case

Probably not worth it? If we see a LCL_FLD use of a promoted local (which is currently not possible by the time forward sub runs, though that will not be true in the future), we're likely dealing with some sort of un-analyzeable reinterpretation, otherwise we'd expect the SDSU local to have been promoted and the use replaced with that of its field.

should we bail if we're making a local DNER

Forwarding enregisterable/promoted structs, it would be the conservative option. The worry is of course that we'd pessimize all other defs/uses.

what are the rules for when we have Layout

We only have layout on TYP_STRUCT local fields. TYP_SIMD is intentionally excluded (#69822). In general, we have layout for all TYP_STRUCT nodes except multi-reg HWIs (which should be fixed) and TYP_STRUCT INDs (those should be eliminated), and may or may not have layout for TYP_SIMD.

Overall, I am wondering whether we should slot the LCL_VAR -> LCL_FLD substitution into local copy propagation instead. Seems like that'd be more powerful, and global copy propagation already has this ability.

@AndyAyersMS
Copy link
Member Author

Overall, I am wondering whether we should slot the LCL_VAR -> LCL_FLD substitution into local copy propagation instead. Seems like that'd be more powerful, and global copy propagation already has this ability.

You may be right. I'll probably give this a try.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Aug 22, 2022

Overall, I am wondering whether we should slot the LCL_VAR -> LCL_FLD substitution into local copy propagation instead. Seems like that'd be more powerful, and global copy propagation already has this ability.

You may be right. I'll probably give this a try.

I've been looking into this alternative, and it's not working anywhere near as well as forward sub (despite being potentially more capable). We call assertion gen after morphing the trees, and the expansions we do for many struct copes aren't recognizable as struct copies anymore. One idea I may try is to invoke assertion gen on the pre-morphed tree (at least for struct copies).

For example,

fgMorphTree BB01, STMT00012 (before)
               [000063] -A---------                         *  ASG       struct (copy)
               [000062] D------N---                         +--*  LCL_VAR   struct<System.Runtime.CompilerServices.ConfiguredValueTaskAwaitable`1[System.__Canon], 24> V07 tmp2         
               [000012] -----------                         \--*  LCL_VAR   struct<System.Threading.Tasks.ValueTask`1[System.__Canon], 24>(P) V06 loc2         
                                                            \--*    ref    V06._obj (offs=0x00) -> V11 tmp6         
                                                            \--*    ref    V06._result (offs=0x08) -> V12 tmp7         
                                                            \--*    short  V06._token (offs=0x10) -> V13 tmp8         
                                                            \--*    bool   V06._continueOnCapturedContext (offs=0x12) -> V14 tmp9  

will not generate a V07 == V14 assertion.

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Aug 22, 2022

Are the issues caused by promotion expansions or something else?

(Edit: I see it is)

Generating assertions in pre-order is "wrong" as the copying hasn't happened at that point yet (thus assertions created would not be valid for subtrees); creating them just before block morphing should be fine though.

@SingleAccretion
Copy link
Contributor

creating them just before block morphing should be fine though

This is assuming we have appropriate heuristics in propagation itself, and will not try to replace something with a struct that has just been eliminated through decomposition. Subtitution also has to have these heuristic (the aforementioned DNER question), but it's easier for it to not get into questionable territories because the use/def count is known.

@AndyAyersMS
Copy link
Member Author

Are the issues caused by promotion expansions or something else?

(Edit: I see it is)

Yeah, seems to be caused by the expansions. It would be impractical to pattern match these later, I think.

Generating assertions in pre-order is "wrong" as the copying hasn't happened at that point yet (thus assertions created would not be valid for subtrees); creating them just before block morphing should be fine though.

Right, was looking for a good place to try this.

@AndyAyersMS
Copy link
Member Author

May be a hack, but fgMorphTreeDone has a copy of the original tree... using that to generate assertions fixes the first case I looked at anyways.

@AndyAyersMS
Copy link
Member Author

Abandoning this in favor of #74384

@ghost ghost locked as resolved and limited conversation to collaborators Sep 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants