Skip to content

Conversation

@tannergooding
Copy link
Member

@tannergooding tannergooding commented May 24, 2024

This changes:

private static Vector128<float> Test()
{
    var y = Vector128.Create(3, 2, 1, 0);
    return Vector128.Shuffle(Vector128.Create(1.0f, 2.0f, 3.0f, 4.0f), y);
}

From generating:

; Method Program:Test():System.Runtime.Intrinsics.Vector128`1[float] (FullOpts)
G_M000_IG01:                ;; offset=0x0000
       push     rbx
       sub      rsp, 64
       mov      rbx, rcx

G_M000_IG02:                ;; offset=0x0008
       vmovups  xmm0, xmmword ptr [reloc @RWD00]
       vmovaps  xmmword ptr [rsp+0x30], xmm0
       vmovups  xmm0, xmmword ptr [reloc @RWD16]
       vmovaps  xmmword ptr [rsp+0x20], xmm0
       lea      rdx, [rsp+0x30]
       lea      r8, [rsp+0x20]
       mov      rcx, rbx
       call     [System.Runtime.Intrinsics.Vector128:Shuffle(System.Runtime.Intrinsics.Vector128`1[float],System.Runtime.Intrinsics.Vector128`1[int]):System.Runtime.Intrinsics.Vector128`1[float]]
       mov      rax, rbx

G_M000_IG03:                ;; offset=0x003A
       add      rsp, 64
       pop      rbx
       ret      
RWD00  	dq	400000003F800000h, 4080000040400000h
RWD16  	dq	0000000200000003h, 0000000000000001h
; Total bytes of code: 64

to instead generate:

; Method Program:Test():System.Runtime.Intrinsics.Vector128`1[float] (FullOpts)
G_M40807_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M40807_IG02:  ;; offset=0x0000
       vpermilps xmm0, xmmword ptr [reloc @RWD00], 27
       vmovups  xmmword ptr [rcx], xmm0
       mov      rax, rcx
						;; size=17 bbWeight=1 PerfScore 4.25

G_M40807_IG03:  ;; offset=0x0011
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
RWD00  	dq	400000003F800000h, 4080000040400000h
; Total bytes of code: 18

Due to limitations in forward sub, this does not handle some other cases where other statements interfere with the ability to substitute.

Doing this post global morph is much more difficult as the call gets rewritten to spilled locals, such as:

fgMorphTree BB01, STMT00002 (after)
               [000016] SACXG+-----                         *  CALL      void   System.Runtime.Intrinsics.Vector128:Shuffle(System.Runtime.Intrinsics.Vector128`1[float],System.Runtime.Intrinsics.Vector128`1[int]):System.Runtime.Intrinsics.Vector128`1[float]
               [000021] DA--------- arg1 setup              +--*  STORE_LCL_VAR simd16<System.Runtime.Intrinsics.Vector128`1>(AX) V04 tmp2         
               [000014] -----+-----                         |  \--*  HWINTRINSIC simd16 float Add
               [000012] -----+-----                         |     +--*  LCL_VAR   simd16 V03 tmp1         
               [000013] -----+-----                         |     \--*  LCL_VAR   simd16 V03 tmp1          (last use)
               [000022] ----------- arg1 in rdx             +--*  LCL_ADDR  long   V04 tmp2         [+0]
               [000015] -----+----- arg2 in r8              +--*  LCL_ADDR  long   V01 loc0         [+0]
               [000017] -----+----- retbuf in rcx           \--*  LCL_VAR   byref  V00 RetBuf       

This is basically the same reason we can't do what GT_INTRINSIC does by just carrying a GT_HWINTRINSIC down to rationalization and rewriting it back to a call. The ABI handling around return buffers and parameter passing happens very early today (return buffers around import and parameter passing in global morph). If the ABI handling were moved down, then we could move this logic later (such as post VN) and catch essentially all cases instead.

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 24, 2024
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@tannergooding
Copy link
Member Author

Ended up replacing this with #102702, which allows it to be done in rationalization instead and so can cover many more scenarios.

@github-actions github-actions bot locked and limited conversation to collaborators Jun 28, 2024
@tannergooding tannergooding deleted the shuffle-cns-tst branch July 1, 2025 14:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant