Optimize jump stubs on arm64

On x64 we emit the following code for jump stubs:
```asm
mov rax, 123456789abcdef0h
jmp rax
```
as I understand from https://github.com/dotnet/runtime/blob/70d20f1445127d6473283047d0a551ccf85f897f/src/coreclr/vm/amd64/cgenamd64.cpp#L505-L507

while on arm64 we make a memory load (from data section via pc):
```asm
ldr x16, [pc, #8]
br  x16
[target address]
```
https://github.com/dotnet/runtime/blob/eeb79b33d2a602915d4bd0153945fff75abdbbf7/src/coreclr/vm/arm64/cgencpu.h#L294-L296

I'm just wondering if it's not faster to do what x64 does and emit the const directly even if it takes 4 instructions to populate it...
```asm
mov     x8, #9044
movk    x8, #9268, lsl #16
movk    x8, #61203, lsl #32
movk    x8, #43981, lsl #48
br      x8
```
I'm asking because I have a feeling that it could be a bottleneck if I read it correctly from the TE traces (Plaintext benchmark):
![image](https://user-images.githubusercontent.com/523221/144512647-b4eeffe5-9298-490b-951a-5f11b3b24ce0.png)

cc @dotnet/jit-contrib @jkotas 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize jump stubs on arm64 #62302

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	// mov rax, 123456789abcdef0h 48 b8 xx xx xx xx xx xx xx xx
	// jmp rax ff e0

	// +0: ldr x16, [pc, #8]
	// +4: br x16
	// +8: [target address]

Optimize jump stubs on arm64 #62302

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions