Preserve SIMD registers in Windows fastcall functions#1378
Preserve SIMD registers in Windows fastcall functions#1378
Conversation
| if let ValueLoc::Reg(ru) = *value_loc { | ||
| if !used.is_avail(GPR, ru) { | ||
| used.free(GPR, ru); | ||
| if GPR.contains(ru) { |
There was a problem hiding this comment.
not checking that GPR (and FPR) contain the regunit in question could introduce errors if these RegClasses had differing widths. So far as I can see, x86 RegClass all have the same width of 1, so this happens to work out okay, but I think is strictly speaking a bug independent of SIMD register preservation.
|
Also, just aside from reviews, thank you so much for writing this patch! |
|
Okay! This is properly ready for review now, after running into a few quirks. |
c2f4fde to
deefb03
Compare
cranelift-codegen/src/isa/x86/abi.rs
Outdated
| // The calling convention described in | ||
| // https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention only specifically | ||
| // discusses XMM6-XMM15, which are 128-bit registers. It may be sufficient to preserve only | ||
| // the low 128 bits here, but we may find that in practice we should preserve all of |
There was a problem hiding this comment.
You only need to preserve the lower 128 bits.
The upper 128 (or 384, whatever) are volatile.
The rationale is:
- You cannot make registers nonvolatile after the fact.
- The ABI cannot be revised in a compatible way, adding nonvolatile registers.
- There are no unwind codes to describe how to restore anything not in the v1 ABI.
You can save more if you want, but it buys little/nothing.
Except, you know, you can have private calling conventions among known callers/callees.
Callers can assume more volatiles (ymm, zmm) are preserved, and can allow for nonvolatiles to be trashed (that the caller preserves), if he knows the callee and the callee behavior -- and the other way around -- a callee that trashes nonvolatiles has to know its callers.
Calling conventions are largely?entirely about separate compilation.
There was a problem hiding this comment.
Speaking of unwind, can we update unwind.rs to emit the unwind codes for these CSRs as part of this PR? The original unwind codes implementation made some assumptions as to what instructions were present in the prologue.
There was a problem hiding this comment.
I've update Windows fastcall unwind generation to emit XmmSave128 and XmmSave128Far, though I'm suspect of the offset.
From reading https://docs.microsoft.com/en-us/cpp/build/exception-handling-x64?view=vs-2019#unwind-operation-code it looks like the offset for SaveXmm counts a number of 16-byte stack slots from the base of the call frame, so storing to [rbp - 0x10] would have an offset of 1, [rbp - 0x20] would be 2. In contrast, the callee-save code added here should be [rsp - csr_stack_slot_offset + FPR_offset] where FPR_offset is 0x10, 0x20, ... as appropriate. The offsets recorded are FPR_offset, and my suspicion is that they would be interpreted as offsets from the other side of callee-save space by Windows.
This is well beyond my experience with Windows to actually verify. If this looks right to you, I'd be happy to hear it - threading the size of csr_stack_slot through to UnwindInfo seems kind of annoying. I think it would best be a field on Function?
There was a problem hiding this comment.
I also updated the comment to be more strongly worded on preserving the low 128 bits, thanks for the feedback, @jaykrell.
There was a problem hiding this comment.
To better understand, just look at C++ examples.
Here is a start. Change it up with alloca to get a frame pointer.
But ouch, keep in mind, compiler does some odd things that seem to violate the spec.
Like getting the rsp/rbp before adjustment, and recording offsets from after.
That does occur below unfortunately -- the use of rax.
That is an optimization and you don't have to do it. Please don't, really.
It is just trying to keep the offsets in the instructions to fit in 8 bits instead of 32 I think, to shrink instruction size, and looks unnecessary in this case, and wastes an instruction.
C:\s>type 1.c
__declspec(dllexport)
float (*f1)(float, float);
__declspec(dllexport)
float f2(float a, float b)
{
float c = a + b;
float d = a - b;
float e = a * b;
float f = a / b;
return f1(a, b) + f1(c, d) + f1(c, d) - f1(e, f) + f1(e + f, e - f);
}
C:\s>cl /O2 /LD /Zi 1.c /link /noentry /incremental:no
Microsoft (R) C/C++ Optimizing Compiler Version 19.21.27702.2 for x64
Copyright (C) Microsoft Corporation. All rights reserved.
1.c
Microsoft (R) Incremental Linker Version 14.21.27702.2
Copyright (C) Microsoft Corporation. All rights reserved.
/out:1.dll
/dll
/implib:1.lib
/debug
/noentry
/incremental:no
1.obj
Creating library 1.lib and object 1.exp
C:\s>link /dump /unwindinfo /disasm 1.dll
Microsoft (R) COFF/PE Dumper Version 14.21.27702.2
Copyright (C) Microsoft Corporation. All rights reserved.
Dump of file 1.dll
File Type: DLL
f2:
; This instruction is ok but imho an unnecessary and possibly failed optimization.
0000000180001000: 48 8B C4 mov rax,rsp
0000000180001003: 48 81 EC 98 00 00 sub rsp,98h
00
; for example, this could/should be encoded as mov [rsp-80]
000000018000100A: 0F 29 70 E8 movaps xmmword ptr [rax-18h],xmm6
; for example, this could/should be encoded as mov [rsp-70]
000000018000100E: 0F 29 78 D8 movaps xmmword ptr [rax-28h],xmm7
0000000180001012: 0F 28 F9 movaps xmm7,xmm1
0000000180001015: 44 0F 29 40 C8 movaps xmmword ptr [rax-38h],xmm8
000000018000101A: 44 0F 28 C0 movaps xmm8,xmm0
000000018000101E: 44 0F 29 48 B8 movaps xmmword ptr [rax-48h],xmm9
0000000180001023: 44 0F 28 C8 movaps xmm9,xmm0
0000000180001027: 44 0F 29 50 A8 movaps xmmword ptr [rax-58h],xmm10
000000018000102C: F3 44 0F 5C CF subss xmm9,xmm7
0000000180001031: 44 0F 29 58 98 movaps xmmword ptr [rax-68h],xmm11
0000000180001036: 44 0F 28 D0 movaps xmm10,xmm0
000000018000103A: 44 0F 29 60 88 movaps xmmword ptr [rax-78h],xmm12
000000018000103F: F3 44 0F 58 D7 addss xmm10,xmm7
0000000180001044: 44 0F 28 E0 movaps xmm12,xmm0
0000000180001048: 44 0F 28 D8 movaps xmm11,xmm0
000000018000104C: F3 44 0F 59 E7 mulss xmm12,xmm7
0000000180001051: 41 0F 28 C9 movaps xmm1,xmm9
0000000180001055: 41 0F 28 C2 movaps xmm0,xmm10
0000000180001059: F3 44 0F 5E DF divss xmm11,xmm7
000000018000105E: FF 15 A4 1F 00 00 call qword ptr [f1]
0000000180001064: 0F 28 F0 movaps xmm6,xmm0
0000000180001067: 0F 28 CF movaps xmm1,xmm7
000000018000106A: 41 0F 28 C0 movaps xmm0,xmm8
000000018000106E: FF 15 94 1F 00 00 call qword ptr [f1]
0000000180001074: F3 0F 58 F0 addss xmm6,xmm0
0000000180001078: 41 0F 28 C9 movaps xmm1,xmm9
000000018000107C: 41 0F 28 C2 movaps xmm0,xmm10
0000000180001080: FF 15 82 1F 00 00 call qword ptr [f1]
0000000180001086: 0F 28 F8 movaps xmm7,xmm0
0000000180001089: 41 0F 28 CB movaps xmm1,xmm11
000000018000108D: 41 0F 28 C4 movaps xmm0,xmm12
0000000180001091: F3 0F 58 FE addss xmm7,xmm6
0000000180001095: FF 15 6D 1F 00 00 call qword ptr [f1]
000000018000109B: 41 0F 28 CC movaps xmm1,xmm12
000000018000109F: F3 0F 5C F8 subss xmm7,xmm0
00000001800010A3: F3 41 0F 5C CB subss xmm1,xmm11
00000001800010A8: F3 45 0F 58 DC addss xmm11,xmm12
00000001800010AD: 41 0F 28 C3 movaps xmm0,xmm11
00000001800010B1: FF 15 51 1F 00 00 call qword ptr [f1]
; blech, see below
00000001800010B7: 4C 8D 9C 24 98 00 lea r11,[rsp+98h]
00 00
00000001800010BF: 41 0F 28 73 E8 movaps xmm6,xmmword ptr [r11-18h]
00000001800010C4: F3 0F 58 C7 addss xmm0,xmm7
00000001800010C8: 0F 28 7C 24 70 movaps xmm7,xmmword ptr [rsp+70h]
00000001800010CD: 45 0F 28 43 C8 movaps xmm8,xmmword ptr [r11-38h]
00000001800010D2: 45 0F 28 4B B8 movaps xmm9,xmmword ptr [r11-48h]
00000001800010D7: 45 0F 28 53 A8 movaps xmm10,xmmword ptr [r11-58h]
00000001800010DC: 45 0F 28 5B 98 movaps xmm11,xmmword ptr [r11-68h]
00000001800010E1: 45 0F 28 63 88 movaps xmm12,xmmword ptr [r11-78h]
; This violates the letter of the spec, alas.
; It is supposed to be add rsp,n, or lea rsp, n[frame]
00000001800010E6: 49 8B E3 mov rsp,r11
00000001800010E9: C3 ret
Function Table (1)
Begin End Info Function Name
00000000 00001000 000010EA 0000211C f2
Unwind version: 1
Unwind flags: None
Size of prologue: 0x3F
Count of codes: 16
Unwind codes:
3F: SAVE_XMM128, register=xmm12 offset=0x20
36: SAVE_XMM128, register=xmm11 offset=0x30
2C: SAVE_XMM128, register=xmm10 offset=0x40
23: SAVE_XMM128, register=xmm9 offset=0x50
1A: SAVE_XMM128, register=xmm8 offset=0x60
12: SAVE_XMM128, register=xmm7 offset=0x70
0E: SAVE_XMM128, register=xmm6 offset=0x80
0A: ALLOC_LARGE, size=0x98
So, overall, mixed. The C++ examples help to understand, but they also confuse, because the C++ compiler violates the spec in ways that end up meaning the same thing at runtime, if you understand closely how unwinding works.
I would encourage you to follow the spec instead, though I suppose there might be measurable perf benefits otherwise.
You can do the link /dump /unwindinfo on any binary -- no need for symbols.
There was a problem hiding this comment.
The NT unwinder is also open sourced, here:
at least an old maybe subsetted/altered form, but probably useful.
There was a problem hiding this comment.
The offsets in the generated unwind information in the test do seem odd, especially compared to the above output from VC++, but I think that's a result of adjusting the stack pointer for the frame after we're saving the SIMD CSRs. In the test output, an offset of 0, 0x10, and 0x20 make sense given that the SP will be adjusted up during unwinding before it restores the XMM registers.
Perhaps we should adjust the SP first? Either way, the generated unwind info looks ok assuming the register constants are correct, as the docs seem to omit the XMM register values for unwind info.
There was a problem hiding this comment.
If the generated unwind info looks okay, are you good with this change and revisiting in case it turns out not to be? I don't have anything resembling a useful Windows development environment, so I'm not even entirely sure what to look for being wrong in case the unwind information is wrong - I imagine "only" an incorrect value for a callee-save register in some parent frame? Given that overall this PR fixes an ABI error that results in dangerously wrong code, I would like to get this figured out :)
There was a problem hiding this comment.
I'm good with the change and open to revisiting it in the future. I think the order of operations in our prologues is just a little "strange" compared to VC++'s prologues, but I don't necessarily see that as a problem with respect to unwind.
There was a problem hiding this comment.
The VIsual C++ code above is weird.
I do not recommend emulating it.
It is surely correct, and it might be optimized, but do not emulate it.
Visual C++ definitely does some strange things as optimizations.
In particular, it stores to the stack before adjusting rsp, and then has to misstate frame pointer offsets. "Normal" code only pushes prior to storing to the stack, I believe.
Maybe try looking at unoptimized output instead.
Or link /dump /disasm /unwindinfo ntoskrnl.exe to find simple stuff written in assembly.
Or read and follow the spec -- Visual C++ kinda skirts the spec.
…4x2, so use regular f64 instead
eae9359 to
869428b
Compare
peterhuene
left a comment
There was a problem hiding this comment.
Reviewing the Windows unwind changes: 👍. The rest makes sense to me, but we should probably get sign-off from another maintainer regarding those changes.
|
Thanks for reviewing! I'm always happy to get more eyes on ABI details - @fitzgen you took a look earlier, is this something you're comfortable reviewing? And if not, anyone want to volunteer? |
| ; nextln: offset: 42, | ||
| ; nextln: op: SmallStackAlloc, | ||
| ; nextln: info: 3, | ||
| ; nextln: info: 5, |
There was a problem hiding this comment.
Sorry, I just caught this. I would expect a value of 4 originally, indicating a 40 byte SP adjustment: 8 bytes to realign the stack from the even number of GPR pushes and 32 bytes for the callee spill space we never elide (even for leaf functions). So there's a bug currently in how we're aligning the stack even when spilling just the GPRs.
With your changes, I would expect a value of 10, indicating 88 bytes: 8 bytes to realign the stack from the even number of GPR pushes, 48 bytes for the saved XMM registers, and 32 bytes for the callee spill space. Even without the alignment bug mentioned above, the value should at least be 9 to account for the space needed to save the XMM registers.
There was a problem hiding this comment.
For comparison, the VC++ example above allocates 152 bytes for its frame: 8 bytes for stack alignment (upon entry stack is always off by 8 bytes due to call instruction pushing return address), 112 bytes for saved XMM registers, and 32 bytes for callee spill space (the function is not a leaf).
There was a problem hiding this comment.
Yep, there's definitely an error here. I hadn't checked the generated assembly on this case but
Disassembly of 356 bytes: [69/1840]
0: 55 push rbp
1: 48 89 e5 mov rbp, rsp
4: 53 push rbx
5: 56 push rsi
6: 57 push rdi
7: 41 54 push r12
9: 41 55 push r13
b: 41 56 push r14
d: 41 57 push r15
f: 48 8d 84 24 10 00 00 00 lea rax, [rsp + 0x10]
17: f2 0f 11 30 movsd qword ptr [rax], xmm6
1b: f2 0f 11 78 10 movsd qword ptr [rax + 0x10], xmm7
20: f2 44 0f 11 40 20 movsd qword ptr [rax + 0x20], xmm8
is definitely not the prologue I'd want to see. It looks like the callee-save for fprs begins at where the stack was when created, rather than being a non-overlapping region laid out somewhere later on. Which makes sense, since this is so late in compilation. urgh.
There was a problem hiding this comment.
Was the unwind information "correct" in that RSP is adjusted by only 0x28 following saving the XMM registers?
There was a problem hiding this comment.
correct in that it was adjusted by 0x30 (5 * 8 == 40 == 0x28, but there's the implied extra 8 bytes bringing it to 0x30)
There was a problem hiding this comment.
Oh sorry, I meant 0x30 (the 5 in the unwind info indicating (5*8)+8; I seem to always forget the +8). Thanks for confirming.
There was a problem hiding this comment.
I've deleted my last comment since it's probably just related to the overlapping stack slot issues. Once we have the proper prologue generated, I'll verify that the unwind information properly describes it.
There was a problem hiding this comment.
Some notes as it took me a surprising amount of time to figure out what exactly was going on. I didn't really understand StackSlot too well.
The gist of it is that lea rax, [rsp + 0x10] may look like an address of nothing in particular, but that's because it seems StackSlot used alongside adjusting the stack is a recipe for bad times. It's the instruction that would be correct to get the stack slot's start at the entry of the function without intervening changes. That is, the return address and rbp are preserved on the stack by Cranelift, and 0x10 down the stack from the base address of the call frame would be rbx and the start of the rest of the csr region. rsp isn't really "ready for use" until after the prologue though, so stack_addr doesn't seem appropriate to use as it currently exists for this purpose.
Subsequently, the xmm stores would overwrite the return address, rbp, rbx, ..., were the stack slot address right - I misunderstood which direction offsets count from too.
I think what I can do is make a second stack slot that we can directly address for FPR save/lrestore, and requiring 16-byte alignment on that would keep offsets simple as well. I'll give it a try and we'll see how it goes 🤞
There was a problem hiding this comment.
Surprise of surprises, the number is now not 3, 4, 5, or 10, but 12! I can argue the correctness of 12, though. For stack_slot reasons I don't understand, the FPR preservation region is aligned more strictly than necessary. This grows what would be 48 bytes into 64. Plus eight bytes to realign the stack, and 32 bytes of shadow stack space, that makes for a 104-byte allocation. That would be 13 8-byte slots, minus one for the implied slot makes 12!
A similar wider-than-necessary FPR region is also present in the float_callee_save test I added. I've tried to follow through what causes this in layout_stack and my best theory is that treating a stack slot as a collection of smaller units is not quite how it's intended to be used, and instead it tries to align the provided stack slots using their sizes as an alignment requirement. It looks like treating a stack slot as an aggregate of more permissible units might be a bit of a change, so I'm open to recommendations if taking ~16-48 extra bytes per call frame is a nonstarter.
fitzgen
left a comment
There was a problem hiding this comment.
The overall patch looks good to me, but obviously the overlapping stack slot bug needs to be fixed before we land this.
I didn't look too much at the unwind code stuff, since Peter already did and is much more familiar with it than I am.
Once the bug is fixed, I'll give it a final once over, and then it should be good to land!
| // | ||
| // TODO: For now, add just an `F64` rather than `F64X2` because `F64X2` would require | ||
| // encoding a fstDisp8 with REX bits set, and we currently can't encode that. F64 causes a | ||
| // whole XMM register to be preserved anyway. |
There was a problem hiding this comment.
File an issue for this todo and link to it here, so that it doesn't get lost, please.
| let csr_slot = func.create_stack_slot(ir::StackSlotData { | ||
| kind: ir::StackSlotKind::IncomingArg, | ||
| size: csr_stack_size as u32, | ||
| offset: Some(-csr_stack_size), |
There was a problem hiding this comment.
Is it too late in the pipeline to make offset be None here and then let layout_stack assign the offset? This might help with the overlapping stack slot issues you and Peter are discussing elsewhere in this thread (can't figure out how to comment inline there??). Or does the fastcall ABI require that they are saved at a particular location?
There was a problem hiding this comment.
I was just looking at this for the same reason, I think you're on the right track. These offsets are up from the bottom of the stack slot so zero is on the opposite side from where push starts placing registers. My expectation was the callee-save region would look like
size | rbp
size - 8 | rbx
size - 16 | rsi
... other callee-save gprs
0x20 | xmm8
0x10 | xmm7
0x00 | xmm6
which is close to what the actual layout looks like, but the base isn't what I'd expected.
There was a problem hiding this comment.
update: it's not too late, but layout_stack requires slots of type IncomingArg to have a defined offset. That offset is used for the base of the stack layout so a None ends up being a panic.
I ended up making FPR save/restore go through a second stack slot of kind SpillSlot - it seems like either this or ExplicitSlot might make sense on account of stack store/load is a lot like a spill (if they weren't used, they wouldn't have to be "spilled" to the stack). I'm not entirely sure what ExplicitSlot ought to be used for, but it also looks applicable?
There was a problem hiding this comment.
I'm not totally sure, but I think that SpillSlots are intended to be reused/de-duped across spilled values with distinct live ranges. This should be possible to do in theory for calling convention related stack slots (same as with struct return) but I'm not sure that it works as of today nor that it wouldn't be buggy.
I think (again not totally sure, as they're mostly used in tests and pretty much all slot kinds other than incoming args are treated the same way) that explicit slots are intended as catch all slots used for random purposes.
It seems that other csr slots are all IncomingArg tho, so maybe its best to keep that slot kind and figure out what the explicit offset needs to be, rather than rely on layout_stack.
There was a problem hiding this comment.
I think we'd need to make some changes to stack_layout and StackSlot for "correct offset in StackSlotKind::IncomingArg" to be a question we can really answer. stack_addr and friends assume the prologue has already done its stack setup, and, seemingly, that rsp won't move through the body of the function. So for the address of an IncomingArg slot to be used in the prologue, we'd need to defer actual resolution of stack offsets even after the instructions are added.
The other option is tracking the offset ourselves in the prologue, which I expect would look like tracking the amount we've moved SP with pushes/explicit adjustment, then acknowledging that the address from stack_addr(IncomingArgSlot) will be wrong (it'd be off by however much we've pushed) and correcting for that after the fact. Because floating point save/restore is past any GPR pushes, I think the offset will always be valid, just smaller than it "should" be.
I'd be more suspicious that I'm misunderstanding something, but I don't think we have other instances of wanting to do stack loads/stores in prologues/epilogues, so maybe this is actually a rough spot? :(
|
Close this? |
|
Yup, I was going through open conversations to figure out which unresolved conversations needed to move. |
If you don't know who could review this, please indicate so and/or ping
bnjbvr. The list of suggested reviewers on the right can help you.This is a draft PR as there are two points where I think I should just ask for help rather than spin my wheels figuring out the Right Thing, I'll add comments below.
@sstangl if you get a chance, I think you might know the right things for the questions to follow!