Implement stack limit checks for AArch64#1573
Implement stack limit checks for AArch64#1573alexcrichton merged 1 commit intobytecodealliance:masterfrom
Conversation
0585d6a to
2d7a4ce
Compare
Subscribe to Label Actioncc @bnjbvr DetailsThis issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
cfallin
left a comment
There was a problem hiding this comment.
Thanks for this -- you jumped into the new backend guts remarkably quickly!
The main concern is the choice of register to hold the stack limit in the function body, as noted below; spilltmp comes with a bunch of baggage as it's used in other ways (notably register-allocator-inserted spills/reloads, which can happen in lots of places). I think the fix is relatively straightforward though!
| let reg = generate_gv(f, abi, gv, &mut insts); | ||
| return (reg, insts); | ||
|
|
||
| fn generate_gv( |
There was a problem hiding this comment.
Hmm. I'm not wild about having this GlobalValue expansion implementation local to the stack-limit code, but upon looking at aarch64/lower.rs, it seems that in all other cases the legalization pass transforms actual GlobalValue ops into core ops as here.
Perhaps we could at least move generate_gv to aarch64/lower.rs (passing in the vmctx reg as an arg)? This could eventually be used if we shift from the current encodings-based legalization framework to something simpler/more purpose-built as well.
There was a problem hiding this comment.
So FWIW this is definitely an "unfortunate" special context. This is "duplicated" with the x86 backend as well. The reason for this is that prologue/epilogue happens so late (way after legalization and even after register allocation) that it's difficult to share code. The hope is that the number of global values we'd need to process wouldn't be that large and/or it'd be easy to extend them over time.
In that sense I suspect that the global value legalization pass is unlikely to go away (it's architecture independent too!). That being said I'm happy to move this code around. Given that do you think it'd still have a better home in lower.rs?
(I can definitely do a better job about documenting all this too)
There was a problem hiding this comment.
Hmm, OK. I still think that we should move this to lower.rs, but other refactoring and cleanup can come later -- don't want to hold this up for it!
| stackslots.push(off); | ||
| } | ||
|
|
||
| let stack_limit = match f |
There was a problem hiding this comment.
Factor this pattern out (also occurs above) -- take f and sig, and return reg for vmctx plus Vec<Inst>?
To allocate a new temp as an into_reg to hold stack_limit, we'll need to finagle the plumbing a bit. Vreg temps are allocated by the LowerCtx, which we've kept separate from the ABI traits, and using this will be tricky because we can't have generic methods on the ABIBody trait. The ordering is slightly tricky too: the ABIBody is created before lowering starts, so that's too early to alloc a temp, but gen_prologue happens late (we lower instructions in reverse order) so we can't do it there either.
Suggestion: perhaps add an init method on the ABIBody trait that takes a "just in case" temp? (vregs are basically free, aside from creating sparsity in the ID space, so this is fine.) Then this could set stack_limit for use by gen_prologue and any calls.
There was a problem hiding this comment.
Actually, one more thought: after reading through the rest of the patch, it seems that stack_limit is not used beyond the prologue. Given this, I think it would be fine to use spilltmp within gen_prologue itself, and then carefully validate that we don't keep it live across anything else that could implicitly use it. This includes any virtual registers at all: the register allocator is liable to pop in and do a surprise spill with a big offset somewhere, wiping out spilltmp.
| readonly: _, | ||
| } => { | ||
| let base = generate_gv(f, abi, base, insts); | ||
| let into_reg = writable_spilltmp_reg(); |
There was a problem hiding this comment.
Eek, I'm not sure if it's safe to use spilltmp here -- or at least it's not clear it will interact well with anything we might write in the future. The basic invariant is that spilltmp should only be used within a sequence of instructions generated as one unit (the original use was to compute large stackframe offsets for regalloc spills/reloads, hence the name). We should probably take an into_reg as an argument of generate_gv here, and then by implication, to gen_stack_limit too.
(I should've added more documentation about this, sorry -- if you wouldn't mind adding a comment on spilltmp_reg() to this effect, that'd be very helpful, thanks!)
|
Thanks for taking a look @cfallin! I definitely have some documentation to shore up no matter what :). In any case I wanted to ask some questions about your comments. It sounds like you're thinking we should use vregs (which I presume is a virtual not-yet-allocated register, right?) for this stuff. I think that might be difficult though because prologue generation happens after register allocation. The saving grace, though, is that the registers used here in the prologue are only used during the prologue and never afterwards. Additionally I think that all caller-saved registers are available, not just the spilltmp one (that one just had a somewhat convenient name to pull from). Does that make sense? I could also very well be missing something about how to plumb vregs in here. I also don't mind writing more documentation about the stack limit support in cranelift. I didn't add documentation in my previous PR (boo) and would be happy to add some here which has notes about how this is implemented for each architecture's backend and how it's running at a unique time where register allocation/legalization aren't available. |
|
So, one clarification: in the new backend, the prologue generation runs before register allocation. So we are free to allocate temp virtual registers (vregs) and use them if we need to. (I guess the old backend design didn't do it this way; I didn't study the reasoning for that in detail, but the current approach made things a lot simpler for us.) Given that, I think it would be best if we added just a little plumbing: alloc a vreg when calling |
|
(And, as a followup clarification to that, we made sure that the register allocator understands live ranges of both real regs and virtual regs, so it's totally OK to use a vreg and then later in the prologue, refer to a realreg like an argument register; the regalloc considers the real-reg live range when choosing where to allocate the vreg.) |
|
Oh that's news to me! I'm a bit confused though because where |
|
I believe |
|
@cfallin hm so given that register allocation runs before prologue/epilogue generation, do you think it'd be ok to shore up the documentation around these and say how they can use any caller-saved register and how we're doing manual register allocation because we generate the prologue so late? |
|
Argh, OK, I was deeply confused yesterday; my apologies. (Many balls in the air now that we're tying-up-loose-ends...) You're correct, prologue generation happens after regalloc. I think that I was thinking about the other bits of ABI-generated instruction sequences (argument/retval copies) which use vregs freely. Indeed, the prologue can use only real regs. So I think the best way forward is to use the spilltmp reg as you have, and then carefully document why it's safe -- between the def (computing the GV) and use (the check itself) no spills could have happened. |
2d7a4ce to
55a48b7
Compare
|
Ok I've tried to add some comments and such, as well as rebasing and uncommenting a few tests which now pass. Mind taking another looking @cfallin? |
cfallin
left a comment
There was a problem hiding this comment.
Looks good, thanks! Sorry about the earlier confusion and thanks for your patience!
| /// means that we need to do manual register allocation here and also be | ||
| /// careful to not clobber any callee-saved or argument registers. For now | ||
| /// this routine makes do with the `writable_spilltmp_reg` as one temporary | ||
| /// register, and a second register of `x16` which is caler-saved. This |
This commit implements the stack limit checks in cranelift for the AArch64 backend. This gets the `stack_limit` argument purpose as well as a function's global `stack_limit` directive working for the AArch64 backend. I've tested this locally on some hardware and in an emulator and it looks to be working for basic tests, but I've never really done AArch64 before so some scrutiny on the instructions would be most welcome!
55a48b7 to
2bef279
Compare
|
No worries at all! Thanks again for the review :) |
This commit implements the stack limit checks in cranelift for the
AArch64 backend. This gets the
stack_limitargument purpose as well asa function's global
stack_limitdirective working for the AArch64backend. I've tested this locally on some hardware and in an emulator
and it looks to be working for basic tests, but I've never really done
AArch64 before so some scrutiny on the instructions would be most
welcome!
I also suspect that these stack limit checks aren't quite optimal, but they relatively closely match the x86 backend (which also isn't optimal), so for now I'm largely hoping for correctness as opposed to speed.
Closes #1569