Skip to content

Implement stack limit checks for AArch64#1573

Merged
alexcrichton merged 1 commit intobytecodealliance:masterfrom
alexcrichton:aarch64-stack-limit
Apr 24, 2020
Merged

Implement stack limit checks for AArch64#1573
alexcrichton merged 1 commit intobytecodealliance:masterfrom
alexcrichton:aarch64-stack-limit

Conversation

@alexcrichton
Copy link
Member

@alexcrichton alexcrichton commented Apr 21, 2020

This commit implements the stack limit checks in cranelift for the
AArch64 backend. This gets the stack_limit argument purpose as well as
a function's global stack_limit directive working for the AArch64
backend. I've tested this locally on some hardware and in an emulator
and it looks to be working for basic tests, but I've never really done
AArch64 before so some scrutiny on the instructions would be most
welcome!

I also suspect that these stack limit checks aren't quite optimal, but they relatively closely match the x86 backend (which also isn't optimal), so for now I'm largely hoping for correctness as opposed to speed.

Closes #1569

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:aarch64 Issues related to AArch64 backend. labels Apr 22, 2020
@github-actions
Copy link

Subscribe to Label Action

cc @bnjbvr

Details This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64"

Thus the following users have been cc'd because of the following labels:

  • bnjbvr: cranelift

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@bnjbvr bnjbvr requested a review from cfallin April 22, 2020 10:12
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this -- you jumped into the new backend guts remarkably quickly!

The main concern is the choice of register to hold the stack limit in the function body, as noted below; spilltmp comes with a bunch of baggage as it's used in other ways (notably register-allocator-inserted spills/reloads, which can happen in lots of places). I think the fix is relatively straightforward though!

let reg = generate_gv(f, abi, gv, &mut insts);
return (reg, insts);

fn generate_gv(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'm not wild about having this GlobalValue expansion implementation local to the stack-limit code, but upon looking at aarch64/lower.rs, it seems that in all other cases the legalization pass transforms actual GlobalValue ops into core ops as here.

Perhaps we could at least move generate_gv to aarch64/lower.rs (passing in the vmctx reg as an arg)? This could eventually be used if we shift from the current encodings-based legalization framework to something simpler/more purpose-built as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So FWIW this is definitely an "unfortunate" special context. This is "duplicated" with the x86 backend as well. The reason for this is that prologue/epilogue happens so late (way after legalization and even after register allocation) that it's difficult to share code. The hope is that the number of global values we'd need to process wouldn't be that large and/or it'd be easy to extend them over time.

In that sense I suspect that the global value legalization pass is unlikely to go away (it's architecture independent too!). That being said I'm happy to move this code around. Given that do you think it'd still have a better home in lower.rs?

(I can definitely do a better job about documenting all this too)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, OK. I still think that we should move this to lower.rs, but other refactoring and cleanup can come later -- don't want to hold this up for it!

stackslots.push(off);
}

let stack_limit = match f
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factor this pattern out (also occurs above) -- take f and sig, and return reg for vmctx plus Vec<Inst>?

To allocate a new temp as an into_reg to hold stack_limit, we'll need to finagle the plumbing a bit. Vreg temps are allocated by the LowerCtx, which we've kept separate from the ABI traits, and using this will be tricky because we can't have generic methods on the ABIBody trait. The ordering is slightly tricky too: the ABIBody is created before lowering starts, so that's too early to alloc a temp, but gen_prologue happens late (we lower instructions in reverse order) so we can't do it there either.

Suggestion: perhaps add an init method on the ABIBody trait that takes a "just in case" temp? (vregs are basically free, aside from creating sparsity in the ID space, so this is fine.) Then this could set stack_limit for use by gen_prologue and any calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, one more thought: after reading through the rest of the patch, it seems that stack_limit is not used beyond the prologue. Given this, I think it would be fine to use spilltmp within gen_prologue itself, and then carefully validate that we don't keep it live across anything else that could implicitly use it. This includes any virtual registers at all: the register allocator is liable to pop in and do a surprise spill with a big offset somewhere, wiping out spilltmp.

readonly: _,
} => {
let base = generate_gv(f, abi, base, insts);
let into_reg = writable_spilltmp_reg();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eek, I'm not sure if it's safe to use spilltmp here -- or at least it's not clear it will interact well with anything we might write in the future. The basic invariant is that spilltmp should only be used within a sequence of instructions generated as one unit (the original use was to compute large stackframe offsets for regalloc spills/reloads, hence the name). We should probably take an into_reg as an argument of generate_gv here, and then by implication, to gen_stack_limit too.

(I should've added more documentation about this, sorry -- if you wouldn't mind adding a comment on spilltmp_reg() to this effect, that'd be very helpful, thanks!)

@alexcrichton
Copy link
Member Author

Thanks for taking a look @cfallin! I definitely have some documentation to shore up no matter what :).

In any case I wanted to ask some questions about your comments. It sounds like you're thinking we should use vregs (which I presume is a virtual not-yet-allocated register, right?) for this stuff. I think that might be difficult though because prologue generation happens after register allocation. The saving grace, though, is that the registers used here in the prologue are only used during the prologue and never afterwards. Additionally I think that all caller-saved registers are available, not just the spilltmp one (that one just had a somewhat convenient name to pull from).

Does that make sense? I could also very well be missing something about how to plumb vregs in here.

I also don't mind writing more documentation about the stack limit support in cranelift. I didn't add documentation in my previous PR (boo) and would be happy to add some here which has notes about how this is implemented for each architecture's backend and how it's running at a unique time where register allocation/legalization aren't available.

@cfallin
Copy link
Member

cfallin commented Apr 22, 2020

So, one clarification: in the new backend, the prologue generation runs before register allocation. So we are free to allocate temp virtual registers (vregs) and use them if we need to. (I guess the old backend design didn't do it this way; I didn't study the reasoning for that in detail, but the current approach made things a lot simpler for us.) Given that, I think it would be best if we added just a little plumbing: alloc a vreg when calling gen_prologue() and pass it in (as a Writable<Reg>). Sorry for the confusion on this point!

@cfallin
Copy link
Member

cfallin commented Apr 22, 2020

(And, as a followup clarification to that, we made sure that the register allocator understands live ranges of both real regs and virtual regs, so it's totally OK to use a vreg and then later in the prologue, refer to a realreg like an argument register; the regalloc considers the real-reg live range when choosing where to allocate the vreg.)

@alexcrichton
Copy link
Member Author

Oh that's news to me! I'm a bit confused though because where gen_prologue is called takes a RegAllocResult which sounds like register allocation has already happened?

@bjorn3
Copy link
Contributor

bjorn3 commented Apr 24, 2020

I believe gen_prologue is indeed run after regalloc.

@alexcrichton
Copy link
Member Author

@cfallin hm so given that register allocation runs before prologue/epilogue generation, do you think it'd be ok to shore up the documentation around these and say how they can use any caller-saved register and how we're doing manual register allocation because we generate the prologue so late?

@cfallin
Copy link
Member

cfallin commented Apr 24, 2020

Argh, OK, I was deeply confused yesterday; my apologies. (Many balls in the air now that we're tying-up-loose-ends...) You're correct, prologue generation happens after regalloc. I think that I was thinking about the other bits of ABI-generated instruction sequences (argument/retval copies) which use vregs freely.

Indeed, the prologue can use only real regs. So I think the best way forward is to use the spilltmp reg as you have, and then carefully document why it's safe -- between the def (computing the GV) and use (the check itself) no spills could have happened.

@alexcrichton
Copy link
Member Author

Ok I've tried to add some comments and such, as well as rebasing and uncommenting a few tests which now pass. Mind taking another looking @cfallin?

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks! Sorry about the earlier confusion and thanks for your patience!

/// means that we need to do manual register allocation here and also be
/// careful to not clobber any callee-saved or argument registers. For now
/// this routine makes do with the `writable_spilltmp_reg` as one temporary
/// register, and a second register of `x16` which is caler-saved. This
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/caler/caller/

This commit implements the stack limit checks in cranelift for the
AArch64 backend. This gets the `stack_limit` argument purpose as well as
a function's global `stack_limit` directive working for the AArch64
backend. I've tested this locally on some hardware and in an emulator
and it looks to be working for basic tests, but I've never really done
AArch64 before so some scrutiny on the instructions would be most
welcome!
@alexcrichton alexcrichton merged commit 74eda80 into bytecodealliance:master Apr 24, 2020
@alexcrichton
Copy link
Member Author

No worries at all! Thanks again for the review :)

@alexcrichton alexcrichton deleted the aarch64-stack-limit branch April 24, 2020 20:02
cfallin added a commit to cfallin/wasmtime that referenced this pull request Apr 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift:area:aarch64 Issues related to AArch64 backend. cranelift Issues related to the Cranelift code generator

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stack limit protections not implemented for AArch64

3 participants