cranelift: Support callee-saved registers with tail calls on x64#8246
cranelift: Support callee-saved registers with tail calls on x64#8246elliottt merged 12 commits intobytecodealliance:mainfrom
Conversation
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
Instead of building and copying the new frame over the old one, make use of the frame shrink/grow pseudo-instructions to move the frame, and then reuse the existing epilogue generation functions to setup the tail call. Co-authored-by: Jamey Sharp <jsharp@fastly.com>
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
8059780 to
5204d6b
Compare
Subscribe to Label ActionDetailsThis issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
jameysharp
left a comment
There was a problem hiding this comment.
I'm proud of the work Trevor and I did on this!
Here's a few quick comments edits after giving this another complete read-through.
|
Also I see we have only one function in the precise-output compile filetests that exercises |
|
Super excited for this! Going to start digging in. Concurrently, would you mind running the default sightglass suite with vs without this change (when enabling the wasm tail calls proposal for both) to determine if this fully resolves #6759 ? |
Or I guess the comparison we really care about is
This gives us the "wasmtime" vs "tail" calling conventions comparison we want. |
fitzgen
left a comment
There was a problem hiding this comment.
Thanks!! Lots of nitpicks, but generally really like this.
+1 to comment about more filetests and the lack of shrink_frame testing. Seems like it would be good to exercise the following cases, which it seems like we aren't already:
- tail call from a function with multiple stack args to a function with a single stack arg
- tail call from a function with multiple stack args to a function with zero stack args
- tail call from a function with a single stack arg to a function with multiple stack args
- tail call from a function with zero stack args to a function with multiple stack args
| // The total size that we're going to copy, including the return address and frame | ||
| // pointer that are pushed on the stack alreadcy. |
There was a problem hiding this comment.
This does not include stack arguments right? I think it is worth noting that in the comment.
(Sorry if these nitpicks on comments are too nitpicky. It's just that all this ABI code is so subtle and finicky and has to be Just Right, that I am constantly questioning everything, and if my questions aren't immediately answered in the comments I get really nervous that I am missing/overlooking something)
There was a problem hiding this comment.
Diagrams of the stack frame, with labeled sections/slots, could help alleviate some of my fears/questions here. Y'all should know I'm a sucker for ASCII diagrams in comments by now :-p
| /// The size of the new stack frame's stack arguments. This is necessary | ||
| /// for copying the frame over our current frame. It must already be | ||
| /// allocated on the stack. | ||
| pub new_stack_arg_size: u32, | ||
| /// The size of the current/old stack frame's stack arguments. | ||
| pub old_stack_arg_size: u32, | ||
| /// The return address. Needs to be written into the correct stack slot | ||
| /// after the new stack frame is copied into place. | ||
| pub ret_addr: Option<Gpr>, | ||
| /// A copy of the frame pointer, because we will overwrite the current | ||
| /// `rbp`. | ||
| pub fp: Gpr, | ||
| /// A temporary register. | ||
| pub tmp: WritableGpr, |
8b7ee11 to
8597592
Compare
Co-authored-by: Jamey Sharp <jamey@minilop.net>
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
fitzgen
left a comment
There was a problem hiding this comment.
Looks great! Thanks! 🎉
Excited for the benchmark results!
| ;; GrowArgumentArea does a memmove of everything in the frame except for | ||
| ;; the argument area, to make room for more arguments. That includes all | ||
| ;; the stack slots, the callee-saved registers, and the saved FP and | ||
| ;; return address. To keep the stack pointers in sync with that change, | ||
| ;; it also subtracts the given amount from both the FP and SP registers. | ||
| (GrowArgumentArea (amount u32) | ||
| (tmp WritableGpr)) | ||
|
|
||
| ;; ShrinkArgumentArea does a memmove of everything in the frame except | ||
| ;; for the argument area, to trim space for fewer arguments. That | ||
| ;; includes all the stack slots, the callee-saved registers, and the | ||
| ;; saved FP and return address. To keep the stack pointers in sync with | ||
| ;; that change, it also adds the given amount to both the FP and SP | ||
| ;; registers. | ||
| (ShrinkArgumentArea (amount u32) | ||
| (tmp WritableGpr)) |
There was a problem hiding this comment.
These comments are fantastic -- thanks!
| function %call_one_stack_arg(i32, i32, i32, i32, i32, i32, i32, i32, i32) tail { | ||
| fn0 = colocated %one_stack_arg(i32, i32, i32, i32, i32, i32, i32) tail | ||
|
|
||
| block0(v0: i32, v1: i32, v2: i32, v3: i32, v4: i32, v5: i32, v6: i32, v7: i32, v8: i32): | ||
| return_call fn0(v2, v3, v4, v5, v6, v7, v8) | ||
| } |
|
Here are the execution benchmarks for Running the same benchmarks on main without tail calls and this branch with tail calls separately, yields the following results: mainthis branchIt's exciting to see that the max for the tail-calls branch is smaller than the min of main for spidermonkey, though even if that's spurious they're still in the same ballpark 🎉 |
|
Ship it! |
The `gen_spill` and `gen_reload` methods on `Callee` are used to emit appropriate moves between registers and the stack, as directed by the register allocator. These moves always apply to a single register at a time, even if that register was originally part of a group of registers. For example, when an I128 is represented using two 64-bit registers, either of those registers may be spilled independently. As a result, the `load_spillslot`/`store_spillslot` helpers were more general than necessary, which in turn required extra complexity in the `gen_load_stack_multi`/`gen_store_stack_multi` helpers. None of these helpers were used in any other context, so all that complexity was unnecessary. Inlining all four helpers and then simplifying eliminates a lot of code without changing the output of the compiler. These helpers were also the only uses of `StackAMode::offset`, so I've deleted that. While I was there, I also deleted `StackAMode::get_type`, which was introduced in bytecodealliance#8151 and became unused again in bytecodealliance#8246.
The `gen_spill` and `gen_reload` methods on `Callee` are used to emit appropriate moves between registers and the stack, as directed by the register allocator. These moves always apply to a single register at a time, even if that register was originally part of a group of registers. For example, when an I128 is represented using two 64-bit registers, either of those registers may be spilled independently. As a result, the `load_spillslot`/`store_spillslot` helpers were more general than necessary, which in turn required extra complexity in the `gen_load_stack_multi`/`gen_store_stack_multi` helpers. None of these helpers were used in any other context, so all that complexity was unnecessary. Inlining all four helpers and then simplifying eliminates a lot of code without changing the output of the compiler. These helpers were also the only uses of `StackAMode::offset`, so I've deleted that. While I was there, I also deleted `StackAMode::get_type`, which was introduced in #8151 and became unused again in #8246.
Rework the tail calling convention on x64 to support tail calls by changing the compilation strategy in the following ways:
Inst::ReturnCallin the x64 backend, reuse thegen_clobber_saveandgen_epilogue_frame_restorefunctions to setup the frame and any callee-saved registers to the state expected by the callee we're jumping to.With these three changes in place, we modified the x64 abi for tail calls to include a list of callee-saved registers, and observed that we now save them in the prologue, and restore them right before jumping to the tail-callee.
TODO
GrowFramein the stack checkr14for the stack limit in theTailcalling conventionCo-authored-by: Jamey Sharp jsharp@fastly.com