cranelift: Support callee-saved registers with tail calls on x64 by elliottt · Pull Request #8246 · bytecodealliance/wasmtime

elliottt · 2024-03-26T22:57:59Z

Rework the tail calling convention on x64 to support tail calls by changing the compilation strategy in the following ways:

Instead of making a shadow frame that we copy over the old one, instead determine if we need to grow or shrink the stack argument area and emit a special-purpose instruction to handle that case.
As we're now growing and shrinking the frame in a separate instruction sequence, emit arguments as moving to FP offsets instead of SP offsets, placing them where they'll need to go for the tail call being setup.
Finally, when emitting a Inst::ReturnCall in the x64 backend, reuse the gen_clobber_save and gen_epilogue_frame_restore functions to setup the frame and any callee-saved registers to the state expected by the callee we're jumping to.

With these three changes in place, we modified the x64 abi for tail calls to include a list of callee-saved registers, and observed that we now save them in the prologue, and restore them right before jumping to the tail-callee.

TODO

Account for the extra space used by GrowFrame in the stack check
Revisit the use of r14 for the stack limit in the Tail calling convention

Co-authored-by: Jamey Sharp jsharp@fastly.com

Co-authored-by: Jamey Sharp <jsharp@fastly.com>

Instead of building and copying the new frame over the old one, make use of the frame shrink/grow pseudo-instructions to move the frame, and then reuse the existing epilogue generation functions to setup the tail call. Co-authored-by: Jamey Sharp <jsharp@fastly.com>

Co-authored-by: Jamey Sharp <jsharp@fastly.com>

github-actions · 2024-03-26T23:44:43Z

Subscribe to Label Action

cc @cfallin, @fitzgen

Details

This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

cfallin: isle
fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

jameysharp

I'm proud of the work Trevor and I did on this!

Here's a few quick comments edits after giving this another complete read-through.

cranelift/codegen/src/isa/x64/abi.rs

cranelift/codegen/src/isa/x64/inst/emit.rs

jameysharp · 2024-03-27T01:03:31Z

Also I see we have only one function in the precise-output compile filetests that exercises grow_frame, and none that exercise shrink_frame. I suppose we ought to extend the tests.

fitzgen · 2024-03-27T15:04:52Z

Super excited for this! Going to start digging in.

Concurrently, would you mind running the default sightglass suite with vs without this change (when enabling the wasm tail calls proposal for both) to determine if this fully resolves #6759 ?

fitzgen · 2024-03-27T15:10:35Z

Concurrently, would you mind running the default sightglass suite with vs without this change (when enabling the wasm tail calls proposal for both) to determine if this fully resolves #6759 ?

Or I guess the comparison we really care about is

main with default wasm features, versus
this branch with wasm tail calls enabled

This gives us the "wasmtime" vs "tail" calling conventions comparison we want.

fitzgen

Thanks!! Lots of nitpicks, but generally really like this.

+1 to comment about more filetests and the lack of shrink_frame testing. Seems like it would be good to exercise the following cases, which it seems like we aren't already:

tail call from a function with multiple stack args to a function with a single stack arg
tail call from a function with multiple stack args to a function with zero stack args
tail call from a function with a single stack arg to a function with multiple stack args
tail call from a function with zero stack args to a function with multiple stack args

cranelift/codegen/src/isa/x64/inst.isle

cranelift/codegen/src/isa/x64/inst/emit.rs

fitzgen · 2024-03-27T16:45:27Z

cranelift/codegen/src/isa/x64/inst/emit.rs

+            // The total size that we're going to copy, including the return address and frame
+            // pointer that are pushed on the stack alreadcy.


This does not include stack arguments right? I think it is worth noting that in the comment.

(Sorry if these nitpicks on comments are too nitpicky. It's just that all this ABI code is so subtle and finicky and has to be Just Right, that I am constantly questioning everything, and if my questions aren't immediately answered in the comments I get really nervous that I am missing/overlooking something)

Diagrams of the stack frame, with labeled sections/slots, could help alleviate some of my fears/questions here. Y'all should know I'm a sucker for ASCII diagrams in comments by now :-p

cranelift/codegen/src/isa/x64/inst/emit.rs

fitzgen · 2024-03-27T17:14:14Z

cranelift/codegen/src/isa/x64/inst/mod.rs

-    /// The size of the new stack frame's stack arguments. This is necessary
-    /// for copying the frame over our current frame. It must already be
-    /// allocated on the stack.
-    pub new_stack_arg_size: u32,
-    /// The size of the current/old stack frame's stack arguments.
-    pub old_stack_arg_size: u32,
-    /// The return address. Needs to be written into the correct stack slot
-    /// after the new stack frame is copied into place.
-    pub ret_addr: Option<Gpr>,
-    /// A copy of the frame pointer, because we will overwrite the current
-    /// `rbp`.
-    pub fp: Gpr,
-    /// A temporary register.
-    pub tmp: WritableGpr,


cranelift/codegen/src/machinst/abi.rs

…il cc

Co-authored-by: Jamey Sharp <jamey@minilop.net>

Co-authored-by: Jamey Sharp <jsharp@fastly.com>

fitzgen

Looks great! Thanks! 🎉

Excited for the benchmark results!

fitzgen · 2024-03-27T23:12:03Z

cranelift/codegen/src/isa/x64/inst.isle

+       ;; GrowArgumentArea does a memmove of everything in the frame except for
+       ;; the argument area, to make room for more arguments. That includes all
+       ;; the stack slots, the callee-saved registers, and the saved FP and
+       ;; return address. To keep the stack pointers in sync with that change,
+       ;; it also subtracts the given amount from both the FP and SP registers.
+       (GrowArgumentArea (amount u32)
+                         (tmp WritableGpr))
+
+       ;; ShrinkArgumentArea does a memmove of everything in the frame except
+       ;; for the argument area, to trim space for fewer arguments. That
+       ;; includes all the stack slots, the callee-saved registers, and the
+       ;; saved FP and return address. To keep the stack pointers in sync with
+       ;; that change, it also adds the given amount to both the FP and SP
+       ;; registers.
+       (ShrinkArgumentArea (amount u32)
+                           (tmp WritableGpr))


These comments are fantastic -- thanks!

fitzgen · 2024-03-27T23:15:53Z

cranelift/filetests/filetests/isa/x64/return-call.clif

+function %call_one_stack_arg(i32, i32, i32, i32, i32, i32, i32, i32, i32) tail {
+    fn0 = colocated %one_stack_arg(i32, i32, i32, i32, i32, i32, i32) tail
+
+block0(v0: i32, v1: i32, v2: i32, v3: i32, v4: i32, v5: i32, v6: i32, v7: i32, v8: i32):
+    return_call fn0(v2, v3, v4, v5, v6, v7, v8)
+}


Nice, thanks for these tests.

elliottt · 2024-03-27T23:18:10Z

Here are the execution benchmarks for spidermonkey, bz2, and pulldown-cmark relative to the current tail calls implementation on main:

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 204255214.50 ± 513219.80 (confidence = 99%)

  tail-calls.so is 1.08x to 1.08x faster than main.so!

  [2862185961 2862312218.75 2862598037] main.so
  [2657985419 2658057004.25 2658128352] tail-calls.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 13264568.75 ± 1.43 (confidence = 99%)

  tail-calls.so is 1.06x to 1.06x faster than main.so!

  [239534828 239534828.25 239534829] main.so
  [226270259 226270259.50 226270260] tail-calls.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 391358.50 ± 894.30 (confidence = 99%)

  tail-calls.so is 1.02x to 1.02x faster than main.so!

  [20465937 20466114.00 20466584] main.so
  [20074442 20074755.50 20075106] tail-calls.so

Running the same benchmarks on main without tail calls and this branch with tail calls separately, yields the following results:

main

execution
  benchmarks/bz2/benchmark.wasm
    instructions-retired
      [226286674 226286674.00 226286674] /home/trevor/src/main.so
  benchmarks/pulldown-cmark/benchmark.wasm
    instructions-retired
      [20080325 20080512.75 20080999] /home/trevor/src/main.so
  benchmarks/spidermonkey/benchmark.wasm
    instructions-retired
      [2659583881 2659624337.75 2659672231] /home/trevor/src/main.so

this branch

execution
  benchmarks/bz2/benchmark.wasm
    instructions-retired
      [226270259 226270259.50 226270260] /home/trevor/src/tail-calls.so
  benchmarks/pulldown-cmark/benchmark.wasm
    instructions-retired
      [20074405 20074419.00 20074445] /home/trevor/src/tail-calls.so
  benchmarks/spidermonkey/benchmark.wasm
    instructions-retired
      [2657979471 2658099662.50 2658220755] /home/trevor/src/tail-calls.so

It's exciting to see that the max for the tail-calls branch is smaller than the min of main for spidermonkey, though even if that's spurious they're still in the same ballpark 🎉

fitzgen · 2024-03-27T23:22:04Z

Ship it!

The `gen_spill` and `gen_reload` methods on `Callee` are used to emit appropriate moves between registers and the stack, as directed by the register allocator. These moves always apply to a single register at a time, even if that register was originally part of a group of registers. For example, when an I128 is represented using two 64-bit registers, either of those registers may be spilled independently. As a result, the `load_spillslot`/`store_spillslot` helpers were more general than necessary, which in turn required extra complexity in the `gen_load_stack_multi`/`gen_store_stack_multi` helpers. None of these helpers were used in any other context, so all that complexity was unnecessary. Inlining all four helpers and then simplifying eliminates a lot of code without changing the output of the compiler. These helpers were also the only uses of `StackAMode::offset`, so I've deleted that. While I was there, I also deleted `StackAMode::get_type`, which was introduced in bytecodealliance#8151 and became unused again in bytecodealliance#8246.

The `gen_spill` and `gen_reload` methods on `Callee` are used to emit appropriate moves between registers and the stack, as directed by the register allocator. These moves always apply to a single register at a time, even if that register was originally part of a group of registers. For example, when an I128 is represented using two 64-bit registers, either of those registers may be spilled independently. As a result, the `load_spillslot`/`store_spillslot` helpers were more general than necessary, which in turn required extra complexity in the `gen_load_stack_multi`/`gen_store_stack_multi` helpers. None of these helpers were used in any other context, so all that complexity was unnecessary. Inlining all four helpers and then simplifying eliminates a lot of code without changing the output of the compiler. These helpers were also the only uses of `StackAMode::offset`, so I've deleted that. While I was there, I also deleted `StackAMode::get_type`, which was introduced in #8151 and became unused again in #8246.

elliottt and others added 2 commits March 26, 2024 15:52

Add GrowFrame and ShrinkFrame instructions for moving the frame

eff3018

Co-authored-by: Jamey Sharp <jsharp@fastly.com>

Experimentally emit grow/shrink frame instructions for x64 tail calls

fe39506

Co-authored-by: Jamey Sharp <jsharp@fastly.com>

elliottt requested a review from a team as a code owner March 26, 2024 22:57

elliottt requested review from fitzgen and removed request for a team March 26, 2024 22:58

elliottt and others added 2 commits March 26, 2024 16:25

Enable callee saves with the tail calling convention on x64

5204d6b

Co-authored-by: Jamey Sharp <jsharp@fastly.com>

elliottt force-pushed the trevor/copy-callee-saves branch from 8059780 to 5204d6b Compare March 26, 2024 23:25

jameysharp reviewed Mar 27, 2024

View reviewed changes

cranelift/codegen/src/isa/x64/abi.rs Outdated Show resolved Hide resolved

cranelift/codegen/src/isa/x64/inst/emit.rs Outdated Show resolved Hide resolved

cranelift/codegen/src/isa/x64/inst/emit.rs Outdated Show resolved Hide resolved

fitzgen reviewed Mar 27, 2024

View reviewed changes

elliottt added 2 commits March 27, 2024 10:28

Remove the requirement that indirect calls go through r15 with the ta…

2f6131a

…il cc

Stop using r14 for a temporary during the stack check with the tail cc

8597592

elliottt force-pushed the trevor/copy-callee-saves branch from 8b7ee11 to 8597592 Compare March 27, 2024 18:58

elliottt and others added 6 commits March 27, 2024 13:35

Apply suggestions from code review

ff1c5e1

Co-authored-by: Jamey Sharp <jamey@minilop.net>

Remove constants in favor of reusing values computed for FrameLayout

90545c5

Co-authored-by: Jamey Sharp <jsharp@fastly.com>

Suggestions from review

dd4b4c3

Rename the grow/shrink frame instructions, and adjust their comments

20702a0

Comments on ArgLoc

16b9655

Add more tests for return_call, and fix grow/shrink arg area printing

5917714

fitzgen approved these changes Mar 27, 2024

View reviewed changes

elliottt added this pull request to the merge queue Mar 27, 2024

Merged via the queue into bytecodealliance:main with commit a461382 Mar 27, 2024

elliottt deleted the trevor/copy-callee-saves branch March 27, 2024 23:50

jameysharp mentioned this pull request Apr 3, 2024

cranelift: Simplify spill/reload emission #8296

Merged

		// The total size that we're going to copy, including the return address and frame
		// pointer that are pushed on the stack alreadcy.

Conversation

elliottt commented Mar 26, 2024 • edited by jameysharp Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

github-actions bot commented Mar 26, 2024

Subscribe to Label Action

Uh oh!

jameysharp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jameysharp commented Mar 27, 2024

Uh oh!

fitzgen commented Mar 27, 2024

Uh oh!

fitzgen commented Mar 27, 2024

Uh oh!

fitzgen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fitzgen Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

fitzgen Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fitzgen Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fitzgen left a comment

Choose a reason for hiding this comment

Uh oh!

fitzgen Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

fitzgen Mar 27, 2024

Choose a reason for hiding this comment

Uh oh!

elliottt commented Mar 27, 2024

main

this branch

Uh oh!

fitzgen commented Mar 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

elliottt commented Mar 26, 2024 •

edited by jameysharp

Loading