Skip to content

cranelift: Support callee-saved registers with tail calls on x64#8246

Merged
elliottt merged 12 commits intobytecodealliance:mainfrom
elliottt:trevor/copy-callee-saves
Mar 27, 2024
Merged

cranelift: Support callee-saved registers with tail calls on x64#8246
elliottt merged 12 commits intobytecodealliance:mainfrom
elliottt:trevor/copy-callee-saves

Conversation

@elliottt
Copy link
Member

@elliottt elliottt commented Mar 26, 2024

Rework the tail calling convention on x64 to support tail calls by changing the compilation strategy in the following ways:

  1. Instead of making a shadow frame that we copy over the old one, instead determine if we need to grow or shrink the stack argument area and emit a special-purpose instruction to handle that case.
  2. As we're now growing and shrinking the frame in a separate instruction sequence, emit arguments as moving to FP offsets instead of SP offsets, placing them where they'll need to go for the tail call being setup.
  3. Finally, when emitting a Inst::ReturnCall in the x64 backend, reuse the gen_clobber_save and gen_epilogue_frame_restore functions to setup the frame and any callee-saved registers to the state expected by the callee we're jumping to.

With these three changes in place, we modified the x64 abi for tail calls to include a list of callee-saved registers, and observed that we now save them in the prologue, and restore them right before jumping to the tail-callee.

TODO

  • Account for the extra space used by GrowFrame in the stack check
  • Revisit the use of r14 for the stack limit in the Tail calling convention

Co-authored-by: Jamey Sharp jsharp@fastly.com

elliottt and others added 2 commits March 26, 2024 15:52
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
@elliottt elliottt requested a review from a team as a code owner March 26, 2024 22:57
@elliottt elliottt requested review from fitzgen and removed request for a team March 26, 2024 22:58
elliottt and others added 2 commits March 26, 2024 16:25
Instead of building and copying the new frame over the old one, make use
of the frame shrink/grow pseudo-instructions to move the frame, and then
reuse the existing epilogue generation functions to setup the tail call.

Co-authored-by: Jamey Sharp <jsharp@fastly.com>
Co-authored-by: Jamey Sharp <jsharp@fastly.com>
@elliottt elliottt force-pushed the trevor/copy-callee-saves branch from 8059780 to 5204d6b Compare March 26, 2024 23:25
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen isle Related to the ISLE domain-specific language labels Mar 26, 2024
@github-actions
Copy link

Subscribe to Label Action

cc @cfallin, @fitzgen

Details This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

  • cfallin: isle
  • fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

Copy link
Contributor

@jameysharp jameysharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm proud of the work Trevor and I did on this!

Here's a few quick comments edits after giving this another complete read-through.

@jameysharp
Copy link
Contributor

Also I see we have only one function in the precise-output compile filetests that exercises grow_frame, and none that exercise shrink_frame. I suppose we ought to extend the tests.

@fitzgen
Copy link
Member

fitzgen commented Mar 27, 2024

Super excited for this! Going to start digging in.

Concurrently, would you mind running the default sightglass suite with vs without this change (when enabling the wasm tail calls proposal for both) to determine if this fully resolves #6759 ?

@fitzgen
Copy link
Member

fitzgen commented Mar 27, 2024

Concurrently, would you mind running the default sightglass suite with vs without this change (when enabling the wasm tail calls proposal for both) to determine if this fully resolves #6759 ?

Or I guess the comparison we really care about is

  • main with default wasm features, versus
  • this branch with wasm tail calls enabled

This gives us the "wasmtime" vs "tail" calling conventions comparison we want.

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!! Lots of nitpicks, but generally really like this.

+1 to comment about more filetests and the lack of shrink_frame testing. Seems like it would be good to exercise the following cases, which it seems like we aren't already:

  • tail call from a function with multiple stack args to a function with a single stack arg
  • tail call from a function with multiple stack args to a function with zero stack args
  • tail call from a function with a single stack arg to a function with multiple stack args
  • tail call from a function with zero stack args to a function with multiple stack args

Comment on lines +1749 to +1750
// The total size that we're going to copy, including the return address and frame
// pointer that are pushed on the stack alreadcy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not include stack arguments right? I think it is worth noting that in the comment.

(Sorry if these nitpicks on comments are too nitpicky. It's just that all this ABI code is so subtle and finicky and has to be Just Right, that I am constantly questioning everything, and if my questions aren't immediately answered in the comments I get really nervous that I am missing/overlooking something)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diagrams of the stack frame, with labeled sections/slots, could help alleviate some of my fears/questions here. Y'all should know I'm a sucker for ASCII diagrams in comments by now :-p

Comment on lines -55 to -68
/// The size of the new stack frame's stack arguments. This is necessary
/// for copying the frame over our current frame. It must already be
/// allocated on the stack.
pub new_stack_arg_size: u32,
/// The size of the current/old stack frame's stack arguments.
pub old_stack_arg_size: u32,
/// The return address. Needs to be written into the correct stack slot
/// after the new stack frame is copied into place.
pub ret_addr: Option<Gpr>,
/// A copy of the frame pointer, because we will overwrite the current
/// `rbp`.
pub fp: Gpr,
/// A temporary register.
pub tmp: WritableGpr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@elliottt elliottt force-pushed the trevor/copy-callee-saves branch from 8b7ee11 to 8597592 Compare March 27, 2024 18:58
Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks! 🎉

Excited for the benchmark results!

Comment on lines +544 to +559
;; GrowArgumentArea does a memmove of everything in the frame except for
;; the argument area, to make room for more arguments. That includes all
;; the stack slots, the callee-saved registers, and the saved FP and
;; return address. To keep the stack pointers in sync with that change,
;; it also subtracts the given amount from both the FP and SP registers.
(GrowArgumentArea (amount u32)
(tmp WritableGpr))

;; ShrinkArgumentArea does a memmove of everything in the frame except
;; for the argument area, to trim space for fewer arguments. That
;; includes all the stack slots, the callee-saved registers, and the
;; saved FP and return address. To keep the stack pointers in sync with
;; that change, it also adds the given amount to both the FP and SP
;; registers.
(ShrinkArgumentArea (amount u32)
(tmp WritableGpr))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are fantastic -- thanks!

Comment on lines +227 to +232
function %call_one_stack_arg(i32, i32, i32, i32, i32, i32, i32, i32, i32) tail {
fn0 = colocated %one_stack_arg(i32, i32, i32, i32, i32, i32, i32) tail

block0(v0: i32, v1: i32, v2: i32, v3: i32, v4: i32, v5: i32, v6: i32, v7: i32, v8: i32):
return_call fn0(v2, v3, v4, v5, v6, v7, v8)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for these tests.

@elliottt
Copy link
Member Author

Here are the execution benchmarks for spidermonkey, bz2, and pulldown-cmark relative to the current tail calls implementation on main:

execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 204255214.50 ± 513219.80 (confidence = 99%)

  tail-calls.so is 1.08x to 1.08x faster than main.so!

  [2862185961 2862312218.75 2862598037] main.so
  [2657985419 2658057004.25 2658128352] tail-calls.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 13264568.75 ± 1.43 (confidence = 99%)

  tail-calls.so is 1.06x to 1.06x faster than main.so!

  [239534828 239534828.25 239534829] main.so
  [226270259 226270259.50 226270260] tail-calls.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 391358.50 ± 894.30 (confidence = 99%)

  tail-calls.so is 1.02x to 1.02x faster than main.so!

  [20465937 20466114.00 20466584] main.so
  [20074442 20074755.50 20075106] tail-calls.so

Running the same benchmarks on main without tail calls and this branch with tail calls separately, yields the following results:

main

execution
  benchmarks/bz2/benchmark.wasm
    instructions-retired
      [226286674 226286674.00 226286674] /home/trevor/src/main.so
  benchmarks/pulldown-cmark/benchmark.wasm
    instructions-retired
      [20080325 20080512.75 20080999] /home/trevor/src/main.so
  benchmarks/spidermonkey/benchmark.wasm
    instructions-retired
      [2659583881 2659624337.75 2659672231] /home/trevor/src/main.so

this branch

execution
  benchmarks/bz2/benchmark.wasm
    instructions-retired
      [226270259 226270259.50 226270260] /home/trevor/src/tail-calls.so
  benchmarks/pulldown-cmark/benchmark.wasm
    instructions-retired
      [20074405 20074419.00 20074445] /home/trevor/src/tail-calls.so
  benchmarks/spidermonkey/benchmark.wasm
    instructions-retired
      [2657979471 2658099662.50 2658220755] /home/trevor/src/tail-calls.so

It's exciting to see that the max for the tail-calls branch is smaller than the min of main for spidermonkey, though even if that's spurious they're still in the same ballpark 🎉

@elliottt elliottt added this pull request to the merge queue Mar 27, 2024
@fitzgen
Copy link
Member

fitzgen commented Mar 27, 2024

Ship it!

Merged via the queue into bytecodealliance:main with commit a461382 Mar 27, 2024
@elliottt elliottt deleted the trevor/copy-callee-saves branch March 27, 2024 23:50
jameysharp added a commit to jameysharp/wasmtime that referenced this pull request Apr 3, 2024
The `gen_spill` and `gen_reload` methods on `Callee` are used to emit
appropriate moves between registers and the stack, as directed by the
register allocator.

These moves always apply to a single register at a time, even if that
register was originally part of a group of registers. For example, when
an I128 is represented using two 64-bit registers, either of those
registers may be spilled independently.

As a result, the `load_spillslot`/`store_spillslot` helpers were more
general than necessary, which in turn required extra complexity in the
`gen_load_stack_multi`/`gen_store_stack_multi` helpers. None of these
helpers were used in any other context, so all that complexity was
unnecessary.

Inlining all four helpers and then simplifying eliminates a lot of code
without changing the output of the compiler.

These helpers were also the only uses of `StackAMode::offset`, so I've
deleted that. While I was there, I also deleted `StackAMode::get_type`,
which was introduced in bytecodealliance#8151 and became unused again in bytecodealliance#8246.
github-merge-queue bot pushed a commit that referenced this pull request Apr 3, 2024
The `gen_spill` and `gen_reload` methods on `Callee` are used to emit
appropriate moves between registers and the stack, as directed by the
register allocator.

These moves always apply to a single register at a time, even if that
register was originally part of a group of registers. For example, when
an I128 is represented using two 64-bit registers, either of those
registers may be spilled independently.

As a result, the `load_spillslot`/`store_spillslot` helpers were more
general than necessary, which in turn required extra complexity in the
`gen_load_stack_multi`/`gen_store_stack_multi` helpers. None of these
helpers were used in any other context, so all that complexity was
unnecessary.

Inlining all four helpers and then simplifying eliminates a lot of code
without changing the output of the compiler.

These helpers were also the only uses of `StackAMode::offset`, so I've
deleted that. While I was there, I also deleted `StackAMode::get_type`,
which was introduced in #8151 and became unused again in #8246.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants