Skip to content

Conversation

@folkertdev
Copy link
Contributor

@folkertdev folkertdev commented Nov 20, 2025

fixes #56891
fixes #72390

Currently the x86 backend miscompiles straightforward tail calls when the stack is used for argument passing. This program segfaults on any optimization level:

https://godbolt.org/z/5xr99jr4v

typedef struct {
    uint64_t x;
    uint64_t y;
    uint64_t z;
} S;

__attribute__((noinline))
uint64_t callee(S s) {
    return s.x + s.y + s.z;
}

__attribute__((noinline))
uint64_t caller(S s) {
    [[clang::musttail]]
    return callee(s);
}

The immediate issue is that caller decides to shuffle values around on the stack, and in the process writes to *rsp, which contains the return address. With the return address trashed, the ret in callee jumps to an invalid address.

caller:
        mov     rax, qword ptr [rsp + 24]
        mov     qword ptr [rsp + 16], rax
        movaps  xmm0, xmmword ptr [rsp + 8]
        movups  xmmword ptr [rsp], xmm0 ; <-- that is just all kinds of wrong
        movaps  xmmword ptr [rsp + 8], xmm0
        mov     qword ptr [rsp + 24], rax
        jmp     callee

However, I think the actual problem is that the x86 backend never considers musttail calls to be sibcalls. For sibcalls, no stack reshuffling is required at all, circumventing the problem here.

This PR essentially copies https://reviews.llvm.org/D131034 (cc @huangjd), but this time I hope we can actually land this, and solve this problem.

The aarch64 backend also miscompiled this example, but they appear to have fixed it in LLVM 20.

Tail calls just not working for any sort of non-trivial argument types is a blocker for tail call support in rust, see rust-lang/rust#144855 (comment).

@llvmbot
Copy link
Member

llvmbot commented Nov 20, 2025

@llvm/pr-subscribers-backend-x86

Author: Folkert de Vries (folkertdev)

Changes

fixes #56891
fixes #72390

Currently the x86 backend miscompiles straightforward tail calls when the stack is used for argument passing. This program segfaults on any optimization level:

https://godbolt.org/z/5xr99jr4v

typedef struct {
    uint64_t x;
    uint64_t y;
    uint64_t z;
} S;

__attribute__((noinline))
uint64_t callee(S s) {
    return s.x + s.y + s.z;
}

__attribute__((noinline))
uint64_t caller(S s) {
    [[clang::musttail]]
    return callee(s);
}

The immediate issue is that caller decides to shuffle values around on the stack, and in the process writes to *rsp, which contains the return address. With the return address trashed, the ret in callee jumps to an invalid address.

caller:
        mov     rax, qword ptr [rsp + 24]
        mov     qword ptr [rsp + 16], rax
        movaps  xmm0, xmmword ptr [rsp + 8]
        movups  xmmword ptr [rsp], xmm0 ; &lt;-- that is just all kinds of wrong
        movaps  xmmword ptr [rsp + 8], xmm0
        mov     qword ptr [rsp + 24], rax
        jmp     callee

However, I think the actual problem is that the x86 backend never considers musttail calls to be sibcalls. For sibcalls, no stack reshuffling is required at all, circumventing the problem here.

This PR essentially copies https://reviews.llvm.org/D131034, but this time I hope we can actually land this, and solve this problem.

The aarch64 backend also miscompiled this example, but they appear to have fixed it in LLVM 20.

Tail calls just not working for any sort of non-trivial argument types is a blocker for tail call support in rust, see rust-lang/rust#144855 (comment).


Full diff: https://github.com/llvm/llvm-project/pull/168956.diff

4 Files Affected:

  • (modified) llvm/lib/Target/X86/X86ISelLoweringCall.cpp (+12-8)
  • (added) llvm/test/CodeGen/X86/musttail-struct.ll (+57)
  • (modified) llvm/test/CodeGen/X86/musttail-tailcc.ll (-18)
  • (modified) llvm/test/CodeGen/X86/swifttailcc-store-ret-address-aliasing-stack-slot.ll (+2-4)
diff --git a/llvm/lib/Target/X86/X86ISelLoweringCall.cpp b/llvm/lib/Target/X86/X86ISelLoweringCall.cpp
index 37d77728882b1..e53e5a2a02cb4 100644
--- a/llvm/lib/Target/X86/X86ISelLoweringCall.cpp
+++ b/llvm/lib/Target/X86/X86ISelLoweringCall.cpp
@@ -2098,15 +2098,18 @@ X86TargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,
       isTailCall = false;
   }
 
-  if (isTailCall && !IsMustTail) {
+  if (isTailCall) {
     // Check if it's really possible to do a tail call.
-    isTailCall = IsEligibleForTailCallOptimization(CLI, CCInfo, ArgLocs,
-                                                   IsCalleePopSRet);
+    IsSibcall = IsEligibleForTailCallOptimization(CLI, CCInfo, ArgLocs,
+                                                  IsCalleePopSRet);
+
+    if (!IsMustTail) {
+      isTailCall = IsSibcall;
 
-    // Sibcalls are automatically detected tailcalls which do not require
-    // ABI changes.
-    if (!IsGuaranteeTCO && isTailCall)
-      IsSibcall = true;
+      // Sibcalls are automatically detected tailcalls which do not require
+      // ABI changes.
+      IsSibcall = IsSibcall && !IsGuaranteeTCO;
+    }
 
     if (isTailCall)
       ++NumTailCalls;
@@ -2128,8 +2131,9 @@ X86TargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,
   else if (IsGuaranteeTCO && canGuaranteeTCO(CallConv))
     NumBytes = GetAlignedArgumentStackSize(NumBytes, DAG);
 
+  // A sibcall is ABI-compatible and does not need to adjust the stack pointer.
   int FPDiff = 0;
-  if (isTailCall &&
+  if (isTailCall && !IsSibcall &&
       shouldGuaranteeTCO(CallConv,
                          MF.getTarget().Options.GuaranteedTailCallOpt)) {
     // Lower arguments at fp - stackoffset + fpdiff.
diff --git a/llvm/test/CodeGen/X86/musttail-struct.ll b/llvm/test/CodeGen/X86/musttail-struct.ll
new file mode 100644
index 0000000000000..0ae6b91d4fc02
--- /dev/null
+++ b/llvm/test/CodeGen/X86/musttail-struct.ll
@@ -0,0 +1,57 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown | FileCheck %s
+; RUN: llc < %s -mtriple=i686-unknown-unknown | FileCheck %s
+
+; Test correct handling of a musttail call with a byval struct argument.
+
+%struct.1xi32 = type { [1 x i32] }
+%struct.3xi32 = type { [3 x i32] }
+%struct.5xi32 = type { [5 x i32] }
+
+declare dso_local i32 @Func1(ptr byval(%struct.1xi32) %0)
+declare dso_local i32 @Func3(ptr byval(%struct.3xi32) %0)
+declare dso_local i32 @Func5(ptr byval(%struct.5xi32) %0)
+declare dso_local i32 @FuncManyArgs(i32 %0, i32 %1, i32 %2, i32 %3, i32 %4, i32 %5, i8 %6, ptr byval(%struct.5xi32) %7)
+
+define dso_local i32 @test1(ptr byval(%struct.1xi32) %0) {
+; CHECK-LABEL: test1:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    jmp Func1 # TAILCALL
+  %r = musttail call i32 @Func1(ptr byval(%struct.1xi32) %0)
+  ret i32 %r
+}
+
+define dso_local i32 @test3(ptr byval(%struct.3xi32) %0) {
+; CHECK-LABEL: test3:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    jmp Func3 # TAILCALL
+  %r = musttail call i32 @Func3(ptr byval(%struct.3xi32) %0)
+  ret i32 %r
+}
+
+; sizeof(%struct.5xi32) > 16, in x64 this is passed on stack.
+define dso_local i32 @test5(ptr byval(%struct.5xi32) %0) {
+; CHECK-LABEL: test5:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    jmp Func5 # TAILCALL
+  %r = musttail call i32 @Func5(ptr byval(%struct.5xi32) %0)
+  ret i32 %r
+}
+
+; Test passing multiple arguments with different sizes on stack. In x64 Linux
+; the first 6 are passed by register.
+define dso_local i32 @testManyArgs(i32 %0, i32 %1, i32 %2, i32 %3, i32 %4, i32 %5, i8 %6, ptr byval(%struct.5xi32) %7) {
+; CHECK-LABEL: testManyArgs:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    jmp FuncManyArgs # TAILCALL
+  %r = musttail call i32 @FuncManyArgs(i32 %0, i32 %1, i32 %2, i32 %3, i32 %4, i32 %5, i8 %6, ptr byval(%struct.5xi32) %7)
+  ret i32 %r
+}
+
+define dso_local i32 @testRecursion(i32 %0, i32 %1, i32 %2, i32 %3, i32 %4, i32 %5, i8 %6, ptr byval(%struct.5xi32) %7) {
+; CHECK-LABEL: testRecursion:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    jmp testRecursion # TAILCALL
+  %r = musttail call i32 @testRecursion(i32 %0, i32 %1, i32 %2, i32 %3, i32 %4, i32 %5, i8 %6, ptr byval(%struct.5xi32) %7)
+  ret i32 %r
+}
diff --git a/llvm/test/CodeGen/X86/musttail-tailcc.ll b/llvm/test/CodeGen/X86/musttail-tailcc.ll
index fae698d53b927..f1ffbcb1142c5 100644
--- a/llvm/test/CodeGen/X86/musttail-tailcc.ll
+++ b/llvm/test/CodeGen/X86/musttail-tailcc.ll
@@ -55,15 +55,6 @@ define dso_local tailcc void @void_test(i32, i32, i32, i32) {
 ;
 ; X86-LABEL: void_test:
 ; X86:       # %bb.0: # %entry
-; X86-NEXT:    pushl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 8
-; X86-NEXT:    .cfi_offset %esi, -8
-; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
-; X86-NEXT:    movl {{[0-9]+}}(%esp), %esi
-; X86-NEXT:    movl %esi, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl %eax, {{[0-9]+}}(%esp)
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
 ; X86-NEXT:    jmp void_test # TAILCALL
   entry:
    musttail call tailcc void @void_test( i32 %0, i32 %1, i32 %2, i32 %3)
@@ -77,15 +68,6 @@ define dso_local tailcc i1 @i1test(i32, i32, i32, i32) {
 ;
 ; X86-LABEL: i1test:
 ; X86:       # %bb.0: # %entry
-; X86-NEXT:    pushl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 8
-; X86-NEXT:    .cfi_offset %esi, -8
-; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
-; X86-NEXT:    movl {{[0-9]+}}(%esp), %esi
-; X86-NEXT:    movl %esi, {{[0-9]+}}(%esp)
-; X86-NEXT:    movl %eax, {{[0-9]+}}(%esp)
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
 ; X86-NEXT:    jmp i1test # TAILCALL
   entry:
   %4 = musttail call tailcc i1 @i1test( i32 %0, i32 %1, i32 %2, i32 %3)
diff --git a/llvm/test/CodeGen/X86/swifttailcc-store-ret-address-aliasing-stack-slot.ll b/llvm/test/CodeGen/X86/swifttailcc-store-ret-address-aliasing-stack-slot.ll
index cd669768705e5..b901d22f66392 100644
--- a/llvm/test/CodeGen/X86/swifttailcc-store-ret-address-aliasing-stack-slot.ll
+++ b/llvm/test/CodeGen/X86/swifttailcc-store-ret-address-aliasing-stack-slot.ll
@@ -1,3 +1,4 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
 ; RUN: llc %s -o - | FileCheck %s
 
 target triple = "x86_64-apple-macosx"
@@ -24,9 +25,7 @@ define swifttailcc void @test(ptr %0, ptr swiftasync %1, i64 %2, i64 %3, ptr %4,
 ; CHECK-NEXT:    movq {{[0-9]+}}(%rsp), %r15
 ; CHECK-NEXT:    callq _foo
 ; CHECK-NEXT:    movq %r14, (%rax)
-; CHECK-NEXT:    movl [[OFF:[0-9]+]](%rsp), %edx
-; CHECK-NEXT:    movq {{[0-9]+}}(%rsp), %rcx
-; CHECK-NEXT:    movq %rcx, [[OFF]](%rsp)
+; CHECK-NEXT:    movl {{[0-9]+}}(%rsp), %edx
 ; CHECK-NEXT:    movq %rax, %r14
 ; CHECK-NEXT:    movq %r13, %rdi
 ; CHECK-NEXT:    movq %r15, %rsi
@@ -34,7 +33,6 @@ define swifttailcc void @test(ptr %0, ptr swiftasync %1, i64 %2, i64 %3, ptr %4,
 ; CHECK-NEXT:    addq $8, %rsp
 ; CHECK-NEXT:    popq %rbx
 ; CHECK-NEXT:    popq %r15
-; CHECK-NEXT:    addq $16, %rsp
 ; CHECK-NEXT:    jmp _tc_fn ## TAILCALL
 entry:
   %res = tail call ptr @foo()

@folkertdev folkertdev changed the title fix x86 musttail sibcall miscompilation x86: fix musttail sibcall miscompilation Nov 20, 2025
@github-actions
Copy link

github-actions bot commented Nov 20, 2025

🐧 Linux x64 Test Results

  • 187202 tests passed
  • 4932 tests skipped

✅ The build succeeded and all tests passed.

@efriedma-quic
Copy link
Collaborator

As discussed in https://reviews.llvm.org/D131034, this is still broken in cases where the argument lists don't match. If you want an example of how to handle this case, see #109943.

@folkertdev
Copy link
Contributor Author

As discussed in https://reviews.llvm.org/D131034, this is still broken in cases where the argument lists don't match. If you want an example of how to handle this case, see #109943.

Interesting, I had been looking for something like that. However, #168506 appears to use a different, simpler approach to order the reads and writes. Does that look right (and perhaps preferable) to you?

@efriedma-quic
Copy link
Collaborator

As far as I can tell, LoongArchTargetLowering::LowerCall always copies byval arguments to a temporary, then copies them into the argument list. Which is a legal implementation, but not really optimized.

@folkertdev
Copy link
Contributor Author

Allright, so generally you'd like to see an x86 implementation that resembles the arm one more? Because currently the x86 implementation is quite different (and, as established, very broken).

This will be tough (for me) but I'm willing to give that a go.

@efriedma-quic
Copy link
Collaborator

I'd prefer the more optimized implementation... but really, I'd be okay with any complete implementation that doesn't regress non-musttail cases. I just don't want to leave a lurking miscompile.

@folkertdev folkertdev force-pushed the x86-musttail-sibcalls branch from 839564d to 4636011 Compare November 22, 2025 19:52
@folkertdev
Copy link
Contributor Author

Allright, I gave this a go, but I'll need some help polishing this.

Copy link
Contributor Author

@folkertdev folkertdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was very helpful, thanks!

@folkertdev folkertdev force-pushed the x86-musttail-sibcalls branch 3 times, most recently from 05af31d to 238805f Compare November 25, 2025 12:44
@folkertdev folkertdev force-pushed the x86-musttail-sibcalls branch from 238805f to 346a911 Compare November 25, 2025 19:08
Currently this logic is only actually used for musttail calls.
In particular, we don't need to bother with shuffling arguments around
on the stack, because in the x86 backend, only functions that do not
need to move arguments around on the stack are considered sibcalls.
@folkertdev folkertdev force-pushed the x86-musttail-sibcalls branch from 346a911 to e5dbe7b Compare November 25, 2025 21:49
Copy link
Collaborator

@efriedma-quic efriedma-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know when you're ready for another round of review.

@folkertdev
Copy link
Contributor Author

Ah yeah sorry for kind of thinking out loud here. Barring CI failures on some test I missed, I think this should be good.

So to summarize, I think we could now make IsEligibleForTailCallOptimization behave more like arm, but I'd prefer to do that in a separate PR because at least in my quick experiments it'll touch a lot of files because that change can influence codegen for any tail call, not just musttail.

@folkertdev
Copy link
Contributor Author

CI got further than it has so far, so I think the codegen tests are good. I've now added the x32 tests, nothing jumps out to me as incorrect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of IsGuaranteeTCO seems confusing at best.

I suspect we should just replace bool IsGuaranteeTCO with bool ShouldGuaranteeTCO = shouldGuaranteeTCO(CallConv, MF.getTarget().Options.GuaranteedTailCallOpt); at the beginning, and use that instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very suspicious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the cause is that the call is now marked as a sibcall. It properly qualifies to be one I think?

--- <unnamed>
+++ <unnamed>
@@ -17,8 +17,6 @@
 	callq	_foo
 	movq	%r14, (%rax)
 	movl	40(%rsp), %edx
-	movq	24(%rsp), %rcx
-	movq	%rcx, 40(%rsp)
 	movq	%rax, %r14
 	movq	%r13, %rdi
 	movq	%r15, %rsi
@@ -26,7 +24,6 @@
 	addq	$8, %rsp
 	popq	%rbx
 	popq	%r15
-	addq	$16, %rsp
 	jmp	_tc_fn                          ## TAILCALL
 	.cfi_endproc
                                         ## -- End function

so previously this moved the value at rsp + 24 to rsp + 40, but then shifted the stack pointer by 16. Now we leave it at rsp + 40 but don't move the stack pointer. I think that cancels out?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we triggering the allocation of a stack object here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are, because the auto *SrcFrameIdxNode = dyn_cast<FrameIndexSDNode>(Src); produces a NULL pointer in ByValNeedsCopyForTailCall. The Src value is apparently a t7: i32,ch = CopyFromReg t0, Register:i32 %2.

I don't really know what that means, is there a straightforward way to make that check more permissive?

Also I see how MFI.CreateStackObject would create a stack object (and causes the subl and addl, but apparently the actual copy is optimized out?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this could be detected as a sibcall, but you're not implementing that for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the x86 backend notion of a sibcall is a little stronger currently than I think the original term intended. In the x86 backend it means that a sibling can be called without moving arguments around on the stack at all.

That is a restriction that could now be lifted, but it created a lot of test changes earlier, so I'd prefer to take this on in smaller steps.

Copy link
Contributor Author

@folkertdev folkertdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit of a procedural question: how should I handle rebasing and force pushes? For me switching branches between different PRs kind of trashes the cache (ccache helps, but it's still a couple of minutes). So, instead I rebase to whatever latest upstream I've pulled in, but then every push is a force push, and reviewing those is difficult (and sometimes buggy).

Over in the rust repository we have a bot make nice range diffs e.g. since the last review, e.g.

rust-lang/rust#148302 (comment)

So, how does llvm deal with this in a way that is convenient for the author and the reviewer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the x86 backend notion of a sibcall is a little stronger currently than I think the original term intended. In the x86 backend it means that a sibling can be called without moving arguments around on the stack at all.

That is a restriction that could now be lifted, but it created a lot of test changes earlier, so I'd prefer to take this on in smaller steps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are, because the auto *SrcFrameIdxNode = dyn_cast<FrameIndexSDNode>(Src); produces a NULL pointer in ByValNeedsCopyForTailCall. The Src value is apparently a t7: i32,ch = CopyFromReg t0, Register:i32 %2.

I don't really know what that means, is there a straightforward way to make that check more permissive?

Also I see how MFI.CreateStackObject would create a stack object (and causes the subl and addl, but apparently the actual copy is optimized out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the cause is that the call is now marked as a sibcall. It properly qualifies to be one I think?

--- <unnamed>
+++ <unnamed>
@@ -17,8 +17,6 @@
 	callq	_foo
 	movq	%r14, (%rax)
 	movl	40(%rsp), %edx
-	movq	24(%rsp), %rcx
-	movq	%rcx, 40(%rsp)
 	movq	%rax, %r14
 	movq	%r13, %rdi
 	movq	%r15, %rsi
@@ -26,7 +24,6 @@
 	addq	$8, %rsp
 	popq	%rbx
 	popq	%r15
-	addq	$16, %rsp
 	jmp	_tc_fn                          ## TAILCALL
 	.cfi_endproc
                                         ## -- End function

so previously this moved the value at rsp + 24 to rsp + 40, but then shifted the stack pointer by 16. Now we leave it at rsp + 40 but don't move the stack pointer. I think that cancels out?

Comment on lines 24 to 30
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is due to using canGuaranteeTCO in the definition of ShouldGuaranteeTCO.

/// Return true if the calling convention is one that we can guarantee TCO for.
static bool canGuaranteeTCO(CallingConv::ID CC) {
  return (CC == CallingConv::Fast || CC == CallingConv::GHC ||
          CC == CallingConv::X86_RegCall || CC == CallingConv::HiPE ||
          CC == CallingConv::Tail || CC == CallingConv::SwiftTail);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Miscompile with [[clang::musttail]] [Backend][X86] Incorrect output + missed optimization for function marked as clang::musttail

3 participants