Skip to content

Unwind info for Windows may not be correct for entire function prologue #3859

@alexcrichton

Description

@alexcrichton

With a debugging session last Friday about Windows-specific CI failures we ended up concluding that CI failures were tied to Github Actions' recent upgrade of the windows-latest image from windows-2019 to windows-2022. The PR to temporarily pin to windows-2019 has landed and CI is fixed for now but we also did some further investigation to figure out what about windows-2022 was causing breakage.

With the investigation on #3853 the conclusion was that one of the failing tests was wast::Cranelift::misc::stack_overflow. Some further investigation showed that all stack overflow tests were failing. Digging further into this our first conclusion was #3857 which is separate from this issue, but we realized that reproducing this issue required running tests on separate threads, at least with --test-threads 2.

The previous CI failures we were seeing were out-of-memory errors on the tune of 5GiB allocations. It turns out that Backtrace::new_unresolved was the culprit as the stack trace was hitting an "infinite loop" where the stack wasn't actually infinite but the stack unwinder was infinitely looping over frames. We attempted to collect (#3858) all these frames which resulted in OOM.

That's all basically a long-winded way of saying that the system-generated stack trace on windows-2019 was correct while the stack trace on windows-2022 was an "infinite" stack trace (despite it not actually being infinite). For a simple WebAssembly module:

(module
  (func call 0)
)

the cranelift machine IR emitted for this is:

VCode_ShowWithRRU {{
  Entry block: 0
Block 0:
  (original IR block: block0)
  (successor: Block 1)
  (instruction range: 0 .. 13)
  Inst 0:   pushq   %rbp
  Inst 1:   unwind PushFrameRegs { offset_upward_to_caller_sp: 16 }
  Inst 2:   movq    %rsp, %rbp
  Inst 3:   movq    0(%rdi), %r10
  Inst 4:   movq    0(%r10), %r10
  Inst 5:   cmpq    %rsp, %r10
  Inst 6:   jbe ; ud2 stk_ovf ;
  Inst 7:   unwind DefineNewFrame { offset_upward_to_caller_sp: 16, offset_downward_to_clobbers: 0 }
  Inst 8:   movq    %rdi, %rax
  Inst 9:   movq    %rax, %rdi
  Inst 10:   movq    %rax, %rsi
  Inst 11:   call    User { namespace: 0, index: 0 }
  Inst 12:   jmp     label1
Block 1:
  (original IR block: block1)
  (instruction range: 13 .. 16)
  Inst 13:   movq    %rbp, %rsp
  Inst 14:   popq    %rbp
  Inst 15:   ret
}}

When wasm hits the fault at "Inst 6" above due to stack overflow this is before the DefineNewFrame unwind pseudo-instruction, but that instruction should probably be just after "Inst 2" instead of at the end of the prologue that checks the out-of-stack condition. The current suspicion is that this probably-incorrect location of DefineNewFrame is why the windows-2022 stack unwinding hits an infinite loop.

It's worth noting that one of the debugging runs on #3853 printed out the ip/sp of each frame on the stack and it looked like:

0: ip=0x7ff62df77637 sp=0x7c55af8d20
1: ip=0x7ff62df77637 sp=0x7c55af8d20
2: ip=0x7ff62df77527 sp=0x7c55afabd0
3: ip=0x7ff62df54c33 sp=0x7c55afac40
4: ip=0x7ff62df55303 sp=0x7c55afae30
5: ip=0x7ff62df635a7 sp=0x7c55afaea0
6: ip=0x7ff62df55238 sp=0x7c55afaf10
7: ip=0x7fff1896b592 sp=0x7c55afaf60
8: ip=0x7fff18922022 sp=0x7c55afb000
9: ip=0x7fff18992e1e sp=0x7c55afb240
10: ip=0x1d76a8f1013 sp=0x7c55afc010
11: ip=0x7c55afc020 sp=0x7c55afc018
12: ip=0x1d76a8f1023 sp=0x7c55afc020
13: ip=0x7c55afc030 sp=0x7c55afc028
14: ip=0x1d76a8f1023 sp=0x7c55afc030
15: ip=0x7c55afc040 sp=0x7c55afc038
16: ip=0x1d76a8f1023 sp=0x7c55afc040
...
22446: ip=0x1d76a8f1023 sp=0x7c55b27d30
22447: ip=0x7c55b27d40 sp=0x7c55b27d38
22448: ip=0x1d76a8f1023 sp=0x7c55b27d40
22449: ip=0x7c55b27d50 sp=0x7c55b27d48
22450: ip=0x1d76a8f1023 sp=0x7c55b27d50
22451: ip=0x7c55b27d60 sp=0x7c55b27d58
22452: ip=0x1d76a8f1023 sp=0x7c55b27d60
22453: ip=0x7c55b27d70 sp=0x7c55b27d68
22454: ip=0x1d76a8f1023 sp=0x7c55b27d70
22455: ip=0x7c55b27d80 sp=0x7c55b27d78
22456: ip=0x1d76a8f1023 sp=0x7c55b27d80
22457: ip=0x7c55b27d90 sp=0x7c55b27d88
22458: ip=0x1d76a8f1023 sp=0x7c55b27d90
22459: ip=0x7c55b27da0 sp=0x7c55b27d98

where the ip is seemingly correct in that this was a mutally recursive set of functions as part of the test case and the sp is indeed increasing, but there seems to be no limit to sp increasing and apparently no reads/writes of memory are being done because presumably it would have otherwise segfaulted at this point!

In any case it appears that bad unwinding information is to blame here on Windows, so this issue is intended to track fixing that, and a fix for this issue should likely be accompanied with a revert of #3854 as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIncorrect behavior in the current implementation that needs fixingcraneliftIssues related to the Cranelift code generator

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions