-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
With a debugging session last Friday about Windows-specific CI failures we ended up concluding that CI failures were tied to Github Actions' recent upgrade of the windows-latest image from windows-2019 to windows-2022. The PR to temporarily pin to windows-2019 has landed and CI is fixed for now but we also did some further investigation to figure out what about windows-2022 was causing breakage.
With the investigation on #3853 the conclusion was that one of the failing tests was wast::Cranelift::misc::stack_overflow. Some further investigation showed that all stack overflow tests were failing. Digging further into this our first conclusion was #3857 which is separate from this issue, but we realized that reproducing this issue required running tests on separate threads, at least with --test-threads 2.
The previous CI failures we were seeing were out-of-memory errors on the tune of 5GiB allocations. It turns out that Backtrace::new_unresolved was the culprit as the stack trace was hitting an "infinite loop" where the stack wasn't actually infinite but the stack unwinder was infinitely looping over frames. We attempted to collect (#3858) all these frames which resulted in OOM.
That's all basically a long-winded way of saying that the system-generated stack trace on windows-2019 was correct while the stack trace on windows-2022 was an "infinite" stack trace (despite it not actually being infinite). For a simple WebAssembly module:
(module
(func call 0)
)
the cranelift machine IR emitted for this is:
VCode_ShowWithRRU {{
Entry block: 0
Block 0:
(original IR block: block0)
(successor: Block 1)
(instruction range: 0 .. 13)
Inst 0: pushq %rbp
Inst 1: unwind PushFrameRegs { offset_upward_to_caller_sp: 16 }
Inst 2: movq %rsp, %rbp
Inst 3: movq 0(%rdi), %r10
Inst 4: movq 0(%r10), %r10
Inst 5: cmpq %rsp, %r10
Inst 6: jbe ; ud2 stk_ovf ;
Inst 7: unwind DefineNewFrame { offset_upward_to_caller_sp: 16, offset_downward_to_clobbers: 0 }
Inst 8: movq %rdi, %rax
Inst 9: movq %rax, %rdi
Inst 10: movq %rax, %rsi
Inst 11: call User { namespace: 0, index: 0 }
Inst 12: jmp label1
Block 1:
(original IR block: block1)
(instruction range: 13 .. 16)
Inst 13: movq %rbp, %rsp
Inst 14: popq %rbp
Inst 15: ret
}}
When wasm hits the fault at "Inst 6" above due to stack overflow this is before the DefineNewFrame unwind pseudo-instruction, but that instruction should probably be just after "Inst 2" instead of at the end of the prologue that checks the out-of-stack condition. The current suspicion is that this probably-incorrect location of DefineNewFrame is why the windows-2022 stack unwinding hits an infinite loop.
It's worth noting that one of the debugging runs on #3853 printed out the ip/sp of each frame on the stack and it looked like:
0: ip=0x7ff62df77637 sp=0x7c55af8d20
1: ip=0x7ff62df77637 sp=0x7c55af8d20
2: ip=0x7ff62df77527 sp=0x7c55afabd0
3: ip=0x7ff62df54c33 sp=0x7c55afac40
4: ip=0x7ff62df55303 sp=0x7c55afae30
5: ip=0x7ff62df635a7 sp=0x7c55afaea0
6: ip=0x7ff62df55238 sp=0x7c55afaf10
7: ip=0x7fff1896b592 sp=0x7c55afaf60
8: ip=0x7fff18922022 sp=0x7c55afb000
9: ip=0x7fff18992e1e sp=0x7c55afb240
10: ip=0x1d76a8f1013 sp=0x7c55afc010
11: ip=0x7c55afc020 sp=0x7c55afc018
12: ip=0x1d76a8f1023 sp=0x7c55afc020
13: ip=0x7c55afc030 sp=0x7c55afc028
14: ip=0x1d76a8f1023 sp=0x7c55afc030
15: ip=0x7c55afc040 sp=0x7c55afc038
16: ip=0x1d76a8f1023 sp=0x7c55afc040
...
22446: ip=0x1d76a8f1023 sp=0x7c55b27d30
22447: ip=0x7c55b27d40 sp=0x7c55b27d38
22448: ip=0x1d76a8f1023 sp=0x7c55b27d40
22449: ip=0x7c55b27d50 sp=0x7c55b27d48
22450: ip=0x1d76a8f1023 sp=0x7c55b27d50
22451: ip=0x7c55b27d60 sp=0x7c55b27d58
22452: ip=0x1d76a8f1023 sp=0x7c55b27d60
22453: ip=0x7c55b27d70 sp=0x7c55b27d68
22454: ip=0x1d76a8f1023 sp=0x7c55b27d70
22455: ip=0x7c55b27d80 sp=0x7c55b27d78
22456: ip=0x1d76a8f1023 sp=0x7c55b27d80
22457: ip=0x7c55b27d90 sp=0x7c55b27d88
22458: ip=0x1d76a8f1023 sp=0x7c55b27d90
22459: ip=0x7c55b27da0 sp=0x7c55b27d98
where the ip is seemingly correct in that this was a mutally recursive set of functions as part of the test case and the sp is indeed increasing, but there seems to be no limit to sp increasing and apparently no reads/writes of memory are being done because presumably it would have otherwise segfaulted at this point!
In any case it appears that bad unwinding information is to blame here on Windows, so this issue is intended to track fixing that, and a fix for this issue should likely be accompanied with a revert of #3854 as well.