Recover properly from stack overflow on Windows#265
Recover properly from stack overflow on Windows#265mrowqa wants to merge 1 commit intobytecodealliance:masterfrom
Conversation
|
@mrowqa great! I had a suspicion that the remaining two test failures are unrelated, and we should look into them independently. It looks like tests are failing for release builds, unfortunately. Can you look into that? |
|
This is fascinating! So the premise here is that setjmp/longjmp have actually been working this whole time, but we exhausted the stack twice and the fact that we didn't call this function meant that the second failure was an unhandled exception where the first one actually was correctly handled? |
|
@tschneidereit I noticed the tests fail yesterday afternoon. I'm going to look into this. I and @sunfishcode looked quickly at the remaining tests and it looks like problem with signaling nans. What's interesting, these tests were passing a couple of weeks ago. I'll investigate them, too. @alexcrichton yeah. Windows puts guard page on the page below stack, and when stack overflow happens, the guard is removed. If you don't restore it, next time you'll get access violation which is not handled by our handler for some reason. |
|
FWIW I tried to debug this a bit locally and some things I seem to see are:
Not sure why it's working in debug mode... |
|
Quick update: after rebasing on master my code stopped working. I stepped into setjmp/longjmp with debugger. It turns out that calls from C++ and Rust does different things: Rust simply restores context from jmp_buf, and C++ actually calls PS since we're not checking the PC in our exception handler, it's pretty messy. In our code it was called during unwinding, and caused next unwinding. |
|
Interesting! I think calling For checking the PC I wonder if that's obscuring some original exceptions when I've been looking at this. We could probably get by with a simple thread-local boolean for now indicating "are you in wasm" and clear that whenever an exception happens so a second exception doesn't try to get re-handled. I'm not sure what you mean by OS native unwinding though, can you elaborate? Do you mean like a wasm module calls an imported function which calls |
|
Oh, I like the idea with boolean for checking the PC. It's a simple workaround. I have talked with @sunfishcode earlier about the PC check and I might implement it properly later. By "OS native unwinding" I meant using OS specific unwinding mechanism (i.e. calling |
|
Ah ok, makes sense! I think it's definitely true that longjmp/setjmp don't run destructors, but I think that's ok as well because that unwinding only ever happens between wasm frames, right? Now handling |
Well, if I remember correctly, the wasm code can call back to Rust code (@sunfishcode ?). We would probably want to unwind just the top wasm call. But we could unwind from Rust code in case of some error - in this case we'll need to have a proper function entry for native unwinding mechanism anyway. The other thing is, if we can call back to Rust code, the boolean flag for PC won't work.
Do you mean the imported functions in wasm modules?
You mean just for the panics through Rust->JIT code->Rust frames? And not for general unwinding implementation? Edit:
Testing quickly #258 - looks like the |
|
Oh right yeah sorry I'm talking about when wasm calls and imported function (defined in Rust) and then that Rust function panics. (note that I haven't tested this on Unix yet either, and I suspect it dies there as well since we can't backtrace through the JIT code yet) |
7872fbf to
98a647e
Compare
|
It's ugly, but it works for me locally on native Windows and on WSL. I'm going to open a new issue about proper implementation of unwinding with some references that might be helpful in the future. There are still two tests failing on Windows (related to nans / floats). I'll look into them later. |
|
Tests update:
|
alexcrichton
left a comment
There was a problem hiding this comment.
For PATH you'll need to do something like this depending on where the LLVM is installed to (I'm not entirely sure where though)
Unfortunately debugging the failure here is pretty hard. It looks like if you use lldb to debug the executable it just stops on segfaults (the ones we handle) and I can't figure out how to continue past it (aka let the runtime handle the segfault and proceed). I was able to get a core dump but they're pretty inscrutable and I'm not entirely sure why. It looks like some thread somewhere is just dying in malloc on my system, not hitting an illegal instruction on CI. In that sense I'm not entirely sure what's going on with OSX on CI.
| pub extern "C" fn LeaveScope(ptr: *const u8) { | ||
| JMP_BUF.with(|buf| buf.set(ptr)) | ||
| /// A simple guard to ensure that `JMP_BUFS` is reset when we're done. | ||
| struct ScopeGuard { |
There was a problem hiding this comment.
I don't think this is necessary to bring back, I think we can stay in the same semantics this currenty has where there's just one pointer, and we store that pointer into a thread local (swap it out) and restore it when we return from the trampoline
There was a problem hiding this comment.
@sunfishcode ? I haven't fully understood your argument why we want to keep ScopeGuard.
There was a problem hiding this comment.
I think we still want ScopeGuard so that we can be correct in case there's a Rust panic which triggers an unwind through wasm frames and back into Rust. That would bypass our setjmp, and we want to properly clean up our data structures in that case.
There was a problem hiding this comment.
It's technically undefined behavior for Rust panics to unwind through JIT code in the sense that we're not really guaranteeing how a Rust panic is implemented. It's also the case that at least on Windows 64-bit MSVC Rust panics are done with SEH which requires careful coordination with the JIT compiler to figure out how to unwind the stack, so it doesn't work there anyway. (I suspect unwinding doesn't actually work on any platform due to the lack of ability to unwind through the JIT code).
I think the best long-term solution for panicking through jit code is to always catch_unwind at the boundaries, trigger a wasm trap unwind, and then resume_unwind when we get back to Rust.
98a647e to
f3d8e76
Compare
Thanks. Looks like LLVM uses NSIS. I have explicitly specified the installation directory which was the default one in the installer. It seems that Chocolatey simply assumes it's the default, too (look at tools\chocolateyinstall.ps1).
I see. Maybe it's related to the unwinding inside the signal handler. Nice trick is an inline assembly after |
f3d8e76 to
974ad8a
Compare
1) Recover properly from stack overflow. 2) Use non-unwinding setjmp/longjmp (clang builtins). Regular setjmp/longjmp unwinds on Windows, but we can't provide unwind info for the OS at the moment.
974ad8a to
89b9e52
Compare
|
We decided with @sunfishcode to focus on my main project, so we're postponing this change. As mentioned before, it works on Windows and WSL. We need to somehow add clang to PATH on Windows CI (I tried a couple of ways and failed) and debug Mac builds. |
|
@peterhuene looks like you've covered everything (and it's not hacky as this PR). I'm really excited by your changes! |
Co-authored-by: Alexander Theißen <alex.theissen@me.com>
fcvt_from_uint and fcvt_from_sint Updates bytecodealliance#224 Co-authored-by: Michael McLoughlin <mcloughlin@cmu.edu>
It fixes most of the tests failures on Windows.
It looks like we don't need the
RtlInstallFunctionTableCallback(#258) yet since we're usingsetjmp/longjmpand the vectored exception handlers with first flag set. It will most likely change after implementing OS native stack unwinding method.cc @alexcrichton @sunfishcode