Skip to content

Recover properly from stack overflow on Windows#265

Closed
mrowqa wants to merge 1 commit intobytecodealliance:masterfrom
mrowqa:windows-stackoverflow
Closed

Recover properly from stack overflow on Windows#265
mrowqa wants to merge 1 commit intobytecodealliance:masterfrom
mrowqa:windows-stackoverflow

Conversation

@mrowqa
Copy link
Contributor

@mrowqa mrowqa commented Aug 9, 2019

It fixes most of the tests failures on Windows.

It looks like we don't need the RtlInstallFunctionTableCallback (#258) yet since we're using setjmp/longjmp and the vectored exception handlers with first flag set. It will most likely change after implementing OS native stack unwinding method.

cc @alexcrichton @sunfishcode

@tschneidereit
Copy link
Member

@mrowqa great! I had a suspicion that the remaining two test failures are unrelated, and we should look into them independently.

It looks like tests are failing for release builds, unfortunately. Can you look into that?

@alexcrichton
Copy link
Member

This is fascinating! So the premise here is that setjmp/longjmp have actually been working this whole time, but we exhausted the stack twice and the fact that we didn't call this function meant that the second failure was an unhandled exception where the first one actually was correctly handled?

@mrowqa
Copy link
Contributor Author

mrowqa commented Aug 9, 2019

@tschneidereit I noticed the tests fail yesterday afternoon. I'm going to look into this. I and @sunfishcode looked quickly at the remaining tests and it looks like problem with signaling nans. What's interesting, these tests were passing a couple of weeks ago. I'll investigate them, too.

@alexcrichton yeah. Windows puts guard page on the page below stack, and when stack overflow happens, the guard is removed. If you don't restore it, next time you'll get access violation which is not handled by our handler for some reason.

@alexcrichton
Copy link
Member

FWIW I tried to debug this a bit locally and some things I seem to see are:

  • For whatever reason the divide-by-zero exception can be caught once, but not twice. On the second call it faults
  • For stack overflow it looks like something is causing a "stack cookie check" to fail while calling longjmp
  • The next failing test (an index out of bounds for loads) has the same stack cookie check failing

Not sure why it's working in debug mode...

@mrowqa
Copy link
Contributor Author

mrowqa commented Aug 9, 2019

Quick update: after rebasing on master my code stopped working. I stepped into setjmp/longjmp with debugger. It turns out that calls from C++ and Rust does different things: Rust simply restores context from jmp_buf, and C++ actually calls RtlUnwindEx. And they are linked against different functions even if you use them in same program from both Rust and C++. So #253 was a good move towards better unwinding.
By the way, do we want _resetstkoflw to be called from C++ or calling from Rust is fine? I'm not sure whether it'll needed after properly handling the OS native unwinding. Seems like we'll need #258 quicker than I thought.

PS since we're not checking the PC in our exception handler, it's pretty messy. In our code it was called during unwinding, and caused next unwinding.

@alexcrichton
Copy link
Member

Interesting! I think calling _resetstkoflw, if necessary, is fine to do from C++ (no need to bring Rust into the picture if we're unwinding via C++ anyway).

For checking the PC I wonder if that's obscuring some original exceptions when I've been looking at this. We could probably get by with a simple thread-local boolean for now indicating "are you in wasm" and clear that whenever an exception happens so a second exception doesn't try to get re-handled.

I'm not sure what you mean by OS native unwinding though, can you elaborate? Do you mean like a wasm module calls an imported function which calls panic!() which triggers a "native unwind"?

@mrowqa
Copy link
Contributor Author

mrowqa commented Aug 9, 2019

Oh, I like the idea with boolean for checking the PC. It's a simple workaround. I have talked with @sunfishcode earlier about the PC check and I might implement it properly later.

By "OS native unwinding" I meant using OS specific unwinding mechanism (i.e. calling RtlUnwindEx). @sunfishcode mentioned that using "regular" setjmp/longjmp might not unwind properly Rust stack frames between wasm frames (i.e. calling Rust destructors).

@alexcrichton
Copy link
Member

Ah ok, makes sense!

I think it's definitely true that longjmp/setjmp don't run destructors, but I think that's ok as well because that unwinding only ever happens between wasm frames, right?

Now handling RtlUnwindEx and friends definitely makes sense as well. That'll be necessary for imported functions to be able to panic!() without aborting the process. That may be handle-able though with catch_unwind perhaps for the Rust side of things at least combined with resume_unwind

@mrowqa
Copy link
Contributor Author

mrowqa commented Aug 9, 2019

that unwinding only ever happens between wasm frames, right?

Well, if I remember correctly, the wasm code can call back to Rust code (@sunfishcode ?). We would probably want to unwind just the top wasm call. But we could unwind from Rust code in case of some error - in this case we'll need to have a proper function entry for native unwinding mechanism anyway. The other thing is, if we can call back to Rust code, the boolean flag for PC won't work.

That'll be necessary for imported functions to be able to panic!() without aborting the process.

Do you mean the imported functions in wasm modules?

That may be handle-able though with catch_unwind perhaps for the Rust side of things at least combined with resume_unwind

You mean just for the panics through Rust->JIT code->Rust frames? And not for general unwinding implementation?

Edit:

I think it's definitely true that longjmp/setjmp don't run destructors
Looks like it might not be true on Windows if it runs RtlUnwindEx (I'm not sure what it does exactly though).

Testing quickly #258 - looks like the runtime_function_callback is being called, but not exception_handler. Anyway, I'll look into that later.

@alexcrichton
Copy link
Member

Oh right yeah sorry I'm talking about when wasm calls and imported function (defined in Rust) and then that Rust function panics.

(note that I haven't tested this on Unix yet either, and I suspect it dies there as well since we can't backtrace through the JIT code yet)

@mrowqa mrowqa force-pushed the windows-stackoverflow branch from 7872fbf to 98a647e Compare August 20, 2019 22:38
@mrowqa
Copy link
Contributor Author

mrowqa commented Aug 20, 2019

It's ugly, but it works for me locally on native Windows and on WSL. I'm going to open a new issue about proper implementation of unwinding with some references that might be helpful in the future. There are still two tests failing on Windows (related to nans / floats). I'll look into them later.

@mrowqa
Copy link
Contributor Author

mrowqa commented Aug 20, 2019

Tests update:

  • build fails on Windows, because the installer most likely doesn't add clang to the PATH environment variable, and wasmtime-runtime/build.rs assumes it's in the PATH. I can't find flags for LLVM installer.
  • I don't have Mac. The error might be related to the usage of __builtin_setjmp & __builtin_longjmp (compiler maybe emits illegal instructions for them?).

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PATH you'll need to do something like this depending on where the LLVM is installed to (I'm not entirely sure where though)

Unfortunately debugging the failure here is pretty hard. It looks like if you use lldb to debug the executable it just stops on segfaults (the ones we handle) and I can't figure out how to continue past it (aka let the runtime handle the segfault and proceed). I was able to get a core dump but they're pretty inscrutable and I'm not entirely sure why. It looks like some thread somewhere is just dying in malloc on my system, not hitting an illegal instruction on CI. In that sense I'm not entirely sure what's going on with OSX on CI.

pub extern "C" fn LeaveScope(ptr: *const u8) {
JMP_BUF.with(|buf| buf.set(ptr))
/// A simple guard to ensure that `JMP_BUFS` is reset when we're done.
struct ScopeGuard {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary to bring back, I think we can stay in the same semantics this currenty has where there's just one pointer, and we store that pointer into a thread local (swap it out) and restore it when we return from the trampoline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunfishcode ? I haven't fully understood your argument why we want to keep ScopeGuard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still want ScopeGuard so that we can be correct in case there's a Rust panic which triggers an unwind through wasm frames and back into Rust. That would bypass our setjmp, and we want to properly clean up our data structures in that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's technically undefined behavior for Rust panics to unwind through JIT code in the sense that we're not really guaranteeing how a Rust panic is implemented. It's also the case that at least on Windows 64-bit MSVC Rust panics are done with SEH which requires careful coordination with the JIT compiler to figure out how to unwind the stack, so it doesn't work there anyway. (I suspect unwinding doesn't actually work on any platform due to the lack of ability to unwind through the JIT code).

I think the best long-term solution for panicking through jit code is to always catch_unwind at the boundaries, trigger a wasm trap unwind, and then resume_unwind when we get back to Rust.

@mrowqa mrowqa force-pushed the windows-stackoverflow branch from 98a647e to f3d8e76 Compare August 26, 2019 19:17
@mrowqa
Copy link
Contributor Author

mrowqa commented Aug 26, 2019

For PATH you'll need to do something like this depending on where the LLVM is installed to (I'm not entirely sure where though)

Thanks.

Looks like LLVM uses NSIS. I have explicitly specified the installation directory which was the default one in the installer. It seems that Chocolatey simply assumes it's the default, too (look at tools\chocolateyinstall.ps1).

Unfortunately debugging the failure here is pretty hard.

I see. Maybe it's related to the unwinding inside the signal handler. Nice trick is an inline assembly after setjmp with a software breakpoint, the 0xCC byte. I'll talk with @sunfishcode what we're planning to do with MacOS.

@mrowqa mrowqa force-pushed the windows-stackoverflow branch from f3d8e76 to 974ad8a Compare August 26, 2019 19:57
1) Recover properly from stack overflow.

2) Use non-unwinding setjmp/longjmp (clang builtins).
   Regular setjmp/longjmp unwinds on Windows, but we can't provide
   unwind info for the OS at the moment.
@mrowqa mrowqa force-pushed the windows-stackoverflow branch from 974ad8a to 89b9e52 Compare August 26, 2019 20:13
@mrowqa
Copy link
Contributor Author

mrowqa commented Aug 26, 2019

We decided with @sunfishcode to focus on my main project, so we're postponing this change.

As mentioned before, it works on Windows and WSL. We need to somehow add clang to PATH on Windows CI (I tried a couple of ways and failed) and debug Mac builds.

@peterhuene
Copy link
Member

peterhuene commented Nov 7, 2019

@mrowqa thanks for this PR! As I incorporated your reset guard page fix into #462 (along with implementing unwind information so that we can continue to use an unwinding longjmp), I'm going to close this PR in favor of that one.

Please reactivate if you think I've overlooked something. Thanks!

@peterhuene peterhuene closed this Nov 7, 2019
@mrowqa
Copy link
Contributor Author

mrowqa commented Nov 7, 2019

@peterhuene looks like you've covered everything (and it's not hacky as this PR). I'm really excited by your changes!

@mrowqa mrowqa deleted the windows-stackoverflow branch November 7, 2019 08:47
grishasobol pushed a commit to grishasobol/wasmtime that referenced this pull request Nov 29, 2021
Co-authored-by: Alexander Theißen <alex.theissen@me.com>
frank-emrich pushed a commit to frank-emrich/wasmtime that referenced this pull request Dec 3, 2024
avanhatt added a commit to wellesley-prog-sys/wasmtime that referenced this pull request Apr 9, 2025
fcvt_from_uint and fcvt_from_sint

Updates bytecodealliance#224

Co-authored-by: Michael McLoughlin <mcloughlin@cmu.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants