Recover properly from stack overflow on Windows by mrowqa · Pull Request #265 · bytecodealliance/wasmtime

mrowqa · 2019-08-09T00:40:11Z

It fixes most of the tests failures on Windows.

It looks like we don't need the RtlInstallFunctionTableCallback (#258) yet since we're using setjmp/longjmp and the vectored exception handlers with first flag set. It will most likely change after implementing OS native stack unwinding method.

cc @alexcrichton @sunfishcode

tschneidereit · 2019-08-09T08:08:24Z

@mrowqa great! I had a suspicion that the remaining two test failures are unrelated, and we should look into them independently.

It looks like tests are failing for release builds, unfortunately. Can you look into that?

wasmtime-runtime/src/traphandlers.rs

alexcrichton · 2019-08-09T16:22:53Z

This is fascinating! So the premise here is that setjmp/longjmp have actually been working this whole time, but we exhausted the stack twice and the fact that we didn't call this function meant that the second failure was an unhandled exception where the first one actually was correctly handled?

mrowqa · 2019-08-09T16:38:20Z

@tschneidereit I noticed the tests fail yesterday afternoon. I'm going to look into this. I and @sunfishcode looked quickly at the remaining tests and it looks like problem with signaling nans. What's interesting, these tests were passing a couple of weeks ago. I'll investigate them, too.

@alexcrichton yeah. Windows puts guard page on the page below stack, and when stack overflow happens, the guard is removed. If you don't restore it, next time you'll get access violation which is not handled by our handler for some reason.

alexcrichton · 2019-08-09T21:51:32Z

FWIW I tried to debug this a bit locally and some things I seem to see are:

For whatever reason the divide-by-zero exception can be caught once, but not twice. On the second call it faults
For stack overflow it looks like something is causing a "stack cookie check" to fail while calling longjmp
The next failing test (an index out of bounds for loads) has the same stack cookie check failing

Not sure why it's working in debug mode...

mrowqa · 2019-08-09T21:54:20Z

Quick update: after rebasing on master my code stopped working. I stepped into setjmp/longjmp with debugger. It turns out that calls from C++ and Rust does different things: Rust simply restores context from jmp_buf, and C++ actually calls RtlUnwindEx. And they are linked against different functions even if you use them in same program from both Rust and C++. So #253 was a good move towards better unwinding.
By the way, do we want _resetstkoflw to be called from C++ or calling from Rust is fine? I'm not sure whether it'll needed after properly handling the OS native unwinding. Seems like we'll need #258 quicker than I thought.

PS since we're not checking the PC in our exception handler, it's pretty messy. In our code it was called during unwinding, and caused next unwinding.

alexcrichton · 2019-08-09T21:59:08Z

Interesting! I think calling _resetstkoflw, if necessary, is fine to do from C++ (no need to bring Rust into the picture if we're unwinding via C++ anyway).

For checking the PC I wonder if that's obscuring some original exceptions when I've been looking at this. We could probably get by with a simple thread-local boolean for now indicating "are you in wasm" and clear that whenever an exception happens so a second exception doesn't try to get re-handled.

I'm not sure what you mean by OS native unwinding though, can you elaborate? Do you mean like a wasm module calls an imported function which calls panic!() which triggers a "native unwind"?

mrowqa · 2019-08-09T22:07:48Z

Oh, I like the idea with boolean for checking the PC. It's a simple workaround. I have talked with @sunfishcode earlier about the PC check and I might implement it properly later.

By "OS native unwinding" I meant using OS specific unwinding mechanism (i.e. calling RtlUnwindEx). @sunfishcode mentioned that using "regular" setjmp/longjmp might not unwind properly Rust stack frames between wasm frames (i.e. calling Rust destructors).

alexcrichton · 2019-08-09T22:14:23Z

Ah ok, makes sense!

I think it's definitely true that longjmp/setjmp don't run destructors, but I think that's ok as well because that unwinding only ever happens between wasm frames, right?

Now handling RtlUnwindEx and friends definitely makes sense as well. That'll be necessary for imported functions to be able to panic!() without aborting the process. That may be handle-able though with catch_unwind perhaps for the Rust side of things at least combined with resume_unwind

mrowqa · 2019-08-09T22:42:09Z

that unwinding only ever happens between wasm frames, right?

Well, if I remember correctly, the wasm code can call back to Rust code (@sunfishcode ?). We would probably want to unwind just the top wasm call. But we could unwind from Rust code in case of some error - in this case we'll need to have a proper function entry for native unwinding mechanism anyway. The other thing is, if we can call back to Rust code, the boolean flag for PC won't work.

That'll be necessary for imported functions to be able to panic!() without aborting the process.

Do you mean the imported functions in wasm modules?

That may be handle-able though with catch_unwind perhaps for the Rust side of things at least combined with resume_unwind

You mean just for the panics through Rust->JIT code->Rust frames? And not for general unwinding implementation?

Edit:

I think it's definitely true that longjmp/setjmp don't run destructors
Looks like it might not be true on Windows if it runs RtlUnwindEx (I'm not sure what it does exactly though).

Testing quickly #258 - looks like the runtime_function_callback is being called, but not exception_handler. Anyway, I'll look into that later.

alexcrichton · 2019-08-12T15:06:27Z

Oh right yeah sorry I'm talking about when wasm calls and imported function (defined in Rust) and then that Rust function panics.

(note that I haven't tested this on Unix yet either, and I suspect it dies there as well since we can't backtrace through the JIT code yet)

mrowqa · 2019-08-20T22:43:29Z

It's ugly, but it works for me locally on native Windows and on WSL. I'm going to open a new issue about proper implementation of unwinding with some references that might be helpful in the future. There are still two tests failing on Windows (related to nans / floats). I'll look into them later.

mrowqa · 2019-08-20T23:28:17Z

Tests update:

build fails on Windows, because the installer most likely doesn't add clang to the PATH environment variable, and wasmtime-runtime/build.rs assumes it's in the PATH. I can't find flags for LLVM installer.
I don't have Mac. The error might be related to the usage of __builtin_setjmp & __builtin_longjmp (compiler maybe emits illegal instructions for them?).

alexcrichton

For PATH you'll need to do something like this depending on where the LLVM is installed to (I'm not entirely sure where though)

Unfortunately debugging the failure here is pretty hard. It looks like if you use lldb to debug the executable it just stops on segfaults (the ones we handle) and I can't figure out how to continue past it (aka let the runtime handle the segfault and proceed). I was able to get a core dump but they're pretty inscrutable and I'm not entirely sure why. It looks like some thread somewhere is just dying in malloc on my system, not hitting an illegal instruction on CI. In that sense I'm not entirely sure what's going on with OSX on CI.

wasmtime-runtime/build.rs

alexcrichton · 2019-08-22T00:01:33Z

wasmtime-runtime/src/traphandlers.rs

-pub extern "C" fn LeaveScope(ptr: *const u8) {
-    JMP_BUF.with(|buf| buf.set(ptr))
+/// A simple guard to ensure that `JMP_BUFS` is reset when we're done.
+struct ScopeGuard {


I don't think this is necessary to bring back, I think we can stay in the same semantics this currenty has where there's just one pointer, and we store that pointer into a thread local (swap it out) and restore it when we return from the trampoline

@sunfishcode ? I haven't fully understood your argument why we want to keep ScopeGuard.

I think we still want ScopeGuard so that we can be correct in case there's a Rust panic which triggers an unwind through wasm frames and back into Rust. That would bypass our setjmp, and we want to properly clean up our data structures in that case.

It's technically undefined behavior for Rust panics to unwind through JIT code in the sense that we're not really guaranteeing how a Rust panic is implemented. It's also the case that at least on Windows 64-bit MSVC Rust panics are done with SEH which requires careful coordination with the JIT compiler to figure out how to unwind the stack, so it doesn't work there anyway. (I suspect unwinding doesn't actually work on any platform due to the lack of ability to unwind through the JIT code).

I think the best long-term solution for panicking through jit code is to always catch_unwind at the boundaries, trigger a wasm trap unwind, and then resume_unwind when we get back to Rust.

wasmtime-runtime/src/traphandlers.rs

mrowqa · 2019-08-26T19:28:41Z

For PATH you'll need to do something like this depending on where the LLVM is installed to (I'm not entirely sure where though)

Thanks.

Looks like LLVM uses NSIS. I have explicitly specified the installation directory which was the default one in the installer. It seems that Chocolatey simply assumes it's the default, too (look at tools\chocolateyinstall.ps1).

Unfortunately debugging the failure here is pretty hard.

I see. Maybe it's related to the unwinding inside the signal handler. Nice trick is an inline assembly after setjmp with a software breakpoint, the 0xCC byte. I'll talk with @sunfishcode what we're planning to do with MacOS.

1) Recover properly from stack overflow. 2) Use non-unwinding setjmp/longjmp (clang builtins). Regular setjmp/longjmp unwinds on Windows, but we can't provide unwind info for the OS at the moment.

mrowqa · 2019-08-26T23:03:29Z

We decided with @sunfishcode to focus on my main project, so we're postponing this change.

As mentioned before, it works on Windows and WSL. We need to somehow add clang to PATH on Windows CI (I tried a couple of ways and failed) and debug Mac builds.

peterhuene · 2019-11-07T00:18:34Z

@mrowqa thanks for this PR! As I incorporated your reset guard page fix into #462 (along with implementing unwind information so that we can continue to use an unwinding longjmp), I'm going to close this PR in favor of that one.

Please reactivate if you think I've overlooked something. Thanks!

mrowqa · 2019-11-07T08:47:24Z

@peterhuene looks like you've covered everything (and it's not hacky as this PR). I'm really excited by your changes!

Co-authored-by: Alexander Theißen <alex.theissen@me.com>

fcvt_from_uint and fcvt_from_sint Updates bytecodealliance#224 Co-authored-by: Michael McLoughlin <mcloughlin@cmu.edu>

alexcrichton reviewed Aug 9, 2019

View reviewed changes

wasmtime-runtime/src/traphandlers.rs Outdated Show resolved Hide resolved

mrowqa force-pushed the windows-stackoverflow branch from 7872fbf to 98a647e Compare August 20, 2019 22:38

mrowqa mentioned this pull request Aug 21, 2019

Stack unwinding on Windows (SEH) #291

Closed

alexcrichton reviewed Aug 26, 2019

View reviewed changes

mrowqa force-pushed the windows-stackoverflow branch from 98a647e to f3d8e76 Compare August 26, 2019 19:17

mrowqa force-pushed the windows-stackoverflow branch from f3d8e76 to 974ad8a Compare August 26, 2019 19:57

Fix tests on Windows.

89b9e52

1) Recover properly from stack overflow. 2) Use non-unwinding setjmp/longjmp (clang builtins). Regular setjmp/longjmp unwinds on Windows, but we can't provide unwind info for the OS at the moment.

mrowqa force-pushed the windows-stackoverflow branch from 974ad8a to 89b9e52 Compare August 26, 2019 20:13

peterhuene closed this Nov 7, 2019

mrowqa deleted the windows-stackoverflow branch November 7, 2019 08:47

grishasobol pushed a commit to grishasobol/wasmtime that referenced this pull request Nov 29, 2021

Fix Cargo.toml no_std example in readme (bytecodealliance#265)

f899fe9

Co-authored-by: Alexander Theißen <alex.theissen@me.com>

frank-emrich pushed a commit to frank-emrich/wasmtime that referenced this pull request Dec 3, 2024

Merge with upstream (bytecodealliance#265)

7a10550

avanhatt added a commit to wellesley-prog-sys/wasmtime that referenced this pull request Apr 9, 2025

aarch64: conversions int to fp (bytecodealliance#265)

cdbe7b1

fcvt_from_uint and fcvt_from_sint Updates bytecodealliance#224 Co-authored-by: Michael McLoughlin <mcloughlin@cmu.edu>

Conversation

mrowqa commented Aug 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tschneidereit commented Aug 9, 2019

Uh oh!

Uh oh!

alexcrichton commented Aug 9, 2019

Uh oh!

mrowqa commented Aug 9, 2019

Uh oh!

alexcrichton commented Aug 9, 2019

Uh oh!

mrowqa commented Aug 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented Aug 9, 2019

Uh oh!

mrowqa commented Aug 9, 2019

Uh oh!

alexcrichton commented Aug 9, 2019

Uh oh!

mrowqa commented Aug 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcrichton commented Aug 12, 2019

Uh oh!

mrowqa commented Aug 20, 2019

Uh oh!

mrowqa commented Aug 20, 2019

Uh oh!

alexcrichton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexcrichton Aug 22, 2019

Choose a reason for hiding this comment

Uh oh!

mrowqa Aug 26, 2019

Choose a reason for hiding this comment

Uh oh!

sunfishcode Aug 26, 2019

Choose a reason for hiding this comment

Uh oh!

alexcrichton Aug 26, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mrowqa commented Aug 26, 2019

Uh oh!

mrowqa commented Aug 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterhuene commented Nov 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrowqa commented Nov 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mrowqa commented Aug 9, 2019 •

edited

Loading

mrowqa commented Aug 9, 2019 •

edited

Loading

mrowqa commented Aug 9, 2019 •

edited

Loading

mrowqa commented Aug 26, 2019 •

edited

Loading

peterhuene commented Nov 7, 2019 •

edited

Loading

mrowqa commented Nov 7, 2019 •

edited

Loading