Skip to content
This repository was archived by the owner on Mar 24, 2022. It is now read-only.

Refactor run_async to use custom Future implementation #626

Merged
cfallin merged 7 commits intobytecodealliance:mainfrom
benaubin:custom-run-async-future
Feb 19, 2021
Merged

Refactor run_async to use custom Future implementation #626
cfallin merged 7 commits intobytecodealliance:mainfrom
benaubin:custom-run-async-future

Conversation

@benaubin
Copy link
Contributor

@benaubin benaubin commented Jan 15, 2021

This PR replaces the async fn run_async_internal() with a custom implementation of Future.

Advantages:

  • Significantly less unsafe code than the original implementation of run_async (just one line, used for pinning the future)
    a much sounder execution model: the future is executed from within the guest! When host execution is desired tokio::spawn or similar works from within block_on
  • Support for defining async fn hostcalls without ever manually calling block_on
  • Near-zero overhead for polling an immediately-ready future (only the cost of an Arc.clone()). For example, a future like async { 5 } would immediately resolve to 5, without requiring a yield to the host.
  • Support for regular yielding from an async execution, even from within block_on
  • A defined type for the future returned by run_async, which makes it easier for embedders to manage the execution, such as by changing the execution bound between polls.

Semantically, guests behave exactly as normal async functions would, where block_on is equivalent to calling await.

It's now possible to declare an async hostcall like an async function:

#[lucet_hostcall]
#[no_mangle]
pub async fn hostcall_async(vmctx: &Vmctx) -> u32 {
    for i in 0..6 {
        YieldingFuture { times: 2 }.await
    }

    return 0;
}

I also a added new try_block_on method, which allows an embedded to define fallback behavior for when an instance is not run from within an async execution context.

Additionally, I removed the inst_count_bound argument in run_async and run_async_start (added in #612, API change is not released) and replaced it with a method and field on the RunAsync future. This is more flexible for users that need the ability to configure a CPU bound (it lets a user change the CPU bound in-between any poll) while being less verbose (and backward compatible with the old api) for users without that need.

@benaubin
Copy link
Contributor Author

benaubin commented Jan 15, 2021

In case it's useful, here's one way to implement a soft CPU time limit with this refactor. I'm not sure if it's worth adding an extra dependency to lucet, so I'm not including it in the PR (I'd be happy to, though!).

use std::time::Duration;
use cpu_time;

pub struct RunAsyncWithCPULimit<'a> {
    run_async: RunAsync<'a>,
    pub cpu_time_limit: Duration,
    pub cpu_time_used: Duration,
}

struct CPULimitExceeded;

impl<'a> Future for RunAsyncWithCPULimit<'a> {
    type Output = (
        Result<Result<UntypedRetVal, Error>, CPULimitExceeded>,
        Duration,
    );

    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        let start = cpu_time::ThreadTime::now();

        let poll = Pin::new(&mut self.run_async).poll(cx);

        self.cpu_time_used += start.elapsed();

       match poll {
           Poll::Ready(res) => Poll::Ready((Ok(res), self.cpu_time_used)),
           Poll::Pending if self.cpu_time_used > self.cpu_time_limit => {
               Poll::Ready((Err(CPULimitExceeded), self.cpu_time_used))
           }
           Poll::Pending => Poll::Pending
       }
    }
}

Feel free to use this example however you'd like (I'm hereby releasing this example under the CC-0 license)

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @benaubin,

Thanks for this PR; however, I'm not sure I understand the reasoning behind some of the changes below. If I'm understanding correctly, the changes here create an explicit named type for the future, moving the existing async fn into a custom trait impl with an explicit state machine.

It might help me to understand the need for these if you could summarize why one can't implement the wall-clock bound example above by wrapping the future returned by the current API. For example, I think one could write an add_timeout<F: Future...>(inner: F) -> TimeoutFuture<F> and define TimeoutFuture to store an F, without requiring a named future-state struct on the Lucet side. This is largely like your example above, but with the struct type param to make it work without Lucet changes. Is there a reason this wouldn't work?

@benaubin
Copy link
Contributor Author

benaubin commented Jan 19, 2021

It might help me to understand the need for these if you could summarize why one can't implement the wall-clock bound example above by wrapping the future returned by the current API. For example, I think one could write an add_timeout<F: Future...>(inner: F) -> TimeoutFuture and define TimeoutFuture to store an F, without requiring a named future-state struct on the Lucet side. This is largely like your example above, but with the struct type param to make it work without Lucet changes. Is there a reason this wouldn't work?

To be honest, I completely forgot that you could use generics for this. However, futures created by async/await are Unpin - so it would be more difficult to write (although theoretically possible).

The primary thing that is possible to do with this API but not an async fn is adjusting the instruction bound between polls.

Additionally, the custom Future implementation reduces major sharp edges by making it much more clear how the lucet and the future executor interact.

  • If an instance polls an immediately ready future, the old API would reset the instruction bound and continue running without yielding. While possible to fix within the async block, having a loop makes this the default behavior, while the custom implementation means an instance will never be called twice due to a single poll.
  • If you run_async on an instance where bound_expired(), the old implementation would resume the instance rather than fail when attempting to run the instance from the given entry point. Solving this some sort of explicit state machine for the run loop, rather than depending on the instance's state.

In the future, this API makes other features possible:

  • It would be possible to add an API that allows yielding from a run_async context.
  • A fully-synchronous API that does not require a std::task::Waker that can drive the RunAsync, except when blocking. This would likely be a much better way to implement timeouts (no need to yield to the executor if the instruction bound expired, but the cpu time-bound did not - improving performance/cache utilization for non-adversarial guests)

@benaubin benaubin requested a review from cfallin February 9, 2021 20:35
@benaubin
Copy link
Contributor Author

benaubin commented Feb 9, 2021

@cfallin Hey Chris! I just pushed a new commit that takes advantage of the custom Future implementation to do the first poll of a future without yielding. For futures that are immediately ready, this implementation requires no context switch and no heap allocations - similar to using .await.

@benaubin benaubin force-pushed the custom-run-async-future branch 2 times, most recently from f8fcc38 to c6e83a2 Compare February 9, 2021 21:38
@benaubin benaubin force-pushed the custom-run-async-future branch from cba158f to 8579d6b Compare February 10, 2021 20:25
@cfallin
Copy link
Member

cfallin commented Feb 10, 2021

@benaubin Sorry, I haven't gotten back to this as I've been somewhat short on time. To be honest, I still don't quite understand some of the details and reasoning here. For example, could you explain what you mean by: "takes advantage of the custom Future implementation to do the first poll of a future without yielding." -- first poll of which future? AFAICT in the current code, if the called Wasm function returns soon enough (before first bound expiration), we don't yield and require another poll, we just return a completed value. Or do you mean in some other situation?

From above, the only desired behavior I see that is not yet possible is to set the bound to a different amount on every poll. Everything else is possible with the current API. Is that correct?

Overall, I am hesitant to accept a full rewrite of the core async code, especially one that replaces an async fn with a custom future implementation and especially one that uses unsafe code, without a really really compelling reason. I'd be happy to look at a much smaller-scoped patch that allows for adjusting the bound on each poll, though, if you'd like. Sorry and thanks for the patience here!

@benaubin
Copy link
Contributor Author

benaubin commented Feb 11, 2021

No problem!

The change I made yesterday makes it so there is no yield if the future passed to block_on is immediately ready. For example, a call like vmctx.block_on(async { 5 }), would return 5 without needing a yield to the host.

I'm currently working on reducing the amount of unsafe code necessary to make this work.

@benaubin
Copy link
Contributor Author

benaubin commented Feb 11, 2021

@cfallin I just pushed a commit that removes nearly all of the unsafe code from run_async, including the transmute that the trampoline-style execution required. There's one unsafe block left related to run_async, which is solely used for pinning the future (which is required in order to poll).

@benaubin
Copy link
Contributor Author

benaubin commented Feb 11, 2021

@cfallin Just pushed a new commit that adds support for defining an async hostcall just like an async function:

        #[lucet_hostcall]
        #[no_mangle]
        pub async fn hostcall_async(vmctx: &Vmctx, times: u32, times_double: u32) -> u32 {
            for i in 0..times {
                YieldingFuture { times: 2 }.await
            }

            return times * 2;
        }

Like Rust's .await, there is no overhead for a future that is immediately ready, and all execution of the future takes place from within the guest. To make this work, if the future passed to block_on returns Poll::Pending, block_on yields to RunAsync, which itself returns Poll::Pending.

Later, the async executor polls the RunAsync future, which resumes the guest, which polls the future from within the guest execution context.

@benaubin benaubin force-pushed the custom-run-async-future branch 5 times, most recently from 9e4c495 to a7a775b Compare February 12, 2021 14:52
@benaubin
Copy link
Contributor Author

benaubin commented Feb 12, 2021

Rebased onto main, consolidated commits (happy to squash into a single commit if requested, but I'm separating out a few of them in case some of the changes are rejected).

@cfallin the changes should be ready for review. I know that my original changes weren't massive improvements, but these more recent ones much larger advantages, while nearly-eliminating the unsafe code in run_async:

  • significantly less unsafe code than the original implementation of run_async (just one line, used for pinning the future)
  • a much sounder execution model: the future is executed from within the guest! When host execution is desired tokio::spawn or similar works from within block_on
  • support for defining async fn hostcalls without ever manually calling block_on
  • near-zero overhead for polling an immediately-ready future (only the cost of an Arc.clone())
  • support for yielding from an async execution, even from within block_on

Semantically, guests behave exactly as normal async functions would, where block_on is equivalent to calling await.

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benaubin, thanks for the updates and sorry for the delay on the further review!

I think this is definitely going in the right direction, and given the reductions in unsafe code and after reading and understanding further the async-hostcall implementation, I'm inclined toward making sure this gets in.

I do have one pending uncertainty, though, below, regarding a possible race (it's possible you've thought through this already and if so I'd appreciate a clarifying comment :-) ).

I think that w.r.t. the async hostcall implementation we should also get any input that @pchickey might have. It looks reasonable to me but I may be missing some requirement here.

@benaubin
Copy link
Contributor Author

Awesome! Thanks, @cfallin. I'll make the changes tomorrow.

@pchickey
Copy link
Contributor

Thanks for working on this! Many of the futures bits of this are over my head, but the change to the macro to accept async fn is welcome. I had planned to do something just like that for compatibility with wasmtime :)

@benaubin
Copy link
Contributor Author

benaubin commented Feb 18, 2021

The changes have been made and the test for yielding from async has been added!

@benaubin benaubin force-pushed the custom-run-async-future branch from 07b2c7f to 2b80350 Compare February 18, 2021 18:50
@benaubin benaubin requested a review from cfallin February 18, 2021 18:50
cfallin
cfallin previously approved these changes Feb 19, 2021
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @benaubin -- this LGTM. I think it might be good to have one more sanity-check on the core future/waker logic -- @alexcrichton, would you mind deploying your async expertise here in a final check?

We'll also need to trigger the CircleCI results; usually this occurs when we push to a branch in this repo, but in this case you won't have permissions to do that. @pchickey / @iximeow / @acfoltzer, is there a way around this or will one of us need to push a branch and create a new PR in place of @benaubin?

@pchickey
Copy link
Contributor

I am not aware of a workaround to the CircleCI problem, unfortunately - I can push a branch and once it goes green i'll link to the status here and we can merge this PR.

@pchickey
Copy link
Contributor

@pchickey
Copy link
Contributor

you'll need to merge the latest main to fix the cargo audit error

@benaubin
Copy link
Contributor Author

@pchickey Done.

@benaubin benaubin requested a review from cfallin February 19, 2021 01:03
By replacing the expected yield value from a Box<PhantomData<T>> to
TypeId, it becomes possible to resume instance execution when the size
of a boxed type is unknown and cannot be passed as an argument.

This makes it possible to yield and resume from an async context by
passing the resumption value through RunAsyncState.

Finally, we have to check that the context didn't change between
resumption within try_block_on, in order to make sure the right Waker
is scheduled and the future will be resumed when it is ready.
@benaubin benaubin force-pushed the custom-run-async-future branch from 0090233 to 4773acb Compare February 19, 2021 01:05
cfallin
cfallin previously approved these changes Feb 19, 2021
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-r+'ing for my part after updates.

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a baseline I'm pretty unfamiliar with the internals of Lucet, and I'm unfortunately finding it difficult to disentangle "just the futures bits" from the rest of the Lucet embedding/runtime. In that sense I can say that after reading this nothing looks obviously wrong but I can't say with conviction that this all looks good to me.

I'm basically trying to find parallels to the async implementation in wasmtime because I believe at a high-level the two async strategies are the same but I was finding it pretty difficult to draw the parallels because I don't understand enough of Lucet.

That being said it looks like this has gotten reviews from other folks, so that's probably good enough? If needed, though, I can set aside more time to learning more of Lucet first and then digging more into the changes here.

@benaubin
Copy link
Contributor Author

benaubin commented Feb 19, 2021

@alexcrichton It's indeed extremely similar to your wasmtime PR. Lucet runs every instance within a context switch, so it's very similar to wasmtime's fibers for async instances.

The primary difference in logic is that, in this PR, I avoid having to transmute the std::task::Context by cloning the Waker and creating a new context from inside the instance. This is partly because an instance could yield and last for longer than the lifetime of the Context (although it should be possible to use a stricter lifetime bound to soundly avoid the clone of context, it's probably not worth it).

@alexcrichton
Copy link
Member

Ah ok that makes sense. While I believe that works well today the intention of Context was to abstract over possible extensions in the future, so in the limit of time that may not be an equivalent operation. For example std may one day expand the abilities of Context with new functionality, and certain features of tokio may take advantage of that, but that means that hostcalls using tokio futures may unexpectedly not have access to these features since we're recreating Context.

I don't think that there's any plans at this time to actually do this so it's a pretty meta-level concern though.

@benaubin
Copy link
Contributor Author

benaubin commented Feb 19, 2021

@alexcrichton That's what I figured. My original implementation passed the context to the instance with a transmute, but also required a lot of unsafe code blocks propagated throughout lucet.

It wouldn't be too hard to change this if necessary, although to do so soundly, RunAsync would need to prevent "yielding" (Lucet instances can yield without blocking on a future and later be resumed from the host). Prior to this PR, that was the case - instances were not allowed to yield when in an async context. After switching the implementation to clone a waker rather than transmute the lifetime of Context, I also relaxed the restriction on yielding from an async context.

For future compatibility's sake, it might be worth reinstating the restriction, or somehow marking yielding as "unstable".

@alexcrichton
Copy link
Member

I'm not really sure what it means to prevent yielding or be in or not be an async context myself, but I think it's a reasonable state of affairs to just create a new Context for now to prevent a lot of unsafe code. I suspect that any change to Context is years out at the earliest, if ever, so we'd have plenty of runway to update things if needed.

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+ on latest commit.

pchickey's branch in #633 shows green CircleCI results for 4773acb, one commit back from this branch's HEAD; the last commit is just an early-return out-denting refactor and passes all other CI.

Given that and given the discussion above with alexcrichton, I think we should be good to merge this. I'll go ahead and merge given the manually-verified CI above!

@cfallin cfallin merged commit bf24022 into bytecodealliance:main Feb 19, 2021
@benaubin benaubin deleted the custom-run-async-future branch February 19, 2021 23:51
pchickey added a commit that referenced this pull request Feb 26, 2021
pchickey pushed a commit that referenced this pull request Feb 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants