PVF: Remove rayon and some uses of tokio#7153
Conversation
1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed. 2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1] [^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet. 3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop. 4. The order of thread selection was flipped to make (3) sound (see note in code). I left some TODO's related to panics which I'm going to address soon as part of #7045.
Addresses a couple of follow-up TODOs from #7153.
| // Run the main worker loop. | ||
| let rt = Runtime::new().expect("Creates tokio runtime. If this panics the worker will die and the host will detect that and deal with it."); | ||
| let handle = rt.handle(); | ||
| let err = rt |
There was a problem hiding this comment.
Why do we need tokio and futures if everything async-related is already purged? Filesystem interactions are sync in nature, and for the worker reading from the socket is blocking too.
There was a problem hiding this comment.
You're right! I just didn't remove the rest of async to keep this PR focused. But I can do it here if you want.
Note that we still need to remove the dependencies on polkadot-node-core-pvf and tracing-gum to fully remove the dependency on tokio.
There was a problem hiding this comment.
How's tracing-gum related to tokio?
There was a problem hiding this comment.
I just ran cargo tree -e normal in the crate and saw tokio several times in the output, e.g. under sc-network and libp2p crates. I have no idea how tracing-gum works though.
There was a problem hiding this comment.
It's jaeger, even though gum only uses hashing from it.
You're right! I just didn't remove the rest of async to keep this PR focused. But I can do it here if you want.
Better to polish the rest of the code and properly synchronize threads and take care of removing tokio later (if it's possible)
- Measure the CPU time in the prepare thread, so the observed time is not affected by any delays in joining on the thread. - Measure the full CPU time in the execute thread.
Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std docs. Considered also using a condvar to signal the CPU thread to end, in place of an mpsc channel. This was not done because `Condvar::wait_timeout_while` is documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not documented as such. Also, we would need a separate condvar, to avoid this case: the worker thread finishes its job, notifies the condvar, the CPU thread returns first, and we join on it and not the worker thread. So it was simpler to leave this part as is.
|
@slumber I pushed another change, can you please take a look? Thank you for helping! |
node/core/pvf/worker/src/common.rs
Outdated
| #[derive(Clone, Copy)] | ||
| pub enum WaitOutcome { | ||
| JobFinished, | ||
| CpuTimedOut, |
There was a problem hiding this comment.
nit: Wouldn't JobPending and JobTimedOut sound better?
There was a problem hiding this comment.
Or remove the prefix completely: Finished, TimedOut, Pending.
| WaitOutcome::JobFinished => { | ||
| let _ = cpu_time_monitor_tx.send(()); | ||
| execute_thread.join().unwrap_or_else(|e| { | ||
| // TODO: Use `Panic` error once that is implemented. |
There was a problem hiding this comment.
Maybe have an issue for this , rather than TODO in the code ?
There was a problem hiding this comment.
I already addressed it here, it's approved so I'll merge it right after this PR. (I had planned to do these changes right after another so I left the TODO as a marker for myself, did the change, and set 7155's merge target to this branch. Will do issues instead in the future. 👍)
| None | ||
| }, | ||
| } | ||
| // Join on the thread handle. |
There was a problem hiding this comment.
nit: comment seems superfluous
node/core/pvf/worker/src/prepare.rs
Outdated
| Err(PrepareError::TimedOut) | ||
| }, | ||
| Ok(None) => Err(PrepareError::IoErr( | ||
| "error communicating over finished channel".into(), |
There was a problem hiding this comment.
| "error communicating over finished channel".into(), | |
| "error communicating over closed channel".into(), |
| /// Block the thread while it waits on the condvar or on a timeout. If the timeout is hit, | ||
| /// returns `None`. | ||
| #[cfg_attr(not(any(target_os = "linux", feature = "jemalloc-allocator")), allow(dead_code))] | ||
| pub fn wait_for_threads_with_timeout(cond: &Cond, dur: Duration) -> Option<WaitOutcome> { |
There was a problem hiding this comment.
Could we have an extra variant in WaitOutcome instead of the None ? It would then have the same return type as wait_for_threads.
There was a problem hiding this comment.
I think it is better the way it is, the variant wouldn't be applicable in a couple of places so we'd need extra unreachable!s. And if we set it to a different variant here it would trigger the condvar not being pending anymore.
* PVF: Remove `rayon` and some uses of `tokio` 1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed. 2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1] [^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet. 3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop. 4. The order of thread selection was flipped to make (3) sound (see note in code). I left some TODO's related to panics which I'm going to address soon as part of #7045. * PVF: Vote invalid on panics in execution thread (after a retry) Also make sure we kill the worker process on panic errors and internal errors to potentially clear any error states independent of the candidate. * Address a couple of TODOs Addresses a couple of follow-up TODOs from #7153. * Add some documentation to implementer's guide * Fix compile error * Fix compile errors * Fix compile error * Update roadmap/implementers-guide/src/node/utility/candidate-validation.md Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com> * Address comments + couple other changes (see message) - Measure the CPU time in the prepare thread, so the observed time is not affected by any delays in joining on the thread. - Measure the full CPU time in the execute thread. * Implement proper thread synchronization Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std docs. Considered also using a condvar to signal the CPU thread to end, in place of an mpsc channel. This was not done because `Condvar::wait_timeout_while` is documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not documented as such. Also, we would need a separate condvar, to avoid this case: the worker thread finishes its job, notifies the condvar, the CPU thread returns first, and we join on it and not the worker thread. So it was simpler to leave this part as is. * Catch panics in threads so we always notify condvar * Use `WaitOutcome` enum instead of bool condition variable * Fix retry timeouts to depend on exec timeout kind * Address review comments * Make the API for condvars in workers nicer * Add a doc * Use condvar for memory stats thread * Small refactor * Enumerate internal validation errors in an enum * Fix comment * Add a log * Fix test * Update variant naming * Address a missed TODO --------- Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* master: (60 commits) Ensure all `StorageVersion`s on Rococo/Westend are correct and migration hooks pass (#7251) Try-runtime proper return types (#7146) Have OCW mined election once a week on Westend (#7248) Bump enumn from 0.1.5 to 0.1.8 (#7226) Companion to #14183: FRAME: Allow message ID to be mutated in `ProcessMessage` (#7262) Remove TODO comment (#7260) Fix build (#7261) Update syn (#7258) Use Message Queue pallet for UMP dispatch (#6271) Freeze chain if there are byzantine threshold + 1 invalid votes against a local candidate (#7225) Revert chain if at least f+1 validators voted against a candidate (#7151) Ensure all `StorageVersion`s on Polkadot/Kusama are correct (#7199) Forgotten pub reexport for `GlobalConsensusParachainConvertsFor` (#7238) PVF: Vote invalid on panics in execution thread (after a retry) (#7155) PVF: Remove `rayon` and some uses of `tokio` (#7153) [xcm] Foreign global consensus parachain LocationToAccountId converter (#7016) Update docs (#7230) Bump parity-db to 0.4.8 (#7231) Merge branch 'master' of https://github.com/paritytech/polkadot (#7224) Relax the watermark rule in the runtime (#7188) ...
…slashing-client * ao-past-session-slashing-runtime: (61 commits) Ensure all `StorageVersion`s on Rococo/Westend are correct and migration hooks pass (#7251) Try-runtime proper return types (#7146) Have OCW mined election once a week on Westend (#7248) Bump enumn from 0.1.5 to 0.1.8 (#7226) Companion to #14183: FRAME: Allow message ID to be mutated in `ProcessMessage` (#7262) Remove TODO comment (#7260) Fix build (#7261) Update syn (#7258) Use Message Queue pallet for UMP dispatch (#6271) Freeze chain if there are byzantine threshold + 1 invalid votes against a local candidate (#7225) Revert chain if at least f+1 validators voted against a candidate (#7151) Ensure all `StorageVersion`s on Polkadot/Kusama are correct (#7199) Forgotten pub reexport for `GlobalConsensusParachainConvertsFor` (#7238) PVF: Vote invalid on panics in execution thread (after a retry) (#7155) PVF: Remove `rayon` and some uses of `tokio` (#7153) [xcm] Foreign global consensus parachain LocationToAccountId converter (#7016) Update docs (#7230) Bump parity-db to 0.4.8 (#7231) Merge branch 'master' of https://github.com/paritytech/polkadot (#7224) Relax the watermark rule in the runtime (#7188) ...
PULL REQUEST
Overview
We were using
rayonto spawn a superfluous threadpool and thread to do execution, so it was removed.We were using
rayonto set a threadpool-specific thread stack size, and AFAIK we couldn't do that withtokio(it's possible per-runtime but not per-thread). Since we want to removetokiofrom the workers anyway, I changed it to spawn threads with thestd::threadAPI instead oftokio.1std::threadAPI is not async, we could no longerselect!on the threads as futures, so theselect!was removed in favor of non-async coordination using sync primitives.I left some TODO's related to panics which I'm going to address soon as part of #7045.
Related issues
Gets us closer to paritytech/polkadot-sdk#649.
Footnotes
NOTE: This PR does not totally remove the
tokiodependency just yet. ↩