[Merged by Bors] - TaskPool Panic Handling#6443
[Merged by Bors] - TaskPool Panic Handling#6443james7132 wants to merge 9 commits intobevyengine:mainfrom
Conversation
crates/bevy_tasks/src/task_pool.rs
Outdated
| local_executor.tick().await; | ||
| } | ||
| }; | ||
| // Use unwrap_err because we expect a Closed error |
There was a problem hiding this comment.
This comment is no longer true.
There was a problem hiding this comment.
It is accurate once again, haha.
|
The error bubbling approach and the log-and-ignore patterns both seem useful for different use cases. Abort on panic seems... interesting, but really feels like something that should be opt-in (and I'm not sure it's worth the pervasive complexity maintaining both paths would take). What would it take to be able to restart tasks if we discover that they failed? |
I would use all of them for different use cases:
The executor will drop the task, as it may be in a unrecoverable state after the panic. IMO users should handle the panic and reschedule the task(s) if they want to retry the same computation. |
|
Some further investigation:
These two combined with the current implementation suggests that we can just straight up ignore any errors from The Abort-on-Panic approach can be taken when we start diving deeper into making our own async executor. |
5ca752f to
70a0cc9
Compare
|
Setting this to |
alice-i-cecile
left a comment
There was a problem hiding this comment.
Yep, based on your investigations this seems like a significant improvement already.
I'd much rather merge this and then add more sophisticated strategies later as needed than block on designing them right now.
PROMETHIA-27
left a comment
There was a problem hiding this comment.
Looks good to me, although this isn't my area of expertise.
|
@hymm if you disable cargo test's output capture, do you see any panic messages? |
|
I think the hierarchy test is failing due to some weird interaction between the 2 scopes. I tried to reproduce tasks disappearing with just the task pools and was unable to. AsyncComputeTaskPool::get()
.spawn(async {
loop {
info!("this is a task");
sleep(Duration::from_millis(1000));
yield_now().await;
}
})
.detach();
AsyncComputeTaskPool::get()
.spawn(async {
sleep(Duration::from_millis(100));
panic!("this is a panic");
})
.detach();If we're trying to get this in before 0.9, I'd be happy to approve this with the hierarchy test changes reverted. As I can confirm that this PR is preventing the thread from getting killed and is an improvement over the status quo. In the meantime I'm going to investigate some more and see if I can get a reproduction using 2 scopes. |
Done. Can you file a separate issue to track this? |
|
bors r+ |
# Objective Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by #2250 as these threads are global and cannot be replaced after initialization. Removes the need for temporary fixes like #4998. Fixes #4996. Fixes #6081. Fixes #5285. Fixes #5054. Supersedes #2307. ## Solution The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~ - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below). - ~~no error is logged of any kind~~ (See below) - ~~it's unclear if it drops other tasks in the executor~~ (it does not) - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread. ### Alternatives A final solution likely will incorporate elements of any or all of the following. #### ~~Log and Ignore~~ ~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~ Panics already do this by default, even when caught by `catch_unwind`. #### ~~`catch_unwind` in `ParallelExecutor`~~ ~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~ ~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### Abort on Panic The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la #4740). Roughly takes the shape of: ```rust struct AbortOnPanic; impl Drop for AbortOnPanic { fn drop(&mut self) { abort!(); } } let guard = AbortOnPanic; // Run task std::mem::forget(AbortOnPanic); ``` --- ## Changelog Changed: `bevy_tasks::TaskPool`'s threads will no longer terminate permanently when a task scheduled onto them panics. Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.
|
Pull request successfully merged into main. Build succeeded:
|
|
FYI I'm not sure #5054 should have been closed as I think you'll still get different panic messages on the main thread depending on if a system was run on the main thread or not.
will do. |
|
# Objective Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by bevyengine#2250 as these threads are global and cannot be replaced after initialization. Removes the need for temporary fixes like bevyengine#4998. Fixes bevyengine#4996. Fixes bevyengine#6081. Fixes bevyengine#5285. Fixes bevyengine#5054. Supersedes bevyengine#2307. ## Solution The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~ - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below). - ~~no error is logged of any kind~~ (See below) - ~~it's unclear if it drops other tasks in the executor~~ (it does not) - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread. ### Alternatives A final solution likely will incorporate elements of any or all of the following. #### ~~Log and Ignore~~ ~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~ Panics already do this by default, even when caught by `catch_unwind`. #### ~~`catch_unwind` in `ParallelExecutor`~~ ~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~ ~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### Abort on Panic The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la bevyengine#4740). Roughly takes the shape of: ```rust struct AbortOnPanic; impl Drop for AbortOnPanic { fn drop(&mut self) { abort!(); } } let guard = AbortOnPanic; // Run task std::mem::forget(AbortOnPanic); ``` --- ## Changelog Changed: `bevy_tasks::TaskPool`'s threads will no longer terminate permanently when a task scheduled onto them panics. Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.



Objective
Right now, the
TaskPoolimplementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using astd::panic::catch_unwindin every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by #2250 as these threads are global and cannot be replaced after initialization.Removes the need for temporary fixes like #4998. Fixes #4996. Fixes #6081. Fixes #5285. Fixes #5054. Supersedes #2307.
Solution
The current solution is to wrap
Executor::runinTaskPoolwith acatch_unwind, and discarding the potential panic. This was taken straight from smol's current implementation.However, this is not entirely ideal as:the signaled to the awaiting task. We would need to change(See below).Task<T>to useasync_task::FallibleTaskinternally, and even then it doesn't signal why it panicked, just that it did.no error is logged of any kind(See below)it's unclear if it drops other tasks in the executor(it does not)This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread.Alternatives
A final solution likely will incorporate elements of any or all of the following.
Log and IgnoreLog the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.Panics already do this by default, even when caught by
catch_unwind.catch_unwindinParallelExecutorAdd another layer catching system-level panics into theParallelExecutor. How the executor continues when a core dependency of many systems fails to run is up for debate.async_task::Taskbubbles up panics already, this will transitively push panics all the way to the main thread.Emulate/Copytokio::JoinHandlewithTask<T>tokio::JoinHandle<T>bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also useTask<T>likeQuery::par_for_eachandTaskPool::scope, bubbling up the panic until it's either caught or it reaches the main thread.async_task::Taskbubbles up panics already, this will transitively push panics all the way to the main thread.Abort on Panic
The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la #4740). Roughly takes the shape of:
Changelog
Changed:
bevy_tasks::TaskPool's threads will no longer terminate permanently when a task scheduled onto them panics.Changed:
bevy_tasks::Taskandbevy_tasks::Scopewill propagate panics in the spawned tasks/scopes to the parent thread.