[tasks] Handle TaskPool panicking threads by NathanSWard · Pull Request #2307 · bevyengine/bevy

NathanSWard · 2021-06-06T06:21:47Z

Objective

Fixes: Panics while loading assets kill IoThreads preventing assets to be further loaded #2302
Currently, when a TaskPool's inner threads panic, there is no way to act on the errors.
The main threads continues executing while now having one fewer thread in one of it's task pools.

Solution

Implement a TaskPoolThreadPanicPolicy which implements two different behaviors
- Restart (default) -> this will join the panicked thread and then spawn a new thread in it's place.
- Propagate -> this will propagate the panic from the inner thread to the spawning thread (usually the main thread)

TODO

Testing
Some more docs
An example

crates/bevy_tasks/src/task_pool.rs

crates/bevy_tasks/Cargo.toml

crates/bevy_core/src/lib.rs

mockersf

looks good to me, but threading is not my forte

tiagolam

Thanks, @NathanSWard! The approach here does seem sensible to me.

But there's one thing that does stand out to me, though. Why is state being kept in ThreadStates (having to then use a RwLock to sync the Vec in TaskPoolInner)? Given we already have the async_channel dependency for the shutdown channel, which already offers a multi producer / multi consumer channel (both sides Sync + Send), I would have thought we could simply use a channel to flag to the TaskPool which threads needed attention. This means we wouldn't have to loop through all ThreadStates even if there's nothing to action on (and drop the RwLock and AtomicBool).

NathanSWard · 2021-06-15T03:04:54Z

Thanks, @NathanSWard! The approach here does seem sensible to me.

Um, to be frank, I didn't even consider the async_channel, and now thinking about it, I think I prefer that over the manual AtomicBool/ThreadState thing I implemented 😄

Thanks for the comment, I'll see what I can come up with!

NathanSWard · 2021-06-15T03:32:02Z

(and drop the RwLock and AtomicBool).

Well, after looking at this again. We have to keep the RwLock regardless. This is because we somehow need mutable access to the Vec<JoinHandle<()>> since on PanicPolicy::Restart we need to replace the thread with a new one.

tiagolam · 2021-06-15T09:05:18Z

(and drop the RwLock and AtomicBool).

Well, after looking at this again. We have to keep the RwLock regardless. This is because we somehow need mutable access to the Vec<JoinHandle<()>> since on PanicPolicy::Restart we need to replace the thread with a new one.

Hmm, yeah, that's true. My initial thinking was that we could drop the need for the Vec<JoinHandle<()>> as well, by sending the JoinHandle<()> over the channel to be received in handle_panicking_threads, but on a closer it looks like a Thread can't get access to its own handle. This makes it less convenient, so I think the current approach holds well.

@NathanSWard

Built on: bevyengine#2307 from @NathanSWard

@NathanSWard

Built on: bevyengine#2307 from @NathanSWard

# Objective Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by #2250 as these threads are global and cannot be replaced after initialization. Removes the need for temporary fixes like #4998. Fixes #4996. Fixes #6081. Fixes #5285. Fixes #5054. Supersedes #2307. ## Solution The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~ - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below). - ~~no error is logged of any kind~~ (See below) - ~~it's unclear if it drops other tasks in the executor~~ (it does not) - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread. ### Alternatives A final solution likely will incorporate elements of any or all of the following. #### ~~Log and Ignore~~ ~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~ Panics already do this by default, even when caught by `catch_unwind`. #### ~~`catch_unwind` in `ParallelExecutor`~~ ~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~ ~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### Abort on Panic The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la #4740). Roughly takes the shape of: ```rust struct AbortOnPanic; impl Drop for AbortOnPanic { fn drop(&mut self) { abort!(); } } let guard = AbortOnPanic; // Run task std::mem::forget(AbortOnPanic); ``` --- ## Changelog Changed: `bevy_tasks::TaskPool`'s threads will no longer terminate permanently when a task scheduled onto them panics. Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.

james7132 · 2022-11-03T00:05:34Z

Superceded by #6443.

# Objective Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by bevyengine#2250 as these threads are global and cannot be replaced after initialization. Removes the need for temporary fixes like bevyengine#4998. Fixes bevyengine#4996. Fixes bevyengine#6081. Fixes bevyengine#5285. Fixes bevyengine#5054. Supersedes bevyengine#2307. ## Solution The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~ - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below). - ~~no error is logged of any kind~~ (See below) - ~~it's unclear if it drops other tasks in the executor~~ (it does not) - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread. ### Alternatives A final solution likely will incorporate elements of any or all of the following. #### ~~Log and Ignore~~ ~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~ Panics already do this by default, even when caught by `catch_unwind`. #### ~~`catch_unwind` in `ParallelExecutor`~~ ~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~ ~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### Abort on Panic The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la bevyengine#4740). Roughly takes the shape of: ```rust struct AbortOnPanic; impl Drop for AbortOnPanic { fn drop(&mut self) { abort!(); } } let guard = AbortOnPanic; // Run task std::mem::forget(AbortOnPanic); ``` --- ## Changelog Changed: `bevy_tasks::TaskPool`'s threads will no longer terminate permanently when a task scheduled onto them panics. Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.

Handle TaskPool panicking threads

271d700

NathanSWard added core C-Feature A new feature, making something new possible labels Jun 6, 2021

NathanSWard commented Jun 6, 2021

View reviewed changes

crates/bevy_tasks/src/task_pool.rs Outdated Show resolved Hide resolved

appease the wasm gods

8ad299f

NathanSWard changed the title ~~Handle TaskPool panicking threads~~ [tasks] Handle TaskPool panicking threads Jun 6, 2021

NathanSWard added 2 commits June 7, 2021 13:24

refactor PanicPolicy to task_pool_common

58cec2e

rename panic policy

036b321

NathanSWard commented Jun 7, 2021

View reviewed changes

crates/bevy_tasks/Cargo.toml Outdated Show resolved Hide resolved

move conditional check outside of match

8ee3ec0

NathanSWard requested review from alice-i-cecile, bjorn3 and mockersf June 7, 2021 19:32

NathanSWard added 2 commits June 7, 2021 13:37

format errors

8f74d10

more wasm erros...

fcfc0c8

bjorn3 reviewed Jun 7, 2021

View reviewed changes

crates/bevy_core/src/lib.rs Show resolved Hide resolved

bjorn3 reviewed Jun 7, 2021

View reviewed changes

crates/bevy_core/src/lib.rs Show resolved Hide resolved

NathanSWard added 3 commits June 7, 2021 15:04

replace newline

83500c5

wasm stuffffzzz

5974cac

add docs

b8ee81a

github-actions bot added the S-Needs-Triage This issue needs to be labelled label Jun 7, 2021

NathanSWard removed the S-Needs-Triage This issue needs to be labelled label Jun 7, 2021

use bevy_utils for tracing

7e840cf

github-actions bot added the S-Needs-Triage This issue needs to be labelled label Jun 7, 2021

rename task pool trait

f9e4ed3

NathanSWard removed the S-Needs-Triage This issue needs to be labelled label Jun 7, 2021

github-actions bot added the S-Needs-Triage This issue needs to be labelled label Jun 7, 2021

NathanSWard removed the S-Needs-Triage This issue needs to be labelled label Jun 8, 2021

NathanSWard added 2 commits June 8, 2021 16:20

specify bevy_utils version

534cda4

add testing

57565f3

NathanSWard added 2 commits June 8, 2021 16:59

use unwind_resume in propagate path. Update testing

3358af8

moving restart warning to use old_state's threa name

53753dd

NathanSWard requested a review from bjorn3 June 8, 2021 23:03

fix enumeration to match thread's index

4e8d5a1

mockersf approved these changes Jun 9, 2021

View reviewed changes

tiagolam reviewed Jun 15, 2021

View reviewed changes

alice-i-cecile approved these changes Jun 15, 2021

View reviewed changes

cart added the S-Pre-Relicense This PR was made before Bevy added the Apache license. Cannot be merged or used for other work label Jul 23, 2021

mockersf removed the S-Pre-Relicense This PR was made before Bevy added the Apache license. Cannot be merged or used for other work label Jul 24, 2021

alice-i-cecile added the S-Needs-Review label Sep 22, 2021

cart removed the S-Needs-Review label Dec 16, 2021

james7132 mentioned this pull request Jun 12, 2022

[Merged by Bors] - change panicking test to not run on global task pool #4998

Closed

alice-i-cecile added P-High This is particularly urgent, and deserves immediate attention S-Adopt-Me The original PR author has no intent to complete this work. Pick me up! labels Jun 12, 2022

SarthakSingh31 added a commit to SarthakSingh31/bevy that referenced this pull request Jun 15, 2022

Handle TaskPool panicking threads

fba1605

Built on: bevyengine#2307 from @NathanSWard

SarthakSingh31 added a commit to SarthakSingh31/bevy that referenced this pull request Jun 15, 2022

Handle TaskPool panicking threads

22168c4

Built on: bevyengine#2307 from @NathanSWard

mockersf mentioned this pull request Jun 20, 2022

#[should_panic(expected = ...)] non-functional. #5054

Closed

james7132 mentioned this pull request Nov 2, 2022

[Merged by Bors] - TaskPool Panic Handling #6443

Closed

james7132 closed this Nov 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tasks] Handle TaskPool panicking threads#2307

[tasks] Handle TaskPool panicking threads#2307
NathanSWard wants to merge 17 commits intobevyengine:mainfrom
NathanSWard:nward/task-panic-handling

NathanSWard commented Jun 6, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mockersf left a comment

Uh oh!

tiagolam left a comment

Uh oh!

NathanSWard commented Jun 15, 2021

Uh oh!

NathanSWard commented Jun 15, 2021

Uh oh!

tiagolam commented Jun 15, 2021

Uh oh!

james7132 commented Nov 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

NathanSWard commented Jun 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Objective

Solution

TODO

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mockersf left a comment

Choose a reason for hiding this comment

Uh oh!

tiagolam left a comment

Choose a reason for hiding this comment

Uh oh!

NathanSWard commented Jun 15, 2021

Uh oh!

NathanSWard commented Jun 15, 2021

Uh oh!

tiagolam commented Jun 15, 2021

Uh oh!

james7132 commented Nov 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

NathanSWard commented Jun 6, 2021 •

edited

Loading