create a new scope executor for every scope by hymm · Pull Request #7564 · bevyengine/bevy

hymm · 2023-02-08T05:32:19Z

Objective

Fixes Feature debug_asset_server causes freeze/infinite loop on startup #7563
nesting multithreaded schedules seems to deadlock.

Solution

Create a new scope executor for every usage of scope and remove the reused thread local one.
This does seem to fix the above issue. Running cargo run --example load_gltf --features debug_asset_server deadlocks without this pr, but works with it. But I'm unsure of the root cause of the deadlock, so this is not a guarantied fix.

Changelog

Change scope back to create a new scope executor for every scope.

hymm · 2023-02-08T05:34:00Z

This test works on main. But if my mental model of what is causing the dealock was right, this should deadlock.

fn can_run_nested_multithreaded_schedules() {
  let mut world = World::default();

  world.init_resource::<MainThreadExecutor>();
  world.init_resource::<SystemOrder>();

  let mut inner_schedule = Schedule::default();
  inner_schedule.set_executor_kind(ExecutorKind::MultiThreaded);
  inner_schedule.add_system(make_function_system(0));

  let mut outer_schedule = Schedule::default();
  outer_schedule.set_executor_kind(ExecutorKind::MultiThreaded);
  outer_schedule.add_system(move |world: &mut World| {
      inner_schedule.run(world);
  });

  outer_schedule.run(&mut world);

  assert_eq!(world.resource::<SystemOrder>().0, vec![0]);
}

hymm · 2023-02-09T18:20:58Z

There's definitely some timing issues associated with the bug. I eventually added enough dbg!'s and that slowed things down enough that it ran and avoided the deadlock. Not sure if that's what's making my test avoid deadlocking. Running with cargo test --release didn't make it deadlock.

superdump · 2023-02-15T08:48:53Z

Are there any downsides to this PR? The debug asset server is very important for rendering development. The ability to iterate shaders live saves a ton of time given that bevy render/pbr etc take a long time to compile.

superdump · 2023-02-15T08:52:03Z

This looks like it removes reuse of a per-thread executor, to instead create a new thread executor every time a scope is used. That sounds like it has the potential for a performance regression due to creating new thread executors for the duration of the scope.

hymm · 2023-02-22T03:49:37Z

I won't have time to look into this issue more for at least a week. I do plan on investigating more when I do have time. But in case no one else figures anything out, I do consider this change to be low risk. The change to reuse a thread local executor instead of creating a new one was made during this release cycle in #7087. So this is basically just reverting that change

shuoli84 · 2023-02-22T11:27:29Z

Following code can reproduce dead lock.

use bevy_app::App;
use bevy_ecs::prelude::*;

fn run_sub_app(mut sub_app: NonSendMut<DebugApp>) {
    sub_app.app.update();
}

struct DebugApp {
    app: App,
}

fn main() {
    let mut app = bevy_app::App::new();

    let sub_app = bevy_app::App::new();
    app.insert_non_send_resource(DebugApp { app: sub_app });
    app.add_system(run_sub_app);

    app.update();
}

shuoli84 · 2023-02-22T11:29:09Z

I think the reason is, run_sub_app is exclusive, subapp's apply_system_buffer also is exclusive. They dead locked.

shuoli84 · 2023-02-22T15:35:20Z

I wondered how LocalExecutor on main thread get ticked, until see the code tick_global_task_pools_on_main_thread, this system is run in CoreSet::Last, if the main is block_on something in CoreSet::Update, and that something spawn a task to main's local executor, which requires main's tick to advance.

hymm · 2023-02-22T16:03:17Z

I think the reason is, run_sub_app is exclusive, subapp's apply_system_buffer also is exclusive. They dead locked.

There are two executors here with one running inside the other one. The exclusivity is per executor, and so running 2 systems that want exclusive access in different executors is allowed.

I wondered how LocalExecutor on main thread get ticked, until see the code tick_global_task_pools_on_main_thread, this system is run in CoreSet::Last, if the main is block_on something in CoreSet::Update, and that something spawn a task to main's local executor, which requires main's tick to advance.

They're also ticked inside the bevy_tasks::scope which the multithreaded executor runs inside. tick_global_task_pools_on_main_thread exists so that the local executors not ticked in the scope are ticked occationally.

hymm · 2023-02-22T16:03:49Z

The test code above is a little out of date. The inner schedule should have some apply_system_buffers in it. My investigations showed that the deadlock was happening during the startup schedule of the debug app. Typically it happened during the 2nd or 3rd apply_system_buffers in that schedule. The schedule doesn't otherwise have any systems in it. But even with trying to add some apply_system_buffers to the test it doesn't deadlock.

NiklasEi · 2023-02-24T20:45:49Z

I have run into issues with integration tests of bevy_asset_loader freezing up on latest main. This PR fixes those issues.

james7132

Comparing this to prior to pipelined rendering's merge, this indeed is just a revert. If this is fixing the usage of the debug asset server, we shouldn't ship 0.10 without this.

shuoli84 · 2023-02-25T16:35:43Z

I think I find something interesting. The problem actually is "the async-executor entered a state that it is not able to be triggered just by spawn."

Normally, when spawn_exclusive_system_task, it triggers/notifies the main threads ThreadExecutor.

spawn thread: ThreadId(4) -> Executor { id: 9, active: 1, global_tasks: 0, local_runners: [], sleepers: 2 } 0 f: bevy_ecs::schedule::executor::multi_threaded::MultiThreadedExecutor::spawn_exclusive_system_task::{{closure}}
executor[9] notify wake 2 Waker { data: 0x600001da4850, vtable: 0x107eb9698 } now: "\"count:2 free_ids:[] wakers:1  <wakers: 1 Waker { data: 0x600001dd2fb0, vtable: 0x107eb9698 }>\\n\""

But when the problem happens, it can't trigger the executor. The notify no effort is a log I added in async_executor, it means the executor believes it is already notified, so nothing to do.

spawn thread: ThreadId(4) -> Executor { id: 9, active: 1, global_tasks: 0, local_runners: [], sleepers: 2 } 0 f: bevy_ecs::schedule::executor::multi_threaded::MultiThreadedExecutor::spawn_exclusive_system_task::{{closure}}
executor[9] notify no effort

Why? That's something I am still figuring out. It appears the executor's state.sleepers messed up. following is it's state. We can see the sleepers' count is 2, but there is only 1 wakers.

  9 ThreadExecutor : active:1 sleepers:"\"count:2 free_ids:[] wakers:1  <wakers: 2 Waker { data: 0x600001da4850, vtable: 0x107eb9698 }>\\n\""

Actually when the ticker/sleeper is just notified and running, this is not a problem. In our case, the main thread is already parked, then this is a problem. The executor believes the ticker is running or already notified, so it won't notify it again. Then the main thread parked forever.

shuoli84 · 2023-02-25T16:41:01Z

This is other executor's states. Only Executor [9]'s count is wakers.len() + 1.

  1 shared executor : active:0 sleepers:"\"count:1 free_ids:[] wakers:1  <wakers: 1 Waker { data: 0x600001dc0040, vtable: 0x107eb9698 }>\\n\""
  2 shared executor : active:0 sleepers:"\"count:1 free_ids:[] wakers:1  <wakers: 1 Waker { data: 0x600001dc0070, vtable: 0x107eb9698 }>\\n\""
  3 shared executor : active:3 sleepers:"\"count:1 free_ids:[2, 3] wakers:1  <wakers: 1 Waker { data: 0x600001dc8070, vtable: 0x107eb9698 }>\\n\""
  5 local executor : active:0 sleepers:"\"count:1 free_ids:[] wakers:1  <wakers: 1 Waker { data: 0x600001dc0070, vtable: 0x107eb9698 }>\\n\""
  4 local executor : active:0 sleepers:"\"count:1 free_ids:[] wakers:1  <wakers: 1 Waker { data: 0x600001dc0040, vtable: 0x107eb9698 }>\\n\""
  6 local executor : active:0 sleepers:"\"count:1 free_ids:[] wakers:1  <wakers: 1 Waker { data: 0x600001dc8070, vtable: 0x107eb9698 }>\\n\""
  8 ThreadExecutor : active:0 sleepers:"\"count:1 free_ids:[1] wakers:1  <wakers: 2 Waker { data: 0x600001da4f40, vtable: 0x107eb9698 }>\\n\""
  7 ThreadExecutor : active:2 sleepers:"\"count:0 free_ids:[1] wakers:0 \""
  9 ThreadExecutor : active:1 sleepers:"\"count:2 free_ids:[] wakers:1  <wakers: 2 Waker { data: 0x600001da4850, vtable: 0x107eb9698 }>\\n\""
  10 local executor : active:0 sleepers:"\"count:0 free_ids:[] wakers:0 \""
  11 ThreadExecutor : active:0 sleepers:"\"count:0 free_ids:[1, 2] wakers:0 \""

shuoli84 · 2023-02-26T12:23:54Z

I think I found the root cause, there are two ThreadExecutors for the main thread, and when one task capture two tickers, each from each Executor, then it's in the situation I mentioned above.

/// the thread local one
thread_local! {
        static LOCAL_EXECUTOR: async_executor::LocalExecutor<'static> = async_executor::LocalExecutor::new("local executor");
        static THREAD_EXECUTOR: Arc<ThreadExecutor<'static>> = Arc::new(ThreadExecutor::new());
    }

/// and also the MainExecutor resource
#[derive(Resource, Default, Clone)]
pub struct MainThreadExecutor(pub Arc<ThreadExecutor<'static>>);

And the troublesome task is created by code:

external_ticker.tick().or(scope_ticker.tick()).await;

shuoli84 · 2023-02-26T12:58:07Z

Just opened #7825 , check it out?

james7132 · 2023-03-01T21:01:59Z

Closing in favor of #7825.

create a new scope executor for every scope

a7fa621

alice-i-cecile added C-Bug An unexpected or incorrect behavior A-Tasks Tools for parallel and async work labels Feb 8, 2023

alice-i-cecile added this to the 0.10 milestone Feb 8, 2023

superdump requested a review from cart February 8, 2023 18:23

superdump added the P-Regression Functionality that used to work but no longer does. Add a test for this! label Feb 8, 2023

DGriffin91 mentioned this pull request Feb 15, 2023

[Merged by Bors] - Initial tonemapping options #7594

Closed

superdump requested a review from james7132 February 15, 2023 08:47

james7132 approved these changes Feb 25, 2023

View reviewed changes

hymm mentioned this pull request Feb 28, 2023

[Merged by Bors] - Fix asset_debug_server hang. There should be at most one ThreadExecut… #7825

Closed

james7132 closed this Mar 1, 2023

hymm deleted the fix-nested-schedules branch October 5, 2023 16:34

Uh oh!

Conversation

hymm commented Feb 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Objective

Solution

Changelog

Uh oh!

hymm commented Feb 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hymm commented Feb 9, 2023

Uh oh!

superdump commented Feb 15, 2023

Uh oh!

superdump commented Feb 15, 2023

Uh oh!

hymm commented Feb 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shuoli84 commented Feb 22, 2023

Uh oh!

shuoli84 commented Feb 22, 2023

Uh oh!

shuoli84 commented Feb 22, 2023

Uh oh!

hymm commented Feb 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hymm commented Feb 22, 2023

Uh oh!

NiklasEi commented Feb 24, 2023

Uh oh!

james7132 left a comment

Choose a reason for hiding this comment

Uh oh!

shuoli84 commented Feb 25, 2023

Uh oh!

shuoli84 commented Feb 25, 2023

Uh oh!

shuoli84 commented Feb 26, 2023

Uh oh!

shuoli84 commented Feb 26, 2023

Uh oh!

james7132 commented Mar 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hymm commented Feb 8, 2023 •

edited

Loading

hymm commented Feb 8, 2023 •

edited

Loading

hymm commented Feb 22, 2023 •

edited

Loading

hymm commented Feb 22, 2023 •

edited

Loading