Cache parallel iteration spans by james7132 · Pull Request #9950 · bevyengine/bevy

james7132 · 2023-09-28T12:34:48Z

Objective

We cached system spans in #9390, but another common span seen in most Bevy apps when enabling tracing are Query::par_iter(_mut) related spans.

Solution

Cache them in QueryState. The one downside to this is that we pay for the memory for every Query(State) instantiated, not just those that are used for parallel iteration, but this shouldn't be a significant cost unless the app is creating hundreds of thousands of Query(State)s regularly.

Metrics

Tested against cargo run --profile stress-test --features trace_tracy --example many_cubes. Yellow is this PR, red is main.

sync_simple_transforms:

check_visibility:

Full frame:

mockersf · 2023-09-30T10:15:39Z

crates/bevy_ecs/src/query/state.rs

-                            count = len,
-                        );
-                        #[cfg(feature = "trace")]
-                        let task = task.instrument(span);


calling instrument is supposed to make the span work across async boundaries, is it not needed anymore?

The task itself never yields to the async executor. There's no async boundary to work across. Best to avoid the overhead and just enter the span inside the task instead.

# Objective `extract_meshes` can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it! ## Solution Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by #7348 in conjunction with `Query::par_iter` to build a set of thread-local queues, and collect them after going wide. ## Performance Using `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR. Red is main. `extract_meshes`: ![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b) An average reduction from 1.2ms to 770us is seen, a 41.6% improvement. Note: this is still not including #9950's changes, so this may actually result in even faster speedups once that's merged in.

# Objective We cached system spans in bevyengine#9390, but another common span seen in most Bevy apps when enabling tracing are Query::par_iter(_mut) related spans. ## Solution Cache them in QueryState. The one downside to this is that we pay for the memory for every Query(State) instantiated, not just those that are used for parallel iteration, but this shouldn't be a significant cost unless the app is creating hundreds of thousands of Query(State)s regularly. ## Metrics Tested against `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR, red is main. `sync_simple_transforms`: ![image](https://github.com/bevyengine/bevy/assets/3137680/d60f6d69-5586-4424-9d78-aac78992aacd) `check_visibility`: ![image](https://github.com/bevyengine/bevy/assets/3137680/096a58d2-a330-4a32-b806-09cd524e6e15) Full frame: ![image](https://github.com/bevyengine/bevy/assets/3137680/3b088cf8-9487-4bc7-a308-026e172d6672)

# Objective `extract_meshes` can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it! ## Solution Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in conjunction with `Query::par_iter` to build a set of thread-local queues, and collect them after going wide. ## Performance Using `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR. Red is main. `extract_meshes`: ![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b) An average reduction from 1.2ms to 770us is seen, a 41.6% improvement. Note: this is still not including bevyengine#9950's changes, so this may actually result in even faster speedups once that's merged in.

# Objective We cached system spans in bevyengine#9390, but another common span seen in most Bevy apps when enabling tracing are Query::par_iter(_mut) related spans. ## Solution Cache them in QueryState. The one downside to this is that we pay for the memory for every Query(State) instantiated, not just those that are used for parallel iteration, but this shouldn't be a significant cost unless the app is creating hundreds of thousands of Query(State)s regularly. ## Metrics Tested against `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR, red is main. `sync_simple_transforms`: ![image](https://github.com/bevyengine/bevy/assets/3137680/d60f6d69-5586-4424-9d78-aac78992aacd) `check_visibility`: ![image](https://github.com/bevyengine/bevy/assets/3137680/096a58d2-a330-4a32-b806-09cd524e6e15) Full frame: ![image](https://github.com/bevyengine/bevy/assets/3137680/3b088cf8-9487-4bc7-a308-026e172d6672)

# Objective `extract_meshes` can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it! ## Solution Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in conjunction with `Query::par_iter` to build a set of thread-local queues, and collect them after going wide. ## Performance Using `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR. Red is main. `extract_meshes`: ![image](https://github.com/bevyengine/bevy/assets/3137680/9d45aa2e-3cfa-4fad-9c08-53498b51a73b) An average reduction from 1.2ms to 770us is seen, a 41.6% improvement. Note: this is still not including bevyengine#9950's changes, so this may actually result in even faster speedups once that's merged in.

Cache parallel iteration spans

c2d88c0

alice-i-cecile added the A-ECS Entities, components, systems, and events label Sep 28, 2023

hymm self-requested a review September 29, 2023 05:08

superdump approved these changes Sep 29, 2023

View reviewed changes

james7132 mentioned this pull request Sep 29, 2023

Parallelize extract_meshes #9966

Merged

hymm approved these changes Sep 30, 2023

View reviewed changes

hymm added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Sep 30, 2023

superdump added this pull request to the merge queue Sep 30, 2023

Merged via the queue into bevyengine:main with commit 95813b8 Sep 30, 2023

mockersf reviewed Sep 30, 2023

View reviewed changes

cart mentioned this pull request Oct 13, 2023

News: Release 0.12 bevyengine/bevy-website#754

Merged

43 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache parallel iteration spans#9950

Cache parallel iteration spans#9950
superdump merged 1 commit intobevyengine:mainfrom
james7132:cache-par-iter-spans

james7132 commented Sep 28, 2023 •

edited

Loading

Uh oh!

mockersf Sep 30, 2023

Uh oh!

james7132 Sep 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

james7132 commented Sep 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Objective

Solution

Metrics

Uh oh!

mockersf Sep 30, 2023

Choose a reason for hiding this comment

Uh oh!

james7132 Sep 30, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

james7132 commented Sep 28, 2023 •

edited

Loading