Cache parallel iteration spans#9950
Merged
superdump merged 1 commit intobevyengine:mainfrom Sep 30, 2023
Merged
Conversation
superdump
approved these changes
Sep 29, 2023
hymm
approved these changes
Sep 30, 2023
mockersf
reviewed
Sep 30, 2023
| count = len, | ||
| ); | ||
| #[cfg(feature = "trace")] | ||
| let task = task.instrument(span); |
Member
There was a problem hiding this comment.
calling instrument is supposed to make the span work across async boundaries, is it not needed anymore?
Member
Author
There was a problem hiding this comment.
The task itself never yields to the async executor. There's no async boundary to work across. Best to avoid the overhead and just enter the span inside the task instead.
github-merge-queue bot
pushed a commit
that referenced
this pull request
Oct 1, 2023
# Objective `extract_meshes` can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it! ## Solution Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by #7348 in conjunction with `Query::par_iter` to build a set of thread-local queues, and collect them after going wide. ## Performance Using `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR. Red is main. `extract_meshes`:  An average reduction from 1.2ms to 770us is seen, a 41.6% improvement. Note: this is still not including #9950's changes, so this may actually result in even faster speedups once that's merged in.
43 tasks
regnarock
pushed a commit
to regnarock/bevy
that referenced
this pull request
Oct 13, 2023
# Objective We cached system spans in bevyengine#9390, but another common span seen in most Bevy apps when enabling tracing are Query::par_iter(_mut) related spans. ## Solution Cache them in QueryState. The one downside to this is that we pay for the memory for every Query(State) instantiated, not just those that are used for parallel iteration, but this shouldn't be a significant cost unless the app is creating hundreds of thousands of Query(State)s regularly. ## Metrics Tested against `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR, red is main. `sync_simple_transforms`:  `check_visibility`:  Full frame: 
regnarock
pushed a commit
to regnarock/bevy
that referenced
this pull request
Oct 13, 2023
# Objective `extract_meshes` can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it! ## Solution Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in conjunction with `Query::par_iter` to build a set of thread-local queues, and collect them after going wide. ## Performance Using `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR. Red is main. `extract_meshes`:  An average reduction from 1.2ms to 770us is seen, a 41.6% improvement. Note: this is still not including bevyengine#9950's changes, so this may actually result in even faster speedups once that's merged in.
rdrpenguin04
pushed a commit
to rdrpenguin04/bevy
that referenced
this pull request
Jan 9, 2024
# Objective We cached system spans in bevyengine#9390, but another common span seen in most Bevy apps when enabling tracing are Query::par_iter(_mut) related spans. ## Solution Cache them in QueryState. The one downside to this is that we pay for the memory for every Query(State) instantiated, not just those that are used for parallel iteration, but this shouldn't be a significant cost unless the app is creating hundreds of thousands of Query(State)s regularly. ## Metrics Tested against `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR, red is main. `sync_simple_transforms`:  `check_visibility`:  Full frame: 
rdrpenguin04
pushed a commit
to rdrpenguin04/bevy
that referenced
this pull request
Jan 9, 2024
# Objective `extract_meshes` can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it! ## Solution Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in conjunction with `Query::par_iter` to build a set of thread-local queues, and collect them after going wide. ## Performance Using `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR. Red is main. `extract_meshes`:  An average reduction from 1.2ms to 770us is seen, a 41.6% improvement. Note: this is still not including bevyengine#9950's changes, so this may actually result in even faster speedups once that's merged in.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Objective
We cached system spans in #9390, but another common span seen in most Bevy apps when enabling tracing are Query::par_iter(_mut) related spans.
Solution
Cache them in QueryState. The one downside to this is that we pay for the memory for every Query(State) instantiated, not just those that are used for parallel iteration, but this shouldn't be a significant cost unless the app is creating hundreds of thousands of Query(State)s regularly.
Metrics
Tested against
cargo run --profile stress-test --features trace_tracy --example many_cubes. Yellow is this PR, red is main.sync_simple_transforms:check_visibility:Full frame: