Conversation
superdump
left a comment
There was a problem hiding this comment.
I guess the copying the data around the heap is cheaper than the parallelism. Well, nice. :)
|
Ideally the HashMap would be closer to a Vec so we can just append to the end of larger Vec since batch memcpys are exceptionally fast due to being heavily vectorized. In fact, it may be better to send a Vec to the render world and construct the HashMap from it in the render world to avoid blocking on extraction. Once we remove the need for entities as the FIXME states, it should be even faster since we only need to construct the HashMap. |
Now that the tracing overhead is almost all gone due to our span caching, I've noted that the overhead from parallelism is very low so long as we don't need to repeatedly park and unpark the task pool threads. As we parallelize more of the engine, there will be significantly less downtime as we increase CPU utilization. It may very well be worth it to copy more if we can go wide more readily. |
#9950 has been merged, and the difference grows even larger. The difference is closer to a 51.6% decrease in time spent. Though again, the bulk of it is still in collection into the final output Vec and HashMap. If we can remove either of those, it should be even faster. |
|
This should be done for 2D meshes as well. And maybe also UI and sprites if possible. The only downside is maybe if the parallelisation adds overhead that is negative for power consumption. So of it takes more energy to extract or do something in parallel than in serial on one core. |
# Objective `extract_meshes` can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it! ## Solution Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in conjunction with `Query::par_iter` to build a set of thread-local queues, and collect them after going wide. ## Performance Using `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR. Red is main. `extract_meshes`:  An average reduction from 1.2ms to 770us is seen, a 41.6% improvement. Note: this is still not including bevyengine#9950's changes, so this may actually result in even faster speedups once that's merged in.
# Objective `extract_meshes` can easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it! ## Solution Use the `ThreadLocal<Cell<Vec<T>>>` approach utilized by bevyengine#7348 in conjunction with `Query::par_iter` to build a set of thread-local queues, and collect them after going wide. ## Performance Using `cargo run --profile stress-test --features trace_tracy --example many_cubes`. Yellow is this PR. Red is main. `extract_meshes`:  An average reduction from 1.2ms to 770us is seen, a 41.6% improvement. Note: this is still not including bevyengine#9950's changes, so this may actually result in even faster speedups once that's merged in.

Objective
extract_meshescan easily be one of the most expensive operations in the blocking extract schedule for 3D apps. It also has no fundamentally serialized parts and can easily be run across multiple threads. Let's speed it up by parallelizing it!Solution
Use the
ThreadLocal<Cell<Vec<T>>>approach utilized by #7348 in conjunction withQuery::par_iterto build a set of thread-local queues, and collect them after going wide.Performance
Using
cargo run --profile stress-test --features trace_tracy --example many_cubes. Yellow is this PR. Red is main.extract_meshes:An average reduction from 1.2ms to 770us is seen, a 41.6% improvement.
Note: this is still not including #9950's changes, so this may actually result in even faster speedups once that's merged in.