Speed up extract_sprite by nical · Pull Request #3590 · bevyengine/bevy

nical · 2022-01-08T12:32:26Z

Objective

Speed up extract_sprite on sprite-heavy workloads.

Solution

I stumbled upon some optimization discussions focusing on the cost of extract_sprite in sprite-heavy workloads such as bevymark. It got me curious so I had a look at bevymark specifically on Linux using perf.

Here is the base profile (bevymark): https://share.firefox.dev/3f160Wb

Asset::get

Note that since it was recorded with perf, threads are sampled only when they are running. This means that if a thread calls only a single function for a millisecond and then sleeps for multiple seconds, that function will get assigned 100% (of the active time) even though it ran for only a small portion of the actual time. Just something to keep in mind to not get confused when interpreting the results.

In that profile you can see extract_sprite occupying about 14% of the active time. so it's a visible slice of CPU time.
Inside extract_sprite, a bit more than half of the time is spent in Assets::get, which is visibly not cheap.

The first commit in this PR addresses this by caching the previous atlas and sprite query and reusing it when multiple consecutive queries operate on the same handle. On a synthetic benchmark such as bevymark this pretty much makes Assets::get disappear from the profile and halves the cost of extract_sprite since we probably end up only executing the expensive getter once per frame.
In a real game the wins would likely not be as dramatic, however since reusing the same textures is important for rendering performance I expect that consecutive access to the same handle is frequent. I'd like to have a more real-world sprite-heavy workload to validate thus assumption.

I think that this optimization could be lifted in a generic helper in the assets code. And could benefit other areas where it is known that consecutive access to the same handles are commonplace.

Here is a profile after optimizing away the getters: https://share.firefox.dev/3zCUNEF

_memcpy*

Another visible slice of CPU time is spent moving memory via __memcpy_avx_unaligned_erms in Vec::push. I have seen this pattern a lot in various rust projects. This happens when pushing a large structs (ExtractedSprite) into vectors or other allocating containers.
The code looks like vector.push(Structure { .. }), a structure being initialized directly as an argument of push and what we would want here is forthe struct be initialized directly into the vector's storage. However push may first have allocate which can panic, and you can't reorder operations across potentially panicking ones (the reality is probably more nuanced than that but that's a pretty good mental model).
So the struct is first initialized on the stack, then the vector ensures it has room for it and then a memcpy is used to move the data and calling memcpy ends up being expensive enough show up as the majority of the remaining CPU time spent in extract_sprite.

Thankfully there is a very simple and effective workaround for this: the copyless crate. It provides an helper methods to Vec that separates the allocation of a slot into the vector and writing data into it.
vector.push(Structure { .. }) becomes vector.alloc().init(Structure { .. }), which while looking very similar has the particularity of letting us initialize the struct after ensuring its spot is allocated. This lets llvm reliably optimize away the move.

Sorry this is a lot of chatter for a small thing but this trick is very useful and can help in many areas of a typical code base. I've seen a few other places in the profile where this would help.
Note that it's not useful when pushing large values that already have to exist on the stack for other reasons. It's also not necessary to use it for very small data moved with memcpy.

I think that the remaining things in extract_sprite like the cost of clearing the vector when items have a Drop impl were already discussed and have PRs open or on the way by other people.

These two getters are not cheap. Caching the previous handle and result allow skipping pricy lookups when the same handle is requested multiple times in a row. This makes the getters disappear from synthetic benchmarks such as bevymark making it a pretty large win there, although in a more realistic setup the wins will likely be more nucanced. In some ways it's a form of batching, but at the handle querying level. I don't know whether this would be better done at a higher level in the Assets logic directly or via some sort of generic helper, I suspect that other parts of bevy may benefit from similar tricks.

ExtractedSprite is a rather large struct. Large enough to be moved via memcpy. Pushing into a vector by default forces the value to exist on the stack before it is moved into the vector storage. This is because Vec::push may have to attempt a memory allocation which could panic before writing the data into the stoage. The initialization of the extracted sprite cannot be reordered with the potentially panicking operation, preventing rustc/llvm to initialize the ExtractedSprite struct directly into the vector's storage. Fortunately there is a very simple crate to work around this: copyless. What it does is to split the push operation into the allocation and the write, so that the payload can be initialized after the allocation. This lets llvm optimize away the move. With this change, ExtractedSprite is initialized directly into the vector storage and the memcpy disappears from profiles.

nical · 2022-01-08T12:46:25Z

It appears that both optimizations were obsoleted by recent changes.

Details here: bevyengine/bevy#3590

nical added 2 commits January 8, 2022 11:39

github-actions bot added the S-Needs-Triage This issue needs to be labelled label Jan 8, 2022

nical changed the title ~~Copyless~~ Speed up extract_sprite Jan 8, 2022

nical closed this Jan 8, 2022

Weasy666 added a commit to Weasy666/bevy_svg that referenced this pull request Jan 9, 2022

Add copyless

7dff559

Details here: bevyengine/bevy#3590

cart mentioned this pull request Mar 24, 2022

Implement Deref(Mut) on BufferVec/UniformVec #3532

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up extract_sprite#3590

Speed up extract_sprite#3590
nical wants to merge 2 commits intobevyengine:mainfrom
nical:copyless

nical commented Jan 8, 2022

Uh oh!

nical commented Jan 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

nical commented Jan 8, 2022

Objective

Solution

Asset::get

_memcpy*

Uh oh!

nical commented Jan 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant