Closed
Conversation
These two getters are not cheap. Caching the previous handle and result allow skipping pricy lookups when the same handle is requested multiple times in a row. This makes the getters disappear from synthetic benchmarks such as bevymark making it a pretty large win there, although in a more realistic setup the wins will likely be more nucanced. In some ways it's a form of batching, but at the handle querying level. I don't know whether this would be better done at a higher level in the Assets logic directly or via some sort of generic helper, I suspect that other parts of bevy may benefit from similar tricks.
ExtractedSprite is a rather large struct. Large enough to be moved via memcpy. Pushing into a vector by default forces the value to exist on the stack before it is moved into the vector storage. This is because Vec::push may have to attempt a memory allocation which could panic before writing the data into the stoage. The initialization of the extracted sprite cannot be reordered with the potentially panicking operation, preventing rustc/llvm to initialize the ExtractedSprite struct directly into the vector's storage. Fortunately there is a very simple crate to work around this: copyless. What it does is to split the push operation into the allocation and the write, so that the payload can be initialized after the allocation. This lets llvm optimize away the move. With this change, ExtractedSprite is initialized directly into the vector storage and the memcpy disappears from profiles.
Author
|
It appears that both optimizations were obsoleted by recent changes. |
Weasy666
added a commit
to Weasy666/bevy_svg
that referenced
this pull request
Jan 9, 2022
Details here: bevyengine/bevy#3590
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Objective
Speed up extract_sprite on sprite-heavy workloads.
Solution
I stumbled upon some optimization discussions focusing on the cost of
extract_spritein sprite-heavy workloads such as bevymark. It got me curious so I had a look at bevymark specifically on Linux using perf.Here is the base profile (bevymark): https://share.firefox.dev/3f160Wb
Asset::get
Note that since it was recorded with perf, threads are sampled only when they are running. This means that if a thread calls only a single function for a millisecond and then sleeps for multiple seconds, that function will get assigned 100% (of the active time) even though it ran for only a small portion of the actual time. Just something to keep in mind to not get confused when interpreting the results.
In that profile you can see
extract_spriteoccupying about 14% of the active time. so it's a visible slice of CPU time.Inside
extract_sprite, a bit more than half of the time is spent inAssets::get, which is visibly not cheap.The first commit in this PR addresses this by caching the previous atlas and sprite query and reusing it when multiple consecutive queries operate on the same handle. On a synthetic benchmark such as bevymark this pretty much makes
Assets::getdisappear from the profile and halves the cost ofextract_spritesince we probably end up only executing the expensive getter once per frame.In a real game the wins would likely not be as dramatic, however since reusing the same textures is important for rendering performance I expect that consecutive access to the same handle is frequent. I'd like to have a more real-world sprite-heavy workload to validate thus assumption.
I think that this optimization could be lifted in a generic helper in the assets code. And could benefit other areas where it is known that consecutive access to the same handles are commonplace.
Here is a profile after optimizing away the getters: https://share.firefox.dev/3zCUNEF
_memcpy*
Another visible slice of CPU time is spent moving memory via
__memcpy_avx_unaligned_ermsinVec::push. I have seen this pattern a lot in various rust projects. This happens when pushing a large structs (ExtractedSprite) into vectors or other allocating containers.The code looks like
vector.push(Structure { .. }), a structure being initialized directly as an argument ofpushand what we would want here is forthe struct be initialized directly into the vector's storage. Howeverpushmay first have allocate which can panic, and you can't reorder operations across potentially panicking ones (the reality is probably more nuanced than that but that's a pretty good mental model).So the struct is first initialized on the stack, then the vector ensures it has room for it and then a memcpy is used to move the data and calling memcpy ends up being expensive enough show up as the majority of the remaining CPU time spent in
extract_sprite.Thankfully there is a very simple and effective workaround for this: the
copylesscrate. It provides an helper methods toVecthat separates the allocation of a slot into the vector and writing data into it.vector.push(Structure { .. })becomesvector.alloc().init(Structure { .. }), which while looking very similar has the particularity of letting us initialize the struct after ensuring its spot is allocated. This lets llvm reliably optimize away the move.Sorry this is a lot of chatter for a small thing but this trick is very useful and can help in many areas of a typical code base. I've seen a few other places in the profile where this would help.
Note that it's not useful when pushing large values that already have to exist on the stack for other reasons. It's also not necessary to use it for very small data moved with memcpy.
I think that the remaining things in
extract_spritelike the cost of clearing the vector when items have aDropimpl were already discussed and have PRs open or on the way by other people.