Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

[Linq] Lower LargeArrayBuilder's ResizeLimit#14738

Merged
VSadov merged 1 commit intodotnet:masterfrom
jamesqo:smaller-resize-limit
Jan 26, 2017
Merged

[Linq] Lower LargeArrayBuilder's ResizeLimit#14738
VSadov merged 1 commit intodotnet:masterfrom
jamesqo:smaller-resize-limit

Conversation

@jamesqo
Copy link
Copy Markdown
Contributor

@jamesqo jamesqo commented Dec 28, 2016

Background: LargeArrayBuilder has a "resize limit" after which it stops resizing its buffer and switches to using chunked arrays. I initially chose this limit to be 32 elements, because I didn't want it to be too small in case sizeof(T) was small, but also didn't want it to be too large in case sizeof(T) was large.

Description: Since writing #14020, I've come to realize that the 90% use case for Linq is with business objects / classes, and not with low-level stuff like bytes. Since reference types are 4/8 bytes wide, it makes sense to optimize for T being wider. This PR lowers the resize limit of LargeArrayBuilder from 32 -> 8 elements.

Performance improvements: gist For reference types, there is a decrease of ~5 GCs for sizes 9-15, and ~10 GCs from 17-31 and 33+. Sizes 16 and 32 regress, which is expected since the array is now chunked at those sizes and the builder can't directly return it.

cc @stephentoub, @JonHanna, @VSadov

@jamesqo jamesqo changed the title Lower LargeArrayBuilder's ResizeLimit [Linq] Lower LargeArrayBuilder's ResizeLimit Dec 28, 2016
@VSadov
Copy link
Copy Markdown
Member

VSadov commented Dec 28, 2016

I wonder if a number of GC is a meaningful metric of performance.

It seems that overall GC number would depend on size of machine, GC implememtation/mode or its learning heuristics. Some collections could be cheaper than others. Bigger is not necessary worse.

Cpu time or bytes allocated would be more meaningful.

@karelz
Copy link
Copy Markdown
Member

karelz commented Jan 10, 2017

@jamesqo any comment on the above feedback? (I agree with @VSadov here)
Do you plan to measure allocated bytes instead?

@jamesqo
Copy link
Copy Markdown
Contributor Author

jamesqo commented Jan 11, 2017

@VSadov, @karelz Sorry for the delayed reply. I will try to measure other stats for this when I have the time. It will be tricky to measure CPU time because there are so many virtual method calls here that they completely blot out the overhead of extra allocations, even for larger enumerables. As for bytes allocated the only way I am aware of how to do that is this new API, so I will have to adjust my benchmark app later to build against the corefx myget feed which has that API.

@karelz
Copy link
Copy Markdown
Member

karelz commented Jan 11, 2017

there are so many virtual method calls here that they completely blot out the overhead of extra allocations, even for larger enumerables

If you don't expect CPU time to show measurable difference, then focus on a metric which shows it -- allocated bytes in this case.
The key point is that we should have a reasonable metric which shows some improvement on real-world-ish cases. Otherwise we really can't say that the change makes things better, just different :)

As for bytes allocated the only way I am aware of how to do that is this new API

Did you try to use PerfView? It has great video tutorials.
You might want to check xunit-performance if it measures allocated bytes / GC frequency by default (I remember we had those plans two years ago, but I am not sure if it is already implemented). Or maybe BenchmarkDotNet supports it? (disclaimer: I never tried it myself)

cc: @brianrob @vancem @valenis for additional perf guidance on allocated bytes measurements

@vancem
Copy link
Copy Markdown

vancem commented Jan 11, 2017

@jamesqo, to do an ad-hoc performance analysis with PerfView is very easy.

There are Videos https://channel9.msdn.com/Series/PerfView-Tutorial and very first one shows you that you are just a couple clicks away from having the tool. From there if you run

PerfView collect

While your benchmark runs it will collect both CPU stacks as well as samples of stacks of GC allocations every 100K of GC bytes allocated. Thus you can quickly get a read of how much allocation happened over the whole benchmark (or any sub-part of it). Running this with and without your change will show you the effect you have had on both allocation as well as CPU time used.

@jamesqo
Copy link
Copy Markdown
Contributor Author

jamesqo commented Jan 14, 2017

@vancem Thank you for the advice, very useful. I actually have used PerfView before for measuring CPU time, but wasn't aware that it could be used to measure allocated bytes as well.

@VSadov I have just made a comparison between setting the resize limit to 8 and setting it to 32 with the same gist I posted above using PerfView.

32 elements:

control

8 elements:

experimental

There is ~6MB less of allocated bytes (out of 26MB, so roughly 20%) when changing the limit to 8. Note that is this somewhat offsetted because we are also allocating additional object[][]s to hold the arrays themselves, but that only adds ~1MB and the net difference is still ~5MB.

@jamesqo
Copy link
Copy Markdown
Contributor Author

jamesqo commented Jan 20, 2017

@VSadov Maybe you could merge this when you have time? I have posted perf results.

@OmarTawfik
Copy link
Copy Markdown
Contributor

LGTM. @VSadov?

@karelz
Copy link
Copy Markdown
Member

karelz commented Jan 25, 2017

@VSadov can we merge? We got 2 reviews ...

@VSadov
Copy link
Copy Markdown
Member

VSadov commented Jan 26, 2017

LGTM

@VSadov VSadov merged commit 64bd0a4 into dotnet:master Jan 26, 2017
@jamesqo jamesqo deleted the smaller-resize-limit branch January 26, 2017 22:17
@karelz karelz modified the milestone: 2.0.0 Feb 4, 2017
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
[Linq] Lower LargeArrayBuilder's ResizeLimit

Commit migrated from dotnet/corefx@64bd0a4
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants