GeneralPurposeAllocator: Considerably improve worst case performance#17383
Merged
andrewrk merged 3 commits intoziglang:masterfrom Oct 3, 2023
Merged
GeneralPurposeAllocator: Considerably improve worst case performance#17383andrewrk merged 3 commits intoziglang:masterfrom
andrewrk merged 3 commits intoziglang:masterfrom
Conversation
Before this commit, GeneralPurposeAllocator could run into incredibly degraded performance in scenarios where the bucket count for a particular size class grew to be large. For example, if exactly `slot_count` allocations of a single size class were performed and then all of them were freed except one, then the bucket for those allocations would have to be kept around indefinitely. If that pattern of allocation were done over and over, then the bucket list for that size class could grow incredibly large. This allocation pattern has been seen in the wild: Vexu/arocc#508 (comment) In that case, the length of the bucket list for the `128` size class would grow to tens of thousands of buckets and cause Debug runtime to balloon to ~8 minutes whereas with the c_allocator the Debug runtime would be ~3 seconds. To address this, there are three different changes happening here: 1. std.Treap is used instead of a doubly linked list for the lists of buckets. This takes the time complexity of searchBucket [used in resize and free] from O(n) to O(log n), but increases the time complexity of insert from O(1) to O(log n) [before, all new buckets would get added to the head of the list]. Note: Any data structure with O(log n) or better search/insert/delete would also work for this use-case. 2. If the 'current' bucket for a size class is full, the list of buckets is never traversed and instead a new bucket is allocated. Previously, traversing the bucket list could only find a non-full bucket in specific circumstances, and only because of a separate optimization that is no longer needed (before, after any resize/free, the affected bucket would be moved to the head of the bucket list to allow searchBucket to perform better on average). Now, the current_bucket for each size class only changes when either (1) the current bucket is emptied/freed, or (2) a new bucket is allocated (due to the current bucket being full or null). Because each bucket's alloc_cursor only moves forward (i.e. slots within a bucket are never re-used), we can therefore always know that any bucket besides the current_bucket will be full, so traversing the list in the hopes of finding an existing non-full bucket is entirely pointless. 3. Size + alignment information for small allocations has been moved into the Bucket data instead of keeping it in a separate HashMap. This offers an improvement over the HashMap since whenever we need to get/modify the length/alignment of an allocation it's extremely likely we will already have calculated any bucket-related information necessary to get the data. The first change is the most relevant and accounts for most of the benefit here. Also note that the overall functionality of GeneralPurposeAllocator is unchanged. In the degraded `arocc` case, these changes bring Debug performance from ~8 minutes to ~20 seconds. Benchmark 1: test-master.bat Time (mean ± σ): 481.263 s ± 5.440 s [User: 479.159 s, System: 1.937 s] Range (min … max): 477.416 s … 485.109 s 2 runs Benchmark 2: test-optim-treap.bat Time (mean ± σ): 19.639 s ± 0.037 s [User: 18.183 s, System: 1.452 s] Range (min … max): 19.613 s … 19.665 s 2 runs Summary 'test-optim-treap.bat' ran 24.51 ± 0.28 times faster than 'test-master.bat' Note: Much of the time taken on Windows in this particular case is related to gathering stack traces. With `.stack_trace_frames = 0` the runtime goes down to 6.7 seconds, which is a little more than 2.5x slower compared to when the c_allocator is used. These changes may or mat not introduce a slight performance regression in the average case: Here's the standard library tests on Windows in Debug mode: Benchmark 1 (10 runs): std-tests-master.exe measurement mean ± σ min … max outliers delta wall_time 16.0s ± 30.8ms 15.9s … 16.1s 1 (10%) 0% peak_rss 42.8MB ± 8.24KB 42.8MB … 42.8MB 0 ( 0%) 0% Benchmark 2 (10 runs): std-tests-optim-treap.exe measurement mean ± σ min … max outliers delta wall_time 16.2s ± 37.6ms 16.1s … 16.3s 0 ( 0%) 💩+ 1.3% ± 0.2% peak_rss 42.8MB ± 5.18KB 42.8MB … 42.8MB 0 ( 0%) + 0.1% ± 0.0% And on Linux: Benchmark 1: ./test-master Time (mean ± σ): 16.091 s ± 0.088 s [User: 15.856 s, System: 0.453 s] Range (min … max): 15.870 s … 16.166 s 10 runs Benchmark 2: ./test-optim-treap Time (mean ± σ): 16.028 s ± 0.325 s [User: 15.755 s, System: 0.492 s] Range (min … max): 15.735 s … 16.709 s 10 runs Summary './test-optim-treap' ran 1.00 ± 0.02 times faster than './test-master'
42ae395 to
95f4c15
Compare
andrewrk
approved these changes
Oct 3, 2023
Member
andrewrk
left a comment
There was a problem hiding this comment.
Fantastic work!
- Because the GPA doesn't have an
initfunction and is directly instantiated instead, the memory pool can't usebacking_allocatorand instead always uses the page_allocator
I was already thinking before this that GPA should lose the feature of using backing_allocator and become hard-coded to always use OS calls directly. An allocator that wraps an existing one and tries to provide some kind of features on top of it could be interesting, but it's a bit of a different concern than what GPA aims to provide. So, this is a step in the right direction IMO.
squeek502
added a commit
to squeek502/zig
that referenced
this pull request
Oct 4, 2023
…rching the list Follow up to ziglang#17383. This is a minor optimization that only matters when a small allocation is resized/free'd soon after it is allocated. The only real difference I was able to observe with this was via a synthetic benchmark that allocates a full bucket and then frees all but one of the slots, over and over in a loop: Debug build: Benchmark 1 (9 runs): gpa-degen-master.exe measurement mean ± σ min … max outliers delta wall_time 575ms ± 5.19ms 569ms … 583ms 0 ( 0%) 0% peak_rss 43.8MB ± 1.37KB 43.8MB … 43.8MB 1 (11%) 0% Benchmark 2 (10 runs): gpa-degen-search-cur.exe measurement mean ± σ min … max outliers delta wall_time 532ms ± 5.55ms 520ms … 539ms 0 ( 0%) ⚡- 7.5% ± 0.9% peak_rss 43.8MB ± 65.2KB 43.8MB … 44.0MB 1 (10%) + 0.0% ± 0.1% ReleaseFast build: Benchmark 1 (129 runs): gpa-degen-master-release.exe measurement mean ± σ min … max outliers delta wall_time 38.9ms ± 1.12ms 36.7ms … 42.4ms 8 ( 6%) 0% peak_rss 23.2MB ± 2.39KB 23.2MB … 23.2MB 0 ( 0%) 0% Benchmark 2 (151 runs): gpa-degen-search-cur-release.exe measurement mean ± σ min … max outliers delta wall_time 33.2ms ± 999us 31.9ms … 36.3ms 20 (13%) ⚡- 14.7% ± 0.6% peak_rss 23.2MB ± 2.26KB 23.2MB … 23.2MB 0 ( 0%) + 0.0% ± 0.0%
andrewrk
pushed a commit
that referenced
this pull request
Oct 4, 2023
…rching the list Follow up to #17383. This is a minor optimization that only matters when a small allocation is resized/free'd soon after it is allocated. The only real difference I was able to observe with this was via a synthetic benchmark that allocates a full bucket and then frees all but one of the slots, over and over in a loop: Debug build: Benchmark 1 (9 runs): gpa-degen-master.exe measurement mean ± σ min … max outliers delta wall_time 575ms ± 5.19ms 569ms … 583ms 0 ( 0%) 0% peak_rss 43.8MB ± 1.37KB 43.8MB … 43.8MB 1 (11%) 0% Benchmark 2 (10 runs): gpa-degen-search-cur.exe measurement mean ± σ min … max outliers delta wall_time 532ms ± 5.55ms 520ms … 539ms 0 ( 0%) ⚡- 7.5% ± 0.9% peak_rss 43.8MB ± 65.2KB 43.8MB … 44.0MB 1 (10%) + 0.0% ± 0.1% ReleaseFast build: Benchmark 1 (129 runs): gpa-degen-master-release.exe measurement mean ± σ min … max outliers delta wall_time 38.9ms ± 1.12ms 36.7ms … 42.4ms 8 ( 6%) 0% peak_rss 23.2MB ± 2.39KB 23.2MB … 23.2MB 0 ( 0%) 0% Benchmark 2 (151 runs): gpa-degen-search-cur-release.exe measurement mean ± σ min … max outliers delta wall_time 33.2ms ± 999us 31.9ms … 36.3ms 20 (13%) ⚡- 14.7% ± 0.6% peak_rss 23.2MB ± 2.26KB 23.2MB … 23.2MB 0 ( 0%) + 0.0% ± 0.0%
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Before this PR, GeneralPurposeAllocator could run into incredibly degraded performance in scenarios where the bucket count for a particular size class grew to be large. For example, if exactly
slot_countallocations of a single size class were performed and then all of them were freed except one, then the bucket for those allocations would have to be kept around indefinitely. If that pattern of allocation were done over and over, then the bucket list for that size class could grow incredibly large, and to find a particular bucket, the entire (doubly linked) list would have to be scanned linearly.This allocation pattern has been seen in the wild: Vexu/arocc#508 (comment)
In that case, the length of the bucket list for the
128size class would grow to tens of thousands of buckets and cause Debug runtime to balloon to ~8 minutes whereas with the c_allocator the Debug runtime would be ~3 seconds.To address this, there are three different changes happening here:
std.Treapis used instead of a doubly linked list for the lists of buckets. This takes the time complexity ofsearchBucket[used in resize and free] fromO(n)toO(log n), but increases the time complexity of insert fromO(1)toO(log n)[before, all new buckets would get added to the head of the list]. This is still a huge win because search happens way more often than insertion of new buckets. Note: Any data structure withO(log n)or better search/insert/delete would also work for this use-case.searchBucketto perform better on average). Now, the current_bucket for each size class only changes when either (1) the current bucket is emptied/freed, or (2) a new bucket is allocated (due to the current bucket being full or null). Because each bucket'salloc_cursoronly moves forward (i.e. slots within a bucket are never re-used), we can therefore always know that any bucket besides the current_bucket will be full, so traversing the list in the hopes of finding an existing non-full bucket is entirely pointless.The first change is the most relevant and accounts for most of the benefit here. Also note that the overall functionality of
GeneralPurposeAllocatoris unchanged.In the degraded
arocccase, these changes bring Debug performance from ~8 minutes to ~20 seconds.Note: Much of the time taken on Windows in this particular case is related to gathering stack traces. With
.stack_trace_frames = 0the runtime goes down to 6.7 seconds, which is a little more than 2.5x slower compared to when the c_allocator is used.These changes may or mat not introduce a slight performance regression in the average case:
Here's the standard library tests on Windows in Debug mode:
And on Linux:
Here are some more benchmark results using a very targeted benchmark that intentionally only does worst-case allocation patterns:
Benchmark code
On Linux:
Debug:
ReleaseFast:
On Windows:
Debug:
ReleaseFast:
Various notes:
Treap.Nodes. This has two slightly weird things:initfunction and is directly instantiated instead, the memory pool can't usebacking_allocatorand instead always uses the page_allocatorNodes will always stay at the peak number ofNodes necessary, meaning that e.g. if a program needs 5000 buckets at one point, then all 5000 of those nodes will live for the rest of the program even if all memory in the buckets is freed (but those 5000 nodes will also be re-used whenever a new node is needed).std.Treap, butstd.Treapslightly outperformed it in my benchmarks and provides all the same benefits.