Skip to content

Eliminates the use of ConcurrentLinkedQueue.size() in PooledBlockAllocator, improving performance when the queue gets large.#389

Merged
tgregg merged 1 commit intomasterfrom
block-pool-size
Oct 13, 2021
Merged

Eliminates the use of ConcurrentLinkedQueue.size() in PooledBlockAllocator, improving performance when the queue gets large.#389
tgregg merged 1 commit intomasterfrom
block-pool-size

Conversation

@tgregg
Copy link
Contributor

@tgregg tgregg commented Oct 11, 2021

Issue #, if available:
Fixes #371

Description of changes:
When the size of the binary writer's block pool queue gets large, ConcurrentLinkedQueue.size() starts dominating profiles because it's not a constant time operation.

One way of fixing this is to use a concurrent queue implementation that does have a constant time size(). Two such implementations are ArrayBlockingQueue and LinkedBlockingQueue. Unlike ConcurrentLinkedQueue, which is a lock-free non-blocking implementation, both of the BlockingQueue implementations use locks. In addition, ArrayBlockingQueue is fixed-size and requires allocation of a backing array of that size up-front. I tried both implementations and found them not to perform as well than the proposed solution that retains ConcurrentLinkedQueue.

This proposed solution simply tracks the approximate size of the queue externally using an AtomicInteger (which is also lock-free).

Let's talk about race conditions

First, the existing implementation has a race condition. The freeBlocks.size() < blockLimit condition could be satisfied by multiple threads before the following freeBlocks.add, resulting in the pool growing beyond its capacity under high contention. This isn't a big deal; keeping an extra block or two around for what is likely a short amount of time isn't going to cause a major headache.

The proposal actually solves that race condition by atomically incrementing the size before adding the block. However, because the size is optimistically incremented, there is a race condition in the uncommon case where the pool ends up being full. Looking at the proposed diff, multiple threads could get to line 71 before the "first" one completes it. In this case, a few blocks that could have fit in the pool would get dropped. They'd be re-allocated if the pool ever needed to grow to that size again.

We could make this change such that it has the same race condition behavior as the existing solution; namely, that it may allow the pool to exceed capacity rather than unnecessarily freeing blocks. I like the proposed behavior slightly better because it's more conservative with heap size and it only requires one operation (increment) on the common path (pool not full) instead of two (check then increment). However, I'm open to other opinions.

Performance

I tested a variety of different conditions to make sure there wouldn't be unintended side-effects. For the sake of brevity I'm only including the results for the case that targets large queue size under high contention here because it illustrates the benefits of the solution. The full results for all of the conditions I tried can be found here.

For the following test, I made a temporary modification to ion-java-benchmark-cli to write the same binary Ion stream with 10 different threads, 10 times each. I used the --ion-writer-block-size option to reduce the block size to 1K from the default 32K, resulting in an increase in the number of blocks in the pool under high contention. Here's the ion-java-benchmark-cli command:

ion-java-benchmark write --io-type buffer --format ion_binary --iterations 2 --warmups 2 --ion-writer-block-size 1024 log.ion

Before:

Benchmark                                   (input)                                          (options)  Mode  Cnt           Score   Error   Units
Bench.run                                   log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2       39735.502           ms/op
Bench.run:Heap usage                        log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2        1557.834              MB
Bench.run:Serialized size                   log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2          23.545              MB
Bench.run:·gc.alloc.rate                    log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2          ≈ 10⁻⁴          MB/sec
Bench.run:·gc.alloc.rate.norm               log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2        9128.000            B/op
Bench.run:·gc.churn.PS_Eden_Space           log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2         174.076          MB/sec
Bench.run:·gc.churn.PS_Eden_Space.norm      log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2  7050337912.000            B/op
Bench.run:·gc.churn.PS_Survivor_Space       log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2          18.090          MB/sec
Bench.run:·gc.churn.PS_Survivor_Space.norm  log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2   747206192.000            B/op
Bench.run:·gc.count                         log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2          30.000          counts
Bench.run:·gc.time                          log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2        2186.000              ms

After:

Benchmark                                   (input)                                          (options)  Mode  Cnt           Score   Error   Units
Bench.run                                   log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2       11345.630           ms/op
Bench.run:Heap usage                        log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2        1579.797              MB
Bench.run:Serialized size                   log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2          23.545              MB
Bench.run:·gc.alloc.rate                    log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2           0.001          MB/sec
Bench.run:·gc.alloc.rate.norm               log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2        9140.000            B/op
Bench.run:·gc.churn.PS_Eden_Space           log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2         632.363          MB/sec
Bench.run:·gc.churn.PS_Eden_Space.norm      log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2  7792570168.000            B/op
Bench.run:·gc.churn.PS_Old_Gen              log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2          49.544          MB/sec
Bench.run:·gc.churn.PS_Old_Gen.norm         log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2   681872076.000            B/op
Bench.run:·gc.churn.PS_Survivor_Space       log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2          73.401          MB/sec
Bench.run:·gc.churn.PS_Survivor_Space.norm  log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2   942625592.000            B/op
Bench.run:·gc.count                         log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2          34.000          counts
Bench.run:·gc.time                          log.ion  write::{f:ION_BINARY,t:BUFFER,a:STREAMING,b:1024}    ss    2        4155.000              ms

That's a 71% improvement (39.735s -> 11.345s).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov
Copy link

codecov bot commented Oct 11, 2021

Codecov Report

Merging #389 (88f2c95) into master (2518f31) will decrease coverage by 0.01%.
The diff coverage is 50.00%.

❗ Current head 88f2c95 differs from pull request most recent head 9a74a9d. Consider uploading reports for the commit 9a74a9d to get more accurate results
Impacted file tree graph

@@             Coverage Diff              @@
##             master     #389      +/-   ##
============================================
- Coverage     66.36%   66.35%   -0.02%     
+ Complexity     5362     5359       -3     
============================================
  Files           154      154              
  Lines         22619    22622       +3     
  Branches       4083     4083              
============================================
- Hits          15011    15010       -1     
- Misses         6249     6251       +2     
- Partials       1359     1361       +2     
Impacted Files Coverage Δ
...zon/ion/impl/bin/PooledBlockAllocatorProvider.java 81.81% <50.00%> (-1.52%) ⬇️
src/com/amazon/ion/impl/BlockedBuffer.java 50.97% <0.00%> (-0.37%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2518f31...9a74a9d. Read the comment docs.

@tgregg tgregg changed the title Eliminates the use of ConcurrentLinkedQueue.size() is PooledBlockAllocator, improving performance when the queue gets large. Eliminates the use of ConcurrentLinkedQueue.size() in PooledBlockAllocator, improving performance when the queue gets large. Oct 11, 2021
jobarr-amzn
jobarr-amzn previously approved these changes Oct 12, 2021
@zslayton
Copy link
Contributor

The proposal actually solves that race condition by atomically incrementing the size before adding the block. However, because the size is optimistically incremented, there is a race condition in the uncommon case where the pool ends up being full. Looking at the proposed diff, multiple threads could get to line 71 before the "first" one completes it. In this case, a few blocks that could have fit in the pool would get dropped. They'd be re-allocated if the pool ever needed to grow to that size again.

We could make this change such that it has the same race condition behavior as the existing solution; namely, that it may allow the pool to exceed capacity rather than unnecessarily freeing blocks. I like the proposed behavior slightly better because it's more conservative with heap size and it only requires one operation (increment) on the common path (pool not full) instead of two (check then increment). However, I'm open to other opinions.

This trade-off seems fine to me. Could you add a comment that says "there's a race condition here that we allow deliberately as an optimization" so no one tries to fix it without performance testing down the road?

zslayton
zslayton previously approved these changes Oct 12, 2021
…cator, improving performance when the queue gets large.
@tgregg tgregg dismissed stale reviews from zslayton and jobarr-amzn via 9a74a9d October 12, 2021 22:10
@tgregg
Copy link
Contributor Author

tgregg commented Oct 12, 2021

@zslayton Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Switch PooledBlockAllocatorProvider's ConcurrentLinkedQueue to an ArrayBlockingQueue

3 participants