Adds a pool of UTF8 String encoders by zslayton · Pull Request #369 · amazon-ion/ion-java

zslayton · 2021-06-23T15:04:19Z

Most of the expense involved in constructing new binary writers
comes from allocating/initializing the buffers needed to encode
Java's UTF-16 Strings to UTF-8.

This change refactors the UTF-8 encoding logic into its own
class (Utf8StringEncoder) and introduces a singleton
Utf8StringEncoderPool that allows these encoders to be reused
across instantiations of binary writers.

Benchmark

This test initializes a new binary writer, writes a small string ("foo"), then closes the writer in a tight loop. The source can be found here.

Before

Benchmark                             Score           Error   Units
time                                 37.664 ±         3.661   ms/op
·gc.alloc.rate                      773.095 ±         3.882  MB/sec
·gc.alloc.rate.norm           438323910.400 ±    366718.519    B/op
·gc.churn.G1_Eden_Space             739.958 ±       238.206  MB/sec
·gc.churn.G1_Eden_Space.norm  419640115.200 ± 135676093.820    B/op
·gc.churn.G1_Old_Gen                  0.017 ±         0.033  MB/sec
·gc.churn.G1_Old_Gen.norm          9425.600 ±     18474.404    B/op
·gc.count                            20.000                  counts
·gc.time                             13.000                      ms

After

Benchmark                             Score           Error   Units
time                                 17.788 ±         3.631   ms/op
·gc.alloc.rate                       43.899 ±         0.850  MB/sec
·gc.alloc.rate.norm            24007104.000 ±    489259.500    B/op
·gc.churn.G1_Eden_Space              57.248 ±       182.466  MB/sec
·gc.churn.G1_Eden_Space.norm   31457280.000 ± 100263003.914    B/op
·gc.count                             2.000                  counts
·gc.time                              4.000                      ms

Following this change, the benchmark:

Took 52.77% less time to run.
Reduced its allocation rate from ~773MB/sec to ~44MB/sec (94.32% less)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Most of the expense involved in constructing new binary writers comes from allocating/initializing the buffers needed to encode Java's UTF-16 Strings to UTF-8. This change refactors the UTF-8 encoding logic into its own class (Utf8StringEncoder) and introduces a singleton Utf8StringEncoderPool that allows these encoders to be reused across instantiations of binary writers.

codecov · 2021-06-23T15:08:06Z

Codecov Report

Merging #369 (1ebb53d) into master (ca85095) will increase coverage by 0.02%.
The diff coverage is 94.73%.

@@             Coverage Diff              @@
##             master     #369      +/-   ##
============================================
+ Coverage     64.05%   64.08%   +0.02%     
- Complexity     4837     4844       +7     
============================================
  Files           136      138       +2     
  Lines         21108    21142      +34     
  Branches       3821     3822       +1     
============================================
+ Hits          13521    13548      +27     
- Misses         6251     6256       +5     
- Partials       1336     1338       +2

Impacted Files	Coverage Δ
...om/amazon/ion/impl/bin/utf8/Utf8StringEncoder.java	`91.89% <91.89%> (ø)`
...rc/com/amazon/ion/impl/bin/IonRawBinaryWriter.java	`91.32% <100.00%> (-0.20%)`	⬇️
...mazon/ion/impl/bin/utf8/Utf8StringEncoderPool.java	`100.00% <100.00%> (ø)`
src/com/amazon/ion/impl/BlockedBuffer.java	`50.72% <0.00%> (-0.49%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca85095...1ebb53d. Read the comment docs.

zslayton · 2021-06-23T15:06:31Z

src/com/amazon/ion/impl/bin/IonRawBinaryWriter.java

    private static final byte VARINT_NEG_ZERO   = (byte) 0xC0;

-    // See IonRawBinaryWriter#writeString(String) for usage information.
-    static final int SMALL_STRING_SIZE = 4 * 1024;


All of the code removed from this file (IonRawBinaryWriter.java) was moved to the new Utf8StringEncoder class.

zslayton · 2021-06-23T15:07:39Z

src/com/amazon/ion/impl/bin/IonRawBinaryWriter.java

+    final Utf8StringEncoder utf8StringEncoder = Utf8StringEncoderPool
+            .getInstance()
+            .getOrCreateUtf8Encoder();


Rather than allocating several new arrays for each binary writer we construct, we simply pull a Utf8StringEncoder from the pool.

zslayton · 2021-06-23T15:11:20Z

src/com/amazon/ion/impl/bin/utf8/Utf8StringEncoder.java

+     * @return  A {@link Result} containing a byte array of UTF-8 bytes and encoded length.
+     * @throws IllegalArgumentException if the String cannot be encoded as UTF-8.
+     */
+    public Result encode(String text) {


The encoding logic in this method was migrated without changes.

zslayton · 2021-06-23T15:20:44Z

src/com/amazon/ion/impl/bin/IonRawBinaryWriter.java

            patchBuffer.close();
            allocator.close();
+            // We cannot use `utf8StringEncoder` again after returning it to the pool.
+            Utf8StringEncoderPool.getInstance().returnEncoderToPool(utf8StringEncoder);


When the writer is close()d, return the Utf8StringEncoder to the pool.

Something to consider, is if the encoder knows the pool that it came from it can implement a close method and return itself. This way you could just have the caller inject it and avoid having the singleton of the pool be known in the implementation. It is marginally cleaner (e.g. allows for the ability to turn off the pool, if there is some weird multi-threaded thrashing issue), but given how internal all of this stuff is that may or may not be worthwhile.

Another thing to consider is to make the pool injected versus hard coded as a singleton--that would give the caller the flexibility to potentially no-op the "construction" and/or "return".

Again, minor considering how this code is used, but we have seen issues with global singletons and threaded applications sometimes require turning off these concurrent shared things.

I agree with both of the considerations raised here.

tgregg

Nice. Can you also do a benchmark where instead of creating many writers that each write a single string, you create one writer that writes many strings? That way we can verify no impact to that use case.

tgregg · 2021-06-23T17:53:46Z

src/com/amazon/ion/impl/bin/utf8/Utf8StringEncoderPool.java

+    // A singleton instance.
+    private static final Utf8StringEncoderPool INSTANCE = new Utf8StringEncoderPool();


Consider making Utf8StringEncoderPool an enum with a single value: INSTANCE.

Huh! TIL. Will do.

tgregg · 2021-06-23T17:54:48Z

src/com/amazon/ion/impl/bin/utf8/Utf8StringEncoderPool.java

+    private static final Utf8StringEncoderPool INSTANCE = new Utf8StringEncoderPool();
+
+    // A queue of previously initialized encoders that can be loaned out.
+    ArrayBlockingQueue<Utf8StringEncoder> bufferQueue;


private final ?

Good catch!

almann

Minor points/question below.

almann · 2021-06-23T16:27:15Z

src/com/amazon/ion/impl/bin/IonRawBinaryWriter.java

            patchBuffer.close();
            allocator.close();
+            // We cannot use `utf8StringEncoder` again after returning it to the pool.
+            Utf8StringEncoderPool.getInstance().returnEncoderToPool(utf8StringEncoder);


Something to consider, is if the encoder knows the pool that it came from it can implement a close method and return itself. This way you could just have the caller inject it and avoid having the singleton of the pool be known in the implementation. It is marginally cleaner (e.g. allows for the ability to turn off the pool, if there is some weird multi-threaded thrashing issue), but given how internal all of this stuff is that may or may not be worthwhile.

Another thing to consider is to make the pool injected versus hard coded as a singleton--that would give the caller the flexibility to potentially no-op the "construction" and/or "return".

Again, minor considering how this code is used, but we have seen issues with global singletons and threaded applications sometimes require turning off these concurrent shared things.

almann · 2021-06-23T17:30:07Z

src/com/amazon/ion/impl/bin/utf8/Utf8StringEncoderPool.java

+    // The maximum number of Utf8Encoders that can be waiting in the queue before new ones will be discarded.
+    private static final int MAX_QUEUE_SIZE = 32;


Out of curiosity, was this just a small number that seemed reasonable or did you get this number from something?

Just a number that seemed reasonable. In the worst case, an application that had more than 32 binary writers in existence at the same time would be allocating fresh Utf8StringEncoders for the surplus, which seemed low stakes.

That said, raising the ceiling is pretty cheap, though. Each Utf8StringEncoder is something like ~20KB on the heap and the queue only allocates them as needed, so I might bump this up to 128 while I'm making tweaks.

jobarr-amzn · 2021-06-23T18:14:04Z

src/com/amazon/ion/impl/bin/IonRawBinaryWriter.java

            patchBuffer.close();
            allocator.close();
+            // We cannot use `utf8StringEncoder` again after returning it to the pool.
+            Utf8StringEncoderPool.getInstance().returnEncoderToPool(utf8StringEncoder);


I agree with both of the considerations raised here.

jobarr-amzn · 2021-06-23T18:34:29Z

src/com/amazon/ion/impl/bin/utf8/Utf8StringEncoderPool.java

+/**
+ * A thread-safe shared pool of {@link Utf8StringEncoder}s that can be used for UTF8 encoding and decoding.
+ */
+public class Utf8StringEncoderPool {


Why not make this generic? Nothing about this class seems to be specific to the UTF-8 encoder use case (assuming that you drop the singleton pattern, but even then you can have a singleton UTF-8 encoder pool that is an instantiation of a generic pool).

This pattern may be useful for more cases than just UTF-8 string encoding- do we have any other object pools in ion-java?

This pattern may be useful for more cases than just UTF-8 string encoding- do we have any other object pools in ion-java?

There's the PooledBlockAllocatorProvider, but I believe that's it. (Depending on your perspective, the RecyclingStack might qualify?) At any rate, I plan to tackle #370 next, at which point we'll definitely have an opportunity for reuse.

Why not make this generic? Nothing about this class seems to be specific to the UTF-8 encoder use case (assuming that you drop the singleton pattern, but even then you can have a singleton UTF-8 encoder pool that is an instantiation of a generic pool).

Agreed. I'll do this as part of the PR for #370 if you don't mind.

tgregg · 2021-06-23T18:55:49Z

Note: the binary reader also keeps per-instance buffers for string decoding, and could probably benefit similarly from using a pool. https://github.com/amzn/ion-java/blob/master/src/com/amazon/ion/impl/IonReaderBinaryRawX.java#L118

I created #370 for this.

zslayton · 2021-06-23T19:35:24Z

@tgregg said:

Can you also do a benchmark where instead of creating many writers that each write a single string, you create one writer that writes many strings? That way we can verify no impact to that use case.

Test data: a text Ion file with 10,000 repetitions of the top level string "brevity is the soul of wit" created with:

yes '"brevity is the soul of wit"' | head -n 10000 > /tmp/brevity.ion

ion-java-benchmark-cli command:

java -jar ion-java-benchmark-cli_VERSION.jar write --forks 2 /tmp/brevity.ion

Before (v1.8.2)

Benchmark                 Score      Error   Units
time                      2.418 ±    0.588   ms/op
:Heap usage               4.114 ±    0.005      MB
:Serialized size          0.280                 MB
:·gc.alloc.rate           0.106 ±    0.005  MB/sec
:·gc.alloc.rate.norm  56423.200 ± 2853.160    B/op
:·gc.count                  ≈ 0             counts

After (this branch)

Benchmark                 Score      Error   Units
time                      2.426 ±    0.458   ms/op
:Heap usage               4.156 ±    0.005      MB
:Serialized size          0.280                 MB
:·gc.alloc.rate           0.028 ±    0.005  MB/sec
:·gc.alloc.rate.norm  15007.200 ± 2810.004    B/op
:·gc.count                  ≈ 0             counts

Fixed typo.

1612642

zslayton commented Jun 23, 2021

View reviewed changes

tgregg reviewed Jun 23, 2021

View reviewed changes

almann previously approved these changes Jun 23, 2021

View reviewed changes

marcbowes previously approved these changes Jun 23, 2021

View reviewed changes

jobarr-amzn previously approved these changes Jun 23, 2021

View reviewed changes

enum singleton, pool size 32->128, close()

1ebb53d

tgregg mentioned this pull request Jun 23, 2021

Use a memory pool for binary reader string decoding buffers #370

Closed

tgregg previously approved these changes Jun 23, 2021

View reviewed changes

zslayton dismissed stale reviews from tgregg, jobarr-amzn, marcbowes, and almann via 1ebb53d June 23, 2021 19:38

tgregg approved these changes Jun 23, 2021

View reviewed changes

zslayton merged commit 4132c76 into master Jun 23, 2021

zslayton deleted the utf8-encoder-pool branch June 23, 2021 21:21

jobarr-amzn mentioned this pull request Aug 3, 2021

Adds a binary IonReader implementation capable of incremental reads. #355

Merged

linlin-s mentioned this pull request Aug 10, 2021

Adds GitHub workflow to detect the performance regression of ion-java automatically. #373

Merged

		// A singleton instance.
		private static final Utf8StringEncoderPool INSTANCE = new Utf8StringEncoderPool();

		// The maximum number of Utf8Encoders that can be waiting in the queue before new ones will be discarded.
		private static final int MAX_QUEUE_SIZE = 32;

Conversation

zslayton commented Jun 23, 2021

Benchmark

Uh oh!

codecov bot commented Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgregg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

almann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgregg commented Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zslayton commented Jun 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Jun 23, 2021 •

edited

Loading

tgregg commented Jun 23, 2021 •

edited

Loading