Conversation
Most of the expense involved in constructing new binary writers comes from allocating/initializing the buffers needed to encode Java's UTF-16 Strings to UTF-8. This change refactors the UTF-8 encoding logic into its own class (Utf8StringEncoder) and introduces a singleton Utf8StringEncoderPool that allows these encoders to be reused across instantiations of binary writers.
Codecov Report
@@ Coverage Diff @@
## master #369 +/- ##
============================================
+ Coverage 64.05% 64.08% +0.02%
- Complexity 4837 4844 +7
============================================
Files 136 138 +2
Lines 21108 21142 +34
Branches 3821 3822 +1
============================================
+ Hits 13521 13548 +27
- Misses 6251 6256 +5
- Partials 1336 1338 +2
Continue to review full report at Codecov.
|
| private static final byte VARINT_NEG_ZERO = (byte) 0xC0; | ||
|
|
||
| // See IonRawBinaryWriter#writeString(String) for usage information. | ||
| static final int SMALL_STRING_SIZE = 4 * 1024; |
There was a problem hiding this comment.
All of the code removed from this file (IonRawBinaryWriter.java) was moved to the new Utf8StringEncoder class.
| final Utf8StringEncoder utf8StringEncoder = Utf8StringEncoderPool | ||
| .getInstance() | ||
| .getOrCreateUtf8Encoder(); |
There was a problem hiding this comment.
Rather than allocating several new arrays for each binary writer we construct, we simply pull a Utf8StringEncoder from the pool.
| * @return A {@link Result} containing a byte array of UTF-8 bytes and encoded length. | ||
| * @throws IllegalArgumentException if the String cannot be encoded as UTF-8. | ||
| */ | ||
| public Result encode(String text) { |
There was a problem hiding this comment.
The encoding logic in this method was migrated without changes.
| patchBuffer.close(); | ||
| allocator.close(); | ||
| // We cannot use `utf8StringEncoder` again after returning it to the pool. | ||
| Utf8StringEncoderPool.getInstance().returnEncoderToPool(utf8StringEncoder); |
There was a problem hiding this comment.
When the writer is close()d, return the Utf8StringEncoder to the pool.
There was a problem hiding this comment.
Something to consider, is if the encoder knows the pool that it came from it can implement a close method and return itself. This way you could just have the caller inject it and avoid having the singleton of the pool be known in the implementation. It is marginally cleaner (e.g. allows for the ability to turn off the pool, if there is some weird multi-threaded thrashing issue), but given how internal all of this stuff is that may or may not be worthwhile.
Another thing to consider is to make the pool injected versus hard coded as a singleton--that would give the caller the flexibility to potentially no-op the "construction" and/or "return".
Again, minor considering how this code is used, but we have seen issues with global singletons and threaded applications sometimes require turning off these concurrent shared things.
There was a problem hiding this comment.
I agree with both of the considerations raised here.
tgregg
left a comment
There was a problem hiding this comment.
Nice. Can you also do a benchmark where instead of creating many writers that each write a single string, you create one writer that writes many strings? That way we can verify no impact to that use case.
| // A singleton instance. | ||
| private static final Utf8StringEncoderPool INSTANCE = new Utf8StringEncoderPool(); |
There was a problem hiding this comment.
Consider making Utf8StringEncoderPool an enum with a single value: INSTANCE.
There was a problem hiding this comment.
Huh! TIL. Will do.
| private static final Utf8StringEncoderPool INSTANCE = new Utf8StringEncoderPool(); | ||
|
|
||
| // A queue of previously initialized encoders that can be loaned out. | ||
| ArrayBlockingQueue<Utf8StringEncoder> bufferQueue; |
almann
left a comment
There was a problem hiding this comment.
Minor points/question below.
| patchBuffer.close(); | ||
| allocator.close(); | ||
| // We cannot use `utf8StringEncoder` again after returning it to the pool. | ||
| Utf8StringEncoderPool.getInstance().returnEncoderToPool(utf8StringEncoder); |
There was a problem hiding this comment.
Something to consider, is if the encoder knows the pool that it came from it can implement a close method and return itself. This way you could just have the caller inject it and avoid having the singleton of the pool be known in the implementation. It is marginally cleaner (e.g. allows for the ability to turn off the pool, if there is some weird multi-threaded thrashing issue), but given how internal all of this stuff is that may or may not be worthwhile.
Another thing to consider is to make the pool injected versus hard coded as a singleton--that would give the caller the flexibility to potentially no-op the "construction" and/or "return".
Again, minor considering how this code is used, but we have seen issues with global singletons and threaded applications sometimes require turning off these concurrent shared things.
| // The maximum number of Utf8Encoders that can be waiting in the queue before new ones will be discarded. | ||
| private static final int MAX_QUEUE_SIZE = 32; |
There was a problem hiding this comment.
Out of curiosity, was this just a small number that seemed reasonable or did you get this number from something?
There was a problem hiding this comment.
Just a number that seemed reasonable. In the worst case, an application that had more than 32 binary writers in existence at the same time would be allocating fresh Utf8StringEncoders for the surplus, which seemed low stakes.
That said, raising the ceiling is pretty cheap, though. Each Utf8StringEncoder is something like ~20KB on the heap and the queue only allocates them as needed, so I might bump this up to 128 while I'm making tweaks.
| patchBuffer.close(); | ||
| allocator.close(); | ||
| // We cannot use `utf8StringEncoder` again after returning it to the pool. | ||
| Utf8StringEncoderPool.getInstance().returnEncoderToPool(utf8StringEncoder); |
There was a problem hiding this comment.
I agree with both of the considerations raised here.
| /** | ||
| * A thread-safe shared pool of {@link Utf8StringEncoder}s that can be used for UTF8 encoding and decoding. | ||
| */ | ||
| public class Utf8StringEncoderPool { |
There was a problem hiding this comment.
Why not make this generic? Nothing about this class seems to be specific to the UTF-8 encoder use case (assuming that you drop the singleton pattern, but even then you can have a singleton UTF-8 encoder pool that is an instantiation of a generic pool).
There was a problem hiding this comment.
This pattern may be useful for more cases than just UTF-8 string encoding- do we have any other object pools in ion-java?
There was a problem hiding this comment.
This pattern may be useful for more cases than just UTF-8 string encoding- do we have any other object pools in ion-java?
There's the PooledBlockAllocatorProvider, but I believe that's it. (Depending on your perspective, the RecyclingStack might qualify?) At any rate, I plan to tackle #370 next, at which point we'll definitely have an opportunity for reuse.
Why not make this generic? Nothing about this class seems to be specific to the UTF-8 encoder use case (assuming that you drop the singleton pattern, but even then you can have a singleton UTF-8 encoder pool that is an instantiation of a generic pool).
Agreed. I'll do this as part of the PR for #370 if you don't mind.
|
Note: the binary reader also keeps per-instance buffers for string decoding, and could probably benefit similarly from using a pool. https://github.com/amzn/ion-java/blob/master/src/com/amazon/ion/impl/IonReaderBinaryRawX.java#L118 I created #370 for this. |
|
@tgregg said:
Test data: a text Ion file with 10,000 repetitions of the top level string
Before (v1.8.2) After (this branch) |
1ebb53d
Most of the expense involved in constructing new binary writers
comes from allocating/initializing the buffers needed to encode
Java's UTF-16 Strings to UTF-8.
This change refactors the UTF-8 encoding logic into its own
class (Utf8StringEncoder) and introduces a singleton
Utf8StringEncoderPool that allows these encoders to be reused
across instantiations of binary writers.
Benchmark
This test initializes a new binary writer, writes a small string (
"foo"), then closes the writer in a tight loop. The source can be found here.Before
After
Following this change, the benchmark:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.