Adds a binary IonReader implementation capable of incremental reads.#355
Adds a binary IonReader implementation capable of incremental reads.#355
Conversation
Codecov Report
@@ Coverage Diff @@
## master #355 +/- ##
============================================
+ Coverage 64.09% 66.25% +2.16%
- Complexity 4845 5329 +484
============================================
Files 138 146 +8
Lines 21142 22545 +1403
Branches 3822 4072 +250
============================================
+ Hits 13550 14937 +1387
+ Misses 6255 6250 -5
- Partials 1337 1358 +21
Continue to review full report at Codecov.
|
zslayton
left a comment
There was a problem hiding this comment.
As discussed offline, I'm going to be reviewing this in chunks. These comments are related to the ResizingPipedInputStream class, which was the first file in the recommended reading order.
There was a problem hiding this comment.
It can be inferred from reading the code, but the meaning of 'size' isn't obvious in this comment alone.
Consider changing 'size' to a doclink to the size() method; its doc comment explains the meaning in detail.
There was a problem hiding this comment.
buffer is not visible to the user, right? That's an implementation detail. So this:
/**
* Moves all buffered (but not yet read) bytes to `destinationBuffer`. In total, {@link #size()}
* bytes will be moved.
*/There was a problem hiding this comment.
@jobarr-amzn Did part of your comment get cut off? You are correct that buffer is not visible to the user, but this is a private method.
| * the boundary must be manually extended (see {@link #extendBoundary(int)} to make these bytes | ||
| * available. When false, all buffered bytes will be available to read. | ||
| */ | ||
| private final boolean useBoundary; |
There was a problem hiding this comment.
Based solely on a read through of this class, it's not obvious to me what the use case for this feature is. If there's a high-level example use case you could add to this comment, that would be helpful.
There was a problem hiding this comment.
The best example is IonReaderLookaheadBuffer, which fills the ResizingPipedInputStream in large chunks, but only wants to make the bytes available for consumption from ResizingPipedInputStream after reading the bytes to determine that they are part of the current top-level Ion value.
I don't really want to refer to IonReaderLookaheadBuffer (or even Ion) in ResizingPipedInputStream since this is a standalone class, so I hesitate to add this example in the comment. For now I'm actually content with the existing comment: "This can be used to buffer arbitrarily-sized chunks of bytes without making them available for consumption."
There was a problem hiding this comment.
To help me understand the constraints we're working with- why is this behavior in ResizingPipedInputStream, rather than a wrapper?
There was a problem hiding this comment.
@jobarr-amzn There isn't really a constraint that requires it to be implemented this way (that I can think of). A wrapper adds an extra layer (it would have to override most of the methods in this class to handle the boundary logic), but the boundary logic itself is pretty simple.
zslayton
left a comment
There was a problem hiding this comment.
These comments are related to the IonReaderLookaheadBuffer. This class is well written and I appreciate the lengths you went to to document each method and keep it readable.
There's a lot of complexity around saving and testing state that aims to avoid ever having to re-process an input byte if moreDataRequired() returns true. For example, if the buffered data ends in the middle of an annotation ID, the lookahead buffer can resume parsing the rest of the annotation ID on the next call.
I would wager that for most typical use cases the "insufficient data" case is comparatively rare, and that most calls to next() will be satisfied by the buffer. I suspect that a lookahead buffer implementation that was willing to re-parse bytes in the (uncommon) insufficient data case would not only be much simpler, it would probably perform better in general because it wouldn't need to save/test state for each byte.
Does saving/testing state byte-by-byte buy us other functionality that I've glossed over?
This is an interesting suggestion that I've now had time to think about for a while. I agree that "insufficient data" happens much less frequently than "sufficient data" when calling However, I wasn't able to find much complexity in the code that I'd be able to eliminate by re-parsing bytes after previously returning
I'm happy to continue discussing this if you can see something I've missed, though. If we can find a technique that significantly simplifies the code and provides better performance for the |
jobarr-amzn
left a comment
There was a problem hiding this comment.
Shipping a few comments, will be reviewing incrementally. Most of these I've chatted about offline already.
| * the boundary must be manually extended (see {@link #extendBoundary(int)} to make these bytes | ||
| * available. When false, all buffered bytes will be available to read. | ||
| */ | ||
| private final boolean useBoundary; |
There was a problem hiding this comment.
To help me understand the constraints we're working with- why is this behavior in ResizingPipedInputStream, rather than a wrapper?
| * @param initialBufferSize the initial size of the buffer. When full, the buffer will grow by this | ||
| * many bytes. The buffer always stores bytes contiguously, so growth requires | ||
| * allocation of a new buffer capable of holding the new capacity and copying of the | ||
| * existing bytes into the new buffer. As such, a size should be chosen carefully | ||
| * such that growth is expected to occur rarely, if ever. |
There was a problem hiding this comment.
I asked offline what causes the constraint that the buffer always stores bytes contiguously.
A desire for simplicity, mostly. One of the things that makes this fast is that the parser can parse directly from that buffer once it’s filled. That gets significantly more complicated (and possibly slower) if we have to worry about addressing pages and page offsets.
The idea being that the size of your buffer will eventually stabilize and the cost of finding the right size contiguous buffer will be amortized.
There was a problem hiding this comment.
This javadoc convention makes ragged blocks whose indentation is dependent on the length of a variable name. What do you think of this convention instead?
/**
* @param initialBufferSize the initial size of the buffer. When full, the buffer will grow by
* this many bytes. The buffer always stores bytes contiguously, so growth requires
* allocation of a new buffer capable of holding the new capacity and copying of the
* existing bytes into the new buffer. As such, a size should be chosen carefully such that
* growth is expected to occur rarely, if ever.
*/There was a problem hiding this comment.
The javadoc style doesn't really matter to me either way. The current style is what IntelliJ does for me when I go to a new line within a Javadoc parameter description, which is why it's here. I like the way you've proposed slightly more. Since there are many other comments on this PR I'm going to focus on those, and we can follow up with a change to the Javadoc style in a future PR.
There was a problem hiding this comment.
buffer is not visible to the user, right? That's an implementation detail. So this:
/**
* Moves all buffered (but not yet read) bytes to `destinationBuffer`. In total, {@link #size()}
* bytes will be moved.
*/| if (shortfall > 0) { | ||
| // There is not enough space in the buffer even though all available bytes have already been | ||
| // moved to the start of the buffer. Growth is required. | ||
| int amountToGrow = Math.max(initialBufferSize, shortfall); |
There was a problem hiding this comment.
This ought to scale proportionately, a constant growth factor rather than a constant growth amount. Here when minimumNumberOfBytesRequired > initialBufferSize unbalanced write/read sequences will lead to allocation and copying on every write.
There was a problem hiding this comment.
Using a growth factor instead of growing by initialBufferSize each time is something I'd like to experiment with, and I'd prefer to do it as a fast-follow to this PR as it requires all new performance/memory benchmarks.
Here when
minimumNumberOfBytesRequired > initialBufferSizeunbalanced write/read sequences will lead to allocation and copying on every write.
That's true if you're writing significantly more than you're reading. In that case, users should heed the warning on the Javadoc for the initialBufferSize parameter: "should be chosen carefully such that growth is expected to occur rarely, if ever." Each read frees bytes in the buffer, so for normal cases that read the data that is written fairly frequently, the size of the buffer should stabilize. For the Ion use case, for example, we know that the buffer will be filled with enough chunks of bytes to complete a single top-level value (plus preceding system value(s)), then that data will be read. The size of the buffer will stabilize to approximately the size of the largest top-level value in the stream.
Using a growth factor requires fewer allocations/copies to stabilize, but will usually be more wasteful of memory. That's why I want to study the difference in-depth after resolving the other feedback on this PR.
zslayton
left a comment
There was a problem hiding this comment.
Comments for BufferConfiguration, BufferEventHandler, and the Ion-specific implementations.
zslayton
left a comment
There was a problem hiding this comment.
Review for IonReaderBinaryIncremental. As a high level comment: this file is huge. Is there a reason not to break some of the nested classes into their own files?
| // The text representations of the symbol table that is currently in scope, indexed by symbol ID. If the element at | ||
| // a particular index is null, that symbol has unknown text. | ||
| private final List<String> symbols; | ||
|
|
||
| // The catalog used by the reader to resolve shared symbol table imports. | ||
| private final IonCatalog catalog; | ||
|
|
||
| // The shared symbol tables imported by the local symbol table that is currently in scope. The key is the highest | ||
| // local symbol ID that resolves to a symbol contained in the value's symbol table. | ||
| private final TreeMap<Integer, SymbolTable> imports; | ||
|
|
||
| // A map of symbol ID to SymbolToken representation. Because most use cases only require symbol text, this | ||
| // is used only if necessary to avoid imposing the extra expense on all symbol lookups. | ||
| private List<SymbolToken> symbolTokensById = null; | ||
|
|
||
| // The number of symbols in the local symbol table that is currently in scope. | ||
| private int numberOfSymbols; | ||
|
|
||
| // The highest local symbol ID that resolves to a symbol contained in a shared symbol table imported by the | ||
| // current local symbol table. | ||
| private int importMaxId; |
There was a problem hiding this comment.
At this point in my read-through, these member fields seem like they should be grouped together into a class.
There was a problem hiding this comment.
I have a TODO to "split this implementation into a user-level reader and a system-level reader", where all of the symbol-related stuff would be in the user-level reader. I think that could help to organize some of this better, in addition to reducing the size of the file, which you pointed out in your top-level comment. I'd prefer to wait on reorganization of these variables until implementing the user/system separation, which I'm not prepared to do as part of this initial PR.
There was a problem hiding this comment.
Non-blocking concern: I would expect a load factor of 1 to waste less memory but provide somewhat slower lookup times. We should run some ion-java-benchmark-cli tests to see what the performance difference between the default (.75?) and 1 is.
There was a problem hiding this comment.
The main thing I want to avoid is a guaranteed growth. As an alternative to a load factor of 1, we can use the default but inflate the initial size beyond what we need. Here's what I have:
mapView = new HashMap<String, Integer>((int) Math.ceil(numberOfSymbols / 0.75), 0.75f);
zslayton
left a comment
There was a problem hiding this comment.
Comments for IonReaderBuilder.
| /** | ||
| * If possible, fills the input pipe with enough bytes to enable one more successful call to | ||
| * {@link IonReader#next()} on the non-incremental reader attached to the pipe. If not enough bytes were available | ||
| * in the raw input stream to complete the next top-level user value, calling {@link #moreDataRequired()} after this | ||
| * method returns will return `true`. In this case, this method must be called again before calling | ||
| * `IonReader.next()` to position the reader on this value. Otherwise, `moreDataRequired()` will return `false` and | ||
| * a call to `IonReader.next()` may be made to position the reader on this value. Implementations may throw | ||
| * `IonException` if invalid Ion data is detected. Any exceptions thrown when invoking {@link BufferEventHandler} | ||
| * methods will be propagated. Implementations may define additional exceptional cases. | ||
| * @throws Exception if thrown by a handler method or if an IOException is thrown by the underlying InputStream. | ||
| */ |
There was a problem hiding this comment.
References to IonReader methods and BufferEventHandler seem out of place at this layer of abstraction, or we have opportunities to reduce abstraction since ReaderLookaheadBufferBase is the only class that implements this interface.
There was a problem hiding this comment.
I agree that the comment about BufferEventHandler should be moved to ReaderLookaheadBufferBase. Done.
I think the IonReader references should remain, however; this interface is inextricably linked to IonReader (hence the newIonReadermethod).
Regarding why both the ReaderLookaheadBuffer interface and ReaderLookaheadBufferBase abstract class both exist, that's a good callout. It's because I separately implemented some incremental reader support for text that I'm not proposing in this PR, but may later. I ended up defining a class that extends ReaderLookaheadBufferBase capable of incrementally reading newline-delimited Ion text. I also defined a "no-op lookahead buffer" that trivially implements the ReaderLookaheadBuffer and allows the user to use the same underlying machinery for blocking use cases. That satisfied me that the current factoring probably has some value, but I'm open to suggestions.
jobarr-amzn
left a comment
There was a problem hiding this comment.
Commenting on the state machine in IonReaderLookaheadBuffer- I also would like to see this more willing to re-parse because my intuition is that it would be simpler to read and not have a meaningful performance cost.
At the extreme, if each call to
moreDataRequired()were completely independent (i.e. if a previous call had to returntrue, the next call would have to re-read every byte from the value starting from the type ID), would that allow us to simplify the state machine? And I came to the conclusion that I don't think it would. We'd still have a loop where we progress through states, with certain states sometimes being skipped based on the outcome of previous states (e.g. if the value is unannotated)
Conceptually we're moving through a state machine, but any computational process can be modeled that way. I think that explicit modeling is getting in the way here more than it is helping, partially because the relevant logic isn't isolated by state. Instead we have a different sort of cross section that cuts across them. At least that's how it feels to me.
More generally I'm trying to generate a positive recommendation for the IonReaderBinaryIncremental, IonReaderLookaheadBuffer, and ResizingPipedInputStream set of classes. These three don't feel well-defined, they bleed into each other and don't make as much sense separately. You see that from the way documentation and comments reference the internals of other classes, or super types reference properties or uses of their subtypes. At present I don't have a helpful clarification to suggest, rather I'm saying I'd be happy if there was one.
| protected int expectedLocalNullSlotSymbolId() { | ||
| // The spec allows for implementations to treat these malformed symbols as "null slots", and all "null slots" | ||
| // in local symbol tables as equivalent to symbol zero. | ||
| return getStreamingMode() == StreamingMode.NEW_STREAMING_INCREMENTAL ? 0 : 10; |
There was a problem hiding this comment.
Why is this? I did a little poking around and don't get it yet. Is this a test case convention? I see this in more places.
There was a problem hiding this comment.
The spec says "Any SIDs that refer to null slots in a local symbol table are equivalent to symbol zero." However, unlike the new implementation, the existing binary reader implementation does not actually assign null slots symbol ID 0. It's a larger effort to fix this in the old implementation, and is left out of scope for this PR (see #151 ).
10 is hardcoded here because the method is only called in places where the null slot is at symbol ID 10; there’s nothing special about 10 with respect to null slots.
There was a problem hiding this comment.
there’s nothing special about 10 with respect to null slots.
That's more what I was getting at. I had a hard time finding the significance of 10, in particular.
These tests and their data have behavior and specification split up in ways that make it difficult as a reader to assemble. I'd like to see a different factoring that lowers the barrier to entry.
Nothing to do here, necessarily, except that while raising 10 as a special value in these tests does exercise a behavior of the new reader that performs to spec I'd rather see that behavior demonstrated in a specific and isolated test so that I don't have to drag the knowledge of "10" across boundaries.
jobarr-amzn
left a comment
There was a problem hiding this comment.
I've now reviewed everything except I need to look through tests for IonReaderBinaryIncremental, IonReaderLookaheadBuffer*, and ResizingPipedInputStream.
| @Test | ||
| public void testReadingDecimalAsInteger() | ||
| { | ||
| // decimal value <space> int conversion | ||
| read( | ||
| "0. 0 " + | ||
| "-2147483648. -2147483648 " + // Min int | ||
| "2147483647. 2147483647 " // Max int | ||
| ); | ||
| while (in.next() != null) | ||
| { | ||
| int actual = in.intValue(); | ||
|
|
||
| in.next(); | ||
| int expected = in.intValue(); | ||
|
|
||
| assertEquals(expected, actual); | ||
| } | ||
| } |
There was a problem hiding this comment.
I know it didn't originate with you and you are copying the conventions of this file, but this is a bizarre pattern. To justify it the sequencing of values in a single stream would have to be important, or reading a new document would have to be so expensive that it justified strange tests to make runtime friendlier. I don't think either of those are true, so it would be better to have a few methods factored out:
readInt(String text)readLong(String text)- etc.
Then you'd be able to do write this as e.g.
int readInt(String text) {
read(text);
if (in.next() == null) {
fail("Expected to read an integer");
}
return in.intValue();
}
@Test
public void testReadingDecimalAsInteger()
{
// decimal value <space> int conversion
assertEquals(readInt("0."), readInt("0"));
assertEquals(readInt("-2147483648."), readInt("-2147483648")); // Min int
assertEquals(readInt("2147483647."), readInt("2147483647")); // Max int
}This can get even cleaner in Java 8, where I'd advocate writing a custom assertion e.g. assertEquals(IonReader::intValue, "0.", "0") so that you can reduce and reuse further. But as it stands the pattern above at least avoids globbing tests cases together.
This needn't be part of this review, but even one example would give an better template for future expansions here.
There was a problem hiding this comment.
Here's all of these tests XasY tests rewritten to avoid this while idiom: https://gist.github.com/jobarr-amzn/0dd822b2fd7fd6ef9258bb95e43bd3c2
There was a problem hiding this comment.
I agree that it's an awkward pattern, and I like your suggestion. I've opened #380 for this.
There was a problem hiding this comment.
In case you missed it, this is completely implemented in that gist. I'll add a comment in #380, maybe just fire out another PR after this one is shipped then.
There was a problem hiding this comment.
I didn't miss it; I just don't want to further inflate this PR. A follow-on PR would be great.
| import static org.junit.Assert.assertNull; | ||
| import static org.junit.Assert.assertTrue; | ||
|
|
||
| public class IonReaderBinaryIncrementalTest { |
There was a problem hiding this comment.
Helper methods and classes would reduce a lot of the repetition here and make it read more easily.
That could be custom assertions like assertString(IonReader r, String s):
void assertString(IonReader r, String s) {
assertEquals(IonType.STRING, r.next());
assertEquals(s, r.stringValue());
}Or composable Matchers, helper methods for common setup, etc. Lambdas etc. would also help if they were available 🤷🏻♂️.
There was a problem hiding this comment.
Returning to this comment- the IonValue-based assertion helpers I proposed for IonReaderLookAheadBufferTestBase would help here.
There was a problem hiding this comment.
I've added some helper methods to reduce repetition in both IonReaderBinaryIncrementalTest and IonReaderLookaheadBufferTest.
| new IonReaderAssertion() { | ||
| @Override | ||
| public void evaluate(IonReader reader) { | ||
| assertEquals(IonType.LIST, reader.next()); | ||
| reader.stepIn(); | ||
| assertEquals(IonType.STRUCT, reader.next()); | ||
| reader.stepIn(); | ||
| assertEquals(IonType.SYMBOL, reader.next()); | ||
| assertEquals("foo", reader.getFieldName()); | ||
| assertEquals("bar", reader.stringValue()); | ||
| reader.stepOut(); | ||
| assertEquals(IonType.SEXP, reader.next()); | ||
| assertEquals(IonType.INT, reader.next()); | ||
| assertEquals(123, reader.intValue()); | ||
| assertNull(reader.next()); | ||
| reader.stepOut(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Given some helper methods, you could write this like this:
a_list(
a_struct(a_field("foo", a_symbol("bar"))),
a_sexp(a_symbol("baz"), a_symbol("bar")),
an_integer(123)
)The whole test file written this way reduces it by ~200 lines and is easier to read and write at the expense of slightly increased difficulty of debugging failures. Here's that refactor: https://gist.github.com/jobarr-amzn/7e4fc2fbc91cb31524ff241458849f72
It needs this infrastructure (~200 LOC ;) ) to make an IonReaderAssertion that matches compares an IonValue to an IonReader: https://gist.github.com/jobarr-amzn/26f5f58f71715a34c50e4cc1101b3ab5
Given that this file is not the only place we make structural/content assertions around IonReaders, I think infrastructure like the above could be useful.
There was a problem hiding this comment.
I appreciate how succinct this proposal makes the tests themselves. However, it goes to great lengths to hide the verbosity of the IonReader API--but that's the API we're testing. We should expect some verbosity in tests that test a verbose API.
I used to try to make my tests as DRY and concise as possible. After having to actually use those tests enough times (and with enough time elapsed since writing them), I found that this almost always made them a pain to debug and required more of a time investment to re-familiarize myself with because I had to re-learn what all the helper methods did.
Now, in tests, I always value debuggability above all else. Especially in a code base with as many tests and different authors as ion-java, most developers who encounter a unit test failure will be unfamiliar with that particular test. If such a developer gets a failure in one of these tests, where do they put a breakpoint? Under the proposal, it's not even in the same file; it's in a recursive assertion helper method. As the code is currently factored, the assertions are right there in the test and a breakpoint can quickly be placed before any of them without having to do a bunch of digging.
I know there are different philosophies for how to write tests, and I'm not necessarily asserting that mine is the correct one; I'm open to being convinced that it isn't. In the meantime, I'd like to move forward with the current factoring and explore ways to reduce boilerplate without sacrificing debuggability.
There was a problem hiding this comment.
FWIW I agree about the value of debugging. The proposed recursive helper method definitely makes debugging more difficult. I do think though that this is not a necessary trade-off. More investment in IonValue-based structural assertions could make debugging straightforward while keeping tests easy to read.
I don't expect that anyone should use the IonReader API the way that the tests do- though I'm ready to be wrong on that one.
In the meantime, I'd like to move forward with the current factoring and explore ways to reduce boilerplate without sacrificing debuggability.
Sounds good to me. I'll keep this general approach in my pocket and see what I can do to improve it at a later date. Better failure reporting would make all the difference here.
There was a problem hiding this comment.
I don't expect that anyone should use the IonReader API the way that the tests do- though I'm ready to be wrong on that one.
I think the tests do some things that a user probably wouldn't often do (e.g. call next() multiple times at the end of a container), but generally I think they fairly accurately depict how IonReader would be used to traverse values. Maybe I'm not fully understanding this statement. Can you give an example?
There was a problem hiding this comment.
Sure- in these tests we're making a series of calls to verify arbitrary data, including things like the specific ordering of fields in a struct. That sort of specificity is appropriate when verifying the functioning of a parser, but as a user of ion-java I wouldn't expect to do this. Instead I'd be using higher-level abstractions to guide my use of the library.
That might be ISL, it might be native objects with some other serialization constraints, it might be some other framework, but whatever it is I don't expect to do or see this kind of unrolled/decompiled sequence of calls.
|
(Quotes from @jobarr-amzn's comments)
I've laid out my reasoning on this topic in this comment. It would be great to have some of those points refuted specifically so that I can identify where our understanding diverges. In particular, I would like an answer to the following question: if the lookahead buffer were implemented such that it would re-parse data on each call to
I'm going to push on this one a little bit; maybe that will help us arrive at something actionable. First, can you please give examples of "super types reference properties or uses of their subtypes"? Now let's take the classes one at a time.
|
…remental, IonTypeID, and ResizingPipedInputStream.
…ed for all readers except the incremental reader.
|
This was a massive undertaking, and it's all functional code. Inactionable comments don't fit here. Please ship it, and in the event that anyone ever has a concrete and actionable improvement to anything in the future we can talk about it then.
I misspoke and said something incorrect, I apologize. Let me try again. I was thinking of the way IonReaderBinaryIncremental used the ResizingPipedInputStream belonging to its IonReaderLookaheadBuffer, but I was tired and said subtype/supertype (IS A when I was thinking HAS A). This is ground we've already covered, so no need to get into it again. I have a friend who when building a gaming PC knew his tolerances well enough to strip the heat sink from one component and share the larger heat sink of another, and to remove the casing from the power supply to make another component placement possible. It worked- the rig outperforms anything I've owned and fits more power per dollar per space in the case than the individual components suggested ought to be possible. The abstraction violation in the three classes above is intentional but not gratuitous- my discomfort is not an indicator that the solution seen here is not achieving what it sets out to do. I do not think that additional interfaces on the lookahead buffer or resizing pipe would be more clear than the current state, so please keep it as is. |
…hat the option may apply to text readers if support is added in the future.
Description of Changes
This PR adds a new binary
IonReaderimplementation that is enabled and configured via optionalIonReaderBuilderoptions.Motivation
The motivation is twofold:
implementation in most situations (see the
Performancesection below).always expects a complete value to be available when the user calls
next(). If only part ofa value is available, the existing reader will raise an
UnexpectedEOFExceptionand terminateprocessing, even if data completing the value would have become available later. The new
implementation supports incremental (streaming) use cases by allowing the user to wait for
additional data to become available and call
next()again to proceed.Code reuse was not a goal of this project. I wanted to build a new implementation from scratch
to avoid being influenced or constrained by any potentially performance-limiting code used
by the existing implementation. As such, there is intentionally some logic that probably looks
similar between the implementations.
Testing
The classes added as part of this change are covered by new unit tests. Additionally, an enum
value for the incremental reader implementation was added so that any existing reader tests
(including the tests generated from the
ion-testsfiles) exercise both the old and newimplementations.
Performance
The ion-java-benchmark-cli
was updated to support the new incremental implementation. It was then used to compare
latency, heap usage, and garbage collection stats for the existing and new implementation for
various data. Below are the results, which are for fully-materialized traversals using
default settings unless otherwise stated.
Medium-size log stream
A 22MB stream of struct values descending to depth ~5, containing timestamps, strings, symbol
values, annotations, nested lists and structs, ints, 32-bit floats, and decimals. Each top-level
value is up to a couple kilobytes in size. Local symbol table appends are interspersed with user
values.
Full traversal
The new implementation is 52% faster (515ms -> 246ms) and uses 33% less heap space (30MB -> 20MB).
Full results here.
Sparse read
This run navigated to and materialized a single depth 1 string value from each top level struct.
The new implementation is 31% faster (96ms -> 66ms) and uses 56% less heap space (54MB -> 24MB).
Full results here.
Large log stream
A 7.7GB stream of values with similar shape to the values in the medium-size log stream (see
above), but with significantly different values, field names, and annotations, and with
significantly more symbols and larger top-level values.
The new implementation is 35% faster (110s -> 71s), uses similar heap space (123MB -> 135MB),
and spends 49% less time performing garbage collections (506ms -> 260ms).
Full results here.
Single small value
A single 5KB value from the medium-size log stream (see above).
The new implementation is 24% faster (267us -> 202us).
Full results here.
Single large list
A single 1MB list of random float values. The code used to generate the list can be found
here.
Default initial buffer size (32K)
With the default buffer size of 32K, the incremental reader's internal buffer must grow several
times in order to fit the 1MB value, consuming a lot of heap space and leading to GC churn.
Nevertheless, the new implementation is 14% faster (9342us -> 8019us) despite the fact that
it uses much more heap space (3MB -> 107MB) and spends much more time in GC (67ms -> 603ms).
Full results here.
Sufficient buffer size to fit the value (1.01 MB)
When it is known that the data to parse contains very few values of a certain maximum size that
exceeds the default buffer size, users may configure the incremental reader with a given
initial buffer size to avoid the need for growth.
Doing this made the new implementation 56% faster (9277us -> 4068us), even though heap usage
(3MB -> 34MB) and GC time (71ms -> 159ms) still exceeded the existing implementation.
Full results here.
Many large lists
100MB stream of 1MB lists of random float values. The code used to generate the lists can be found
here.
Default initial buffer size (32K)
The new implementation is 54% faster (818ms -> 372ms), but uses more heap space (3MB -> 13MB).
Full results here.
Sufficient buffer size to fit a value (1.01 MB)
Because the stream contains 100 values, setting an appropriate initial buffer size is less
important because buffer growth occurs during processing of the first value only and is amortized
over all the values.
The new implementation is 57% faster (853ms -> 367ms) and uses similar heap space (2.7MB vs 3.8MB).
Full results here.
Random blobs/clobs
100MB stream of random blobs or clobs of random size between 0 and 512 bytes. The code used to
generate the lobs can be found
here.
Both implementations take almost identical time (69ms -> 71ms), heap usage (50MB -> 50MB), and GC
time (2ms -> 5ms).
Full results here
and here.
Random decimals
108MB stream of random decimals with varying precision. The code used to generate the decimals
can be found
here.
Read via
decimalValue()The new implementation is 24% faster (1669ms -> 1260ms) and uses similar heap space (27MB -> 32MB).
Full results here.
Read via
bigDecimalValue()The new implementation is 36% faster (1748ms -> 1119ms) and uses similar heap space (28 -> 26MB).
Full results here.
Random floats
100MB stream of random 32- and 64-bit floats. The code used to generate the floats can be found
here.
The new implementation is 4% faster (934ms -> 899ms) and uses almost identical heap space (3.8MB
-> 3.8MB), with no time spent in GC in either case.
Full results here.
Random annotations on random floats
100MB stream of random floats annotated with up to 3 annotations each from a pool of 500
symbols of various lengths between 0 and 20.
The code used to generate the annotated floats can be found
here.
Annotations read as strings
The new implementation is 31% faster (1487ms -> 1021ms) and uses similar heap space (30MB -> 31MB).
Full results here.
Annotations read as SymbolTokens
The new implementation is 20% faster (1380ms -> 1106ms) and uses less heap space (26MB -> 18MB).
Full results here.
Random ints
110MB stream of random ints of various sizes. The code used to generate the ints can be found
here.
The new implementation is 24% faster (1474ms -> 1115ms) and uses similar heap space (20MB -> 17MB).
Full results here.
Random strings
174MB stream of random strings of various sizes between 0 and 20. The code used to generate the
strings can be found
here.
The new implementation is 12% faster (1262ms -> 1113ms) and uses identical heap space (16MB) and
GC time (28ms).
Full results here.
Random symbol values
123MB stream of symbol values pulled from a pool of 500 random symbols of various sizes between
0 and 20. The code used to generate the symbol values can be found
here.
Read to strings using
stringValue()The new implementation is 9% faster (3397ms -> 3095ms) and uses identical heap space (4MB) and
zero GC time.
Full results here.
Read to SymbolTokens using
symbolValue()The new implementation is 12% faster (3582ms -> 3130ms), uses much less heap space (28MB -> 4MB),
and less GC time (96ms -> 0ms).
Full results here.
Random timestamps
113MB stream of random timestamps with various precisions. The code used to generate the timestamps
can be found
here.
The new implementation is 20% faster (1818ms -> 1449ms) and uses slightly more heap space (11MB ->
18MB).
Full results here.
Caveats
The new implementation only supports incremental reading at the top level, meaning that it must
buffer an entire top-level value (and any system values that precede it) before allowing the
user to start consuming the value. This means the internal buffer must grow to the size of the
largest top-level value (plus preceding system values) in the stream. The old implementation
buffers data using fixed-size pages and only allocates additional pages if a single leaf value
exceeds the size of a page, which is rare.
As a result, the new implementation is not suitable for use with data that contains values
that approach or exceed the amount of memory allocated to the JVM, and currently cannot support
values larger than 2 GB in any case because the underlying buffer is a byte array indexed by an
int. This should not be a common use case.In order to avoid inadvertently attempting to process a humongous top-level value that would
result in an
OutOfMemoryErrorwith the new implementation, an option is providedto allow the user to configure the maximum size of the internal buffer. Under no circumstances
will the buffer grow beyond this size, even if that means a large value must be dropped. A
callback must be implemented by users to handle this case if it occurs.
As described previously, the new implementation is a normal
IonReaderimplementation except thatit supports incremental reads, which necessitates slightly different behavior when calling
next()at the top level. With the new implementation, receivingnullfromnext()at thetop level means that there is not enough data to complete a value. The user may decide either to
wait and call
next()again in the future, at which pointnext()will return a non-nullIonTypeif a complete value has become available; or, the user may decide that the stream iscomplete and
close()the reader. Ifclose()is called when the reader holds apartially-buffered value,
UnexpectedEOFExceptionwill be raised, mimicking the behavior of theold implementation. If any existing users rely on being able to partially read an incomplete
top-level value before handling an
UnexpectedEOFException, the new implementation will not besuitable. This is also unlikely to be common.
Example
The following example creates an incremental reader over a growing binary Ion file and limits
the size of the internal buffer to 128K.
Suggested Review Order
IonReaderLookaheadBuffer and ResizingPipedInputStream.
IonReaderLookaheadBuffer to ensure a complete top-level value is buffered. Navigates through
the stream and parses values at the direction of the user.
and configure its buffers.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.