Skip to content

Adds a binary IonReader implementation capable of incremental reads.#355

Merged
tgregg merged 21 commits intomasterfrom
incremental-reader-merge
Sep 8, 2021
Merged

Adds a binary IonReader implementation capable of incremental reads.#355
tgregg merged 21 commits intomasterfrom
incremental-reader-merge

Conversation

@tgregg
Copy link
Contributor

@tgregg tgregg commented Apr 10, 2021

Description of Changes

This PR adds a new binary IonReader implementation that is enabled and configured via optional
IonReaderBuilder options.

Motivation

The motivation is twofold:

  1. Improve performance. The new implementation provides improved performance over the existing
    implementation in most situations (see the Performance section below).
  2. Support incremental reading. The existing implementation is purely blocking, meaning that it
    always expects a complete value to be available when the user calls next(). If only part of
    a value is available, the existing reader will raise an UnexpectedEOFException and terminate
    processing, even if data completing the value would have become available later. The new
    implementation supports incremental (streaming) use cases by allowing the user to wait for
    additional data to become available and call next() again to proceed.

Code reuse was not a goal of this project. I wanted to build a new implementation from scratch
to avoid being influenced or constrained by any potentially performance-limiting code used
by the existing implementation. As such, there is intentionally some logic that probably looks
similar between the implementations.

Testing

The classes added as part of this change are covered by new unit tests. Additionally, an enum
value for the incremental reader implementation was added so that any existing reader tests
(including the tests generated from the ion-tests files) exercise both the old and new
implementations.

Performance

The ion-java-benchmark-cli
was updated to support the new incremental implementation. It was then used to compare
latency, heap usage, and garbage collection stats for the existing and new implementation for
various data. Below are the results, which are for fully-materialized traversals using
default settings unless otherwise stated.

Medium-size log stream

A 22MB stream of struct values descending to depth ~5, containing timestamps, strings, symbol
values, annotations, nested lists and structs, ints, 32-bit floats, and decimals. Each top-level
value is up to a couple kilobytes in size. Local symbol table appends are interspersed with user
values.

Full traversal

The new implementation is 52% faster (515ms -> 246ms) and uses 33% less heap space (30MB -> 20MB).

Full results here.

Sparse read

This run navigated to and materialized a single depth 1 string value from each top level struct.

The new implementation is 31% faster (96ms -> 66ms) and uses 56% less heap space (54MB -> 24MB).

Full results here.

Large log stream

A 7.7GB stream of values with similar shape to the values in the medium-size log stream (see
above), but with significantly different values, field names, and annotations, and with
significantly more symbols and larger top-level values.

The new implementation is 35% faster (110s -> 71s), uses similar heap space (123MB -> 135MB),
and spends 49% less time performing garbage collections (506ms -> 260ms).

Full results here.

Single small value

A single 5KB value from the medium-size log stream (see above).

The new implementation is 24% faster (267us -> 202us).

Full results here.

Single large list

A single 1MB list of random float values. The code used to generate the list can be found
here.

Default initial buffer size (32K)

With the default buffer size of 32K, the incremental reader's internal buffer must grow several
times in order to fit the 1MB value, consuming a lot of heap space and leading to GC churn.

Nevertheless, the new implementation is 14% faster (9342us -> 8019us) despite the fact that
it uses much more heap space (3MB -> 107MB) and spends much more time in GC (67ms -> 603ms).

Full results here.

Sufficient buffer size to fit the value (1.01 MB)

When it is known that the data to parse contains very few values of a certain maximum size that
exceeds the default buffer size, users may configure the incremental reader with a given
initial buffer size to avoid the need for growth.

Doing this made the new implementation 56% faster (9277us -> 4068us), even though heap usage
(3MB -> 34MB) and GC time (71ms -> 159ms) still exceeded the existing implementation.

Full results here.

Many large lists

100MB stream of 1MB lists of random float values. The code used to generate the lists can be found
here.

Default initial buffer size (32K)

The new implementation is 54% faster (818ms -> 372ms), but uses more heap space (3MB -> 13MB).

Full results here.

Sufficient buffer size to fit a value (1.01 MB)

Because the stream contains 100 values, setting an appropriate initial buffer size is less
important because buffer growth occurs during processing of the first value only and is amortized
over all the values.

The new implementation is 57% faster (853ms -> 367ms) and uses similar heap space (2.7MB vs 3.8MB).

Full results here.

Random blobs/clobs

100MB stream of random blobs or clobs of random size between 0 and 512 bytes. The code used to
generate the lobs can be found
here.

Both implementations take almost identical time (69ms -> 71ms), heap usage (50MB -> 50MB), and GC
time (2ms -> 5ms).

Full results here
and here.

Random decimals

108MB stream of random decimals with varying precision. The code used to generate the decimals
can be found
here.

Read via decimalValue()

The new implementation is 24% faster (1669ms -> 1260ms) and uses similar heap space (27MB -> 32MB).

Full results here.

Read via bigDecimalValue()

The new implementation is 36% faster (1748ms -> 1119ms) and uses similar heap space (28 -> 26MB).

Full results here.

Random floats

100MB stream of random 32- and 64-bit floats. The code used to generate the floats can be found
here.

The new implementation is 4% faster (934ms -> 899ms) and uses almost identical heap space (3.8MB
-> 3.8MB), with no time spent in GC in either case.

Full results here.

Random annotations on random floats

100MB stream of random floats annotated with up to 3 annotations each from a pool of 500
symbols of various lengths between 0 and 20.

The code used to generate the annotated floats can be found
here.

Annotations read as strings

The new implementation is 31% faster (1487ms -> 1021ms) and uses similar heap space (30MB -> 31MB).

Full results here.

Annotations read as SymbolTokens

The new implementation is 20% faster (1380ms -> 1106ms) and uses less heap space (26MB -> 18MB).

Full results here.

Random ints

110MB stream of random ints of various sizes. The code used to generate the ints can be found
here.

The new implementation is 24% faster (1474ms -> 1115ms) and uses similar heap space (20MB -> 17MB).

Full results here.

Random strings

174MB stream of random strings of various sizes between 0 and 20. The code used to generate the
strings can be found
here.

The new implementation is 12% faster (1262ms -> 1113ms) and uses identical heap space (16MB) and
GC time (28ms).

Full results here.

Random symbol values

123MB stream of symbol values pulled from a pool of 500 random symbols of various sizes between
0 and 20. The code used to generate the symbol values can be found
here.

Read to strings using stringValue()

The new implementation is 9% faster (3397ms -> 3095ms) and uses identical heap space (4MB) and
zero GC time.

Full results here.

Read to SymbolTokens using symbolValue()

The new implementation is 12% faster (3582ms -> 3130ms), uses much less heap space (28MB -> 4MB),
and less GC time (96ms -> 0ms).

Full results here.

Random timestamps

113MB stream of random timestamps with various precisions. The code used to generate the timestamps
can be found
here.

The new implementation is 20% faster (1818ms -> 1449ms) and uses slightly more heap space (11MB ->
18MB).

Full results here.

Caveats

The new implementation only supports incremental reading at the top level, meaning that it must
buffer an entire top-level value (and any system values that precede it) before allowing the
user to start consuming the value. This means the internal buffer must grow to the size of the
largest top-level value (plus preceding system values) in the stream. The old implementation
buffers data using fixed-size pages and only allocates additional pages if a single leaf value
exceeds the size of a page, which is rare.

As a result, the new implementation is not suitable for use with data that contains values
that approach or exceed the amount of memory allocated to the JVM, and currently cannot support
values larger than 2 GB in any case because the underlying buffer is a byte array indexed by an
int. This should not be a common use case.

In order to avoid inadvertently attempting to process a humongous top-level value that would
result in an OutOfMemoryError with the new implementation, an option is provided
to allow the user to configure the maximum size of the internal buffer. Under no circumstances
will the buffer grow beyond this size, even if that means a large value must be dropped. A
callback must be implemented by users to handle this case if it occurs.

As described previously, the new implementation is a normal IonReader implementation except that
it supports incremental reads, which necessitates slightly different behavior when calling
next() at the top level. With the new implementation, receiving null from next() at the
top level means that there is not enough data to complete a value. The user may decide either to
wait and call next() again in the future, at which point next() will return a non-null
IonType if a complete value has become available; or, the user may decide that the stream is
complete and close() the reader. If close() is called when the reader holds a
partially-buffered value, UnexpectedEOFException will be raised, mimicking the behavior of the
old implementation. If any existing users rely on being able to partially read an incomplete
top-level value before handling an UnexpectedEOFException, the new implementation will not be
suitable. This is also unlikely to be common.

Example

The following example creates an incremental reader over a growing binary Ion file and limits
the size of the internal buffer to 128K.

InputStream in = new FileInputStream("growingFile.10n");
IonReader reader = IonReaderBuilder.standard()
    .withIncrementalReadingEnabled(true)
    .withBufferConfiguration(
        IonBufferConfiguration.Builder.standard()
            .withMaximumBufferSize(128 * 1024)
            .onOversizedSymbolTable(() -> {
                // A symbol table alone exceeded the maximum buffer size. This is not
                // recoverable because any values following this symbol table may rely on
                // it. Returning normally causes the reader to cleanly exit, but some
                // implementations may wish to throw an exception here.
            })
            .onOversizedValue(() -> {
                // Return normally, causing the oversize value to be skipped and processing
                // to continue. Some implementations may wish to throw an exception here,
                // causing processing to abort.
            })
            .onData(numberOfBytes -> {
                // This implementation does not care about byte count; do nothing. Some
                // implementations may wish to use metrics to track the number of bytes
                // processed.
            })
            .build()
    )
    .build(in);
// Data in the file (text-equivalent): "foo" [123,
reader.next(); // Returns IonType.STRING
reader.next(); // Returns null
// Data is appended to the file: 4.56]
reader.next(); // Returns IonType.LIST
reader.stepIn();
reader.next(); // Returns IonType.INT
reader.next(); // Returns IonType.DECIMAL
reader.next(); // Returns null (end of container)
reader.stepOut();
// The file has stopped growing.
reader.next(); // Returns null
reader.close(); // Succeeds.

Suggested Review Order

  1. ResizingPipedInputStream - manages the growable buffer that holds top-level values.
  2. IonReaderLookaheadBuffer - fills the ResizingPipedInputStream with complete top-level values.
  3. *BufferConfiguration, *BufferEventHandler - allows the user to configure the
    IonReaderLookaheadBuffer and ResizingPipedInputStream.
  4. IonReaderBinaryIncremental - the core incremental IonReader implementation. Uses an
    IonReaderLookaheadBuffer to ensure a complete top-level value is buffered. Navigates through
    the stream and parses values at the direction of the user.
  5. IonReaderBuilder - provides the options that allow users to enable the incremental reader
    and configure its buffers.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov
Copy link

codecov bot commented Apr 10, 2021

Codecov Report

Merging #355 (9c94bee) into master (b183f25) will increase coverage by 2.16%.
The diff coverage is 96.45%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #355      +/-   ##
============================================
+ Coverage     64.09%   66.25%   +2.16%     
- Complexity     4845     5329     +484     
============================================
  Files           138      146       +8     
  Lines         21142    22545    +1403     
  Branches       3822     4072     +250     
============================================
+ Hits          13550    14937    +1387     
+ Misses         6255     6250       -5     
- Partials       1337     1358      +21     
Impacted Files Coverage Δ
src/com/amazon/ion/impl/SymbolTokenImpl.java 66.66% <50.00%> (ø)
src/com/amazon/ion/impl/lite/IonDatagramLite.java 50.12% <50.00%> (ø)
src/com/amazon/ion/IonBufferConfiguration.java 92.30% <92.30%> (ø)
...om/amazon/ion/impl/IonReaderBinaryIncremental.java 95.01% <95.01%> (ø)
src/com/amazon/ion/system/IonReaderBuilder.java 96.59% <96.55%> (-0.29%) ⬇️
...com/amazon/ion/impl/ReaderLookaheadBufferBase.java 97.43% <97.43%> (ø)
.../com/amazon/ion/impl/IonReaderLookaheadBuffer.java 98.60% <98.60%> (ø)
src/com/amazon/ion/BufferConfiguration.java 100.00% <100.00%> (ø)
src/com/amazon/ion/impl/IonTypeID.java 100.00% <100.00%> (ø)
.../com/amazon/ion/impl/ResizingPipedInputStream.java 100.00% <100.00%> (ø)
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b183f25...9c94bee. Read the comment docs.

Copy link
Contributor

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I'm going to be reviewing this in chunks. These comments are related to the ResizingPipedInputStream class, which was the first file in the recommended reading order.

Comment on lines 167 to 168
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be inferred from reading the code, but the meaning of 'size' isn't obvious in this comment alone.

Consider changing 'size' to a doclink to the size() method; its doc comment explains the meaning in detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer is not visible to the user, right? That's an implementation detail. So this:

    /**
     * Moves all buffered (but not yet read) bytes to `destinationBuffer`. In total, {@link #size()}
     * bytes will be moved.
     */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jobarr-amzn Did part of your comment get cut off? You are correct that buffer is not visible to the user, but this is a private method.

* the boundary must be manually extended (see {@link #extendBoundary(int)} to make these bytes
* available. When false, all buffered bytes will be available to read.
*/
private final boolean useBoundary;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based solely on a read through of this class, it's not obvious to me what the use case for this feature is. If there's a high-level example use case you could add to this comment, that would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best example is IonReaderLookaheadBuffer, which fills the ResizingPipedInputStream in large chunks, but only wants to make the bytes available for consumption from ResizingPipedInputStream after reading the bytes to determine that they are part of the current top-level Ion value.

I don't really want to refer to IonReaderLookaheadBuffer (or even Ion) in ResizingPipedInputStream since this is a standalone class, so I hesitate to add this example in the comment. For now I'm actually content with the existing comment: "This can be used to buffer arbitrarily-sized chunks of bytes without making them available for consumption."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help me understand the constraints we're working with- why is this behavior in ResizingPipedInputStream, rather than a wrapper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jobarr-amzn There isn't really a constraint that requires it to be implemented this way (that I can think of). A wrapper adds an extra layer (it would have to override most of the methods in this class to handle the boundary logic), but the boundary logic itself is pretty simple.

Copy link
Contributor

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are related to the IonReaderLookaheadBuffer. This class is well written and I appreciate the lengths you went to to document each method and keep it readable.

There's a lot of complexity around saving and testing state that aims to avoid ever having to re-process an input byte if moreDataRequired() returns true. For example, if the buffered data ends in the middle of an annotation ID, the lookahead buffer can resume parsing the rest of the annotation ID on the next call.

I would wager that for most typical use cases the "insufficient data" case is comparatively rare, and that most calls to next() will be satisfied by the buffer. I suspect that a lookahead buffer implementation that was willing to re-parse bytes in the (uncommon) insufficient data case would not only be much simpler, it would probably perform better in general because it wouldn't need to save/test state for each byte.

Does saving/testing state byte-by-byte buy us other functionality that I've glossed over?

@tgregg
Copy link
Contributor Author

tgregg commented May 7, 2021

There's a lot of complexity around saving and testing state that aims to avoid ever having to re-process an input byte if moreDataRequired() returns true. For example, if the buffered data ends in the middle of an annotation ID, the lookahead buffer can resume parsing the rest of the annotation ID on the next call.

I would wager that for most typical use cases the "insufficient data" case is comparatively rare, and that most calls to next() will be satisfied by the buffer. I suspect that a lookahead buffer implementation that was willing to re-parse bytes in the (uncommon) insufficient data case would not only be much simpler, it would probably perform better in general because it wouldn't need to save/test state for each byte.

Does saving/testing state byte-by-byte buy us other functionality that I've glossed over?

This is an interesting suggestion that I've now had time to think about for a while. I agree that "insufficient data" happens much less frequently than "sufficient data" when calling moreDataRequired().

However, I wasn't able to find much complexity in the code that I'd be able to eliminate by re-parsing bytes after previously returning true from moreDataRequired(). Here is some of my thought process.

  1. It doesn't apply to readTypeId(), which only reads a single byte. Either the byte is there or it isn't; there's nothing to re-parse.
  2. It does apply to readVarUInt(), which is the only other place where bytes are actually parsed. At first, I was thinking simplifying this might allow us to get rid of the VarUInt class and inProgressVarUInt variable. However, we would then have the problem that we still need to return multiple pieces of information from the readVarUInt() method, namely:
    ** Is the VarUInt complete? If not, moreDataRequired() must return true.
    ** If the VarUInt is complete, what is its value and how many bytes composed the value? (The latter is needed when reading annotation wrappers.)
    Therefore, I conclude that a reusable object like VarUInt is still the best way to convey this information, and keeping the in-progress value doesn't cost us anything. In fact, it is actually necessary in the case where the current value is being skipped due to exceeding the maximum size, because in that case bytes are read directly from the underlying InputStream and are never buffered. The value could not be re-parsed in that case because the bytes that composed it are gone.
  3. At the extreme, if each call to moreDataRequired() were completely independent (i.e. if a previous call had to return true, the next call would have to re-read every byte from the value starting from the type ID), would that allow us to simplify the state machine? And I came to the conclusion that I don't think it would. We'd still have a loop where we progress through states, with certain states sometimes being skipped based on the outcome of previous states (e.g. if the value is unannotated). We'd also have the same problem as in 2) above, where values being skipped are never buffered.

I'm happy to continue discussing this if you can see something I've missed, though. If we can find a technique that significantly simplifies the code and provides better performance for the moreDataRequired() -> false case at the expense of the moreDataRequired() -> true case, I'm all for it.

Copy link
Contributor

@jobarr-amzn jobarr-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shipping a few comments, will be reviewing incrementally. Most of these I've chatted about offline already.

* the boundary must be manually extended (see {@link #extendBoundary(int)} to make these bytes
* available. When false, all buffered bytes will be available to read.
*/
private final boolean useBoundary;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help me understand the constraints we're working with- why is this behavior in ResizingPipedInputStream, rather than a wrapper?

Comment on lines +137 to +141
* @param initialBufferSize the initial size of the buffer. When full, the buffer will grow by this
* many bytes. The buffer always stores bytes contiguously, so growth requires
* allocation of a new buffer capable of holding the new capacity and copying of the
* existing bytes into the new buffer. As such, a size should be chosen carefully
* such that growth is expected to occur rarely, if ever.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked offline what causes the constraint that the buffer always stores bytes contiguously.

A desire for simplicity, mostly. One of the things that makes this fast is that the parser can parse directly from that buffer once it’s filled. That gets significantly more complicated (and possibly slower) if we have to worry about addressing pages and page offsets.
The idea being that the size of your buffer will eventually stabilize and the cost of finding the right size contiguous buffer will be amortized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This javadoc convention makes ragged blocks whose indentation is dependent on the length of a variable name. What do you think of this convention instead?

    /**
     * @param initialBufferSize the initial size of the buffer. When full, the buffer will grow by
     *     this many bytes. The buffer always stores bytes contiguously, so growth requires 
     *     allocation of a new buffer capable of holding the new capacity and copying of the
     *     existing bytes into the new buffer. As such, a size should be chosen carefully such that
     *     growth is expected to occur rarely, if ever.
     */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The javadoc style doesn't really matter to me either way. The current style is what IntelliJ does for me when I go to a new line within a Javadoc parameter description, which is why it's here. I like the way you've proposed slightly more. Since there are many other comments on this PR I'm going to focus on those, and we can follow up with a change to the Javadoc style in a future PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 167 to 168
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer is not visible to the user, right? That's an implementation detail. So this:

    /**
     * Moves all buffered (but not yet read) bytes to `destinationBuffer`. In total, {@link #size()}
     * bytes will be moved.
     */

if (shortfall > 0) {
// There is not enough space in the buffer even though all available bytes have already been
// moved to the start of the buffer. Growth is required.
int amountToGrow = Math.max(initialBufferSize, shortfall);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ought to scale proportionately, a constant growth factor rather than a constant growth amount. Here when minimumNumberOfBytesRequired > initialBufferSize unbalanced write/read sequences will lead to allocation and copying on every write.

Copy link
Contributor Author

@tgregg tgregg Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a growth factor instead of growing by initialBufferSize each time is something I'd like to experiment with, and I'd prefer to do it as a fast-follow to this PR as it requires all new performance/memory benchmarks.

Here when minimumNumberOfBytesRequired > initialBufferSize unbalanced write/read sequences will lead to allocation and copying on every write.

That's true if you're writing significantly more than you're reading. In that case, users should heed the warning on the Javadoc for the initialBufferSize parameter: "should be chosen carefully such that growth is expected to occur rarely, if ever." Each read frees bytes in the buffer, so for normal cases that read the data that is written fairly frequently, the size of the buffer should stabilize. For the Ion use case, for example, we know that the buffer will be filled with enough chunks of bytes to complete a single top-level value (plus preceding system value(s)), then that data will be read. The size of the buffer will stabilize to approximately the size of the largest top-level value in the stream.

Using a growth factor requires fewer allocations/copies to stabilize, but will usually be more wasteful of memory. That's why I want to study the difference in-depth after resolving the other feedback on this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments for BufferConfiguration, BufferEventHandler, and the Ion-specific implementations.

Copy link
Contributor

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review for IonReaderBinaryIncremental. As a high level comment: this file is huge. Is there a reason not to break some of the nested classes into their own files?

Comment on lines 230 to 250
// The text representations of the symbol table that is currently in scope, indexed by symbol ID. If the element at
// a particular index is null, that symbol has unknown text.
private final List<String> symbols;

// The catalog used by the reader to resolve shared symbol table imports.
private final IonCatalog catalog;

// The shared symbol tables imported by the local symbol table that is currently in scope. The key is the highest
// local symbol ID that resolves to a symbol contained in the value's symbol table.
private final TreeMap<Integer, SymbolTable> imports;

// A map of symbol ID to SymbolToken representation. Because most use cases only require symbol text, this
// is used only if necessary to avoid imposing the extra expense on all symbol lookups.
private List<SymbolToken> symbolTokensById = null;

// The number of symbols in the local symbol table that is currently in scope.
private int numberOfSymbols;

// The highest local symbol ID that resolves to a symbol contained in a shared symbol table imported by the
// current local symbol table.
private int importMaxId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point in my read-through, these member fields seem like they should be grouped together into a class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a TODO to "split this implementation into a user-level reader and a system-level reader", where all of the symbol-related stuff would be in the user-level reader. I think that could help to organize some of this better, in addition to reducing the size of the file, which you pointed out in your top-level comment. I'd prefer to wait on reorganization of these variables until implementing the user/system separation, which I'm not prepared to do as part of this initial PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking concern: I would expect a load factor of 1 to waste less memory but provide somewhat slower lookup times. We should run some ion-java-benchmark-cli tests to see what the performance difference between the default (.75?) and 1 is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing I want to avoid is a guaranteed growth. As an alternative to a load factor of 1, we can use the default but inflate the initial size beyond what we need. Here's what I have:

mapView = new HashMap<String, Integer>((int) Math.ceil(numberOfSymbols / 0.75), 0.75f);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me. 👍

Copy link
Contributor

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments for IonReaderBuilder.

Comment on lines 12 to 22
/**
* If possible, fills the input pipe with enough bytes to enable one more successful call to
* {@link IonReader#next()} on the non-incremental reader attached to the pipe. If not enough bytes were available
* in the raw input stream to complete the next top-level user value, calling {@link #moreDataRequired()} after this
* method returns will return `true`. In this case, this method must be called again before calling
* `IonReader.next()` to position the reader on this value. Otherwise, `moreDataRequired()` will return `false` and
* a call to `IonReader.next()` may be made to position the reader on this value. Implementations may throw
* `IonException` if invalid Ion data is detected. Any exceptions thrown when invoking {@link BufferEventHandler}
* methods will be propagated. Implementations may define additional exceptional cases.
* @throws Exception if thrown by a handler method or if an IOException is thrown by the underlying InputStream.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

References to IonReader methods and BufferEventHandler seem out of place at this layer of abstraction, or we have opportunities to reduce abstraction since ReaderLookaheadBufferBase is the only class that implements this interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the comment about BufferEventHandler should be moved to ReaderLookaheadBufferBase. Done.

I think the IonReader references should remain, however; this interface is inextricably linked to IonReader (hence the newIonReadermethod).

Regarding why both the ReaderLookaheadBuffer interface and ReaderLookaheadBufferBase abstract class both exist, that's a good callout. It's because I separately implemented some incremental reader support for text that I'm not proposing in this PR, but may later. I ended up defining a class that extends ReaderLookaheadBufferBase capable of incrementally reading newline-delimited Ion text. I also defined a "no-op lookahead buffer" that trivially implements the ReaderLookaheadBuffer and allows the user to use the same underlying machinery for blocking use cases. That satisfied me that the current factoring probably has some value, but I'm open to suggestions.

Copy link
Contributor

@jobarr-amzn jobarr-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting on the state machine in IonReaderLookaheadBuffer- I also would like to see this more willing to re-parse because my intuition is that it would be simpler to read and not have a meaningful performance cost.

At the extreme, if each call to moreDataRequired() were completely independent (i.e. if a previous call had to return true, the next call would have to re-read every byte from the value starting from the type ID), would that allow us to simplify the state machine? And I came to the conclusion that I don't think it would. We'd still have a loop where we progress through states, with certain states sometimes being skipped based on the outcome of previous states (e.g. if the value is unannotated)

Conceptually we're moving through a state machine, but any computational process can be modeled that way. I think that explicit modeling is getting in the way here more than it is helping, partially because the relevant logic isn't isolated by state. Instead we have a different sort of cross section that cuts across them. At least that's how it feels to me.

More generally I'm trying to generate a positive recommendation for the IonReaderBinaryIncremental, IonReaderLookaheadBuffer, and ResizingPipedInputStream set of classes. These three don't feel well-defined, they bleed into each other and don't make as much sense separately. You see that from the way documentation and comments reference the internals of other classes, or super types reference properties or uses of their subtypes. At present I don't have a helpful clarification to suggest, rather I'm saying I'd be happy if there was one.

protected int expectedLocalNullSlotSymbolId() {
// The spec allows for implementations to treat these malformed symbols as "null slots", and all "null slots"
// in local symbol tables as equivalent to symbol zero.
return getStreamingMode() == StreamingMode.NEW_STREAMING_INCREMENTAL ? 0 : 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this? I did a little poking around and don't get it yet. Is this a test case convention? I see this in more places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec says "Any SIDs that refer to null slots in a local symbol table are equivalent to symbol zero." However, unlike the new implementation, the existing binary reader implementation does not actually assign null slots symbol ID 0. It's a larger effort to fix this in the old implementation, and is left out of scope for this PR (see #151 ).

10 is hardcoded here because the method is only called in places where the null slot is at symbol ID 10; there’s nothing special about 10 with respect to null slots.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there’s nothing special about 10 with respect to null slots.

That's more what I was getting at. I had a hard time finding the significance of 10, in particular.

These tests and their data have behavior and specification split up in ways that make it difficult as a reader to assemble. I'd like to see a different factoring that lowers the barrier to entry.

Nothing to do here, necessarily, except that while raising 10 as a special value in these tests does exercise a behavior of the new reader that performs to spec I'd rather see that behavior demonstrated in a specific and isolated test so that I don't have to drag the knowledge of "10" across boundaries.

Copy link
Contributor

@jobarr-amzn jobarr-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now reviewed everything except I need to look through tests for IonReaderBinaryIncremental, IonReaderLookaheadBuffer*, and ResizingPipedInputStream.

Comment on lines +181 to +199
@Test
public void testReadingDecimalAsInteger()
{
// decimal value <space> int conversion
read(
"0. 0 " +
"-2147483648. -2147483648 " + // Min int
"2147483647. 2147483647 " // Max int
);
while (in.next() != null)
{
int actual = in.intValue();

in.next();
int expected = in.intValue();

assertEquals(expected, actual);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it didn't originate with you and you are copying the conventions of this file, but this is a bizarre pattern. To justify it the sequencing of values in a single stream would have to be important, or reading a new document would have to be so expensive that it justified strange tests to make runtime friendlier. I don't think either of those are true, so it would be better to have a few methods factored out:

  • readInt(String text)
  • readLong(String text)
  • etc.

Then you'd be able to do write this as e.g.

    int readInt(String text) {
        read(text);
        if (in.next() == null) {
            fail("Expected to read an integer");
        }
        return in.intValue();
    }

    @Test
    public void testReadingDecimalAsInteger()
    {
        // decimal value <space> int conversion
        assertEquals(readInt("0."), readInt("0"));
        assertEquals(readInt("-2147483648."), readInt("-2147483648")); // Min int
        assertEquals(readInt("2147483647."), readInt("2147483647")); // Max int
    }

This can get even cleaner in Java 8, where I'd advocate writing a custom assertion e.g. assertEquals(IonReader::intValue, "0.", "0") so that you can reduce and reuse further. But as it stands the pattern above at least avoids globbing tests cases together.

This needn't be part of this review, but even one example would give an better template for future expansions here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's all of these tests XasY tests rewritten to avoid this while idiom: https://gist.github.com/jobarr-amzn/0dd822b2fd7fd6ef9258bb95e43bd3c2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's an awkward pattern, and I like your suggestion. I've opened #380 for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you missed it, this is completely implemented in that gist. I'll add a comment in #380, maybe just fire out another PR after this one is shipped then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't miss it; I just don't want to further inflate this PR. A follow-on PR would be great.

import static org.junit.Assert.assertNull;
import static org.junit.Assert.assertTrue;

public class IonReaderBinaryIncrementalTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helper methods and classes would reduce a lot of the repetition here and make it read more easily.

That could be custom assertions like assertString(IonReader r, String s):

void assertString(IonReader r, String s) {
    assertEquals(IonType.STRING, r.next());
    assertEquals(s, r.stringValue());
}

Or composable Matchers, helper methods for common setup, etc. Lambdas etc. would also help if they were available 🤷🏻‍♂️.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning to this comment- the IonValue-based assertion helpers I proposed for IonReaderLookAheadBufferTestBase would help here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some helper methods to reduce repetition in both IonReaderBinaryIncrementalTest and IonReaderLookaheadBufferTest.

Comment on lines +630 to +647
new IonReaderAssertion() {
@Override
public void evaluate(IonReader reader) {
assertEquals(IonType.LIST, reader.next());
reader.stepIn();
assertEquals(IonType.STRUCT, reader.next());
reader.stepIn();
assertEquals(IonType.SYMBOL, reader.next());
assertEquals("foo", reader.getFieldName());
assertEquals("bar", reader.stringValue());
reader.stepOut();
assertEquals(IonType.SEXP, reader.next());
assertEquals(IonType.INT, reader.next());
assertEquals(123, reader.intValue());
assertNull(reader.next());
reader.stepOut();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given some helper methods, you could write this like this:

                a_list(
                        a_struct(a_field("foo", a_symbol("bar"))),
                        a_sexp(a_symbol("baz"), a_symbol("bar")),
                        an_integer(123)
                )

The whole test file written this way reduces it by ~200 lines and is easier to read and write at the expense of slightly increased difficulty of debugging failures. Here's that refactor: https://gist.github.com/jobarr-amzn/7e4fc2fbc91cb31524ff241458849f72

It needs this infrastructure (~200 LOC ;) ) to make an IonReaderAssertion that matches compares an IonValue to an IonReader: https://gist.github.com/jobarr-amzn/26f5f58f71715a34c50e4cc1101b3ab5

Given that this file is not the only place we make structural/content assertions around IonReaders, I think infrastructure like the above could be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate how succinct this proposal makes the tests themselves. However, it goes to great lengths to hide the verbosity of the IonReader API--but that's the API we're testing. We should expect some verbosity in tests that test a verbose API.

I used to try to make my tests as DRY and concise as possible. After having to actually use those tests enough times (and with enough time elapsed since writing them), I found that this almost always made them a pain to debug and required more of a time investment to re-familiarize myself with because I had to re-learn what all the helper methods did.

Now, in tests, I always value debuggability above all else. Especially in a code base with as many tests and different authors as ion-java, most developers who encounter a unit test failure will be unfamiliar with that particular test. If such a developer gets a failure in one of these tests, where do they put a breakpoint? Under the proposal, it's not even in the same file; it's in a recursive assertion helper method. As the code is currently factored, the assertions are right there in the test and a breakpoint can quickly be placed before any of them without having to do a bunch of digging.

I know there are different philosophies for how to write tests, and I'm not necessarily asserting that mine is the correct one; I'm open to being convinced that it isn't. In the meantime, I'd like to move forward with the current factoring and explore ways to reduce boilerplate without sacrificing debuggability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I agree about the value of debugging. The proposed recursive helper method definitely makes debugging more difficult. I do think though that this is not a necessary trade-off. More investment in IonValue-based structural assertions could make debugging straightforward while keeping tests easy to read.

I don't expect that anyone should use the IonReader API the way that the tests do- though I'm ready to be wrong on that one.

In the meantime, I'd like to move forward with the current factoring and explore ways to reduce boilerplate without sacrificing debuggability.

Sounds good to me. I'll keep this general approach in my pocket and see what I can do to improve it at a later date. Better failure reporting would make all the difference here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect that anyone should use the IonReader API the way that the tests do- though I'm ready to be wrong on that one.

I think the tests do some things that a user probably wouldn't often do (e.g. call next() multiple times at the end of a container), but generally I think they fairly accurately depict how IonReader would be used to traverse values. Maybe I'm not fully understanding this statement. Can you give an example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure- in these tests we're making a series of calls to verify arbitrary data, including things like the specific ordering of fields in a struct. That sort of specificity is appropriate when verifying the functioning of a parser, but as a user of ion-java I wouldn't expect to do this. Instead I'd be using higher-level abstractions to guide my use of the library.

That might be ISL, it might be native objects with some other serialization constraints, it might be some other framework, but whatever it is I don't expect to do or see this kind of unrolled/decompiled sequence of calls.

@tgregg
Copy link
Contributor Author

tgregg commented Aug 20, 2021

(Quotes from @jobarr-amzn's comments)

Commenting on the state machine in IonReaderLookaheadBuffer- I also would like to see this more willing to re-parse because my intuition is that it would be simpler to read and not have a meaningful performance cost.

At the extreme, if each call to moreDataRequired() were completely independent (i.e. if a previous call had to return true, the next call would have to re-read every byte from the value starting from the type ID), would that allow us to simplify the state machine? And I came to the conclusion that I don't think it would. We'd still have a loop where we progress through states, with certain states sometimes being skipped based on the outcome of previous states (e.g. if the value is unannotated)

Conceptually we're moving through a state machine, but any computational process can be modeled that way. I think that explicit modeling is getting in the way here more than it is helping, partially because the relevant logic isn't isolated by state. Instead we have a different sort of cross section that cuts across them. At least that's how it feels to me.

I've laid out my reasoning on this topic in this comment. It would be great to have some of those points refuted specifically so that I can identify where our understanding diverges. In particular, I would like an answer to the following question: if the lookahead buffer were implemented such that it would re-parse data on each call to moreDataRequired(), how would that work for oversized values, whose data may have been consumed from the input but never buffered?

More generally I'm trying to generate a positive recommendation for the IonReaderBinaryIncremental, IonReaderLookaheadBuffer, and ResizingPipedInputStream set of classes. These three don't feel well-defined, they bleed into each other and don't make as much sense separately. You see that from the way documentation and comments reference the internals of other classes, or super types reference properties or uses of their subtypes. At present I don't have a helpful clarification to suggest, rather I'm saying I'd be happy if there was one.

I'm going to push on this one a little bit; maybe that will help us arrive at something actionable.

First, can you please give examples of "super types reference properties or uses of their subtypes"?

Now let's take the classes one at a time.

ResizingPipedInputStream

  • I'm not seeing any documentation/comments that reference the other classes listed.
  • There are examples in the tests of this class being used completely independently of the other two for making data incrementally available via the InputStream interface. See, e.g., IonReaderBinaryIncrementalTest.incrementalRead, where the ResizingPipedInputStream pipe is the InputStream that supplies data to the reader. Only the public interface is required.
  • There are some non-public methods available that allow more direct operation on the buffered data. Would you prefer that these methods be defined by an interface? Right now ResizingPipedInputStream only extends InputStream, but if we think it's necessary we could define up to two new interfaces that define the methods for 1) adding data to the buffer (the receive methods) and 2) retrieving data outside of the InputStream interface. In the other classes, a reference to only the most specific interface(s) required would be stored and used.

IonReaderLookaheadBuffer

  • This class is expected to be used with an IonReader, but that reader need not be IonReaderBinaryIncremental. There are two ways to use it: from within an incremental reader implementation like IonReaderBinaryIncremental, or as a direct monitor over an InputStream that provides Ion data. In the latter case, any IonReader implementation that can be created by IonReaderBuilder may be used to traverse the data buffered by IonReaderLookaheadBuffer, and only the public interface methods are needed. In the former case, there are non-public accessors available that allow the reader implementation retrieve information about the top level values that the lookahead buffer has already parsed. This is a key performance optimization, but it's not strictly required; the public interface is sufficient for building an IonReader on top of the lookahead buffer. Would you prefer there to be an additional interface that defines all of the non-public accessors available to internal reader implementations? This is a pattern we've followed elsewhere in the library (e.g. _Private_IonWriter, which extends the public IonWriter interface with methods for internal use).
  • Are there specific cases where you think the documentation/comments in this class violate the abstraction? Nothing stuck out to me.

IonReaderBinaryIncremental

  • This class uses the non-public methods provided by the other two classes, as discussed above. If we were to add new interfaces that define these non-public methods, then we could achieve the following: 1) only use the public methods of the lookahead buffer via the lookahead field. 2) Keep a separate reference to the lookahead buffer's new interface that allows access to information about top-level values. 3) Change the buffer field from ResizingPipedInputStream to the new interface that defines the methods that can be performed on the underlying buffer.

@jobarr-amzn
Copy link
Contributor

This was a massive undertaking, and it's all functional code. Inactionable comments don't fit here. Please ship it, and in the event that anyone ever has a concrete and actionable improvement to anything in the future we can talk about it then.


First, can you please give examples of "super types reference properties or uses of their subtypes"?

I misspoke and said something incorrect, I apologize. Let me try again.

I was thinking of the way IonReaderBinaryIncremental used the ResizingPipedInputStream belonging to its IonReaderLookaheadBuffer, but I was tired and said subtype/supertype (IS A when I was thinking HAS A). This is ground we've already covered, so no need to get into it again.

I have a friend who when building a gaming PC knew his tolerances well enough to strip the heat sink from one component and share the larger heat sink of another, and to remove the casing from the power supply to make another component placement possible. It worked- the rig outperforms anything I've owned and fits more power per dollar per space in the case than the individual components suggested ought to be possible. The abstraction violation in the three classes above is intentional but not gratuitous- my discomfort is not an indicator that the solution seen here is not achieving what it sets out to do.

I do not think that additional interfaces on the lookahead buffer or resizing pipe would be more clear than the current state, so please keep it as is.

jobarr-amzn
jobarr-amzn previously approved these changes Sep 6, 2021
zslayton
zslayton previously approved these changes Sep 7, 2021
…hat the option may apply to text readers if support is added in the future.
@tgregg tgregg dismissed stale reviews from zslayton and jobarr-amzn via a1a270d September 7, 2021 23:53
@tgregg tgregg merged commit a398b15 into master Sep 8, 2021
@tgregg tgregg deleted the incremental-reader-merge branch September 8, 2021 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments