-
Notifications
You must be signed in to change notification settings - Fork 3k
Parquet: Add row position reader #1254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java
Outdated
Show resolved
Hide resolved
spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java
Outdated
Show resolved
Hide resolved
df20b0b to
9b131c3
Compare
|
This looks good to me! Would be great if someone familiar with Parquet could take a second pass. |
|
|
||
| void setPageSource(PageReadStore pageStore); | ||
|
|
||
| default void setRowOffsetForRowGroup(long position) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add the position to the page source? Then the two operations are tied together: the row offset is the start offset for the new pages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change to that. Just one thing that do we mind to change the function signature in the public API?
|
|
||
| public static ParquetValueReader<Long> position() { | ||
| return new PositionReader(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this to the top of the file with the other factory methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| return offsetToStartRowPosMap; | ||
| } | ||
|
|
||
| long[] getRowGroupsStartRowPos() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about naming this startPositions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startPositions may confuse with rowGroup.startingPosition, how about startRowPosititions?
| return shouldSkip; | ||
| } | ||
|
|
||
| private Map<Long, Long> generateRowGroupsStartRowPos() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this separately read the Parquet file to create a map that is used to initialize an array, when the starting position could be set for the array in the existing loop? I don't think this method is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing loop of row groups is based on the row groups that had been filtered with options. So we need to read the Parquet file without any filter to get each starting row position of row group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Good catch!
Can you add some comments to explain why this is needed for later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| static class PositionReader implements ParquetValueReader<Long> { | ||
| private long rowOffsetInCurrentRowGroup = -1; | ||
| private long rowGroupRowOffsetInFile; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, try to be specific with names, but avoid unnecessary context. In this case, these names can be simpler: rowGroupStart and rowOffset would work fine. Extra context like InFile and InCurrent aren't adding clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| Assert.assertFalse("Should not have extra rows", actualRows.hasNext()); | ||
| } finally { | ||
| if (reader != null) { | ||
| reader.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use try-with-resources instead of a finally block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| } | ||
|
|
||
| long[] startRowPositions() { | ||
| return rowGroupsStartRowPos; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the same name for the variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NP, Done.
| this.rowGroups = reader.getRowGroups(); | ||
| this.shouldSkip = new boolean[rowGroups.size()]; | ||
|
|
||
| Map<Long, Long> offsetToStartRowPosMap = generateRowGroupsStartRowPos(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about naming this offsetToStartPos and similarly updating the method name? There's no need to include a type in the variable name, usually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done.
| return triplesRead < triplesCount; | ||
| } | ||
|
|
||
| public void setRowPosition(long rowPosition) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding this, can you update setPageSource like the other interface that changed?
| protected long triplesRead = 0L; | ||
| protected long advanceNextPageCount = 0L; | ||
| protected Dictionary dictionary; | ||
| protected long rowPosition; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? I don't see any uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this and setRowPosition are no longer needed.
9ee93a0 to
07bb9fb
Compare
|
+1 Thanks @chenjunjiedada, it looks good now. Nice work catching that the metadata was already filtered using the file range, too. |
| this.shouldSkip = new boolean[rowGroups.size()]; | ||
|
|
||
| // Fetch all row groups starting positions to compute the row offsets of the filtered row groups | ||
| Map<Long, Long> offsetToStartPos = generateOffsetToStartPos(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just occurred to me (after merging this) that we may want to make this lazy, like we do in Avro. That way if the row positions are never used, we don't incur the cost of reading the footer another time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used to think to apply Caffeine cache this. Let me think about this again and also check what Avro does. I will update this in follow up vectorization code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenjunjiedada yes we should make this lazy , do you have issue to track improvements to existing logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds position reader for parquet readers.
TODO: