Parquet: Add row position reader #1254

chenjunjiedada · 2020-07-27T12:32:13Z

This adds position reader for parquet readers.

TODO:

Add row position reader for the vectorized reader.

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java

shardulm94 · 2020-07-28T07:36:35Z

This looks good to me! Would be great if someone familiar with Parquet could take a second pass.

rdblue · 2020-07-28T21:29:24Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReader.java


  void setPageSource(PageReadStore pageStore);
+
+  default void setRowOffsetForRowGroup(long position) {}


Why not add the position to the page source? Then the two operations are tied together: the row offset is the start offset for the new pages.

I can change to that. Just one thing that do we mind to change the function signature in the public API?

rdblue · 2020-07-28T21:31:25Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java

+
+  public static ParquetValueReader<Long> position() {
+    return new PositionReader();
+  }


Can you move this to the top of the file with the other factory methods?

rdblue · 2020-07-28T21:32:44Z

parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java

+    return offsetToStartRowPosMap;
+  }
+
+  long[] getRowGroupsStartRowPos() {


How about naming this startPositions?

startPositions may confuse with rowGroup.startingPosition, how about startRowPosititions?

rdblue · 2020-07-28T21:35:09Z

parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java

    return shouldSkip;
  }

+  private Map<Long, Long> generateRowGroupsStartRowPos() {


Why does this separately read the Parquet file to create a map that is used to initialize an array, when the starting position could be set for the array in the existing loop? I don't think this method is needed.

The existing loop of row groups is based on the row groups that had been filtered with options. So we need to read the Parquet file without any filter to get each starting row position of row group.

You're right. Good catch!

Can you add some comments to explain why this is needed for later?

rdblue · 2020-07-28T21:39:34Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java


+  static class PositionReader implements ParquetValueReader<Long> {
+    private long rowOffsetInCurrentRowGroup = -1;
+    private long rowGroupRowOffsetInFile;


In general, try to be specific with names, but avoid unnecessary context. In this case, these names can be simpler: rowGroupStart and rowOffset would work fine. Extra context like InFile and InCurrent aren't adding clarity.

rdblue · 2020-07-28T21:45:23Z

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java

+      Assert.assertFalse("Should not have extra rows", actualRows.hasNext());
+    } finally {
+      if (reader != null) {
+        reader.close();


Why not use try-with-resources instead of a finally block?

rdblue · 2020-07-30T00:19:26Z

parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java

+  }
+
+  long[] startRowPositions() {
+    return rowGroupsStartRowPos;


Can we use the same name for the variable?

rdblue · 2020-07-30T00:20:21Z

parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java

    this.rowGroups = reader.getRowGroups();
    this.shouldSkip = new boolean[rowGroups.size()];

+    Map<Long, Long> offsetToStartRowPosMap = generateRowGroupsStartRowPos();


How about naming this offsetToStartPos and similarly updating the method name? There's no need to include a type in the variable name, usually.

rdblue · 2020-07-30T00:27:56Z

parquet/src/main/java/org/apache/iceberg/parquet/BaseColumnIterator.java

    return triplesRead < triplesCount;
  }

+  public void setRowPosition(long rowPosition) {


Instead of adding this, can you update setPageSource like the other interface that changed?

rdblue · 2020-07-30T00:28:19Z

parquet/src/main/java/org/apache/iceberg/parquet/BaseColumnIterator.java

  protected long triplesRead = 0L;
  protected long advanceNextPageCount = 0L;
  protected Dictionary dictionary;
+  protected long rowPosition;


Is this needed? I don't see any uses.

Right, this and setRowPosition are no longer needed.

rdblue · 2020-07-31T01:13:37Z

+1

Thanks @chenjunjiedada, it looks good now. Nice work catching that the metadata was already filtered using the file range, too.

rdblue · 2020-07-31T01:15:08Z

parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java

    this.shouldSkip = new boolean[rowGroups.size()];

+    // Fetch all row groups starting positions to compute the row offsets of the filtered row groups
+    Map<Long, Long> offsetToStartPos = generateOffsetToStartPos();


It just occurred to me (after merging this) that we may want to make this lazy, like we do in Avro. That way if the row positions are never used, we don't incur the cost of reading the footer another time.

I used to think to apply Caffeine cache this. Let me think about this again and also check what Avro does. I will update this in follow up vectorization code path.

@chenjunjiedada yes we should make this lazy , do you have issue to track improvements to existing logic?

@sudssf , Yes, I have a PR: #1356. It reads the file when the required schema contains position column.

if (expectedSchema.findField(MetadataColumns.ROW_POSITION.fieldId()) != null) { // Only read footer when needed offsetToStartPos = generateOffsetToStartPos(); }

Parquet: Add row position reader

8bb39c3

shardulm94 reviewed Jul 27, 2020

View reviewed changes

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java Outdated Show resolved Hide resolved

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java Outdated Show resolved Hide resolved

shardulm94 reviewed Jul 28, 2020

View reviewed changes

parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java Outdated Show resolved Hide resolved

address comments

9b131c3

chenjunjiedada force-pushed the add-row-position-reader branch from df20b0b to 9b131c3 Compare July 28, 2020 05:20

rdblue reviewed Jul 28, 2020

View reviewed changes

address comments

7629344

rdblue reviewed Jul 30, 2020

View reviewed changes

chenjunjiedada force-pushed the add-row-position-reader branch from 9ee93a0 to 07bb9fb Compare July 30, 2020 03:32

fix naming and remove useless variable

07bb9fb

rdblue merged commit 7060c92 into apache:master Jul 31, 2020

rdblue reviewed Jul 31, 2020

View reviewed changes

rdblue mentioned this pull request Aug 5, 2020

Add _file and _pos metadata columns to Parquet readers #1020

Closed

cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020

Parquet: Add row position reader (apache#1254)

92e9241

chenjunjiedada mentioned this pull request Jun 29, 2022

PARQUET-2161: Fix row index generation in combination with range filtering apache/parquet-java#978

Merged

4 tasks

szehon-ho pushed a commit to szehon-ho/iceberg that referenced this pull request Sep 16, 2024

Internal: Follow-up for Comet Iceberg integration (apache#1254)

2466ea1


		void setPageSource(PageReadStore pageStore);

		default void setRowOffsetForRowGroup(long position) {}

Parquet: Add row position reader #1254

Parquet: Add row position reader #1254

Uh oh!

Conversation

chenjunjiedada commented Jul 27, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shardulm94 commented Jul 28, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada Jul 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jul 31, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada Jul 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chenjunjiedada Jul 29, 2020 •

edited

Loading

chenjunjiedada Jul 31, 2020 •

edited

Loading