Skip to content

Conversation

@chenjunjiedada
Copy link
Collaborator

This adds position reader for parquet readers.

TODO:

  1. Add row position reader for the vectorized reader.

@chenjunjiedada chenjunjiedada force-pushed the add-row-position-reader branch from df20b0b to 9b131c3 Compare July 28, 2020 05:20
@shardulm94
Copy link
Contributor

This looks good to me! Would be great if someone familiar with Parquet could take a second pass.


void setPageSource(PageReadStore pageStore);

default void setRowOffsetForRowGroup(long position) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add the position to the page source? Then the two operations are tied together: the row offset is the start offset for the new pages.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change to that. Just one thing that do we mind to change the function signature in the public API?


public static ParquetValueReader<Long> position() {
return new PositionReader();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to the top of the file with the other factory methods?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return offsetToStartRowPosMap;
}

long[] getRowGroupsStartRowPos() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about naming this startPositions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startPositions may confuse with rowGroup.startingPosition, how about startRowPosititions?

return shouldSkip;
}

private Map<Long, Long> generateRowGroupsStartRowPos() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this separately read the Parquet file to create a map that is used to initialize an array, when the starting position could be set for the array in the existing loop? I don't think this method is needed.

Copy link
Collaborator Author

@chenjunjiedada chenjunjiedada Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing loop of row groups is based on the row groups that had been filtered with options. So we need to read the Parquet file without any filter to get each starting row position of row group.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Good catch!

Can you add some comments to explain why this is needed for later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


static class PositionReader implements ParquetValueReader<Long> {
private long rowOffsetInCurrentRowGroup = -1;
private long rowGroupRowOffsetInFile;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, try to be specific with names, but avoid unnecessary context. In this case, these names can be simpler: rowGroupStart and rowOffset would work fine. Extra context like InFile and InCurrent aren't adding clarity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Assert.assertFalse("Should not have extra rows", actualRows.hasNext());
} finally {
if (reader != null) {
reader.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use try-with-resources instead of a finally block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

long[] startRowPositions() {
return rowGroupsStartRowPos;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the same name for the variable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NP, Done.

this.rowGroups = reader.getRowGroups();
this.shouldSkip = new boolean[rowGroups.size()];

Map<Long, Long> offsetToStartRowPosMap = generateRowGroupsStartRowPos();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about naming this offsetToStartPos and similarly updating the method name? There's no need to include a type in the variable name, usually.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done.

return triplesRead < triplesCount;
}

public void setRowPosition(long rowPosition) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding this, can you update setPageSource like the other interface that changed?

protected long triplesRead = 0L;
protected long advanceNextPageCount = 0L;
protected Dictionary dictionary;
protected long rowPosition;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? I don't see any uses.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this and setRowPosition are no longer needed.

@chenjunjiedada chenjunjiedada force-pushed the add-row-position-reader branch from 9ee93a0 to 07bb9fb Compare July 30, 2020 03:32
@rdblue
Copy link
Contributor

rdblue commented Jul 31, 2020

+1

Thanks @chenjunjiedada, it looks good now. Nice work catching that the metadata was already filtered using the file range, too.

@rdblue rdblue merged commit 7060c92 into apache:master Jul 31, 2020
this.shouldSkip = new boolean[rowGroups.size()];

// Fetch all row groups starting positions to compute the row offsets of the filtered row groups
Map<Long, Long> offsetToStartPos = generateOffsetToStartPos();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just occurred to me (after merging this) that we may want to make this lazy, like we do in Avro. That way if the row positions are never used, we don't incur the cost of reading the footer another time.

Copy link
Collaborator Author

@chenjunjiedada chenjunjiedada Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used to think to apply Caffeine cache this. Let me think about this again and also check what Avro does. I will update this in follow up vectorization code path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenjunjiedada yes we should make this lazy , do you have issue to track improvements to existing logic?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sudssf , Yes, I have a PR: #1356. It reads the file when the required schema contains position column.

    if (expectedSchema.findField(MetadataColumns.ROW_POSITION.fieldId()) != null) {
      // Only read footer when needed
      offsetToStartPos = generateOffsetToStartPos();
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants