Read all row group metadata only when position column is projected#1716
Read all row group metadata only when position column is projected#1716rdblue merged 2 commits intoapache:masterfrom
Conversation
| } | ||
|
|
||
| long rowPosition = rowGroupsStartRowPos[nextRowGroup]; | ||
| model.setPageSource(pages, rowGroupsStartRowPos == null ? 0 : rowGroupsStartRowPos[nextRowGroup]); |
There was a problem hiding this comment.
The rowPosition will be ignored if the position column is not projected.
There was a problem hiding this comment.
I don't think this file should change. If the determination not to use correct start positions in made in ReadConf, then it should just set the position to 0 there. This can use whatever rowGroupStartRowPos contains.
| BlockMetaData rowGroup = rowGroups.get(i); | ||
| startRowPositions[i] = offsetToStartPos.get(rowGroup.getStartingPos()); | ||
| if (offsetToStartPos != null) { | ||
| startRowPositions[i] = offsetToStartPos.get(rowGroup.getStartingPos()); |
There was a problem hiding this comment.
I think this should use offsetToStartPos.getOrDefault(0L).
| Map<Long, Long> offsetToStartPos = null; | ||
| if (expectedSchema.findField(MetadataColumns.ROW_POSITION.fieldId()) != null) { | ||
| offsetToStartPos = generateOffsetToStartPos(); | ||
| this.startRowPositions = new long[rowGroups.size()]; |
There was a problem hiding this comment.
I think this should be defined for all cases so that ParquetReader doesn't need to check whether it is defined. In that case, it should still be a final variable.
The only change for this PR should be whether offsetToStartPos is an empty map or the result of generateOffsetToStartPos(). We may also want to do the ROW_POSITION check in that method instead of here, depending on what is cleaner.
There was a problem hiding this comment.
Make sense to me. Updated.
|
Looks good now. Thanks @chenjunjiedada! |
This is used to avoid reading all row group metadata if the position metadata column is not projected.