PARQUET-2117: Expose Row Index via ParquetReader and ParquetRecordReader #945

prakharjain09 · 2022-02-05T00:24:51Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title.
- https://issues.apache.org/jira/browse/PARQUET-2117
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests:
Added TestParquetReader which covers rowIndex related tests for different kind of filters.
Also Extended all the ColumnIndexFiltering and BloomFiltering tests to validate the "row index" also. This adds unit test coverage for following scenarios for this feature: Parquet V1/V2 with encryption on/off with no-filter/simple-filter/column-index-filter/bloom-filter.

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

parquet-column/src/main/java/org/apache/parquet/column/page/PageReadStore.java

prakharjain09 · 2022-02-07T17:15:45Z

@shangxinli @gszadovszky Please review the changes when you get chance. Thanks!

shangxinli · 2022-02-11T17:31:09Z

Can you squash the commits to make the review easier?

prakharjain09 · 2022-02-11T17:48:24Z

Can you squash the commits to make the review easier?

done

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java

shangxinli · 2022-02-11T22:23:01Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java


+  @Override
+  public Optional<Long> getRowIndexOffset() {
+    return Optional.of(rowIndexOffset);


If the constructor caller cannot have a valid rowIndexOffset, I guess we need to provide an option to return empty.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java

shangxinli · 2022-02-11T22:42:08Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java

+   * Returns the ROW_INDEX of the current row.
+   */
+  public long getCurrentRowIndex() {
+    if (current == 0L) {


What is the reason not turning -1?

current is an existing variable which tracks number of rows already processed. It is initialized to 0 at declaration time. So here we are trying to see if it is still 0, that means we haven't processed any row yet.

I understand why we are checking 'current == 0L'. I was asking why you choose throw exception other than returning an invalid value. This is a public method. We should have it documented ether way you choose.

There was no specific reason for choosing exception over -1.
I have updated it to return -1 and also updated all the public method docs to reflect the same.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java

shangxinli · 2022-02-11T23:18:53Z

We need more test to cover old parquet data that doesn't have column index.

prakharjain09 · 2022-02-16T04:20:07Z

@shangxinli Thanks a lot or the review. I have addressed most of the comments.

We need more test to cover old parquet data that doesn't have column index.

I couldn't find any existing tests or any existing parquet files in Resource directory which doesn't have column indexes. Could you please give some pointer to similar existing test or some way to create parquet file without column indexes (don't see any options to disable writing column indexes either)?
I have added tests for validating row indexes are correct with "column index filtering" disabled.

shangxinli · 2022-02-22T17:07:44Z

I will have another look soon sometime this week.

prakharjain09 · 2022-02-25T00:22:41Z

@shangxinli I have added test to cover old parquet file without column indexes. Please review the changes when you get chance.

shangxinli · 2022-02-25T14:02:16Z

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java


  public ParquetMetadata fromParquetMetadata(FileMetaData parquetMetadata,
      InternalFileDecryptor fileDecryptor, boolean encryptedFooter) throws IOException {
+    return fromParquetMetadata(parquetMetadata, fileDecryptor, encryptedFooter, generateRowGroupOffsets(parquetMetadata));


As you mentioned above, if parquetMetadata is a filtered one, then generateRowGroupOffsets() won't return accurate offsets, correct?

Yes thats correct. Fixed this - now we are passing empty Map so that we don't populate incorrect rowIndexOffsets.

shangxinli · 2022-02-25T14:10:37Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/BlockMetaData.java

  private long totalByteSize;
  private String path;
  private int ordinal;
+  private long rowIndexOffset;


In the following toString(), it should be added too.

shangxinli · 2022-02-25T14:33:15Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java


        try {
          currentValue = recordReader.read();
+          if (rowIdxInFileItr != null) {


&& hasNext()?

shangxinli · 2022-02-25T14:36:26Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java

+      if (pages.getRowIndexes().isPresent()) {
+        rowIdxInRowGroupItr = pages.getRowIndexes().get();
+      } else {
+        // If `pages.getRowIndexes()` is empty, this means column indexing has not triggered.


The name of 'column index' was already used for Page Index in another feature. Can you use something else?

removed this code comment.

prakharjain09 · 2022-03-01T18:42:45Z

@shangxinli Thanks a lot or the review. I have addressed the review comments. Could you please look into the PR again?

Also could you share information about when are we planning to do code-freeze for next minor release? It will be great if we can release this change in next minor/patch release so that Apache Spark/other projects get to use this functionality sooner.

prakharjain09 · 2022-03-07T09:09:11Z

Could you please look into the PR again?
Also could you share information about when are we planning to do code-freeze for next minor release? It will be great if we can release this change in next minor/patch release so that Apache Spark/other projects get to use this functionality sooner.

@shangxinli Gentle reminder. Please take a look when you get chance.

shangxinli · 2022-03-07T16:43:54Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java

  }

+  /**
+   * Returns the row index of the last read row. If no row has been processed, returns -1.


Given this is a public method, we need to take care of the Java doc decorations. Please refer to other methods in this class and follow the same.

shangxinli · 2022-03-07T16:55:11Z

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetReader.java

  private static final Path FILE_V1 = createTempFile();
  private static final Path FILE_V2 = createTempFile();
-  private static final List<PhoneBookWriter.User> DATA = Collections.unmodifiableList(makeUsers(10000));
+  private static final Path STATIC_FILE_WITHOUT_COL_INDEXES = createPathFromCP("/test-file-with-no-column-indexes-1.parquet");


I am not sure if it is a good idea to check in a data file. Can you check if it is possible to stop generating offset index in the current version of Parquet?

@shangxinli It looks like the column-indexes are always written in the current version of parquet and are not configurable.
We are already testing the new row index support with and without the column index filtering being triggered (as part of TestColumnIndexFiltering). Also the new row index feature doesn't rely on column indexes in any way. So we can skip the backward compatibility testing and remove this parquet file from resources. What do you think about this?

shangxinli · 2022-03-07T16:56:31Z

I just left some comments. Other than that, it looks good to me. Add @ggershinsky in case you have time to have a look.

Beyond this PR, if the work you are doing in Iceberg/Spark can be done in Parquet, please consider adding them to Parquet-mr. With that, it can benefit all the applications that need parquet-mr.

ggershinsky · 2022-03-08T05:05:03Z

hi guys, I'm OOO (vacation) this week. Can review it next week if helps, but feel free to go ahead without waiting for me.

prakharjain09 · 2022-03-08T09:38:41Z

@shangxinli Thanks for taking another look. I have addressed all comments other than one. Please advice on the same. Thanks!

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java

ggershinsky · 2022-03-15T07:53:33Z

thanks for this change. The PR looks good to me now, I'll add my approval after it passes the CI tests.

shangxinli · 2022-03-15T14:37:17Z

@prakharjain09 After you fix the CI failures, we can merge.

…eader, also expose the row index via ParquetReader or ParquetRecordReader - Add and populate rowIndexOffset field in BlockMetaData - Changes to generate row index in InternalParquetRecordReader, also expose the row index via ParquetReader or ParquetRecordReader - Add new unit tests and extend all the ColumnIndexFiltering and BloomFiltering unit tests to validate row indexes also.

…, document the same, Return -1 when rowIndexOffset info not available in BlockMetadata

…dentation

prakharjain09 · 2022-03-19T06:30:49Z

Thanks @ggershinsky for the review. I have addressed the comments and fixed the build issue.

prakharjain09 · 2022-03-28T03:40:41Z

@shangxinli @ggershinsky Thanks a lot for reviewing this change.

This will unblock SPARK-37980 if this is released as part of upcoming parquet release. Do we need to cherry-pick this to any release branch for the same?

ggershinsky · 2022-03-28T06:31:08Z

@prakharjain09 the upcoming parquet release will include the current master (plus a couple of WIP PRs, once they are merged), so this patch will be covered.

prakharjain09 · 2022-04-25T21:00:10Z

@shangxinli @ggershinsky Is there any tentative date / rough estimate for when are we planning to do RC cut for the next release?

ggershinsky · 2022-04-26T05:49:22Z

@prakharjain09 hopefully, we'll resolve the remaining issues at the community sync tomorrow, and start working on a cut.

prakharjain09 force-pushed the PARQUET-2117-1 branch 3 times, most recently from 8fbd061 to 37faea9 Compare February 5, 2022 11:43

prakharjain09 commented Feb 7, 2022

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/page/PageReadStore.java Show resolved Hide resolved

prakharjain09 force-pushed the PARQUET-2117-1 branch from 37faea9 to 0fbc608 Compare February 11, 2022 17:48