Skip to content

Conversation

@prakharjain09
Copy link
Contributor

@prakharjain09 prakharjain09 commented Feb 5, 2022

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests:
    Added TestParquetReader which covers rowIndex related tests for different kind of filters.
    Also Extended all the ColumnIndexFiltering and BloomFiltering tests to validate the "row index" also. This adds unit test coverage for following scenarios for this feature: Parquet V1/V2 with encryption on/off with no-filter/simple-filter/column-index-filter/bloom-filter.

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@prakharjain09 prakharjain09 force-pushed the PARQUET-2117-1 branch 3 times, most recently from 8fbd061 to 37faea9 Compare February 5, 2022 11:43
@prakharjain09
Copy link
Contributor Author

@shangxinli @gszadovszky Please review the changes when you get chance. Thanks!

@shangxinli
Copy link
Contributor

Can you squash the commits to make the review easier?

@prakharjain09
Copy link
Contributor Author

Can you squash the commits to make the review easier?

done


@Override
public Optional<Long> getRowIndexOffset() {
return Optional.of(rowIndexOffset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the constructor caller cannot have a valid rowIndexOffset, I guess we need to provide an option to return empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

* Returns the ROW_INDEX of the current row.
*/
public long getCurrentRowIndex() {
if (current == 0L) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason not turning -1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current is an existing variable which tracks number of rows already processed. It is initialized to 0 at declaration time. So here we are trying to see if it is still 0, that means we haven't processed any row yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why we are checking 'current == 0L'. I was asking why you choose throw exception other than returning an invalid value. This is a public method. We should have it documented ether way you choose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no specific reason for choosing exception over -1.
I have updated it to return -1 and also updated all the public method docs to reflect the same.

@shangxinli
Copy link
Contributor

We need more test to cover old parquet data that doesn't have column index.

@prakharjain09
Copy link
Contributor Author

prakharjain09 commented Feb 16, 2022

@shangxinli Thanks a lot or the review. I have addressed most of the comments.

We need more test to cover old parquet data that doesn't have column index.

I couldn't find any existing tests or any existing parquet files in Resource directory which doesn't have column indexes. Could you please give some pointer to similar existing test or some way to create parquet file without column indexes (don't see any options to disable writing column indexes either)?
I have added tests for validating row indexes are correct with "column index filtering" disabled.

@shangxinli
Copy link
Contributor

I will have another look soon sometime this week.

@prakharjain09
Copy link
Contributor Author

@shangxinli I have added test to cover old parquet file without column indexes. Please review the changes when you get chance.


public ParquetMetadata fromParquetMetadata(FileMetaData parquetMetadata,
InternalFileDecryptor fileDecryptor, boolean encryptedFooter) throws IOException {
return fromParquetMetadata(parquetMetadata, fileDecryptor, encryptedFooter, generateRowGroupOffsets(parquetMetadata));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned above, if parquetMetadata is a filtered one, then generateRowGroupOffsets() won't return accurate offsets, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats correct. Fixed this - now we are passing empty Map so that we don't populate incorrect rowIndexOffsets.

private long totalByteSize;
private String path;
private int ordinal;
private long rowIndexOffset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the following toString(), it should be added too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


try {
currentValue = recordReader.read();
if (rowIdxInFileItr != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&& hasNext()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

if (pages.getRowIndexes().isPresent()) {
rowIdxInRowGroupItr = pages.getRowIndexes().get();
} else {
// If `pages.getRowIndexes()` is empty, this means column indexing has not triggered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of 'column index' was already used for Page Index in another feature. Can you use something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this code comment.

@prakharjain09
Copy link
Contributor Author

@shangxinli Thanks a lot or the review. I have addressed the review comments. Could you please look into the PR again?

Also could you share information about when are we planning to do code-freeze for next minor release? It will be great if we can release this change in next minor/patch release so that Apache Spark/other projects get to use this functionality sooner.

@prakharjain09
Copy link
Contributor Author

Could you please look into the PR again?
Also could you share information about when are we planning to do code-freeze for next minor release? It will be great if we can release this change in next minor/patch release so that Apache Spark/other projects get to use this functionality sooner.

@shangxinli Gentle reminder. Please take a look when you get chance.

}

/**
* Returns the row index of the last read row. If no row has been processed, returns -1.
Copy link
Contributor

@shangxinli shangxinli Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is a public method, we need to take care of the Java doc decorations. Please refer to other methods in this class and follow the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

private static final Path FILE_V1 = createTempFile();
private static final Path FILE_V2 = createTempFile();
private static final List<PhoneBookWriter.User> DATA = Collections.unmodifiableList(makeUsers(10000));
private static final Path STATIC_FILE_WITHOUT_COL_INDEXES = createPathFromCP("/test-file-with-no-column-indexes-1.parquet");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it is a good idea to check in a data file. Can you check if it is possible to stop generating offset index in the current version of Parquet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shangxinli It looks like the column-indexes are always written in the current version of parquet and are not configurable.
We are already testing the new row index support with and without the column index filtering being triggered (as part of TestColumnIndexFiltering). Also the new row index feature doesn't rely on column indexes in any way. So we can skip the backward compatibility testing and remove this parquet file from resources. What do you think about this?

@shangxinli
Copy link
Contributor

shangxinli commented Mar 7, 2022

I just left some comments. Other than that, it looks good to me. Add @ggershinsky in case you have time to have a look.

Beyond this PR, if the work you are doing in Iceberg/Spark can be done in Parquet, please consider adding them to Parquet-mr. With that, it can benefit all the applications that need parquet-mr.

@prakharjain09 prakharjain09 requested a review from shangxinli March 7, 2022 17:21
@ggershinsky
Copy link
Contributor

hi guys, I'm OOO (vacation) this week. Can review it next week if helps, but feel free to go ahead without waiting for me.

@prakharjain09
Copy link
Contributor Author

@shangxinli Thanks for taking another look. I have addressed all comments other than one. Please advice on the same. Thanks!

@ggershinsky
Copy link
Contributor

thanks for this change. The PR looks good to me now, I'll add my approval after it passes the CI tests.

@shangxinli
Copy link
Contributor

@prakharjain09 After you fix the CI failures, we can merge.

…eader, also expose the row index via ParquetReader or ParquetRecordReader

 - Add and populate rowIndexOffset field in BlockMetaData
 - Changes to generate row index in InternalParquetRecordReader, also expose the row index via ParquetReader or ParquetRecordReader
 - Add new unit tests and extend all the ColumnIndexFiltering and BloomFiltering unit tests to validate row indexes also.
…, document the same, Return -1 when rowIndexOffset info not available in BlockMetadata
@prakharjain09
Copy link
Contributor Author

Thanks @ggershinsky for the review. I have addressed the comments and fixed the build issue.

@shangxinli shangxinli merged commit c7bff51 into apache:master Mar 20, 2022
@prakharjain09
Copy link
Contributor Author

@shangxinli @ggershinsky Thanks a lot for reviewing this change.

This will unblock SPARK-37980 if this is released as part of upcoming parquet release. Do we need to cherry-pick this to any release branch for the same?

@ggershinsky
Copy link
Contributor

@prakharjain09 the upcoming parquet release will include the current master (plus a couple of WIP PRs, once they are merged), so this patch will be covered.

@prakharjain09
Copy link
Contributor Author

prakharjain09 commented Apr 25, 2022

@shangxinli @ggershinsky Is there any tentative date / rough estimate for when are we planning to do RC cut for the next release?

@ggershinsky
Copy link
Contributor

@prakharjain09 hopefully, we'll resolve the remaining issues at the community sync tomorrow, and start working on a cut.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants