Skip to content

Conversation

@vdiravka
Copy link
Member

@vdiravka vdiravka commented Jan 14, 2021

DRILL-7825: Unknown logical type <LogicalType UUID:UUIDType()> in Parquet

It adds support for new parquet-format (starting from version 2.4) UUID logical type:
https://github.com/apache/parquet-format/blob/master/CHANGES.md

Description

parquet-mr supports it starting from 1.12.0 version (should be released soon: https://issues.apache.org/jira/browse/PARQUET-1898)
One issue related to Parquet 1.12.0 version, it drops possibility to create empty parquet files. Fixed in Drill and the ticket for Parquet is created, see details: apache/parquet-java#852 (comment)
One additional follow-up ticket is created for this ticket to proceed with converting byte[] UUID to human readable info, see: https://issues.apache.org/jira/browse/DRILL-7896

Documentation

https://drill.apache.org/docs/parquet-format/
https://drill.apache.org/docs/supported-data-types/
should be updated to correspond:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#uuid

Testing

All test cases pass. Checked manually within drill-embedded.

@cgivre cgivre requested a review from vvysotskyi January 14, 2021 11:49
@cgivre cgivre added bug enhancement PRs that add a new functionality to Drill labels Jan 14, 2021
@cgivre cgivre changed the title DRILL-7825: Error: SYSTEM ERROR: RuntimeException: Unknown logical ty… DRILL-7825: Unknown logical type <LogicalType UUID:UUIDType()> in Parquet Feb 17, 2021
@vdiravka vdiravka force-pushed the DRILL-7825 branch 3 times, most recently from a94a0b8 to 46527b5 Compare April 13, 2021 16:58
@vdiravka vdiravka marked this pull request as ready for review April 13, 2021 17:06
Copy link
Member

@vvysotskyi vvysotskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdiravka, thanks for the PR, could you please address several minor comments?

* Writes a number of pages within corresponding column chunk
* Writes a number of pages within corresponding column chunk <br>
* // TODO: the Bloom Filter can be useful in filtering entire row groups,
* see <a href="https://issues.apache.org/jira/browse/DRILL-7895">DRILL-7895</a>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class was created as a copy of the ColumnChunkPageWriteStore class from the parquet library (see DRILL-5544 for details)

Since it is a copy, it is better to sync it with the original version instead of adding TODO with adding some specific features from it...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. The best way to remove Drill's copy of this class, we have todo about in ParquetRecordWriter#256, but we can't to it before PARQUET-1006 resolving. Adding the latest functionality from Parquet version of this class requires some deeper involving into it (it requires proper instantiating of ParquetColumnChunkPageWriteStore) and doesn't related to UUID logical type. That's why DRILL-7895 is created

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean applying the same changes that were done earlier to the copy of this class to the newer version, it shouldn't be too complex...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked Parquet ColumnChunkPageWriteStore and looks like we still use ParquetDirectByteBufferAllocator and allocate DrillBuf due to initializing ParquetProperties with proper allocator (see ParquetRecordWriter#258). I also debug TestParquetWriter.testTPCHReadWriteRunRepeated test case and found that Drill allocates the same memory for byte[] in Heap with ColumnChunkPageWriteStore and old ParquetColumnChunkPageWriteStore (~50% for my default settings).
So we can update ParquetRecordWriter with ColumnChunkPageWriteStore

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdiravka, are you sure that heap memory usage is the same? I assumed that the main reason for using ParquetColumnChunkPageWriteStore was to use direct memory instead of heap one...
From the code perspective, it looks like nothing was done in this direction for ColumnChunkPageWriteStore, it is still using the ConcatenatingByteArrayCollector for collecting data before writing it to the file, but our version uses CapacityByteArrayOutputStream that uses provided allocator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I have checked heap memory after creating ColumnChunkPageWriteStore with VisualVM the size is the same:
    https://ibb.co/xFYqC0m
    https://ibb.co/fNB7MBq
  2. The allocator is passed to ColumnChunkPageWriteStore and ColumnChunkPageWriter too and really DrillBuf is used in process of writing the parquet file.
  3. And we converted that buf to bytes via BytesInput.from(buf) and compressedBytes.writeAllTo(buf). So all data still placed in heap.
  4. We already have several other places, where ColumnChunkPageWriteStore is used not directly

So looks like updated ColumnChunkPageWriteStore will menage heap memory even better in process of creating parquet files via Drill and we safe here to go with current change.

And the proper way to use Direct memory more that now is to make improvements in Parquet. One of them is PARQUET-1771, but that one will not help here. So I want to proceed with PARQUET-1006. Looks like we can use direct memory buf for ColumnChunkPageWriteStore, ParquetFileWriter and for ByteArrayOutputStream. I am planning to ask community about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdiravka, thanks for sharing screenshots and providing more details.

  1. And we converted that buf to bytes via BytesInput.from(buf) and compressedBytes.writeAllTo(buf). So all data still placed in heap.

Please note, that when calling BytesInput.from(buf), it doesn't convert all bytes of the buffer at the same time, it creates CapacityBAOSBytesInput that wraps provided CapacityByteArrayOutputStream and uses it when writing to the OutputStream.
Regarding the compressedBytes.writeAllTo(buf) call this is fine to have bytes here since GC will take care of them, no reasons for possible leaks, data that should be processed later will be stored in direct memory.

But when using ConcatenatingByteArrayCollector, all bytes will be stored in heap (including data that should be processed later) so GC has no power here.

Not sure why the heap usage you provided is similar, perhaps it may make difference when we will have more data, or GC will do its work right before flushing data from the buf...

<build>
<plugins>
<plugin>
<plugin> <!-- TODO: this plugin has common things with default profile. Factor out this common things to avoid duplicate code -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please implement this TODO? It doesn't look to be complicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maven-enforcer-plugin is removed from mapr profile, because there is fully the same plugin in default scope.
There is also very similar maven-shade-plugin, but there are some differences. So before merging this plugin it is better to check it on mapr cluster, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks!

pom.xml Outdated
<slf4j.version>1.7.26</slf4j.version>
<shaded.guava.version>28.2-jre</shaded.guava.version>
<guava.version>19.0</guava.version>
<guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create a ticket instead of adding the comment. Also, there are newer versions of Guava.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Yes, I know about newer version. But the newer version can bring some new other issues for sure.
The plan to check <guava.version>28.2-jre</guava.version> at first. If it is fine then to drop <shaded.guava.version> at all. And then to update guava version to the newest one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the comment can be removed?

pom.xml Outdated
<slf4j.version>1.7.26</slf4j.version>
<shaded.guava.version>28.2-jre</shaded.guava.version>
<guava.version>19.0</guava.version>
<guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create a JIRA for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there's a CVE for guava < 29. Could we upgrade to guava 30-jre?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdiravka vdiravka force-pushed the DRILL-7825 branch 3 times, most recently from bb1be23 to 6849d7e Compare April 16, 2021 13:47
Copy link
Member Author

@vdiravka vdiravka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvysotskyi Could you please check again?

<build>
<plugins>
<plugin>
<plugin> <!-- TODO: this plugin has common things with default profile. Factor out this common things to avoid duplicate code -->
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maven-enforcer-plugin is removed from mapr profile, because there is fully the same plugin in default scope.
There is also very similar maven-shade-plugin, but there are some differences. So before merging this plugin it is better to check it on mapr cluster, I think.

pom.xml Outdated
<slf4j.version>1.7.26</slf4j.version>
<shaded.guava.version>28.2-jre</shaded.guava.version>
<guava.version>19.0</guava.version>
<guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here-->
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pom.xml Outdated
<slf4j.version>1.7.26</slf4j.version>
<shaded.guava.version>28.2-jre</shaded.guava.version>
<guava.version>19.0</guava.version>
<guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here-->
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* Writes a number of pages within corresponding column chunk
* Writes a number of pages within corresponding column chunk <br>
* // TODO: the Bloom Filter can be useful in filtering entire row groups,
* see <a href="https://issues.apache.org/jira/browse/DRILL-7895">DRILL-7895</a>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked Parquet ColumnChunkPageWriteStore and looks like we still use ParquetDirectByteBufferAllocator and allocate DrillBuf due to initializing ParquetProperties with proper allocator (see ParquetRecordWriter#258). I also debug TestParquetWriter.testTPCHReadWriteRunRepeated test case and found that Drill allocates the same memory for byte[] in Heap with ColumnChunkPageWriteStore and old ParquetColumnChunkPageWriteStore (~50% for my default settings).
So we can update ParquetRecordWriter with ColumnChunkPageWriteStore

@vdiravka
Copy link
Member Author

@vvysotskyi Thanks for the code review. I've squashed my changes into one commit and left your CR suggestion as a separate one. Is that fine to you to go with 2 commits or do you prefer to squash them?

@vvysotskyi
Copy link
Member

@vdiravka, let's have a single commit and don't hesitate to remove Co-authored-by, it's GitHub inserts it when accepting changes after CR, but I don't think that it is needed here.

Copy link
Member

@vvysotskyi vvysotskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdiravka, thanks for making changes!
LGTM, +1

Copy link
Contributor

@cgivre cgivre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. A minor request.. could you please create JIRAs for any TODOs and reference them in the comments?
Other than that LGTM +1.

Thanks!

@vdiravka vdiravka merged commit 5d21637 into apache:master Apr 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug enhancement PRs that add a new functionality to Drill

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants