-
Notifications
You must be signed in to change notification settings - Fork 986
DRILL-7825: Unknown logical type <LogicalType UUID:UUIDType()> in Parquet #2143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a94a0b8 to
46527b5
Compare
vvysotskyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vdiravka, thanks for the PR, could you please address several minor comments?
...java-exec/src/main/java/org/apache/drill/exec/store/parquet2/DrillParquetGroupConverter.java
Outdated
Show resolved
Hide resolved
| * Writes a number of pages within corresponding column chunk | ||
| * Writes a number of pages within corresponding column chunk <br> | ||
| * // TODO: the Bloom Filter can be useful in filtering entire row groups, | ||
| * see <a href="https://issues.apache.org/jira/browse/DRILL-7895">DRILL-7895</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class was created as a copy of the ColumnChunkPageWriteStore class from the parquet library (see DRILL-5544 for details)
Since it is a copy, it is better to sync it with the original version instead of adding TODO with adding some specific features from it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. The best way to remove Drill's copy of this class, we have todo about in ParquetRecordWriter#256, but we can't to it before PARQUET-1006 resolving. Adding the latest functionality from Parquet version of this class requires some deeper involving into it (it requires proper instantiating of ParquetColumnChunkPageWriteStore) and doesn't related to UUID logical type. That's why DRILL-7895 is created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean applying the same changes that were done earlier to the copy of this class to the newer version, it shouldn't be too complex...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked Parquet ColumnChunkPageWriteStore and looks like we still use ParquetDirectByteBufferAllocator and allocate DrillBuf due to initializing ParquetProperties with proper allocator (see ParquetRecordWriter#258). I also debug TestParquetWriter.testTPCHReadWriteRunRepeated test case and found that Drill allocates the same memory for byte[] in Heap with ColumnChunkPageWriteStore and old ParquetColumnChunkPageWriteStore (~50% for my default settings).
So we can update ParquetRecordWriter with ColumnChunkPageWriteStore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vdiravka, are you sure that heap memory usage is the same? I assumed that the main reason for using ParquetColumnChunkPageWriteStore was to use direct memory instead of heap one...
From the code perspective, it looks like nothing was done in this direction for ColumnChunkPageWriteStore, it is still using the ConcatenatingByteArrayCollector for collecting data before writing it to the file, but our version uses CapacityByteArrayOutputStream that uses provided allocator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I have checked heap memory after creating
ColumnChunkPageWriteStorewithVisualVMthe size is the same:
https://ibb.co/xFYqC0m
https://ibb.co/fNB7MBq - The allocator is passed to
ColumnChunkPageWriteStoreandColumnChunkPageWritertoo and really DrillBuf is used in process of writing the parquet file. - And we converted that
bufto bytes viaBytesInput.from(buf)andcompressedBytes.writeAllTo(buf). So all data still placed in heap. - We already have several other places, where
ColumnChunkPageWriteStoreis used not directly
So looks like updated ColumnChunkPageWriteStore will menage heap memory even better in process of creating parquet files via Drill and we safe here to go with current change.
And the proper way to use Direct memory more that now is to make improvements in Parquet. One of them is PARQUET-1771, but that one will not help here. So I want to proceed with PARQUET-1006. Looks like we can use direct memory buf for ColumnChunkPageWriteStore, ParquetFileWriter and for ByteArrayOutputStream. I am planning to ask community about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vdiravka, thanks for sharing screenshots and providing more details.
- And we converted that buf to bytes via BytesInput.from(buf) and compressedBytes.writeAllTo(buf). So all data still placed in heap.
Please note, that when calling BytesInput.from(buf), it doesn't convert all bytes of the buffer at the same time, it creates CapacityBAOSBytesInput that wraps provided CapacityByteArrayOutputStream and uses it when writing to the OutputStream.
Regarding the compressedBytes.writeAllTo(buf) call this is fine to have bytes here since GC will take care of them, no reasons for possible leaks, data that should be processed later will be stored in direct memory.
But when using ConcatenatingByteArrayCollector, all bytes will be stored in heap (including data that should be processed later) so GC has no power here.
Not sure why the heap usage you provided is similar, perhaps it may make difference when we will have more data, or GC will do its work right before flushing data from the buf...
exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java
Show resolved
Hide resolved
exec/jdbc-all/pom.xml
Outdated
| <build> | ||
| <plugins> | ||
| <plugin> | ||
| <plugin> <!-- TODO: this plugin has common things with default profile. Factor out this common things to avoid duplicate code --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please implement this TODO? It doesn't look to be complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maven-enforcer-plugin is removed from mapr profile, because there is fully the same plugin in default scope.
There is also very similar maven-shade-plugin, but there are some differences. So before merging this plugin it is better to check it on mapr cluster, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks!
pom.xml
Outdated
| <slf4j.version>1.7.26</slf4j.version> | ||
| <shaded.guava.version>28.2-jre</shaded.guava.version> | ||
| <guava.version>19.0</guava.version> | ||
| <guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's create a ticket instead of adding the comment. Also, there are newer versions of Guava.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Yes, I know about newer version. But the newer version can bring some new other issues for sure.
The plan to check <guava.version>28.2-jre</guava.version> at first. If it is fine then to drop <shaded.guava.version> at all. And then to update guava version to the newest one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the comment can be removed?
pom.xml
Outdated
| <slf4j.version>1.7.26</slf4j.version> | ||
| <shaded.guava.version>28.2-jre</shaded.guava.version> | ||
| <guava.version>19.0</guava.version> | ||
| <guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create a JIRA for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there's a CVE for guava < 29. Could we upgrade to guava 30-jre?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bb1be23 to
6849d7e
Compare
vdiravka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvysotskyi Could you please check again?
exec/jdbc-all/pom.xml
Outdated
| <build> | ||
| <plugins> | ||
| <plugin> | ||
| <plugin> <!-- TODO: this plugin has common things with default profile. Factor out this common things to avoid duplicate code --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maven-enforcer-plugin is removed from mapr profile, because there is fully the same plugin in default scope.
There is also very similar maven-shade-plugin, but there are some differences. So before merging this plugin it is better to check it on mapr cluster, I think.
pom.xml
Outdated
| <slf4j.version>1.7.26</slf4j.version> | ||
| <shaded.guava.version>28.2-jre</shaded.guava.version> | ||
| <guava.version>19.0</guava.version> | ||
| <guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pom.xml
Outdated
| <slf4j.version>1.7.26</slf4j.version> | ||
| <shaded.guava.version>28.2-jre</shaded.guava.version> | ||
| <guava.version>19.0</guava.version> | ||
| <guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Writes a number of pages within corresponding column chunk | ||
| * Writes a number of pages within corresponding column chunk <br> | ||
| * // TODO: the Bloom Filter can be useful in filtering entire row groups, | ||
| * see <a href="https://issues.apache.org/jira/browse/DRILL-7895">DRILL-7895</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked Parquet ColumnChunkPageWriteStore and looks like we still use ParquetDirectByteBufferAllocator and allocate DrillBuf due to initializing ParquetProperties with proper allocator (see ParquetRecordWriter#258). I also debug TestParquetWriter.testTPCHReadWriteRunRepeated test case and found that Drill allocates the same memory for byte[] in Heap with ColumnChunkPageWriteStore and old ParquetColumnChunkPageWriteStore (~50% for my default settings).
So we can update ParquetRecordWriter with ColumnChunkPageWriteStore
|
@vvysotskyi Thanks for the code review. I've squashed my changes into one commit and left your CR suggestion as a separate one. Is that fine to you to go with 2 commits or do you prefer to squash them? |
|
@vdiravka, let's have a single commit and don't hesitate to remove |
…pe <LogicalType UUID:UUIDType()>
vvysotskyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vdiravka, thanks for making changes!
LGTM, +1
cgivre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. A minor request.. could you please create JIRAs for any TODOs and reference them in the comments?
Other than that LGTM +1.
Thanks!
DRILL-7825: Unknown logical type <LogicalType UUID:UUIDType()> in Parquet
It adds support for new
parquet-format(starting from version 2.4) UUID logical type:https://github.com/apache/parquet-format/blob/master/CHANGES.md
Description
parquet-mrsupports it starting from 1.12.0 version (should be released soon: https://issues.apache.org/jira/browse/PARQUET-1898)One issue related to Parquet 1.12.0 version, it drops possibility to create empty parquet files. Fixed in Drill and the ticket for Parquet is created, see details: apache/parquet-java#852 (comment)
One additional follow-up ticket is created for this ticket to proceed with converting byte[] UUID to human readable info, see: https://issues.apache.org/jira/browse/DRILL-7896
Documentation
https://drill.apache.org/docs/parquet-format/
https://drill.apache.org/docs/supported-data-types/
should be updated to correspond:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#uuid
Testing
All test cases pass. Checked manually within drill-embedded.