DRILL-7825: Unknown logical type <LogicalType UUID:UUIDType()> in Parquet #2143

vdiravka · 2021-01-14T11:36:59Z

DRILL-7825: Unknown logical type <LogicalType UUID:UUIDType()> in Parquet

It adds support for new parquet-format (starting from version 2.4) UUID logical type:
https://github.com/apache/parquet-format/blob/master/CHANGES.md

Description

parquet-mr supports it starting from 1.12.0 version (should be released soon: https://issues.apache.org/jira/browse/PARQUET-1898)
One issue related to Parquet 1.12.0 version, it drops possibility to create empty parquet files. Fixed in Drill and the ticket for Parquet is created, see details: apache/parquet-java#852 (comment)
One additional follow-up ticket is created for this ticket to proceed with converting byte[] UUID to human readable info, see: https://issues.apache.org/jira/browse/DRILL-7896

Documentation

https://drill.apache.org/docs/parquet-format/
https://drill.apache.org/docs/supported-data-types/
should be updated to correspond:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#uuid

Testing

All test cases pass. Checked manually within drill-embedded.

vvysotskyi

@vdiravka, thanks for the PR, could you please address several minor comments?

...java-exec/src/main/java/org/apache/drill/exec/store/parquet2/DrillParquetGroupConverter.java

vvysotskyi · 2021-04-13T20:31:40Z

exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetColumnChunkPageWriteStore.java

-     * Writes a number of pages within corresponding column chunk
+     * Writes a number of pages within corresponding column chunk <br>
+     * // TODO: the Bloom Filter can be useful in filtering entire row groups,
+     *     see <a href="https://issues.apache.org/jira/browse/DRILL-7895">DRILL-7895</a>


This class was created as a copy of the ColumnChunkPageWriteStore class from the parquet library (see DRILL-5544 for details)

Since it is a copy, it is better to sync it with the original version instead of adding TODO with adding some specific features from it...

Yes, you are right. The best way to remove Drill's copy of this class, we have todo about in ParquetRecordWriter#256, but we can't to it before PARQUET-1006 resolving. Adding the latest functionality from Parquet version of this class requires some deeper involving into it (it requires proper instantiating of ParquetColumnChunkPageWriteStore) and doesn't related to UUID logical type. That's why DRILL-7895 is created

I mean applying the same changes that were done earlier to the copy of this class to the newer version, it shouldn't be too complex...

I double checked Parquet ColumnChunkPageWriteStore and looks like we still use ParquetDirectByteBufferAllocator and allocate DrillBuf due to initializing ParquetProperties with proper allocator (see ParquetRecordWriter#258). I also debug TestParquetWriter.testTPCHReadWriteRunRepeated test case and found that Drill allocates the same memory for byte[] in Heap with ColumnChunkPageWriteStore and old ParquetColumnChunkPageWriteStore (~50% for my default settings).
So we can update ParquetRecordWriter with ColumnChunkPageWriteStore

@vdiravka, are you sure that heap memory usage is the same? I assumed that the main reason for using ParquetColumnChunkPageWriteStore was to use direct memory instead of heap one...
From the code perspective, it looks like nothing was done in this direction for ColumnChunkPageWriteStore, it is still using the ConcatenatingByteArrayCollector for collecting data before writing it to the file, but our version uses CapacityByteArrayOutputStream that uses provided allocator.

I have checked heap memory after creating ColumnChunkPageWriteStore with VisualVM the size is the same:
https://ibb.co/xFYqC0m
https://ibb.co/fNB7MBq

The allocator is passed to ColumnChunkPageWriteStore and ColumnChunkPageWriter too and really DrillBuf is used in process of writing the parquet file.

And we converted that buf to bytes via BytesInput.from(buf) and compressedBytes.writeAllTo(buf). So all data still placed in heap.

We already have several other places, where ColumnChunkPageWriteStore is used not directly

So looks like updated ColumnChunkPageWriteStore will menage heap memory even better in process of creating parquet files via Drill and we safe here to go with current change.

And the proper way to use Direct memory more that now is to make improvements in Parquet. One of them is PARQUET-1771, but that one will not help here. So I want to proceed with PARQUET-1006. Looks like we can use direct memory buf for ColumnChunkPageWriteStore, ParquetFileWriter and for ByteArrayOutputStream. I am planning to ask community about it.

@vdiravka, thanks for sharing screenshots and providing more details.

And we converted that buf to bytes via BytesInput.from(buf) and compressedBytes.writeAllTo(buf). So all data still placed in heap.

Please note, that when calling BytesInput.from(buf), it doesn't convert all bytes of the buffer at the same time, it creates CapacityBAOSBytesInput that wraps provided CapacityByteArrayOutputStream and uses it when writing to the OutputStream.
Regarding the compressedBytes.writeAllTo(buf) call this is fine to have bytes here since GC will take care of them, no reasons for possible leaks, data that should be processed later will be stored in direct memory.

But when using ConcatenatingByteArrayCollector, all bytes will be stored in heap (including data that should be processed later) so GC has no power here.

Not sure why the heap usage you provided is similar, perhaps it may make difference when we will have more data, or GC will do its work right before flushing data from the buf...

exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java

vvysotskyi · 2021-04-13T20:49:39Z

exec/jdbc-all/pom.xml

        <build>
          <plugins>
-            <plugin>
+            <plugin> <!-- TODO: this plugin has common things with default profile. Factor out this common things to avoid duplicate code -->


Could you please implement this TODO? It doesn't look to be complicated.

maven-enforcer-plugin is removed from mapr profile, because there is fully the same plugin in default scope.
There is also very similar maven-shade-plugin, but there are some differences. So before merging this plugin it is better to check it on mapr cluster, I think.

Ok, thanks!

vvysotskyi · 2021-04-13T20:52:08Z

pom.xml

    <slf4j.version>1.7.26</slf4j.version>
    <shaded.guava.version>28.2-jre</shaded.guava.version>
-    <guava.version>19.0</guava.version>
+    <guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here-->


Let's create a ticket instead of adding the comment. Also, there are newer versions of Guava.

ok. Yes, I know about newer version. But the newer version can bring some new other issues for sure.
The plan to check <guava.version>28.2-jre</guava.version> at first. If it is fine then to drop <shaded.guava.version> at all. And then to update guava version to the newest one

WIP: DRILL-7904: Update to 30-jre Guava version

So the comment can be removed?

cgivre · 2021-04-13T15:20:26Z

pom.xml

    <slf4j.version>1.7.26</slf4j.version>
    <shaded.guava.version>28.2-jre</shaded.guava.version>
-    <guava.version>19.0</guava.version>
+    <guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here-->


Can we create a JIRA for this?

Actually, there's a CVE for guava < 29. Could we upgrade to guava 30-jre?

WIP: DRILL-7904: Update to 30-jre Guava version

vdiravka

@vvysotskyi Could you please check again?

vdiravka · 2021-04-14T21:03:35Z

exec/jdbc-all/pom.xml

        <build>
          <plugins>
-            <plugin>
+            <plugin> <!-- TODO: this plugin has common things with default profile. Factor out this common things to avoid duplicate code -->


maven-enforcer-plugin is removed from mapr profile, because there is fully the same plugin in default scope.
There is also very similar maven-shade-plugin, but there are some differences. So before merging this plugin it is better to check it on mapr cluster, I think.

vdiravka · 2021-04-16T13:26:31Z

pom.xml

    <slf4j.version>1.7.26</slf4j.version>
    <shaded.guava.version>28.2-jre</shaded.guava.version>
-    <guava.version>19.0</guava.version>
+    <guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here-->


WIP: DRILL-7904: Update to 30-jre Guava version

vdiravka · 2021-04-16T13:26:36Z

pom.xml

    <slf4j.version>1.7.26</slf4j.version>
    <shaded.guava.version>28.2-jre</shaded.guava.version>
-    <guava.version>19.0</guava.version>
+    <guava.version>19.0</guava.version> <!--todo: 28.2-jre guava can be used here-->


WIP: DRILL-7904: Update to 30-jre Guava version

vdiravka · 2021-04-16T13:48:45Z

exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetColumnChunkPageWriteStore.java

-     * Writes a number of pages within corresponding column chunk
+     * Writes a number of pages within corresponding column chunk <br>
+     * // TODO: the Bloom Filter can be useful in filtering entire row groups,
+     *     see <a href="https://issues.apache.org/jira/browse/DRILL-7895">DRILL-7895</a>


I double checked Parquet ColumnChunkPageWriteStore and looks like we still use ParquetDirectByteBufferAllocator and allocate DrillBuf due to initializing ParquetProperties with proper allocator (see ParquetRecordWriter#258). I also debug TestParquetWriter.testTPCHReadWriteRunRepeated test case and found that Drill allocates the same memory for byte[] in Heap with ColumnChunkPageWriteStore and old ParquetColumnChunkPageWriteStore (~50% for my default settings).
So we can update ParquetRecordWriter with ColumnChunkPageWriteStore

vdiravka · 2021-04-21T21:30:20Z

@vvysotskyi Thanks for the code review. I've squashed my changes into one commit and left your CR suggestion as a separate one. Is that fine to you to go with 2 commits or do you prefer to squash them?

vvysotskyi · 2021-04-21T21:58:32Z

@vdiravka, let's have a single commit and don't hesitate to remove Co-authored-by, it's GitHub inserts it when accepting changes after CR, but I don't think that it is needed here.

…pe <LogicalType UUID:UUIDType()>

…OutputStream

vvysotskyi

@vdiravka, thanks for making changes!
LGTM, +1

cgivre

Thanks for the PR. A minor request.. could you please create JIRAs for any TODOs and reference them in the comments?
Other than that LGTM +1.

Thanks!

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java

...a-exec/src/test/java/org/apache/drill/exec/store/parquet/ParquetSimpleTestFileGenerator.java

cgivre requested a review from vvysotskyi January 14, 2021 11:49

cgivre added bug enhancement PRs that add a new functionality to Drill labels Jan 14, 2021

cgivre changed the title ~~DRILL-7825: Error: SYSTEM ERROR: RuntimeException: Unknown logical ty…~~ DRILL-7825: Unknown logical type <LogicalType UUID:UUIDType()> in Parquet Feb 17, 2021

vdiravka force-pushed the DRILL-7825 branch 3 times, most recently from a94a0b8 to 46527b5 Compare April 13, 2021 16:58

vdiravka marked this pull request as ready for review April 13, 2021 17:06

vvysotskyi requested changes Apr 13, 2021

View reviewed changes

cgivre reviewed Apr 14, 2021

View reviewed changes

vdiravka force-pushed the DRILL-7825 branch 3 times, most recently from bb1be23 to 6849d7e Compare April 16, 2021 13:47

vdiravka commented Apr 17, 2021

View reviewed changes

vvysotskyi approved these changes Apr 17, 2021

View reviewed changes

vdiravka force-pushed the DRILL-7825 branch from 6849d7e to 76f464c Compare April 21, 2021 21:26

vdiravka added 2 commits April 22, 2021 14:17

DRILL-7825: Error: SYSTEM ERROR: RuntimeException: Unknown logical ty…

f6efbee

…pe <LogicalType UUID:UUIDType()>

Leverage latest ColumnChunkPageWriteStore, but with CapacityByteArray…

c6986b7

…OutputStream

vdiravka force-pushed the DRILL-7825 branch from 76f464c to c6986b7 Compare April 22, 2021 11:36

vvysotskyi approved these changes Apr 22, 2021

View reviewed changes

cgivre approved these changes Apr 22, 2021

View reviewed changes

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java Outdated Show resolved Hide resolved

...a-exec/src/test/java/org/apache/drill/exec/store/parquet/ParquetSimpleTestFileGenerator.java Outdated Show resolved Hide resolved

Add Drill tickets for TODOs

52fdaf3

vdiravka merged commit 5d21637 into apache:master Apr 23, 2021

DRILL-7825: Unknown logical type <LogicalType UUID:UUIDType()> in Parquet #2143

DRILL-7825: Unknown logical type <LogicalType UUID:UUIDType()> in Parquet #2143

Uh oh!

Conversation

vdiravka commented Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DRILL-7825: Unknown logical type <LogicalType UUID:UUIDType()> in Parquet

Description

Documentation

Testing

Uh oh!

vvysotskyi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdiravka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdiravka commented Apr 21, 2021

Uh oh!

vvysotskyi commented Apr 21, 2021

Uh oh!

vvysotskyi left a comment

Choose a reason for hiding this comment

Uh oh!

cgivre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

vdiravka commented Jan 14, 2021 •

edited

Loading