feat: Encapsulate Parquet objects#1920
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1920 +/- ##
============================================
+ Coverage 56.12% 58.11% +1.98%
- Complexity 976 1141 +165
============================================
Files 119 132 +13
Lines 11743 12987 +1244
Branches 2251 2407 +156
============================================
+ Hits 6591 7547 +956
- Misses 4012 4217 +205
- Partials 1140 1223 +83 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
b0ca78f to
36b4890
Compare
| /** This method is called from Apache Iceberg. */ | ||
| public void setRowGroupReader(RowGroupReader rowGroupReader, ParquetColumnSpec columnSpec) |
There was a problem hiding this comment.
It would be good to have some unit tests in Comet for the methods intended to be called from Iceberg, so that we catch any regressions in behavior. This could be added as a separate PR so that we don't slow down progress on the integration work.
|
I have a draft iceberg PR |
|
cc @andygrove @parthchandra @hsiang-c |
|
I am testing this locally now. edit: There seems to be a method missing in Comet in this PR: |
|
I added the following method to /** Sets the projected columns to be read later via {@link #readNextRowGroup()} */
public void setRequestedSchemaFromSpecs(List<ParquetColumnSpec> projection) {
paths.clear();
for (ParquetColumnSpec columnSpec : projection) {
ColumnDescriptor col = Utils.buildColumnDescriptor(columnSpec);
paths.put(ColumnPath.get(col.getPath()), col);
}
}I can now compile Iceberg, but I get an exception at runtime, and I do not yet understand why: |
|
The Comet is expecting a scan that implements // Iceberg scan
case s: SupportsComet =>Iceberg does not implement this interface, and I see that there is a related TODO comment: |
|
In my local copy of Iceberg, I updated If I just have the Iceberg jar on the classpath then Comet does try and accelerate the scan, but now I am running into Arrow shading issues. Spark 3.4.3 uses Arrow 11.0.0 It is not possible to shade the JNI classes in Arrow because the Java names have to match the function names in the native code, so this is quite challenging to resolve. We may need to look into using class loader isolation. |
|
I'm surprised we don't have a Comet interface which provides the access needed by Comet, in combination with an Iceberg implementation of that interface that wraps the Iceberg details. |
| <groupId>org.apache.datafusion</groupId> | ||
| <artifactId>comet-parent-spark${spark.version.short}_${scala.binary.version}</artifactId> | ||
| <version>0.9.0-SNAPSHOT</version> | ||
| <version>0.9.0.1-SNAPSHOT</version> |
There was a problem hiding this comment.
@huaxingao I'm assuming that the version bump was not intentional?
|
I am now able to get this working end-to-end 🎉 |
andygrove
left a comment
There was a problem hiding this comment.
LGTM. I am now able to integrate Comet and Iceberg locally. Thanks @huaxingao!
@andygrove You can apply the diff to Iceberg 1.8.1 for Comet support. I'll add Iceberg https://github.com/apache/datafusion-comet/blob/main/dev/diffs/iceberg/1.8.1.diff#L236 |
| ParquetReadOptions options = | ||
| buildParquetReadOptions(conf, properties, start, length, fileEncryptionKey, fileAADPrefix); | ||
| this.converter = new ParquetMetadataConverter(options); | ||
| this.file = HadoopInputFile.fromPath(path, conf); |
There was a problem hiding this comment.
Can we use CometInputFile here? According to the doc
/**
* A Parquet {@link InputFile} implementation that's similar to {@link
* org.apache.parquet.hadoop.util.HadoopInputFile}, but with optimizations introduced in Hadoop 3.x,
* for S3 specifically.
*/
public class CometInputFile implements InputFile {
// omitted
}There was a problem hiding this comment.
I think we can use CometInputFile. Modified.
|
|
||
| Collection<String> readPropertiesToRemove = | ||
| Sets.newHashSet( | ||
| "parquet.read.filter", |
There was a problem hiding this comment.
Could we use these constants since they are not in the method signature?
ParquetInputFormat.UNBOUND_RECORD_FILTER,
ParquetInputFormat.FILTER_PREDICATE,
ParquetInputFormat.READ_SUPPORT_CLASS,
EncryptionPropertiesFactory.CRYPTO_FACTORY_CLASS_PROPERTY_NAME
I filed a follow on issue for this - #1934 |
| import org.slf4j.Logger; | ||
| import org.slf4j.LoggerFactory; | ||
|
|
||
| import org.apache.commons.compress.utils.Sets; |
There was a problem hiding this comment.
(nit) Perhaps we could use built-in Set.
| import org.apache.commons.compress.utils.Sets; | |
| import java.util.Set; |
Iceberg prefers import org.apache.iceberg.relocated.com.google.common.collect.Sets; but Comet doesn't use it.
| // Iceberg remove these read properties when building the ParquetReadOptions. | ||
| // We want build the exact same ParquetReadOptions as Iceberg's. | ||
| Collection<String> readPropertiesToRemove = | ||
| Sets.newHashSet( |
There was a problem hiding this comment.
| Sets.newHashSet( | |
| Set.of( |
|
Perhaps we could simplify the This shifts the burden of type checking to runtime instead of compile time but sidesteps the shading issue. We will not need to make any changes related to calling |
|
The same tactic might work for |
|
On second thoughts, the check |
|
OK. Verified that this along with the corresponding PR in Iceberg works correctly |
parthchandra
left a comment
There was a problem hiding this comment.
This looks good to be merged.
I've one question about potentially reading the footer twice which can be expensive. Can we verify we are not reading it multiple times?
| this.cometOptions = cometOptions; | ||
| this.metrics = null; | ||
| try { | ||
| this.footer = readFooter(file, options, f, converter); |
There was a problem hiding this comment.
The footer should be available already in Iceberg? Can we avoid having to read it twice?
There was a problem hiding this comment.
Yes, the footer is already available from iceberg, but I can't pass ParquetMetadata from Iceberg to Comet. Maybe I can convert ParquetMetadata to JSON string and pass it to Comet, and then construct a ParquetMetadata from JSON string.
There was a problem hiding this comment.
I initially thought I can do something like this:
ParquetMetadata footer = reader.getFooter();
String footerInJson = ParquetMetadata.toJSON(footer);
ParquetMetadata footer2 = ParquetMetadata.fromJSON(footerInJson);
but this doesn't work. I got
java.lang.RuntimeException: shaded.parquet.com.fasterxml.jackson.databind.exc.InvalidDefinitionException: Cannot construct instance of `org.apache.parquet.hadoop.metadata.ParquetMetadata` (no Creators, like default constructor, exist): cannot deserialize from Object value (no delegate- or property-based Creator)
at [Source: (StringReader); line: 1, column: 2]
I will see if there are other ways to re-construct ParquetMetadata.
|
@huaxingao Could you merge latest from main to fix the clippy issues? |
|
Wrote up some notes on this - |
|
Thanks @andygrove @hsiang-c @parthchandra |
Which issue does this PR close?
Closes #1833
Rationale for this change
Iceberg shades Parquet. We can't pass Parquet objects from Iceberg to Comet. In order to get around this problem, this PR encapsulates the Parquet objects.
Here is the summary of the changes:
Iceberg call these APIs:
In order to encapsulate
ColumnDescriptorandPageReader, will add aParquetColumnSpec, change the above two APIs toIn order to call
setRowGroupReader(org.apache.comet.parquet.RowGroupReader rowGroupReader, ParquetColumnSpec columnSpec), in Iceberg side, will need to use Comet'sFileReaderinstead ofParquetFileReader, so we will callFileReader.readNextRowGroup()to get aorg.apache.comet.parquet.RowGroupReaderinstead Parquet'sPageReadStore.ParquetReadOptioncan't be passed directly either, so the related info are passed andParquetReadOptionis built on Comet.What changes are included in this PR?
How are these changes tested?
I did integration test on my local.