Skip to content

docs: Add documentation for accelerating Iceberg Parquet scans with Comet#1683

Merged
andygrove merged 17 commits intoapache:mainfrom
andygrove:iceberg-docs
Apr 28, 2025
Merged

docs: Add documentation for accelerating Iceberg Parquet scans with Comet#1683
andygrove merged 17 commits intoapache:mainfrom
andygrove:iceberg-docs

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Apr 26, 2025

Which issue does this PR close?

Part of #1618

Closes #1684

Rationale for this change

Tell users how to integrate Comet with Iceberg.

What changes are included in this PR?

  • Documentation
  • Refactoring to avoid a shading issue

How are these changes tested?

@andygrove
Copy link
Copy Markdown
Member Author

I fixed the issue with CometSchemaImporter being shaded, but now I'm hitting a similar issue with a shaded Parquet class:

org.apache.iceberg.shaded.org.apache.parquet.column.ColumnDescriptor

@huaxingao
Copy link
Copy Markdown
Contributor

Iceberg shades Parquet. In our internal version of Iceberg, we remove the shading. In OSS, when enabling Comet native execution in apache/iceberg#12709, I actually unshaded Parquet, but I don't think that's the correct approach.

Instead of passing a ColumnDescriptor from Iceberg to Comet, should we pass a String[] path, PrimitiveType.PrimitiveTypeName type, int maxRep, and int maxDef, so that we can construct the ColumnDescriptor on the Comet side?

@andygrove
Copy link
Copy Markdown
Member Author

Iceberg shades Parquet. In our internal version of Iceberg, we remove the shading. In OSS, when enabling Comet native execution in apache/iceberg#12709, I actually unshaded Parquet, but I don't think that's the correct approach.

Instead of passing a ColumnDescriptor from Iceberg to Comet, should we pass a String[] path, PrimitiveType.PrimitiveTypeName type, int maxRep, and int maxDef, so that we can construct the ColumnDescriptor on the Comet side?

Thanks @huaxingao, I will try that.

@andygrove
Copy link
Copy Markdown
Member Author

Instead of passing a ColumnDescriptor from Iceberg to Comet, should we pass a String[] path, PrimitiveType.PrimitiveTypeName type, int maxRep, and int maxDef, so that we can construct the ColumnDescriptor on the Comet side?

I did get farther, but now I run into the same problem when Iceberg calls the following method, because it needs to pass the Parquet PageReader class in:

public void setPageReader(PageReader pageReader) throws IOException

@andygrove
Copy link
Copy Markdown
Member Author

Perhaps we need to have Comet use a shaded version of Parquet also. I don't know if the Maven and Gradle approaches to shading will be compatible though.

@andygrove
Copy link
Copy Markdown
Member Author

Another option is to introduce a higher level of abstraction so that we can avoid directly referencing the Parquet types.

@andygrove
Copy link
Copy Markdown
Member Author

For now, I went with the approach of updating Iceberg to stop shading Parquet. This now seems to be working, but the plan says it is not running natively.

scala> spark.sql(s"SELECT * from t1").show()
25/04/28 06:59:19 INFO core/src/lib.rs: Comet native library version 0.9.0 initialized
25/04/28 06:59:19 WARN CometSparkSessionExtensions$CometExecRule: Comet cannot execute some parts of this plan natively (set spark.comet.explainFallback.enabled=false to disable this logging):
 CollectLimit [COMET: CollectLimit is not supported]
+-  Project [COMET: Project is not native because the following children are not native (BatchScan spark_catalog.default.t1)]
   +-  BatchScan spark_catalog.default.t1 [COMET: Comet Scan only supports Parquet and Iceberg Parquet file formats, BatchScan spark_catalog.default.t1 is not supported]

+---+---+
| c0| c1|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
|  5|  5|
|  6|  6|
|  7|  7|
|  8|  8|
|  9|  9|
| 10| 10|
| 11| 11|
| 12| 12|
| 13| 13|
| 14| 14|
| 15| 15|
| 16| 16|
| 17| 17|
| 18| 18|
| 19| 19|
+---+---+
only showing top 20 rows

@andygrove
Copy link
Copy Markdown
Member Author

The scan is a org.apache.iceberg.spark.source.SparkBatchQueryScan which does not implement SupportsComet. In fact, I do not see any references to SupportsComet in Iceberg.

@andygrove andygrove changed the title [wip] docs: Add documentation for accelerating Iceberg Parquet scans with Comet docs: Add documentation for accelerating Iceberg Parquet scans with Comet Apr 28, 2025
@andygrove
Copy link
Copy Markdown
Member Author

After modifying Iceberg to make SparkBatchQueryScan implement SupportsComet, this now appears to be fully working.

scala> spark.sql(s"SELECT * from t1").show()
25/04/28 07:29:37 INFO core/src/lib.rs: Comet native library version 0.9.0 initialized
25/04/28 07:29:37 WARN CometSparkSessionExtensions$CometExecRule: Comet cannot execute some parts of this plan natively (set spark.comet.explainFallback.enabled=false to disable this logging):
 CollectLimit [COMET: CollectLimit is not supported]
+-  Project [COMET: toprettystring is not supported]
   +- CometScanWrapper

+---+---+
| c0| c1|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
|  5|  5|
|  6|  6|
|  7|  7|
|  8|  8|
|  9|  9|
| 10| 10|
| 11| 11|
| 12| 12|
| 13| 13|
| 14| 14|
| 15| 15|
| 16| 16|
| 17| 17|
| 18| 18|
| 19| 19|
+---+---+
only showing top 20 rows

@andygrove andygrove marked this pull request as ready for review April 28, 2025 13:31
import org.apache.arrow.memory.BufferAllocator;

/** This is a simple wrapper around SchemaImporter to make it accessible from Java Arrow. */
public class CometSchemaImporter extends AbstractCometSchemaImporter {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This moves CometSchemaImporter out of the Arrow namespace (which gets shaded in Iceberg).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we do not break anything here? This class existed to overcome some package private restrictions in Arrow FFI (cannot remember what though).
If it is safe to do so (i.e. the build does not break), it would make sense to move all the Comet classes in org.apache.arrow.c together. Maybe to org.apache.comet.arrow?
Can be a followup.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original class is now an abstract base class and is still in the arrow namespace so that it can access private members. My hack was then to add the concrete class to extend it in the Comet namespace.

I agree that we should do more to refactor this so that only essential parts are in Arrow namespace. I can file as issue later today.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 28, 2025

Codecov Report

❌ Patch coverage is 57.14286% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.82%. Comparing base (f09f8af) to head (3bde994).
⚠️ Report is 1145 commits behind head on main.

Files with missing lines Patch % Lines
...ain/java/org/apache/comet/CometSchemaImporter.java 0.00% 2 Missing ⚠️
...rg/apache/arrow/c/AbstractCometSchemaImporter.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1683      +/-   ##
============================================
+ Coverage     56.12%   58.82%   +2.69%     
- Complexity      976     1090     +114     
============================================
  Files           119      126       +7     
  Lines         11743    12602     +859     
  Branches       2251     2362     +111     
============================================
+ Hits           6591     7413     +822     
- Misses         4012     4018       +6     
- Partials       1140     1171      +31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment but otherwise lgtm

import org.apache.arrow.memory.BufferAllocator;

/** This is a simple wrapper around SchemaImporter to make it accessible from Java Arrow. */
public class CometSchemaImporter extends AbstractCometSchemaImporter {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we do not break anything here? This class existed to overcome some package private restrictions in Arrow FFI (cannot remember what though).
If it is safe to do so (i.e. the build does not break), it would make sense to move all the Comet classes in org.apache.arrow.c together. Maybe to org.apache.comet.arrow?
Can be a followup.

@huaxingao
Copy link
Copy Markdown
Contributor

The scan is a org.apache.iceberg.spark.source.SparkBatchQueryScan which does not implement SupportsComet

Sorry, I forgot to mention that I have to PR to enable native execution. It hasn't been merged yet.

@andygrove andygrove merged commit 50a75f3 into apache:main Apr 28, 2025
78 checks passed
@andygrove andygrove deleted the iceberg-docs branch April 28, 2025 20:11
andygrove added a commit to andygrove/datafusion-comet that referenced this pull request Apr 29, 2025
coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Iceberg integration broken

4 participants