docs: Add documentation for accelerating Iceberg Parquet scans with Comet by andygrove · Pull Request #1683 · apache/datafusion-comet

andygrove · 2025-04-26T15:08:11Z

Which issue does this PR close?

Part of #1618

Closes #1684

Rationale for this change

Tell users how to integrate Comet with Iceberg.

What changes are included in this PR?

Documentation
Refactoring to avoid a shading issue

How are these changes tested?

andygrove · 2025-04-26T21:17:17Z

I fixed the issue with CometSchemaImporter being shaded, but now I'm hitting a similar issue with a shaded Parquet class:

org.apache.iceberg.shaded.org.apache.parquet.column.ColumnDescriptor

huaxingao · 2025-04-27T18:00:16Z

Iceberg shades Parquet. In our internal version of Iceberg, we remove the shading. In OSS, when enabling Comet native execution in apache/iceberg#12709, I actually unshaded Parquet, but I don't think that's the correct approach.

Instead of passing a ColumnDescriptor from Iceberg to Comet, should we pass a String[] path, PrimitiveType.PrimitiveTypeName type, int maxRep, and int maxDef, so that we can construct the ColumnDescriptor on the Comet side?

andygrove · 2025-04-28T12:14:19Z

Iceberg shades Parquet. In our internal version of Iceberg, we remove the shading. In OSS, when enabling Comet native execution in apache/iceberg#12709, I actually unshaded Parquet, but I don't think that's the correct approach.

Instead of passing a ColumnDescriptor from Iceberg to Comet, should we pass a String[] path, PrimitiveType.PrimitiveTypeName type, int maxRep, and int maxDef, so that we can construct the ColumnDescriptor on the Comet side?

Thanks @huaxingao, I will try that.

andygrove · 2025-04-28T12:18:51Z

Instead of passing a ColumnDescriptor from Iceberg to Comet, should we pass a String[] path, PrimitiveType.PrimitiveTypeName type, int maxRep, and int maxDef, so that we can construct the ColumnDescriptor on the Comet side?

I did get farther, but now I run into the same problem when Iceberg calls the following method, because it needs to pass the Parquet PageReader class in:

public void setPageReader(PageReader pageReader) throws IOException

andygrove · 2025-04-28T12:28:18Z

Perhaps we need to have Comet use a shaded version of Parquet also. I don't know if the Maven and Gradle approaches to shading will be compatible though.

andygrove · 2025-04-28T12:36:53Z

Another option is to introduce a higher level of abstraction so that we can avoid directly referencing the Parquet types.

andygrove · 2025-04-28T13:00:51Z

For now, I went with the approach of updating Iceberg to stop shading Parquet. This now seems to be working, but the plan says it is not running natively.

scala> spark.sql(s"SELECT * from t1").show()
25/04/28 06:59:19 INFO core/src/lib.rs: Comet native library version 0.9.0 initialized
25/04/28 06:59:19 WARN CometSparkSessionExtensions$CometExecRule: Comet cannot execute some parts of this plan natively (set spark.comet.explainFallback.enabled=false to disable this logging):
 CollectLimit [COMET: CollectLimit is not supported]
+-  Project [COMET: Project is not native because the following children are not native (BatchScan spark_catalog.default.t1)]
   +-  BatchScan spark_catalog.default.t1 [COMET: Comet Scan only supports Parquet and Iceberg Parquet file formats, BatchScan spark_catalog.default.t1 is not supported]

+---+---+
| c0| c1|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
|  5|  5|
|  6|  6|
|  7|  7|
|  8|  8|
|  9|  9|
| 10| 10|
| 11| 11|
| 12| 12|
| 13| 13|
| 14| 14|
| 15| 15|
| 16| 16|
| 17| 17|
| 18| 18|
| 19| 19|
+---+---+
only showing top 20 rows

andygrove · 2025-04-28T13:23:54Z

The scan is a org.apache.iceberg.spark.source.SparkBatchQueryScan which does not implement SupportsComet. In fact, I do not see any references to SupportsComet in Iceberg.

andygrove · 2025-04-28T13:31:13Z

After modifying Iceberg to make SparkBatchQueryScan implement SupportsComet, this now appears to be fully working.

scala> spark.sql(s"SELECT * from t1").show()
25/04/28 07:29:37 INFO core/src/lib.rs: Comet native library version 0.9.0 initialized
25/04/28 07:29:37 WARN CometSparkSessionExtensions$CometExecRule: Comet cannot execute some parts of this plan natively (set spark.comet.explainFallback.enabled=false to disable this logging):
 CollectLimit [COMET: CollectLimit is not supported]
+-  Project [COMET: toprettystring is not supported]
   +- CometScanWrapper

+---+---+
| c0| c1|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
|  5|  5|
|  6|  6|
|  7|  7|
|  8|  8|
|  9|  9|
| 10| 10|
| 11| 11|
| 12| 12|
| 13| 13|
| 14| 14|
| 15| 15|
| 16| 16|
| 17| 17|
| 18| 18|
| 19| 19|
+---+---+
only showing top 20 rows

andygrove · 2025-04-28T13:43:33Z

+import org.apache.arrow.memory.BufferAllocator;
+
+/** This is a simple wrapper around SchemaImporter to make it accessible from Java Arrow. */
+public class CometSchemaImporter extends AbstractCometSchemaImporter {


This moves CometSchemaImporter out of the Arrow namespace (which gets shaded in Iceberg).

Are we sure we do not break anything here? This class existed to overcome some package private restrictions in Arrow FFI (cannot remember what though).
If it is safe to do so (i.e. the build does not break), it would make sense to move all the Comet classes in org.apache.arrow.c together. Maybe to org.apache.comet.arrow?
Can be a followup.

The original class is now an abstract base class and is still in the arrow namespace so that it can access private members. My hack was then to add the concrete class to extend it in the Comet namespace.

I agree that we should do more to refactor this so that only essential parts are in Arrow namespace. I can file as issue later today.

codecov-commenter · 2025-04-28T15:07:53Z

Codecov Report

❌ Patch coverage is 57.14286% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.82%. Comparing base (f09f8af) to head (3bde994).
⚠️ Report is 1145 commits behind head on main.

Files with missing lines	Patch %	Lines
...ain/java/org/apache/comet/CometSchemaImporter.java	0.00%	2 Missing ⚠️
...rg/apache/arrow/c/AbstractCometSchemaImporter.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1683      +/-   ##
============================================
+ Coverage     56.12%   58.82%   +2.69%     
- Complexity      976     1090     +114     
============================================
  Files           119      126       +7     
  Lines         11743    12602     +859     
  Branches       2251     2362     +111     
============================================
+ Hits           6591     7413     +822     
- Misses         4012     4018       +6     
- Partials       1140     1171      +31

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

parthchandra

One comment but otherwise lgtm

parthchandra · 2025-04-28T16:31:06Z

+import org.apache.arrow.memory.BufferAllocator;
+
+/** This is a simple wrapper around SchemaImporter to make it accessible from Java Arrow. */
+public class CometSchemaImporter extends AbstractCometSchemaImporter {


Are we sure we do not break anything here? This class existed to overcome some package private restrictions in Arrow FFI (cannot remember what though).
If it is safe to do so (i.e. the build does not break), it would make sense to move all the Comet classes in org.apache.arrow.c together. Maybe to org.apache.comet.arrow?
Can be a followup.

huaxingao · 2025-04-28T16:39:15Z

The scan is a org.apache.iceberg.spark.source.SparkBatchQueryScan which does not implement SupportsComet

Sorry, I forgot to mention that I have to PR to enable native execution. It hasn't been merged yet.

…omet (apache#1683)

…omet (#1683) (#1696)

…omet (apache#1683)

andygrove added 2 commits April 26, 2025 09:06

Add wip iceberg docs

cd00233

add link

1018fd2

andygrove mentioned this pull request Apr 26, 2025

Iceberg integration broken #1684

Closed

andygrove added 9 commits April 26, 2025 09:38

link to tracking issue

8640e4f

revert code change

baae1d2

fix shading

c42b549

fix shading

2d70609

update

1ea19ad

update

6c37efc

update

b03dd4c

refactor to avoid shading

4651211

refactor to avoid shading

9fb4ec6

avoid Parquet types in signature

922f711

andygrove added 2 commits April 28, 2025 06:53

experimenting

0ab6a43

update docs

5851aaa

update docs

58d9785

andygrove changed the title ~~[wip] docs: Add documentation for accelerating Iceberg Parquet scans with Comet~~ docs: Add documentation for accelerating Iceberg Parquet scans with Comet Apr 28, 2025

andygrove requested review from huaxingao and parthchandra April 28, 2025 13:31

andygrove marked this pull request as ready for review April 28, 2025 13:31

docs

ba4c073

andygrove commented Apr 28, 2025

View reviewed changes

revert shading change

3bde994

andygrove mentioned this pull request Apr 28, 2025

Set up Comet + Iceberg integration tests in CI #1685

Closed

parthchandra approved these changes Apr 28, 2025

View reviewed changes

andygrove merged commit 50a75f3 into apache:main Apr 28, 2025
78 checks passed

andygrove deleted the iceberg-docs branch April 28, 2025 20:11

andygrove added a commit to andygrove/datafusion-comet that referenced this pull request Apr 29, 2025

docs: Add documentation for accelerating Iceberg Parquet scans with C…

7d72ca8

…omet (apache#1683)

andygrove mentioned this pull request Apr 29, 2025

docs: Add documentation for accelerating Iceberg Parquet scans with Comet [branch-0.8] #1696

Merged

andygrove added a commit that referenced this pull request Apr 29, 2025

docs: Add documentation for accelerating Iceberg Parquet scans with C…

b6342b4

…omet (#1683) (#1696)

hsiang-c mentioned this pull request May 7, 2025

chore: Comet + Iceberg (1.8.1) CI #1715

Merged

coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025

docs: Add documentation for accelerating Iceberg Parquet scans with C…

84a8f2c

…omet (apache#1683)

Conversation

andygrove commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove commented Apr 26, 2025

Uh oh!

huaxingao commented Apr 27, 2025

Uh oh!

andygrove commented Apr 28, 2025

Uh oh!

andygrove commented Apr 28, 2025

Uh oh!

andygrove commented Apr 28, 2025

Uh oh!

andygrove commented Apr 28, 2025

Uh oh!

andygrove commented Apr 28, 2025

Uh oh!

andygrove commented Apr 28, 2025

Uh oh!

andygrove commented Apr 28, 2025

Uh oh!

andygrove Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

parthchandra Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

parthchandra Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Apr 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andygrove commented Apr 26, 2025 •

edited

Loading

codecov-commenter commented Apr 28, 2025 •

edited

Loading