docs: Add documentation for accelerating Iceberg Parquet scans with Comet#1683
docs: Add documentation for accelerating Iceberg Parquet scans with Comet#1683andygrove merged 17 commits intoapache:mainfrom
Conversation
|
I fixed the issue with |
|
Iceberg shades Parquet. In our internal version of Iceberg, we remove the shading. In OSS, when enabling Comet native execution in apache/iceberg#12709, I actually unshaded Parquet, but I don't think that's the correct approach. Instead of passing a |
Thanks @huaxingao, I will try that. |
I did get farther, but now I run into the same problem when Iceberg calls the following method, because it needs to pass the Parquet PageReader class in: |
|
Perhaps we need to have Comet use a shaded version of Parquet also. I don't know if the Maven and Gradle approaches to shading will be compatible though. |
|
Another option is to introduce a higher level of abstraction so that we can avoid directly referencing the Parquet types. |
|
For now, I went with the approach of updating Iceberg to stop shading Parquet. This now seems to be working, but the plan says it is not running natively. |
|
The scan is a |
|
After modifying Iceberg to make |
| import org.apache.arrow.memory.BufferAllocator; | ||
|
|
||
| /** This is a simple wrapper around SchemaImporter to make it accessible from Java Arrow. */ | ||
| public class CometSchemaImporter extends AbstractCometSchemaImporter { |
There was a problem hiding this comment.
This moves CometSchemaImporter out of the Arrow namespace (which gets shaded in Iceberg).
There was a problem hiding this comment.
Are we sure we do not break anything here? This class existed to overcome some package private restrictions in Arrow FFI (cannot remember what though).
If it is safe to do so (i.e. the build does not break), it would make sense to move all the Comet classes in org.apache.arrow.c together. Maybe to org.apache.comet.arrow?
Can be a followup.
There was a problem hiding this comment.
The original class is now an abstract base class and is still in the arrow namespace so that it can access private members. My hack was then to add the concrete class to extend it in the Comet namespace.
I agree that we should do more to refactor this so that only essential parts are in Arrow namespace. I can file as issue later today.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1683 +/- ##
============================================
+ Coverage 56.12% 58.82% +2.69%
- Complexity 976 1090 +114
============================================
Files 119 126 +7
Lines 11743 12602 +859
Branches 2251 2362 +111
============================================
+ Hits 6591 7413 +822
- Misses 4012 4018 +6
- Partials 1140 1171 +31 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
parthchandra
left a comment
There was a problem hiding this comment.
One comment but otherwise lgtm
| import org.apache.arrow.memory.BufferAllocator; | ||
|
|
||
| /** This is a simple wrapper around SchemaImporter to make it accessible from Java Arrow. */ | ||
| public class CometSchemaImporter extends AbstractCometSchemaImporter { |
There was a problem hiding this comment.
Are we sure we do not break anything here? This class existed to overcome some package private restrictions in Arrow FFI (cannot remember what though).
If it is safe to do so (i.e. the build does not break), it would make sense to move all the Comet classes in org.apache.arrow.c together. Maybe to org.apache.comet.arrow?
Can be a followup.
Sorry, I forgot to mention that I have to PR to enable native execution. It hasn't been merged yet. |
Which issue does this PR close?
Part of #1618
Closes #1684
Rationale for this change
Tell users how to integrate Comet with Iceberg.
What changes are included in this PR?
How are these changes tested?