We should use below matrix to check for any missing implementations that could accelerate Spark Iceberg pipeline using comet
READ PATH
| Feature |
Iceberg Java |
iceberg-rust |
datafusion-comet (via iceberg-rust) |
| Basic Parquet scan |
Yes |
Yes |
Yes - IcebergScanExec |
| Positional deletes (V2 MoR) |
Yes |
Yes - ArrowReader + DeleteVector (RoaringTreemap) |
Yes - delegates to ArrowReader with row_selection_enabled(true) |
| Equality deletes (V2 MoR) |
Yes |
Yes - ArrowReader builds equality delete predicates |
Yes - delegates to ArrowReader |
| Deletion vectors (V3) |
Yes - DVUtil, DVFileWriter, DVIterator |
Yes - DeleteVector + Puffin deletion-vector-v1 blob support |
Not wired - Comet doesn't pass DV metadata via protobuf |
| Schema evolution |
Yes |
Yes |
Yes - IcebergStreamWrapper adapts batches to target schema |
| Partition pruning (static) |
Yes |
Yes |
Yes - partitions serialized in protobuf |
| Dynamic partition pruning |
Yes (Spark) |
N/A (engine-level) |
Yes - CometIcebergNativeScanExec defers serialization for DPP |
| Row-group filtering (residuals) |
Yes |
Yes |
Yes - residual predicates converted to iceberg::expr::BoundPredicate |
| Identity partition columns |
Yes |
Yes |
Yes |
| Object stores (S3/GCS/OSS) |
Yes (Hadoop FS) |
Yes (OpenDAL) |
Yes (OpenDAL via FileIOBuilder) |
| V1 spec |
Yes |
Yes |
Yes |
| V2 spec |
Yes |
Yes |
Yes |
| V3 spec metadata |
Yes |
Yes (FormatVersion::V3, next_row_id, row lineage) |
Not used - Comet doesn't handle V3-specific metadata |
WRITE PATH
| Feature |
Iceberg Java |
iceberg-rust |
datafusion-comet |
| Data file writing |
Yes - DataWriter |
Yes - DataFileWriter |
No - uses raw parquet crate, not iceberg-rust |
| Partitioned writes (sorted) |
Yes - ClusteredDataWriter |
Yes - ClusteredWriter |
No - writes single file per Spark partition |
| Partitioned writes (fanout) |
Yes - FanoutDataWriter |
Yes - FanoutWriter |
No |
| Rolling file writer |
Yes |
Yes - RollingFileWriter |
No |
| Equality delete writer |
Yes |
Yes - EqualityDeleteWriter |
No |
| Position delete writer |
Yes |
Partial |
No |
| Deletion vector writer |
Yes - DVFileWriter, PartitioningDVWriter |
No explicit DV writer |
No |
| AppendFiles / FastAppend |
Yes - AppendFiles |
Yes - FastAppendAction |
No - commit done in Java |
| OverwriteFiles |
Yes - OverwriteFiles |
Missing |
No |
| ReplacePartitions |
Yes - ReplacePartitions |
Missing |
No |
| DeleteFiles |
Yes - DeleteFiles |
Missing |
No |
| RowDelta |
Yes - RowDelta |
Missing |
No |
| RewriteFiles |
Yes - RewriteFiles |
Missing |
No |
| Transaction + commit |
Yes - full atomic commit |
Yes - Transaction::commit() with retry |
No - commit is JVM-side |
ROW-LEVEL OPERATIONS (DELETE/UPDATE/MERGE)
| Feature |
Iceberg Java + Spark |
iceberg-rust |
datafusion-comet |
| Copy-on-Write (CoW) scan |
Yes - SparkCopyOnWriteScan |
No CoW scan |
No |
| Copy-on-Write write |
Yes - rewrite affected data files |
Partial (rewrite manually) |
No |
| Merge-on-Read (MoR) scan |
Yes - buildMergeOnReadScan() |
Yes - ArrowReader applies deletes |
Yes (read only) |
| MoR position delta write |
Yes - SparkPositionDeltaWrite |
No |
No |
| DELETE FROM |
Yes (CoW or MoR) |
No action |
No |
| UPDATE |
Yes (CoW or MoR) |
No action |
No |
| MERGE INTO |
Yes - SparkRowLevelOperationBuilder |
No (issue #2201) |
No |
We should use below matrix to check for any missing implementations that could accelerate Spark Iceberg pipeline using comet
READ PATH
WRITE PATH
ROW-LEVEL OPERATIONS (DELETE/UPDATE/MERGE)