Skip to content

[EPIC] Iceberg Feature Matrix: Spark-Iceberg vs iceberg-rust vs datafusion-comet #3756

@Shekharrajak

Description

@Shekharrajak

We should use below matrix to check for any missing implementations that could accelerate Spark Iceberg pipeline using comet

READ PATH

Feature Iceberg Java iceberg-rust datafusion-comet (via iceberg-rust)
Basic Parquet scan Yes Yes Yes - IcebergScanExec
Positional deletes (V2 MoR) Yes Yes - ArrowReader + DeleteVector (RoaringTreemap) Yes - delegates to ArrowReader with row_selection_enabled(true)
Equality deletes (V2 MoR) Yes Yes - ArrowReader builds equality delete predicates Yes - delegates to ArrowReader
Deletion vectors (V3) Yes - DVUtil, DVFileWriter, DVIterator Yes - DeleteVector + Puffin deletion-vector-v1 blob support Not wired - Comet doesn't pass DV metadata via protobuf
Schema evolution Yes Yes Yes - IcebergStreamWrapper adapts batches to target schema
Partition pruning (static) Yes Yes Yes - partitions serialized in protobuf
Dynamic partition pruning Yes (Spark) N/A (engine-level) Yes - CometIcebergNativeScanExec defers serialization for DPP
Row-group filtering (residuals) Yes Yes Yes - residual predicates converted to iceberg::expr::BoundPredicate
Identity partition columns Yes Yes Yes
Object stores (S3/GCS/OSS) Yes (Hadoop FS) Yes (OpenDAL) Yes (OpenDAL via FileIOBuilder)
V1 spec Yes Yes Yes
V2 spec Yes Yes Yes
V3 spec metadata Yes Yes (FormatVersion::V3, next_row_id, row lineage) Not used - Comet doesn't handle V3-specific metadata

WRITE PATH

Feature Iceberg Java iceberg-rust datafusion-comet
Data file writing Yes - DataWriter Yes - DataFileWriter No - uses raw parquet crate, not iceberg-rust
Partitioned writes (sorted) Yes - ClusteredDataWriter Yes - ClusteredWriter No - writes single file per Spark partition
Partitioned writes (fanout) Yes - FanoutDataWriter Yes - FanoutWriter No
Rolling file writer Yes Yes - RollingFileWriter No
Equality delete writer Yes Yes - EqualityDeleteWriter No
Position delete writer Yes Partial No
Deletion vector writer Yes - DVFileWriter, PartitioningDVWriter No explicit DV writer No
AppendFiles / FastAppend Yes - AppendFiles Yes - FastAppendAction No - commit done in Java
OverwriteFiles Yes - OverwriteFiles Missing No
ReplacePartitions Yes - ReplacePartitions Missing No
DeleteFiles Yes - DeleteFiles Missing No
RowDelta Yes - RowDelta Missing No
RewriteFiles Yes - RewriteFiles Missing No
Transaction + commit Yes - full atomic commit Yes - Transaction::commit() with retry No - commit is JVM-side

ROW-LEVEL OPERATIONS (DELETE/UPDATE/MERGE)

Feature Iceberg Java + Spark iceberg-rust datafusion-comet
Copy-on-Write (CoW) scan Yes - SparkCopyOnWriteScan No CoW scan No
Copy-on-Write write Yes - rewrite affected data files Partial (rewrite manually) No
Merge-on-Read (MoR) scan Yes - buildMergeOnReadScan() Yes - ArrowReader applies deletes Yes (read only)
MoR position delta write Yes - SparkPositionDeltaWrite No No
DELETE FROM Yes (CoW or MoR) No action No
UPDATE Yes (CoW or MoR) No action No
MERGE INTO Yes - SparkRowLevelOperationBuilder No (issue #2201) No

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions