Is your feature request related to a problem or challenge?
Currently iceberg-rust doesn't provide a way to see changes between two snapshots. In Spark, through Iceberg Java implementation, this is done using create_changelog_view. This is very useful for doing change data capture on top of Iceberg tables.
Describe the solution you'd like
The output for Spark's create_changelog_view, in default mode, is something like this:
where each row shows its user-defined columns, with addition of 3 metadata columns (_change_type, _change_ordinal, _commit_snapshot_id).
The way Java code does it is incremental, meaning only the data between the optional timestamps (or commit IDs) is processed. Here are some references:
openChangelogScanTask in https://github.com/apache/iceberg/blob/efbfb7ef9addeb33e72208c927936e50b92d3357/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/ChangelogRowReader.java
doPlanFiles in https://github.com/apache/iceberg/blob/6ec3de390d3fa6e797c6975b1eaaea41719db0fe/core/src/main/java/org/apache/iceberg/BaseIncrementalChangelogScan.java
BaseAddedRowsScanTask and BaseDeletedDataFileScanTask. BaseDeletedRowsScanTask is unused, which means that Spark doesn't support row-level deletes, only copy-on-write kind of deletes, for the changelog scan. But it would be good if Rust actually supported that as well, I see no particular reason why this wasn't supported in Spark.
The create_changelog_view has several options, and perhaps we don't have to support them all in Rust immediately, but over time.
Willingness to contribute
I would be willing to contribute to this feature with guidance from the Iceberg Rust community
Is your feature request related to a problem or challenge?
Currently iceberg-rust doesn't provide a way to see changes between two snapshots. In Spark, through Iceberg Java implementation, this is done using create_changelog_view. This is very useful for doing change data capture on top of Iceberg tables.
Describe the solution you'd like
The output for Spark's
create_changelog_view, in default mode, is something like this:where each row shows its user-defined columns, with addition of 3 metadata columns (
_change_type,_change_ordinal,_commit_snapshot_id).The way Java code does it is incremental, meaning only the data between the optional timestamps (or commit IDs) is processed. Here are some references:
openChangelogScanTaskin https://github.com/apache/iceberg/blob/efbfb7ef9addeb33e72208c927936e50b92d3357/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/ChangelogRowReader.javadoPlanFilesin https://github.com/apache/iceberg/blob/6ec3de390d3fa6e797c6975b1eaaea41719db0fe/core/src/main/java/org/apache/iceberg/BaseIncrementalChangelogScan.javaBaseAddedRowsScanTask and BaseDeletedDataFileScanTask. BaseDeletedRowsScanTask is unused, which means that Spark doesn't support row-level deletes, only copy-on-write kind of deletes, for the changelog scan. But it would be good if Rust actually supported that as well, I see no particular reason why this wasn't supported in Spark.
The
create_changelog_viewhas several options, and perhaps we don't have to support them all in Rust immediately, but over time.Willingness to contribute
I would be willing to contribute to this feature with guidance from the Iceberg Rust community