Skip to content

publish_changes spark procedure only cherry-picks a single snapshot when there are multiple with the same wap.id #14953

@SamWheating

Description

@SamWheating

Apache Iceberg version

1.10.1 (latest release)

Query engine

Spark

Please describe the bug 🐞

I have been experimenting with various write-audit-publish workflows and I noticed this potentially hazardous behaviour when using spark.wap.id to stage changes.

If multiple snapshots are created with the same wap.id, then the publish_changes procedure will only cherry-pick the earliest matching snapshot, silently ignoring any other staged changes.

Reproduction, using the spark-sql quickstart:

CREATE TABLE demo.default.wap_example
    (country string, population bigint)
USING ICEBERG
PARTITIONED BY (country)
TBLPROPERTIES ('write.wap.enabled'='true');

SET spark.wap.id=wap_with_two_snapshots;

-- write two rows into two partitions (will create two snapshots with the same wap.id)
INSERT INTO demo.default.wap_example VALUES ('Canada', 40000000);
INSERT INTO demo.default.wap_example VALUES ('USA', 340000000);

CALL demo.system.publish_changes('demo.default.wap_example', 'wap_with_two_snapshots');

SELECT * from demo.default.wap_example;

-- Canada	40000000
-- Time taken: 0.052 seconds, Fetched 1 row(s)

This makes sense looking at the procedure's definition here, but I think that this is potentially harmful as it can result in write being silently lost.

Based on the validation performed when cherry-picking snapshots, it looks like its expected that wap.id will be unique among snapshots. In this case I think we should raise an error during the publish_changes procedure if there are multiple matching snapshots to prevent any ambiguity.

I'd be happy to work on a fix, but I want to ensure that my understanding is correct here.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions