-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Apache Iceberg version
1.10.1 (latest release)
Query engine
Spark
Please describe the bug 🐞
I have been experimenting with various write-audit-publish workflows and I noticed this potentially hazardous behaviour when using spark.wap.id to stage changes.
If multiple snapshots are created with the same wap.id, then the publish_changes procedure will only cherry-pick the earliest matching snapshot, silently ignoring any other staged changes.
Reproduction, using the spark-sql quickstart:
CREATE TABLE demo.default.wap_example
(country string, population bigint)
USING ICEBERG
PARTITIONED BY (country)
TBLPROPERTIES ('write.wap.enabled'='true');
SET spark.wap.id=wap_with_two_snapshots;
-- write two rows into two partitions (will create two snapshots with the same wap.id)
INSERT INTO demo.default.wap_example VALUES ('Canada', 40000000);
INSERT INTO demo.default.wap_example VALUES ('USA', 340000000);
CALL demo.system.publish_changes('demo.default.wap_example', 'wap_with_two_snapshots');
SELECT * from demo.default.wap_example;
-- Canada 40000000
-- Time taken: 0.052 seconds, Fetched 1 row(s)This makes sense looking at the procedure's definition here, but I think that this is potentially harmful as it can result in write being silently lost.
Based on the validation performed when cherry-picking snapshots, it looks like its expected that wap.id will be unique among snapshots. In this case I think we should raise an error during the publish_changes procedure if there are multiple matching snapshots to prevent any ambiguity.
I'd be happy to work on a fix, but I want to ensure that my understanding is correct here.
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time