-
Notifications
You must be signed in to change notification settings - Fork 395
feat: TableMetadata Statistic Files #799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| #[serde(with = "_serde_set_statistics")] | ||
| SetStatistics { | ||
| /// File containing the statistics | ||
| statistics: StatisticsFile, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In IRC and Java SetStatistics has an additional field snapshot_id (link).
This field is redundant with StatisticsFile.snapshot_id and is only used as an assertion in Java.
I removed the redundancy for rust and will start a discussion on the mailing List how to handle this.
As we still need to be spec compliant, we need custom serializer / deserializer.
Slack: https://apache-iceberg.slack.com/archives/C03LG1D563F/p1734109745807119
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this on the dev-list: https://lists.apache.org/thread/6fnrp73wokfqlt5vormpjyjmtvl29ll1
| /// even if the refs map is null. | ||
| pub(crate) refs: HashMap<String, SnapshotReference>, | ||
| /// Mapping of snapshot ids to statistics files. | ||
| pub(crate) statistics: HashMap<i64, StatisticsFile>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we ever have more than one StatisticsFile for a snapshot_id?
In java this is modeled as a mapping snapshot_id : Vec<StatisticsFile>, however I couldn't find a way to get more than one element into the Vec for a snapshot_id other than deserializing.
https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1310-L1320
fc26394 to
f1c166c
Compare
| pub fn set_statistics(mut self, statistics: StatisticsFile) -> Self { | ||
| self.metadata | ||
| .statistics | ||
| .insert(statistics.snapshot_id, statistics.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to check if a snapshot-id exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this as well, but then followed java.
Currently statistics and snapshots are quite separate from each other. If we implement your check (which I like), I think we should eventually also implement:
- Upon deserialization discard statistics that belong to nonexistant snapshots
- When a snapshot is removed delete the statistics for it as well
This would result in snapshots for statistics not missing. It is unclear however what should happen to the puffin files in these cases. We would have coherent metadata, but probably also orphan files.
Do we know why the check is not there in Java?
Xuanwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adds
StatisticFileandPartitionStatisticsFileto spec, builder and REST TableUpdate.By far most of the code are tests. I hope that the size of the PR is OK.