-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: add snapshot tags interface #2961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice to see this get traction. It seems we would need to make a small update the spec.
Left some comments.
| private final int defaultSortOrderId; | ||
| private final List<SortOrder> sortOrders; | ||
| private final Map<String, String> properties; | ||
| private final long currentSnapshotId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle all the update snapshot calls? ie, replaceCurrentSnapshot, addStagedSnashot, etc.
|
Thanks for the timely review! Around the documentation, because this will technically be v3 change, we need to add one more column for all tables in the spec doc, I will update that later. For the update snapshot changes like One potential alternative that I would like to just present is to directly have a
The disadvantages are:
Based on these considerations I think adding the tags directly in the snapshot spec sounds more reasonable to me. Please let me know if anyone thinks otherwise or has any better approaches. |
|
Agree, seems simpler having tag on the Snapshot directly than Snapshot map on the table metadata. (if i understand the two choices right). It seems TableMetadata loads all snapshots anyway, so in any case there must be a reverse indexing to support Snapshot.tags() method |
rymurr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This on its own looks ok to me. Would it be possible to have a design doc or a few words added to the spec doc? I think it would be easier to discuss the feature with a bigger picture goal/plans.
Why add List<String> to the Snapshot? I've always viewed this object as immutable at creation time (maybe I am wrong?). Would it be easier to manage a Map<String, Integer> at the metadata level? This makes the add tag ops more like other metadata operations. And CRUD operations on tags clearer (to me at least).
I am curious about the user API, methods of creation, deletion, modification etc do you envision the modifications happening as a transaction or as a catalog mixin or somethign else? Similarly I am curious about the effect this has on the table maintenance operations. Could you say a little more on that (possibly in the design doc)
Finally, this is in effect creating CVS for the data lake in the sense that each table in your data lake is tracked individually. Have you considered doing this at the catalog level instead? Those of us who used CVS would thank you ;-)
| * @return tags associated with the snapshot, or null if no tags are associated with the snapshot. | ||
| */ | ||
| default List<String> tags() { | ||
| return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why null and not an empty list?
|
@rymurr thanks for the feedback, let me start a design doc around this then. Yes Let's discuss this in more details in design doc then, I will put up something and send to dev list. |
|
I agree with much of what @rymurr said. I think it is good that snapshots are immutable. I think of the tags (and later, branches) feature as a way of referring to snapshots and keeping track of them, just like references in git. For the spec, what I've been thinking is to add a new map that tracks snapshots with names as keys and an object of data. Something like this: Things to note about this example:
|
|
@rdblue thanks for the suggestion! One thing around the immutable nature of snapshot I thought about is that, in encryption we still have the need to add the ability to rotate key. The encryption metadata or its location will likely be a part of snapshot which will change after rotation. This goes to the point I was discussing above, I think snapshot content is of course immutable, but I am not sure if we can continue to keep properties describing the snapshot immutable. For lifecycle, yes I was thinking about adding this as a part of a Tag object, but if we go with placing it in table metadata it will be the way you described. For JSON object instead of map, yes If it is in table metadata it will be an object similar to how we implement other fields. I was just trying to simplify the idea. For the branch example you gave, could you elaborate more about its semantics and usage? I don't see how it would be used without having a complete DAG of commits, and how its relationship would be with the Nessie project. |
| * Return the tags associated with the snapshot, or null if no tags are associated with the snapshot. | ||
| * @return tags associated with the snapshot, or null if no tags are associated with the snapshot. | ||
| */ | ||
| default List<String> tags() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about Set<String> as return type?
We have discussed about this in the past community meeting. Initial PR to support tagging snapshots, so that in the future we can time travel or expire snapshots by tag.
A tag will have a similar definition to the git commit tag, where:
This PR adds the tags interface in
Snapshotand addMap<String, Long> snapshotTagsinTableMetadatawith is indexed based on the snapshots list. Subsequent PRs will be publsihed to add features to update tags, do time travel and snapshot expiration by tag after we agree on the approach here.@rdblue @aokolnychyi @openinx @RussellSpitzer @rymurr @yyanyy