Skip to content

Conversation

@jackye1995
Copy link
Contributor

We have discussed about this in the past community meeting. Initial PR to support tagging snapshots, so that in the future we can time travel or expire snapshots by tag.

A tag will have a similar definition to the git commit tag, where:

  1. tag name is globally unique
  2. a tag can only be associated with one snapshot
  3. a snapshot can have multiple different tags

This PR adds the tags interface in Snapshot and add Map<String, Long> snapshotTags in TableMetadata with is indexed based on the snapshots list. Subsequent PRs will be publsihed to add features to update tags, do time travel and snapshot expiration by tag after we agree on the approach here.

@rdblue @aokolnychyi @openinx @RussellSpitzer @rymurr @yyanyy

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice to see this get traction. It seems we would need to make a small update the spec.

Left some comments.

private final int defaultSortOrderId;
private final List<SortOrder> sortOrders;
private final Map<String, String> properties;
private final long currentSnapshotId;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle all the update snapshot calls? ie, replaceCurrentSnapshot, addStagedSnashot, etc.

@jackye1995
Copy link
Contributor Author

@szehon-ho @RussellSpitzer

Thanks for the timely review!

Around the documentation, because this will technically be v3 change, we need to add one more column for all tables in the spec doc, I will update that later.

For the update snapshot changes like replaceCurrentSnapshot, addStagedSnashot, I am not updating those in this PR and trying to only consolidate on the changes to the spec.

One potential alternative that I would like to just present is to directly have a Map<String, Long> snapshot tags map in the TableMetadata spec. The advantages are:

  1. we do not need to build the index which is beneficial when there are many snapshots in the table.
  2. updates to the tag only updates the tag map itself and do not need to rewrite the snapshot information.

The disadvantages are:

  1. we prefer to update individual components in the spec if possible instead of adding more top-level fields in the table metadata spec. This was the principle we followed when adding row identifier in schema instead of table metadata.
  2. not as easy to update the snapshots system table to add the tags field
  3. the index is already built anyway for snapshot id mapping, so the marginal cost for indexing tags is not too bad
  4. changes like snapshot expiration needs to also update the tags information

Based on these considerations I think adding the tags directly in the snapshot spec sounds more reasonable to me. Please let me know if anyone thinks otherwise or has any better approaches.

@szehon-ho
Copy link
Member

Agree, seems simpler having tag on the Snapshot directly than Snapshot map on the table metadata. (if i understand the two choices right). It seems TableMetadata loads all snapshots anyway, so in any case there must be a reverse indexing to support Snapshot.tags() method

Copy link
Contributor

@rymurr rymurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This on its own looks ok to me. Would it be possible to have a design doc or a few words added to the spec doc? I think it would be easier to discuss the feature with a bigger picture goal/plans.

Why add List<String> to the Snapshot? I've always viewed this object as immutable at creation time (maybe I am wrong?). Would it be easier to manage a Map<String, Integer> at the metadata level? This makes the add tag ops more like other metadata operations. And CRUD operations on tags clearer (to me at least).

I am curious about the user API, methods of creation, deletion, modification etc do you envision the modifications happening as a transaction or as a catalog mixin or somethign else? Similarly I am curious about the effect this has on the table maintenance operations. Could you say a little more on that (possibly in the design doc)

Finally, this is in effect creating CVS for the data lake in the sense that each table in your data lake is tracked individually. Have you considered doing this at the catalog level instead? Those of us who used CVS would thank you ;-)

* @return tags associated with the snapshot, or null if no tags are associated with the snapshot.
*/
default List<String> tags() {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why null and not an empty list?

@jackye1995
Copy link
Contributor Author

@rymurr thanks for the feedback, let me start a design doc around this then.

Yes Snapshot is immutable as of today, this is also something I am not sure if we would like to define in more granularity or not. When we say the snapshot is immutable, it is really the content, especially the manifest list that is immutable. But tagging is more about a description of a snapshot, which I am not sure if it would affect the immutable aspect of it.

Let's discuss this in more details in design doc then, I will put up something and send to dev list.

@rdblue
Copy link
Contributor

rdblue commented Aug 23, 2021

I agree with much of what @rymurr said. I think it is good that snapshots are immutable. I think of the tags (and later, branches) feature as a way of referring to snapshots and keeping track of them, just like references in git.

For the spec, what I've been thinking is to add a new map that tracks snapshots with names as keys and an object of data. Something like this:

  "named-snapshots": {
    "current": { "snapshot-id": 1234..., "type": "branch" },
    "dev": { "snapshot-id": 1235..., "type": "branch" },
    "monthly-2021-06-01": { "snapshot-id": 1233..., "type": "tag", "lifecycle": "snapshot" },
    "audit-2021-q2": { "snapshot-id": 1232..., "type": "tag", "lifecycle": "independent" }
  }

Things to note about this example:

  • The current version is a normal branch. We'd need to keep current-version-id until v3
  • The "type" determines what can be done. Branches can be updated just like "current" and tags are static
  • Tags need information about lifecycle. Some tags are just convenience names that should be removed when a snapshot is expired (lifecycle=snapshot) but others are snapshots that should be retained forever (lifecycle=independent)
  • Using a JSON object for the tag/branch metadata allows us to evolve it more easily, unlike map<string, long>

@jackye1995
Copy link
Contributor Author

@rdblue thanks for the suggestion!

One thing around the immutable nature of snapshot I thought about is that, in encryption we still have the need to add the ability to rotate key. The encryption metadata or its location will likely be a part of snapshot which will change after rotation. This goes to the point I was discussing above, I think snapshot content is of course immutable, but I am not sure if we can continue to keep properties describing the snapshot immutable.

For lifecycle, yes I was thinking about adding this as a part of a Tag object, but if we go with placing it in table metadata it will be the way you described.

For JSON object instead of map, yes If it is in table metadata it will be an object similar to how we implement other fields. I was just trying to simplify the idea.

For the branch example you gave, could you elaborate more about its semantics and usage? I don't see how it would be used without having a complete DAG of commits, and how its relationship would be with the Nessie project.

* Return the tags associated with the snapshot, or null if no tags are associated with the snapshot.
* @return tags associated with the snapshot, or null if no tags are associated with the snapshot.
*/
default List<String> tags() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about Set<String> as return type?

@jackye1995
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants