-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: add snapshot reference to table metadata #3104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continuing the effort, and thanks for renaming the snapshot-annotation values to my suggestion on the doc, I think it sounds more reasonable. I am still not 100% sure it is the way to go (adding this extra state, vs exposing configs/lambda function on expireSnapshot method itself), but I'm ok if everyone agrees. Left some comments in a first pass
| } | ||
|
|
||
| @Override | ||
| public boolean equals(Object o) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed something, but what is different than default equals/hashcode?
| } | ||
| } | ||
|
|
||
| public static void toJson(SnapshotAnnotation annotation, JsonGenerator generator) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a better name/signature for this, toJson infers it returns a string (was confused when reading the caller code). Maybe writeSnapshotAnnotation(JsonGenerator generator, SnapshotAnnotation annotation) if not too redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also do we need make it public?
| return null; | ||
| } | ||
|
|
||
| Preconditions.checkArgument(node.has(property), "Cannot parse missing long %s", property); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No point anymore to this check anymore , right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, updated
|
I think that a feature like this should be more understandable and straightforward. If our overall goal is to enable branching and tagging, then I think it makes the most sense to do that directly rather than introducing a similar concept like annotations. |
|
@rdblue @szehon-ho @nastra updated based on the last discussion, could you take a look? |
| // update the snapshot log | ||
| // update the snapshot log and refs | ||
| Set<Long> validIds = Sets.newHashSet(Iterables.transform(filtered, Snapshot::snapshotId)); | ||
| List<SnapshotReference> newRefs = refs.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my thought is that remove snapshot should work as usual without failing for snapshot with ref, and predicate should be formulated based on ref information.
| } | ||
|
|
||
| /** | ||
| * Returns the minimum number of snapshots to keep for a BRANCH, or null for a TAG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe minSnapshotsToKeep can be 0 instead of null for a TAG? Then you could use the primitive int type rather than Integer
| this.type = type; | ||
| } | ||
|
|
||
| public Builder withMinSnapshotsToKeep(Integer value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these two properties for use in "expire snapshots?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be used once we added it to spec
| private final Map<Integer, SortOrder> sortOrdersById; | ||
| private final List<HistoryEntry> snapshotLog; | ||
| private final List<MetadataLogEntry> previousFiles; | ||
| private final List<SnapshotReference> refs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where I am a little confused. Does every Metadata.json contain the full branch and snapshot references for all other branches and snapshots?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's the current plan, the metadata.json always contains all the snapshots and all references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do so, how are we gonna support snapshot history ? metadata tables per reference should show and use its own snapshots right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I think we should discuss this in Spec PR, (#3425)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking more on this, I think current snapshot will have parent id. so we can still form a history snapshots per branch just by keeping the current snapshot id. So, no need spec change.
But we have to change code for handling metadata tables per reference using above information.
|
Hey @jackye1995 I added a bunch of comments to the design doc. Do you consider this more up to date now? Should I commnet/review here instead? |
|
|
||
| private final TableOperations ops; | ||
| private final TableMetadata base; | ||
| private final Map<String, SnapshotRef> refs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike properties update, I just track the final result of refs map here, and we will not retry during commit phase to avoid concurrent updates of refs.
| * @param currentRefs current table refs list | ||
| * @return updated refs list | ||
| */ | ||
| private static Map<String, SnapshotRef> refsWithMainBranch(long snapshotId, Map<String, SnapshotRef> currentRefs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The snapshot ID here can be -1. I have been thinking if we should keep the main branch ref in case there is no main branch head after a replace operation. But there might be retention policy set at the main branch, and removing the main branch will result in deleting that information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about how awkward this is for a while. We may want to just create an empty snapshot when creating a table. That would clean up using -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes +1 for creating an empty snapshot when creating a table
|
@rdblue Is there any more concern about this PR? |
| private Long maxRefAgeMs; | ||
|
|
||
| Builder(SnapshotRefType type) { | ||
| ValidationException.check(type != null, "Snapshot reference type must not be null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most builders use IllegalArgumentException to reject arguments that aren't allowed and ValidationException to catch inconsistencies. The ValidationException cases that are thrown in build are good examples of when a ValidationException is a good idea because, for example, minSnapshotsToKeep doesn't make sense with a tag.
| public SnapshotRef build() { | ||
| if (type.equals(SnapshotRefType.TAG)) { | ||
| ValidationException.check(minSnapshotsToKeep == null, | ||
| "TAG type snapshot reference does not support setting minSnapshotsToKeep"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use "Tag" instead of "TAG type snapshot reference"? I think people are going to be more familiar with the friendly names, "tag" and "branch", so we should use those rather than focusing on the generalized "reference".
| } else { | ||
| if (minSnapshotsToKeep != null) { | ||
| ValidationException.check(minSnapshotsToKeep > 0, | ||
| "Min snapshots to keep must be greater than 0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be ValidationException.check(minSnapshotsToKeep == null || minSnapshotsToKeep > 0, ...) rather than using an if block.
| * @return a set of {@link SnapshotRef snapshot references} of a snapshot. | ||
| * Note that when there is no ref, it returns an empty set but not null. | ||
| */ | ||
| Set<SnapshotRef> refs(long snapshotId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have a method for returning refs by snapshot but not a ref by name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking it's useful in subsequent PRs for updating the snapshot expiration logic, but we can remove this for now and just have a method for querying the snapshot of a specific ref name.
| * | ||
| * @return a new {@link UpdateSnapshotRefs} | ||
| */ | ||
| UpdateSnapshotRefs updateRefs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you think the other table mutations should handle refs? For example, an append might look like this:
table.newAppend()
.appendFile(someFile)
.commit();Should we add an optional branch to all of the SnapshotProducer operations?
table.newAppend()
.appendFile(someFile)
.branch("some_branch")
.commit();Curious what you've been thinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I do have exactly the same thought, where append will go to the main branch by default and an override is allowed. I think a follow-up PR is needed for this feature.
| return maxRefAgeMs; | ||
| } | ||
|
|
||
| public static Builder builderForTag(long snapshotId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a slightly more direct name scheme, tagBuilder and branchBuilder?
|
|
||
| public static Builder builderFrom(SnapshotRef ref) { | ||
| return new Builder(ref.type()) | ||
| .snapshotId(ref.snapshotId()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use a builder method for snapshot ID when it is passed in through the factory method? What happens when you try to change the snapshot ID or set it to null? I think it may be simpler just to make this a private final field and remove the builder method for it.
|
|
||
| @Override | ||
| public Map<String, SnapshotRef> refs() { | ||
| return lazyTable().refs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to read the table metadata for this. I doubt serialized tables will be accessing refs.
| currentSnapshotId < 0 || snapshotsById.containsKey(currentSnapshotId), | ||
| "Invalid table metadata: Cannot find current version"); | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: unnecessary whitespace change.
| private final List<HistoryEntry> snapshotLog; | ||
| private final List<MetadataLogEntry> previousFiles; | ||
| private final Map<String, SnapshotRef> refs; | ||
| private final SetMultimap<Long, SnapshotRef> refsById; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why these are indexed by snapshot rather than name. Is that a common pattern?
| List<Snapshot> toRemove = snapshots.stream() | ||
| .filter(removeIf) | ||
| .filter(snapshot -> !refsById.containsKey(snapshot.snapshotId())) | ||
| .collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is probably time to remove this method, or at least to stop using it. Instead, I'd probably just produce a list of snapshot IDs to remove. That fits better with the new builder style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added #3900 to track this
| } | ||
|
|
||
| public TableMetadata replaceRefs(Map<String, SnapshotRef> newRefs) { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: unnecessary whitespace starting the function.
| private Map<String, SnapshotRef> validateAndCompleteRefs(Map<String, SnapshotRef> inputRefs) { | ||
| for (SnapshotRef ref : inputRefs.values()) { | ||
| Preconditions.checkArgument(snapshotsById.containsKey(ref.snapshotId()) || ref.snapshotId() == -1, | ||
| "Snapshot reference %s does not exist in the existing snapshots list", ref); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want to validate that only main can be -1?
|
Hi all, I have been discussing snapshot lifecycle management with @jackye1995 and if possible would like to take over addressing any remaining comments for these changes as well as any other potential, relevant implementation improvements. @jackye1995 would this be possible? Would raise another PR and mark @jackye1995 and @hameizi 1249369293@qq.com as co-authors. |
|
Sure, I am a bit low in capacity for implementation recently, please feel free to open a new PR to push the progress of this PR forward and I will then close this one review it, thank you very much @amogh-jahagirdar ! |
…e metadata in apache#3104 Remove unused interface method for getting all snapshot refs for a given snapshot id.
…e metadata in apache#3104 Remove unused interface method for getting all snapshot refs for a given snapshot id.
…e metadata in apache#3104 Remove unused interface method for getting all snapshot refs for a given snapshot id.
Followed up on comments on apache#3104 Remove unused interface method for getting all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove. Added validation that a -1 snapshot id can only happen for the main branch reference. Co-authored-by: wangzeyu <1249369293@qq.com> Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
|
@jackye1995 So we should focus on @amogh-jahagirdar s PR now? Just want to close out this one if that's the case. |
|
@RussellSpitzer yes, let me close this one in favor of #3883 |
Followed up on comments on apache#3104 Remove unused interface method for getting all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove. Added validation that a -1 snapshot id can only happen for the main branch reference. Co-authored-by: wangzeyu <1249369293@qq.com> Co-authored-by: Jack Ye <yzhaoqin@amazon.com> Cleanup
Followed up on comments on apache#3104 Remove unused interface method for getting all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove. Added validation that a -1 snapshot id can only happen for the main branch reference. Co-authored-by: wangzeyu <1249369293@qq.com> Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
Followed up on comments on apache#3104 Remove unused interface method for getting all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove. Added validation that a -1 snapshot id can only happen for the main branch reference. Co-authored-by: wangzeyu <1249369293@qq.com> Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
Followed up on comments on apache#3104 Remove unused interface method for getting all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove. Added validation that a -1 snapshot id can only happen for the main branch reference. Co-authored-by: wangzeyu <1249369293@qq.com> Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
Followed up on comments on apache#3104 Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove. Added validation that a -1 snapshot id can only happen for the main branch reference. Co-authored-by: wangzeyu <1249369293@qq.com> Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
Followed up on comments on apache#3104 Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove. Added validation that a -1 snapshot id can only happen for the main branch reference. Co-authored-by: wangzeyu <1249369293@qq.com> Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
Followed up on comments on apache#3104 Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove. Added validation that a -1 snapshot id can only happen for the main branch reference. Co-authored-by: wangzeyu <1249369293@qq.com> Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
Followed up on comments on apache#3104 Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove. Added validation that a -1 snapshot id can only happen for the main branch reference. Co-authored-by: wangzeyu <1249369293@qq.com> Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
Followed up on comments on apache#3104 Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove. Added validation that a -1 snapshot id can only happen for the main branch reference. Co-authored-by: wangzeyu <1249369293@qq.com> Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
Followed up on comments on apache#3104 Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove. Added validation that a -1 snapshot id can only happen for the main branch reference. Co-authored-by: wangzeyu <1249369293@qq.com> Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
Followed up on comments on apache#3104 Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove. Added validation that a -1 snapshot id can only happen for the main branch reference. Co-authored-by: wangzeyu <1249369293@qq.com> Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
Followed up on comments on apache#3104 Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove. Added validation that a -1 snapshot id can only happen for the main branch reference. Co-authored-by: wangzeyu <1249369293@qq.com> Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
Co-authored-by: @hameizi 1249369293@qq.com
continuation of #2961 after some design discussion, still WIP but I'd like to first publish it to receive some feedback and continue with the effort.
https://docs.google.com/document/d/1PvxK_0ebEoX3s7nS6-LOJJZdBYr_olTWH9oepNUfJ-A/edit#
@rdblue @RussellSpitzer @szehon-ho @stevenzwu