Skip to content

Conversation

@jackye1995
Copy link
Contributor

@jackye1995 jackye1995 commented Sep 13, 2021

Co-authored-by: @hameizi 1249369293@qq.com

continuation of #2961 after some design discussion, still WIP but I'd like to first publish it to receive some feedback and continue with the effort.

https://docs.google.com/document/d/1PvxK_0ebEoX3s7nS6-LOJJZdBYr_olTWH9oepNUfJ-A/edit#

@rdblue @RussellSpitzer @szehon-ho @stevenzwu

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing the effort, and thanks for renaming the snapshot-annotation values to my suggestion on the doc, I think it sounds more reasonable. I am still not 100% sure it is the way to go (adding this extra state, vs exposing configs/lambda function on expireSnapshot method itself), but I'm ok if everyone agrees. Left some comments in a first pass

}

@Override
public boolean equals(Object o) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed something, but what is different than default equals/hashcode?

}
}

public static void toJson(SnapshotAnnotation annotation, JsonGenerator generator) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a better name/signature for this, toJson infers it returns a string (was confused when reading the caller code). Maybe writeSnapshotAnnotation(JsonGenerator generator, SnapshotAnnotation annotation) if not too redundant

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also do we need make it public?

return null;
}

Preconditions.checkArgument(node.has(property), "Cannot parse missing long %s", property);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No point anymore to this check anymore , right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, updated

@rdblue
Copy link
Contributor

rdblue commented Sep 17, 2021

I think that a feature like this should be more understandable and straightforward. If our overall goal is to enable branching and tagging, then I think it makes the most sense to do that directly rather than introducing a similar concept like annotations.

@jackye1995 jackye1995 changed the title Core: add snapshot annotation to table metadata (WIP) Core: add snapshot reference to table metadata Oct 27, 2021
@jackye1995
Copy link
Contributor Author

@rdblue @szehon-ho @nastra updated based on the last discussion, could you take a look?

// update the snapshot log
// update the snapshot log and refs
Set<Long> validIds = Sets.newHashSet(Iterables.transform(filtered, Snapshot::snapshotId));
List<SnapshotReference> newRefs = refs.stream()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my thought is that remove snapshot should work as usual without failing for snapshot with ref, and predicate should be formulated based on ref information.

}

/**
* Returns the minimum number of snapshots to keep for a BRANCH, or null for a TAG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe minSnapshotsToKeep can be 0 instead of null for a TAG? Then you could use the primitive int type rather than Integer

this.type = type;
}

public Builder withMinSnapshotsToKeep(Integer value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two properties for use in "expire snapshots?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be used once we added it to spec

private final Map<Integer, SortOrder> sortOrdersById;
private final List<HistoryEntry> snapshotLog;
private final List<MetadataLogEntry> previousFiles;
private final List<SnapshotReference> refs;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I am a little confused. Does every Metadata.json contain the full branch and snapshot references for all other branches and snapshots?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's the current plan, the metadata.json always contains all the snapshots and all references.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do so, how are we gonna support snapshot history ? metadata tables per reference should show and use its own snapshots right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I think we should discuss this in Spec PR, (#3425)

Copy link
Member

@ajantha-bhat ajantha-bhat Nov 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking more on this, I think current snapshot will have parent id. so we can still form a history snapshots per branch just by keeping the current snapshot id. So, no need spec change.

But we have to change code for handling metadata tables per reference using above information.

@rymurr
Copy link
Contributor

rymurr commented Nov 17, 2021

Hey @jackye1995 I added a bunch of comments to the design doc. Do you consider this more up to date now? Should I commnet/review here instead?

@jackye1995
Copy link
Contributor Author

I have updated this PR based on the TableMetadata builder refactoring. it is quite hard to separate the update logic out of this, so I just included the changes in #3612 and marked @hameizi as coauthor.


private final TableOperations ops;
private final TableMetadata base;
private final Map<String, SnapshotRef> refs;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike properties update, I just track the final result of refs map here, and we will not retry during commit phase to avoid concurrent updates of refs.

* @param currentRefs current table refs list
* @return updated refs list
*/
private static Map<String, SnapshotRef> refsWithMainBranch(long snapshotId, Map<String, SnapshotRef> currentRefs) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snapshot ID here can be -1. I have been thinking if we should keep the main branch ref in case there is no main branch head after a replace operation. But there might be retention policy set at the main branch, and removing the main branch will result in deleting that information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about how awkward this is for a while. We may want to just create an empty snapshot when creating a table. That would clean up using -1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes +1 for creating an empty snapshot when creating a table

@jackye1995 jackye1995 requested a review from rdblue December 13, 2021 19:50
@hameizi
Copy link
Contributor

hameizi commented Dec 21, 2021

@rdblue Is there any more concern about this PR?

private Long maxRefAgeMs;

Builder(SnapshotRefType type) {
ValidationException.check(type != null, "Snapshot reference type must not be null");
Copy link
Contributor

@rdblue rdblue Jan 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most builders use IllegalArgumentException to reject arguments that aren't allowed and ValidationException to catch inconsistencies. The ValidationException cases that are thrown in build are good examples of when a ValidationException is a good idea because, for example, minSnapshotsToKeep doesn't make sense with a tag.

public SnapshotRef build() {
if (type.equals(SnapshotRefType.TAG)) {
ValidationException.check(minSnapshotsToKeep == null,
"TAG type snapshot reference does not support setting minSnapshotsToKeep");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use "Tag" instead of "TAG type snapshot reference"? I think people are going to be more familiar with the friendly names, "tag" and "branch", so we should use those rather than focusing on the generalized "reference".

} else {
if (minSnapshotsToKeep != null) {
ValidationException.check(minSnapshotsToKeep > 0,
"Min snapshots to keep must be greater than 0");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be ValidationException.check(minSnapshotsToKeep == null || minSnapshotsToKeep > 0, ...) rather than using an if block.

* @return a set of {@link SnapshotRef snapshot references} of a snapshot.
* Note that when there is no ref, it returns an empty set but not null.
*/
Set<SnapshotRef> refs(long snapshotId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have a method for returning refs by snapshot but not a ref by name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it's useful in subsequent PRs for updating the snapshot expiration logic, but we can remove this for now and just have a method for querying the snapshot of a specific ref name.

*
* @return a new {@link UpdateSnapshotRefs}
*/
UpdateSnapshotRefs updateRefs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you think the other table mutations should handle refs? For example, an append might look like this:

table.newAppend()
    .appendFile(someFile)
    .commit();

Should we add an optional branch to all of the SnapshotProducer operations?

table.newAppend()
    .appendFile(someFile)
    .branch("some_branch")
    .commit();

Curious what you've been thinking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I do have exactly the same thought, where append will go to the main branch by default and an override is allowed. I think a follow-up PR is needed for this feature.

return maxRefAgeMs;
}

public static Builder builderForTag(long snapshotId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a slightly more direct name scheme, tagBuilder and branchBuilder?


public static Builder builderFrom(SnapshotRef ref) {
return new Builder(ref.type())
.snapshotId(ref.snapshotId())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a builder method for snapshot ID when it is passed in through the factory method? What happens when you try to change the snapshot ID or set it to null? I think it may be simpler just to make this a private final field and remove the builder method for it.


@Override
public Map<String, SnapshotRef> refs() {
return lazyTable().refs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to read the table metadata for this. I doubt serialized tables will be accessing refs.

currentSnapshotId < 0 || snapshotsById.containsKey(currentSnapshotId),
"Invalid table metadata: Cannot find current version");


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary whitespace change.

private final List<HistoryEntry> snapshotLog;
private final List<MetadataLogEntry> previousFiles;
private final Map<String, SnapshotRef> refs;
private final SetMultimap<Long, SnapshotRef> refsById;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why these are indexed by snapshot rather than name. Is that a common pattern?

List<Snapshot> toRemove = snapshots.stream()
.filter(removeIf)
.filter(snapshot -> !refsById.containsKey(snapshot.snapshotId()))
.collect(Collectors.toList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is probably time to remove this method, or at least to stop using it. Instead, I'd probably just produce a list of snapshot IDs to remove. That fits better with the new builder style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added #3900 to track this

}

public TableMetadata replaceRefs(Map<String, SnapshotRef> newRefs) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary whitespace starting the function.

private Map<String, SnapshotRef> validateAndCompleteRefs(Map<String, SnapshotRef> inputRefs) {
for (SnapshotRef ref : inputRefs.values()) {
Preconditions.checkArgument(snapshotsById.containsKey(ref.snapshotId()) || ref.snapshotId() == -1,
"Snapshot reference %s does not exist in the existing snapshots list", ref);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to validate that only main can be -1?

@amogh-jahagirdar
Copy link
Contributor

Hi all, I have been discussing snapshot lifecycle management with @jackye1995 and if possible would like to take over addressing any remaining comments for these changes as well as any other potential, relevant implementation improvements. @jackye1995 would this be possible? Would raise another PR and mark @jackye1995 and @hameizi 1249369293@qq.com as co-authors.

@jackye1995
Copy link
Contributor Author

Sure, I am a bit low in capacity for implementation recently, please feel free to open a new PR to push the progress of this PR forward and I will then close this one review it, thank you very much @amogh-jahagirdar !

amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 12, 2022
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 12, 2022
…e metadata in apache#3104

Remove unused interface method for getting all snapshot refs for a given snapshot id.
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 12, 2022
…e metadata in apache#3104

Remove unused interface method for getting all snapshot refs for a given snapshot id.
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 12, 2022
…e metadata in apache#3104

Remove unused interface method for getting all snapshot refs for a given snapshot id.
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 12, 2022
Followed up on comments on apache#3104
Remove unused interface method for getting all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove.
Added validation that a -1 snapshot id can only happen for the main branch reference.

Co-authored-by: wangzeyu <1249369293@qq.com>
Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
@RussellSpitzer
Copy link
Member

@jackye1995 So we should focus on @amogh-jahagirdar s PR now? Just want to close out this one if that's the case.

@jackye1995
Copy link
Contributor Author

@RussellSpitzer yes, let me close this one in favor of #3883

@jackye1995 jackye1995 closed this Jan 13, 2022
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 14, 2022
Followed up on comments on apache#3104
Remove unused interface method for getting all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove.
Added validation that a -1 snapshot id can only happen for the main branch reference.

Co-authored-by: wangzeyu <1249369293@qq.com>
Co-authored-by: Jack Ye <yzhaoqin@amazon.com>

Cleanup
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 19, 2022
Followed up on comments on apache#3104
Remove unused interface method for getting all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove.
Added validation that a -1 snapshot id can only happen for the main branch reference.

Co-authored-by: wangzeyu <1249369293@qq.com>
Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 19, 2022
Followed up on comments on apache#3104
Remove unused interface method for getting all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove.
Added validation that a -1 snapshot id can only happen for the main branch reference.

Co-authored-by: wangzeyu <1249369293@qq.com>
Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 19, 2022
Followed up on comments on apache#3104
Remove unused interface method for getting all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove.
Added validation that a -1 snapshot id can only happen for the main branch reference.

Co-authored-by: wangzeyu <1249369293@qq.com>
Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 19, 2022
Followed up on comments on apache#3104
Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove.
Added validation that a -1 snapshot id can only happen for the main branch reference.

Co-authored-by: wangzeyu <1249369293@qq.com>
Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 19, 2022
Followed up on comments on apache#3104
Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove.
Added validation that a -1 snapshot id can only happen for the main branch reference.

Co-authored-by: wangzeyu <1249369293@qq.com>
Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 19, 2022
Followed up on comments on apache#3104
Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove.
Added validation that a -1 snapshot id can only happen for the main branch reference.

Co-authored-by: wangzeyu <1249369293@qq.com>
Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 22, 2022
Followed up on comments on apache#3104
Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove.
Added validation that a -1 snapshot id can only happen for the main branch reference.

Co-authored-by: wangzeyu <1249369293@qq.com>
Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 22, 2022
Followed up on comments on apache#3104
Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove.
Added validation that a -1 snapshot id can only happen for the main branch reference.

Co-authored-by: wangzeyu <1249369293@qq.com>
Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 22, 2022
Followed up on comments on apache#3104
Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove.
Added validation that a -1 snapshot id can only happen for the main branch reference.

Co-authored-by: wangzeyu <1249369293@qq.com>
Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 22, 2022
Followed up on comments on apache#3104
Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove.
Added validation that a -1 snapshot id can only happen for the main branch reference.

Co-authored-by: wangzeyu <1249369293@qq.com>
Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
amogh-jahagirdar added a commit to amogh-jahagirdar/iceberg that referenced this pull request Jan 22, 2022
Followed up on comments on apache#3104
Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove.
Added validation that a -1 snapshot id can only happen for the main branch reference.

Co-authored-by: wangzeyu <1249369293@qq.com>
Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants