Skip to content

Conversation

@amogh-jahagirdar
Copy link
Contributor

Co-authored-by: @hameizi 1249369293@qq.com
Co-authored-by: @jackye1995 yzhaoqin@amazon.com

This is a continuation of the work already done by Jack and @hameizi in #3104. Some comments addressed include cleaning up the builder for SnapshotRef, adding the interface methods to get a snapshot ref by name, and validation that only the main branch snapshot reference can have a snapshot id of -1.

we will tackle updating snapshot expiration In subsequent PRs as outlined in this doc: https://docs.google.com/document/d/1PvxK_0ebEoX3s7nS6-LOJJZdBYr_olTWH9oepNUfJ-A/edit# .

The functions for handling table mutations and having a reference to the committed mutation will be in a follow on PR as well.

@jackye1995 jackye1995 changed the title Core: add snapshot reference to table metadata. Follow-up to https://github.com/apache/iceberg/pull/3104 Core: add snapshot reference to table metadata Jan 13, 2022
@amogh-jahagirdar amogh-jahagirdar force-pushed the snapshot-ref branch 6 times, most recently from 636e753 to d5377bb Compare January 19, 2022 03:52
@amogh-jahagirdar
Copy link
Contributor Author

Thanks for taking a look everyone, appreciate it! @rdblue @jackye1995 @RussellSpitzer please let me know if you have any other feedback.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly looks good to me, just nitpicking

}
/**
* Creates a ref builder from the given ref and its properties but the ref will now point to the given snapshotId.
* @param ref Ref to build from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline after javadoc description

* @param refName snapshot reference name
* @return the SnapShot ref with the given name if it exists, null otherwise.
*/
SnapshotRef ref(String refName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer simple variable name, name instead of refName since ref is implied from the method name.

}

@Override
public SnapshotRef ref(String refName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: name as suggested before

}

@Override
public SnapshotRef ref(String refName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: name as suggested before

IllegalArgumentException.class,
"Cannot find ref to remove",
() -> table.updateRefs().remove("tag"));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are missing a test for chaining multiple operations for tag/branch, like

table.updateRefs()
  .branch("b", 123)
  .setMinSnapshotsInBranch("b", 10000)
  .build()

to make sure chaining works as expected when creating new ref and set values at the same time.

@amogh-jahagirdar amogh-jahagirdar force-pushed the snapshot-ref branch 4 times, most recently from 3018410 to dde0ef8 Compare January 22, 2022 00:44
Followed up on comments on apache#3104
Remove unused interface method for getting at all snapshot refs for a given snapshot id. TableMetadata removeSnapshots accepts a list of snapshot IDs. It's up to callers to build a list of the expected snapshot IDs to remove.
Added validation that a -1 snapshot id can only happen for the main branch reference.

Co-authored-by: wangzeyu <1249369293@qq.com>
Co-authored-by: Jack Ye <yzhaoqin@amazon.com>
public UpdateSnapshotRefs tag(String name, long snapshotId) {
Preconditions.checkArgument(name != null, "Tag name must not be null");
Preconditions.checkArgument(base.snapshot(snapshotId) != null, "Cannot find snapshot with ID: %s", snapshotId);
Preconditions.checkArgument(!refs.containsKey(name), "Cannot tag snapshot, ref already exists: %s", name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a replaceTag method for this case? Otherwise, I'd have to remove a tag and re-add it, but I wouldn't really know whether the tag already existed or if it was a tag and not a branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the current API tries to have just create, remove and rename and ref property update APIs, because it's a bit unclear to me if we need a replace given that there is no such operation in schema and partition spec update.

The action replace indicates the tag already exists, and I imagine it should throw exception if the tag does not exist. In that case, user still need to first check existence and then decide to create or replace, and it has little difference from just updating properties of the ref.

If we allow replace to mean put that either create or update, then my concern is that it might cause unintented use that people might override a tag without knowing it already exists. The current API is basically trying to force a check on user side because there is not really a compare-and-swap of the specific ref that we could enforce as of today on server side and we have to replace the entire metadata file.

any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think more about what we want in this API vs in the other APIs. Probably worth thinking through the TableMetadata changes and API as well. It could be that we want to have this style API there instead of here and just keep this to high level operations.

public UpdateSnapshotRefs branch(String name, long snapshotId) {
Preconditions.checkArgument(name != null, "Branch name must not be null");
Preconditions.checkArgument(base.snapshot(snapshotId) != null, "Cannot find snapshot with ID: %s", snapshotId);
Preconditions.checkArgument(!refs.containsKey(name), "Cannot create branch, ref already exists: %s", name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this doesn't support updating a branch? Is the intent that committing to a branch is an operation handled elsewhere instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes good point. Currently the expectation is that TableMetadata always updates the main branch to be the current snapshot ID so that operation is not needed. But it will cumbersome for a user to do remove and recreate the branch just to update the branch head. Do you think this is something we can add in another PR together with updates to the snapshot producers, or better to be out in the same one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about what the API for the TableMetadata builder should be. Then we can kind of work backwards from there. I'm wondering how we set the operating branch when committing, too. Right now we have a stageOnly method on SnapshotProducer. We could do something similar. Maybe update that to produce tagged snapshots?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the snapshot producer perspective, I was thinking 2 cases:

  1. Append to branch:
table.newRewrite()
            .rewriteFiles(Sets.newSet(FILE_A), Sets.newSet(FILE_B))
            .atBranch("beta")

In this case, I would say both use cases make sense:

  1. user might want to fail when committing to a non-existing branch, because that branch was deleted by someone else. This could be satisfied by the current API by removing old branch and update it with the same info, a bit cumbersome but still works.
  2. user might want to create the branch with the new snapshot ID, this could also be done by the current API by first checking branch existence and then do a create after snapshot ID is committed.
  1. Add tag:
table.newRewrite()
            .rewriteFiles(Sets.newSet(FILE_A), Sets.newSet(FILE_B))
            .tag("2022-01-01-compacted")

In this case, tagging with an existing name is likely a system issue as there should only be a single process doing the commit and tag it. This could be identified by the current API.

So from my perspective the current API is enough, although the branch interface is not straight forward for people to just update the branch head. I would imagine it to be a frequent operation, so might worth adding that.

Regarding the builder API, currently it's following the pattern of property update. I think set and remove would be beneficial for performing conditional update for specific refs and avoid full metadata check as we develop the rest catalog, do you think there is any other APIs for the builder that would be more suitable?

Copy link
Contributor

@rdblue rdblue Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you're right. There are use cases for both branching and tagging in the snapshot producer updates. Here's what I'm thinking for the API:

  • SnapshotProducer.stageOnly() - like the current behavior, creates a snapshot without updating a branch
  • SnapshotProducer.forBranch(String name) - commit changes to the given branch
    • Defaults to "main" branch if not called
    • Incompatible with stageOnly()
    • If the branch does not exist, it will be created from main
  • SnapshotProducer.withTag(String name) - tag the committed snapshot as name
    • May be used with stageOnly() or forBranch(...)

This behavior differs from what you were talking about because this will create the branch if it doesn't exist, which means that you wouldn't throw an exception if someone dropped a branch that a job was writing to. I think that's okay because writes are expensive. The best thing to do is to accept the new snapshot and correctly label it -- the user can always drop it easily if it is actually unwanted. Another reason to do this is to make it easy to create branches. You just write to a branch and it forks from master.

Here are a few examples of these:

// WAP commit with identification tag
table.newAppend()
    .appendFile(fileA)
    .stageOnly()
    .withTag(wapId)
    .commit();

// create or update "sql_refactor" branch
table.newReplacePartitions()
    .addFile(fileA)
    .forBranch("sql_refactor")
    .commit();

// tag the main branch and commit
table.newRewrite()
    .rewriteFiles(originals, replacements)
    .withTag("compact_15_of_16")
    .commit();

currentSnapshotId < 0 || snapshotsById.containsKey(currentSnapshotId),
"Invalid table metadata: Cannot find current version");


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary whitespace change.

}
}

class SetSnapshotRefs implements MetadataUpdate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this API should focus on updates to individual refs, rather than updating in bulk. What I added to the OpenAPI REST spec was SetSnapshotRefUpdate, which is a JSON object like this:

{
  "action": "set-snapshot-ref",
  "ref": "ref name",
  "type": "tag|branch",
  "snapshot-id": 50120357190375
}

These can optionally contain max-ref-age-ms, max-snapshot-age-ms, and min-snapshots-to-keep.

}
}

class RemoveSnapshotRefs implements MetadataUpdate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like RemoveSnapshot and SetSnapshotRef, I would probably make this remove a single ref.

return refs;
}

public Set<SnapshotRef> refs(long snapshotId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

public TableMetadata removeSnapshotsIf(Predicate<Snapshot> removeIf) {
List<Snapshot> toRemove = snapshots.stream().filter(removeIf).collect(Collectors.toList());
Set<Long> toRemove = snapshots.stream()
.filter(removeIf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix indentation. Continuing indentation is 4 spaces (2 indents).

return new Builder(this).removeSnapshots(toRemove).build();
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary whitespace change.

.build();
}

public TableMetadata replaceRefs(Map<String, SnapshotRef> newRefs) {
Copy link
Contributor

@rdblue rdblue Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that replacing all refs is a good idea. In fact, I would probably avoid adding new mutation methods to TableMetadata in general and have you go directly to the TableMetadata.Builder API.

That API should allow setting and removing individual refs and it should reject removing the main ref. That way this class doesn't need to merge the main branch back into anything.

}
generator.writeEndArray();

generator.writeObjectFieldStart(REFS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be written before the list of snapshots.

.build();
}

public static void writeIntegerIfExists(String key, Integer value, JsonGenerator generator) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places with a similar pattern, we move the whole boolean to the caller so that this is a bit more flexible:

  JsonUtil.writeIntegerIf(v != null, "min-snapshots-to-keep", v, generator);

@rdblue
Copy link
Contributor

rdblue commented Jan 31, 2022

@amogh-jahagirdar, there are a lot of good changes in here, but I think we should break this up into smaller, more manageable commits:

  • Parser updates to read/write refs
  • Additions to the Table API
  • The ref management API
  • TableMetadata updates
  • Updates to SnapshotProducer API
  • Updates to MergingSnapshotProducer and FastAppend

Can you refactor the parser updates and API interface updates into separate PRs? In the meantime, I'll take a look at the table metadata updates.

@amogh-jahagirdar
Copy link
Contributor Author

@rdblue Sure, makes sense! I will break this up into separate PRs.

@nastra
Copy link
Contributor

nastra commented May 6, 2022

@amogh-jahagirdar would it make sense to link all of those smaller PRs here for a better overview?

@rdblue
Copy link
Contributor

rdblue commented Jun 5, 2022

I think most of this is in, so I'm going to close this one. Thanks, @amogh-jahagirdar!

@rdblue rdblue closed this Jun 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants