-
Notifications
You must be signed in to change notification settings - Fork 3k
Split Snapshot.manifests into dataManifests and deleteManifests #1080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| import org.apache.iceberg.types.Types; | ||
| import org.apache.iceberg.util.CharSequenceWrapper; | ||
|
|
||
| public class FileHistory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer needed. It was used to find all manifest entries for a file to help debug changes. Now that can be handled in parallel by the all_entries metadata table.
|
|
||
| // when snapshot is not null | ||
| CloseableIterable<ManifestEntry<DataFile>> entries = new ManifestGroup(ops.io(), snapshot.manifests()) | ||
| CloseableIterable<ManifestEntry<DataFile>> entries = new ManifestGroup(ops.io(), snapshot.dataManifests()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contract for this class is to return matching data files, not splits. We use it in some cases to avoid newScan that will return either FileScanTask or CombinedScanTask and will emit a scan event. I think we may be able to remove it, or update it to return both data and delete files. For now, this uses dataManifests to mimic the current behavior.
| Set<ManifestFile> manifests = FluentIterable | ||
| .from(snapshots) | ||
| .transformAndConcat(s -> s.manifests()) | ||
| .transformAndConcat(s -> s.dataManifests()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, scans continue to ignore delete manifests and delete files.
| List<ManifestFile> manifests = Lists.newArrayList(); | ||
| if (current != null) { | ||
| manifests.addAll(current.deleteManifests()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Carry forward delete manifests, if present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to the end of the manifest list.
411bd27 to
7f3dfce
Compare
rymurr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. In places I was a bit confused about the choice between reading/using dataManifest vs allManifest but I chalk that up to my level of knowledge of the code base.
|
Thanks for reviewing, @rymurr! Feel free to comment on the places that don't make sense. I might have gotten them wrong. And if not, I'll at least note why I think it should be all or just data. |
rymurr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments @rdblue on places where things weren't entirely clear to me.
| Expression rowFilter = joinFilters(filters); | ||
|
|
||
| Iterable<ManifestFile> manifests = table.currentSnapshot().manifests(); | ||
| Iterable<ManifestFile> manifests = table.currentSnapshot().dataManifests(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this explicitly ignoring the effect of deleted rows on partition metrics or is it just that you are short circuiting any delete files (as we can't use them anyways)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole class will need to be updated to support detecting changes in delete files, so I just did the minimum to update it safely. Since it uses ManifestFiles.read later instead of ManifestFiles.readDeletes, it made sense to only use data manifests. This is one that we should plan to update in a separate PR, if anyone is using it.
core/src/main/java/org/apache/iceberg/BaseRewriteManifests.java
Outdated
Show resolved
Hide resolved
| * | ||
| * @return a list of ManifestFile | ||
| */ | ||
| List<ManifestFile> allManifests(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seem like a breaking change to the public API. Technically, the old description already assumed a list of all manifests. Do we think it is better to change it now to avoid confusion and misuse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is where we want to make the breaking changes. The format can't be read correctly without support for deletes, so to support v2 we will be making some breaking changes. If we do it with API changes then clients will need to update.
If we just return manifest files for deletes through the old manifests method, then older clients will appear to work until they encounter a delete manifest.
| } | ||
|
|
||
| if (dataManifests == null) { | ||
| this.dataManifests = ImmutableList.copyOf(Iterables.filter(allManifests, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list of manifests can contain 1-2k entries, will iterating through that list twice here have an impact on performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this would be pretty quick.
| @Override | ||
| public List<ManifestFile> manifests() { | ||
| if (manifests == null) { | ||
| private void cacheManifests() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to reason about thread safety like we do in PartitionSpec, for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The main problem in #219 was that the lazy value was set before it was completely initialized. Here, allManifests is set to a list that will not change.
In that PR, we also added double checking and a lock to prevent two threads from building the same value. We could do that here as well, but it doesn't affect correctness. There is just a race condition to produce the same sets of manifests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to prevent the race condition later on. I'll submit a PR.
aokolnychyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
…he#1080) This replaces all calls to Snapshot.manifests with calls to one of 3 new methods: * `Snapshot.allManifests` returns both delete and data manifests * `Snapshot.deleteManifests` returns only delete manifests * `Snapshot.dataManifests` returns only data manifests Existing references mostly use either `allManifests` or `dataManifests`, depending on the context. For example, tests with assertions for the number of manifests use `allManifests` because the test cases should validate there are no new delete manifests, but other tests that validate rewritten manifests are deleted use `dataManifests` because only data manifests are rewritten and deleted. This tries to make minimal changes that preserve the current behavior. Operations are not updated to support delete manifests (rewrite still only rewrites data manifests), but will carry through the list of delete manifests correctly.
This replaces all calls to Snapshot.manifests with calls to one of 3 new methods:
Snapshot.allManifestsreturns both delete and data manifestsSnapshot.deleteManifestsreturns only delete manifestsSnapshot.dataManifestsreturns only data manifestsExisting references mostly use either
allManifestsordataManifests, depending on the context. For example, tests with assertions for the number of manifests useallManifestsbecause the test cases should validate there are no new delete manifests, but other tests that validate rewritten manifests are deleted usedataManifestsbecause only data manifests are rewritten and deleted.This tries to make minimal changes that preserve the current behavior. Operations are not updated to support delete manifests (rewrite still only rewrites data manifests), but will carry through the list of delete manifests correctly.
After this commit, we should be able to work in parallel on:
MergingSnapshotProducerto handle delete manifests