Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented May 30, 2020

This replaces all calls to Snapshot.manifests with calls to one of 3 new methods:

  • Snapshot.allManifests returns both delete and data manifests
  • Snapshot.deleteManifests returns only delete manifests
  • Snapshot.dataManifests returns only data manifests

Existing references mostly use either allManifests or dataManifests, depending on the context. For example, tests with assertions for the number of manifests use allManifests because the test cases should validate there are no new delete manifests, but other tests that validate rewritten manifests are deleted use dataManifests because only data manifests are rewritten and deleted.

This tries to make minimal changes that preserve the current behavior. Operations are not updated to support delete manifests (rewrite still only rewrites data manifests), but will carry through the list of delete manifests correctly.

After this commit, we should be able to work in parallel on:

  • Job planning with delete manifests and delete files
  • Updating MergingSnapshotProducer to handle delete manifests
  • Updating rewrite actions to handle deletes

@rdblue
Copy link
Contributor Author

rdblue commented May 30, 2020

@prodeezy, @rymurr, here's another row-level delete PR.

@rdblue rdblue added this to the Row-level Delete milestone May 30, 2020
import org.apache.iceberg.types.Types;
import org.apache.iceberg.util.CharSequenceWrapper;

public class FileHistory {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed. It was used to find all manifest entries for a file to help debug changes. Now that can be handled in parallel by the all_entries metadata table.


// when snapshot is not null
CloseableIterable<ManifestEntry<DataFile>> entries = new ManifestGroup(ops.io(), snapshot.manifests())
CloseableIterable<ManifestEntry<DataFile>> entries = new ManifestGroup(ops.io(), snapshot.dataManifests())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract for this class is to return matching data files, not splits. We use it in some cases to avoid newScan that will return either FileScanTask or CombinedScanTask and will emit a scan event. I think we may be able to remove it, or update it to return both data and delete files. For now, this uses dataManifests to mimic the current behavior.

Set<ManifestFile> manifests = FluentIterable
.from(snapshots)
.transformAndConcat(s -> s.manifests())
.transformAndConcat(s -> s.dataManifests())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, scans continue to ignore delete manifests and delete files.

List<ManifestFile> manifests = Lists.newArrayList();
if (current != null) {
manifests.addAll(current.deleteManifests());
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carry forward delete manifests, if present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to the end of the manifest list.

@rdblue rdblue force-pushed the separate-snapshot-manifests branch from 411bd27 to 7f3dfce Compare May 30, 2020 00:53
Copy link
Contributor

@rymurr rymurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. In places I was a bit confused about the choice between reading/using dataManifest vs allManifest but I chalk that up to my level of knowledge of the code base.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 1, 2020

Thanks for reviewing, @rymurr! Feel free to comment on the places that don't make sense. I might have gotten them wrong. And if not, I'll at least note why I think it should be all or just data.

Copy link
Contributor

@rymurr rymurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments @rdblue on places where things weren't entirely clear to me.

Expression rowFilter = joinFilters(filters);

Iterable<ManifestFile> manifests = table.currentSnapshot().manifests();
Iterable<ManifestFile> manifests = table.currentSnapshot().dataManifests();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this explicitly ignoring the effect of deleted rows on partition metrics or is it just that you are short circuiting any delete files (as we can't use them anyways)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole class will need to be updated to support detecting changes in delete files, so I just did the minimum to update it safely. Since it uses ManifestFiles.read later instead of ManifestFiles.readDeletes, it made sense to only use data manifests. This is one that we should plan to update in a separate PR, if anyone is using it.

*
* @return a list of ManifestFile
*/
List<ManifestFile> allManifests();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seem like a breaking change to the public API. Technically, the old description already assumed a list of all manifests. Do we think it is better to change it now to avoid confusion and misuse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is where we want to make the breaking changes. The format can't be read correctly without support for deletes, so to support v2 we will be making some breaking changes. If we do it with API changes then clients will need to update.

If we just return manifest files for deletes through the old manifests method, then older clients will appear to work until they encounter a delete manifest.

}

if (dataManifests == null) {
this.dataManifests = ImmutableList.copyOf(Iterables.filter(allManifests,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list of manifests can contain 1-2k entries, will iterating through that list twice here have an impact on performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this would be pretty quick.

@Override
public List<ManifestFile> manifests() {
if (manifests == null) {
private void cacheManifests() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to reason about thread safety like we do in PartitionSpec, for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The main problem in #219 was that the lazy value was set before it was completely initialized. Here, allManifests is set to a list that will not change.

In that PR, we also added double checking and a lock to prevent two threads from building the same value. We could do that here as well, but it doesn't affect correctness. There is just a race condition to produce the same sets of manifests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to prevent the race condition later on. I'll submit a PR.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@aokolnychyi aokolnychyi merged commit 482f420 into apache:master Jun 2, 2020
@aokolnychyi
Copy link
Contributor

Thanks for the change, @rdblue! Thanks for reviewing, @rymurr!

cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020
…he#1080)

This replaces all calls to Snapshot.manifests with calls to one of 3 new methods:

* `Snapshot.allManifests` returns both delete and data manifests
* `Snapshot.deleteManifests` returns only delete manifests
* `Snapshot.dataManifests` returns only data manifests

Existing references mostly use either `allManifests` or `dataManifests`, depending on the context. For example, tests with assertions for the number of manifests use `allManifests` because the test cases should validate there are no new delete manifests, but other tests that validate rewritten manifests are deleted use `dataManifests` because only data manifests are rewritten and deleted.

This tries to make minimal changes that preserve the current behavior. Operations are not updated to support delete manifests (rewrite still only rewrites data manifests), but will carry through the list of delete manifests correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants