Spark: Implement an action to remove orphan files #894

aokolnychyi · 2020-04-06T15:20:00Z

This PR adds a Spark action that removes orphan data and metadata files that can be left in some edge cases like executor preemption.

prodeezy · 2020-04-06T22:11:44Z

spark/src/test/java/org/apache/iceberg/TestRemoveOrphanFilesAction.java

+  }
+
+  @Test
+  public void testAllValidFilesAreKept() throws IOException, InterruptedException {


Can you also add a test for Write-Audit-Publish (WAP) workflow case where a snapshot can be staged (using the cherrypicking operation), where it's not part of the list of active snapshots. So expected behavior should be that this action should not delete those staged files as orphan files.
There are tests in TestWapWorkflow that illustrate this case.

Good idea, will add a case for that.

I agree it would be nice to have a test for it. This should work because the metadata tables used return all files reachable by metadata, not just the ones in a single snapshot. We use the same table for a similar check in our environment.

There is one case when all_data_files won't report WAP files: when there is no snapshot.

@Override public CloseableIterable<FileScanTask> planFiles() { Snapshot snapshot = snapshot(); if (snapshot != null) { LOG.info("Scanning table {} snapshot {} created at {} with filter {}", table, snapshot.snapshotId(), formatTimestampMillis(snapshot.timestampMillis()), rowFilter); Listeners.notifyAll( new ScanEvent(table.toString(), snapshot.snapshotId(), rowFilter, schema())); return planFiles(ops, snapshot, rowFilter, caseSensitive, colStats); } else { LOG.info("Scanning empty table {}", table); return CloseableIterable.empty(); } }

Also, table.currentSnapshot().manifestListLocation() in location()can lead to a NPE.

I've created #904 and #905 to address later.

As long as we have a current snapshot, it does work correctly. I've added a test.

I thought we addressed the metadata table problem when there is no current snapshot in #801. I'll check to see why that doesn't work. My initial guess is that PR refers to static tables and this is a parallel table.

It's just individual tables that were fixed. The solution is to override planFiles:

@Override public CloseableIterable<FileScanTask> planFiles() { // override planFiles to avoid the check for a current snapshot because this metadata table is for all snapshots return CloseableIterable.withNoopClose(HistoryTable.this.task(this)); }

api/src/main/java/org/apache/iceberg/io/LocationProvider.java

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java

spark/src/main/java/org/apache/iceberg/Actions.java

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesActionResult.java

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java

rdblue · 2020-04-07T22:44:16Z

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java

+    this.dataLocation = table.locationProvider().dataLocation();
+  }
+
+  public RemoveOrphanFilesAction allDataFilesTable(String newAllDataFilesTable) {


This seems awkward, but I'm not sure a better way to do it.

Any other ideas are more than welcome :)

In other places in Spark, we detect path tables by checking contains("/"). We could do that here to construct the metadata table names:

public String metadataTableName(Table table, String metaTable) { String tableName = table.toString() if (tableName.contains("/")) { return tableName + "#" + metaTable; } else { return tableName + "." + metaTable; } }

I think that convention for naming isn't unreasonable considering how we do it by default for Spark.

We could also allow passing in BiFunction<String, String, String> metadataTableName that we just default to the implementation above.

This almost works. Unfortunately, table.toString() might return anything. For example, it will prepend hive. for friendly names in the Hive catalog and we won't be able to resolve hive.db.table.all_data_files. Exposing a correct TableIdentifier in Table would help but that would mean modifying public BaseTable as well.

Okay, I thought this would work because it is what we do in our Spark 2.4 build. Since we are using DSv2 catalogs, when the catalog adds its name to the identifier we actually get a working multi-catalog identifier.

Maybe we should add something to remove hive. for now, and take it out for Spark 3.0.

spark/src/main/java/org/apache/iceberg/Actions.java

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java

aokolnychyi · 2020-04-09T00:40:39Z

spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceHiveTables.java

-          TestIcebergSourceHiveTables.currentIdentifier.name());
-      return null;
-    });
+  public void dropTable() throws IOException {


We have to clean the location properly.

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java

mehtaashish23 · 2020-04-09T05:20:30Z

Thanks @aokolnychyi for the PR. Based on what I see on PR, you are trying to clean up files that are not referenced by metadata table and all_data_files table. I am actually going to verify it, as I get to use this work in our project, but do the files which fail to get deleted during cleanup after commit here, is also cleaned up with this work? My point is whether the dataFile which is deleted as part of an expired snapshot belongs to all_data_files table or not.

aokolnychyi · 2020-04-09T15:24:05Z

@mehtaashish23, this action should remove all orphan files including those that we failed to delete while expiring snapshots. The all_data_files metadata table gives us currently referenced files in metadata.

rdblue · 2020-04-09T16:37:03Z

spark/src/main/java/org/apache/iceberg/Action.java

+
+package org.apache.iceberg;
+
+public interface Action<R> {


Should we create an actions package?

rdblue · 2020-04-09T16:40:07Z

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java

+      }
+    }
+
+    otherMetadataFiles.add(ops.metadataFileLocation("version-hint.text"));


Good catch!

rdblue · 2020-04-09T16:49:26Z

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java

+    List<String> matchingTopLevelFiles = Lists.newArrayList();
+
+    try {
+      Path path = new Path(location);


Now that this is the table's location, we expect this to contain just two directories: data and metadata. I think the intent of this methods was to parallelize on the first level of partition directories, but that's not what will happen here.

It's a bit more tricky because we don't know the convention actually matches the default structure, but I think it would be reasonable to traverse the first 2 layers of directories to build the top-level set. To do that, adding a depth parameter to the recursive traversal makes sense so you can use it here and return after 2 levels (or a configurable number). That would also be a good thing for the parallel traversal to ensure this won't get caught in a symlink loop.

There is one more problem: the initial number of locations might be pretty small. It seems beneficial to list, for example, 3 levels on the driver and then parallelize. The number of top-level partitions might be small. At the same time, we should avoid listing too much on the driver if the data is written to the root table location. That's why I modified the listing logic so that we list 3 levels by default by don't list locations that have more than 10 sub-locations. The latter ones will be listed in a distributed manner. This should cover cases with a lot of top-level and leaf partitions.

Actually, ignore my previous comment. I'll think more about this.

I think it's fine to list a fixed-number of levels. If all the data is written to the root location, there's nothing we can do anyway because it can't be parallelized.

After thinking about this more, I think it still makes sense to list 2 levels and stop whenever we hit let's say 10 sub-locations. It won't solve the problem when the number of top-level partitions is small. However, it should help if the table is partitioned but the data is written to the table location. Consider tables that were migrated from Hive. If they have 1000 top-level partitions and 10 sub-partitions, we will be listing 10000 locations on the driver only for the first two levels.

@rdblue, what do you think?

rdblue · 2020-04-09T16:53:07Z

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java

+
+    return (FlatMapFunction<Iterator<String>, String>) dirs -> {
+      List<String> files = Lists.newArrayList();
+      Predicate<FileStatus> predicate = file -> file.getModificationTime() < olderThanTimestamp;


Nit: predicate isn't a very descriptive name. Maybe pastOperationTimeLimit instead?

I kept it to stay on one line below.

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java

rdblue · 2020-04-09T17:05:58Z

@aokolnychyi, this looks almost ready to me. The parallel file listing looks like it needs to be updated, and we need javadoc. Otherwise I think the other points can be done as follow-ups.

prodeezy · 2020-04-10T02:30:54Z

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java

+    return manifestDF.union(otherMetadataFileDF);
+  }
+
+  private Dataset<Row> buildActualFileDF() {


a comment here on criteria for collecting all actual files would be helpful.

@prodeezy, do you mean the actual algorithm or that we select files older than a given timestamp?

rdblue · 2020-04-10T16:24:04Z

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java

+
+  private String location = null;
+  private long olderThanTimestamp = System.currentTimeMillis() - TimeUnit.DAYS.toMillis(3);
+  private Consumer<String> deleteFunc = new Consumer<String>() {


Nit: we should be able to write this a table.io()::deleteFile

That's what I tried in the first place. Unfortunately, it complains with:

Variable 'table' might not have been initialized

I think we had the same problem in RemoveSnapshots.

That makes sense. Let's go with this then.

rdblue · 2020-04-10T16:33:19Z

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java

+
+      Predicate<FileStatus> predicate = file -> file.getModificationTime() < olderThanTimestamp;
+
+      int maxDepth = Integer.MAX_VALUE;


This seems excessive, but not really that dangerous. When listing in executors, the purpose is to exit even if there is a reference cycle in the file system. This would technically do that, but would recurse 2 billion levels so the more likely failure is a stack overflow.

That's alright since it's the behavior that was here before, but I think it would be better to set this to 2,000 or something large but reasonable and then throw an exception if there are remaining directories when it returns.

rdblue · 2020-04-10T19:03:49Z

I'm merging this since it's large and the remaining comments are minor. That avoids needing to re-read the whole commit for small updates. Thanks for adding this, @aokolnychyi! I think it is going to be really useful.

…he#894) * Core: Serialize statistics files in TableMetadata (apache#5799) (cherry picked from commit d1befd9) * Add DR support of V2 tables without delete date files Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>

* Internal: DR actions * Internal: Add DR support of V2 tables without delete date files (apache#894)

Spark: Implement an action to remove orphan files

32bd1bd

prodeezy reviewed Apr 6, 2020

View reviewed changes

rdblue reviewed Apr 7, 2020

View reviewed changes

api/src/main/java/org/apache/iceberg/io/LocationProvider.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 7, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 7, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/Actions.java Show resolved Hide resolved

rdblue reviewed Apr 7, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesActionResult.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 7, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 7, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/Actions.java Show resolved Hide resolved

rdblue reviewed Apr 7, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 7, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java Outdated Show resolved Hide resolved

aokolnychyi mentioned this pull request Apr 9, 2020

Metadata tables should include data for WAP snapshots when currentSnapshot is null #905

Closed

Rework the action to clean the whole table location

2c25af7

aokolnychyi commented Apr 9, 2020

View reviewed changes

Remove extra line

e0c96f3

aokolnychyi commented Apr 9, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 9, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 9, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/RemoveOrphanFilesAction.java Outdated Show resolved Hide resolved

Rework listing and add javadocs

f3e6948

aokolnychyi changed the title ~~[WIP] Spark: Implement an action to remove orphan files~~ Spark: Implement an action to remove orphan files Apr 10, 2020

Explain metadataTableName

6c01b75

prodeezy reviewed Apr 10, 2020

View reviewed changes

rdblue reviewed Apr 10, 2020

View reviewed changes

rdblue merged commit bedc9c7 into apache:master Apr 10, 2020

prodeezy mentioned this pull request Apr 13, 2020

File leaking in RemoveSnapshots API #822

Closed

openinx mentioned this pull request Apr 22, 2020

Garbage collection after writer collapsed and is there any logic to deal with small files? #949

Closed

findinpath mentioned this pull request Feb 9, 2022

Expire Snapshot and Remove Orphan files for Iceberg together with coordinator-only execute trinodb/trino#10810

Merged

rodmeneses pushed a commit to rodmeneses/iceberg that referenced this pull request Feb 19, 2024

Internal: DR actions (apache#1000)

7f1f4ec

* Internal: DR actions * Internal: Add DR support of V2 tables without delete date files (apache#894)


		Predicate<FileStatus> predicate = file -> file.getModificationTime() < olderThanTimestamp;

		int maxDepth = Integer.MAX_VALUE;

Spark: Implement an action to remove orphan files #894

Spark: Implement an action to remove orphan files #894

Uh oh!

Conversation

aokolnychyi commented Apr 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Apr 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Apr 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mehtaashish23 commented Apr 9, 2020

Uh oh!

aokolnychyi commented Apr 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rdblue commented Apr 9, 2020

Uh oh!

prodeezy Apr 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

aokolnychyi commented Apr 6, 2020 •

edited

Loading

aokolnychyi Apr 8, 2020 •

edited

Loading

rdblue Apr 7, 2020 •

edited

Loading

prodeezy Apr 10, 2020 •

edited

Loading

aokolnychyi Apr 10, 2020 •

edited

Loading