Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Apr 6, 2020

This PR adds a Spark action that removes orphan data and metadata files that can be left in some edge cases like executor preemption.

}

@Test
public void testAllValidFilesAreKept() throws IOException, InterruptedException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test for Write-Audit-Publish (WAP) workflow case where a snapshot can be staged (using the cherrypicking operation), where it's not part of the list of active snapshots. So expected behavior should be that this action should not delete those staged files as orphan files.
There are tests in TestWapWorkflow that illustrate this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will add a case for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be nice to have a test for it. This should work because the metadata tables used return all files reachable by metadata, not just the ones in a single snapshot. We use the same table for a similar check in our environment.

Copy link
Contributor Author

@aokolnychyi aokolnychyi Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one case when all_data_files won't report WAP files: when there is no snapshot.

  @Override
  public CloseableIterable<FileScanTask> planFiles() {
    Snapshot snapshot = snapshot();
    if (snapshot != null) {
      LOG.info("Scanning table {} snapshot {} created at {} with filter {}", table,
          snapshot.snapshotId(), formatTimestampMillis(snapshot.timestampMillis()),
          rowFilter);

      Listeners.notifyAll(
          new ScanEvent(table.toString(), snapshot.snapshotId(), rowFilter, schema()));

      return planFiles(ops, snapshot, rowFilter, caseSensitive, colStats);

    } else {
      LOG.info("Scanning empty table {}", table);
      return CloseableIterable.empty();
    }
  } 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, table.currentSnapshot().manifestListLocation() in location()can lead to a NPE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created #904 and #905 to address later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we have a current snapshot, it does work correctly. I've added a test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we addressed the metadata table problem when there is no current snapshot in #801. I'll check to see why that doesn't work. My initial guess is that PR refers to static tables and this is a parallel table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just individual tables that were fixed. The solution is to override planFiles:

    @Override
    public CloseableIterable<FileScanTask> planFiles() {
      // override planFiles to avoid the check for a current snapshot because this metadata table is for all snapshots
      return CloseableIterable.withNoopClose(HistoryTable.this.task(this));
    }

this.dataLocation = table.locationProvider().dataLocation();
}

public RemoveOrphanFilesAction allDataFilesTable(String newAllDataFilesTable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems awkward, but I'm not sure a better way to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any other ideas are more than welcome :)

Copy link
Contributor

@rdblue rdblue Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places in Spark, we detect path tables by checking contains("/"). We could do that here to construct the metadata table names:

public String metadataTableName(Table table, String metaTable) {
  String tableName = table.toString()
  if (tableName.contains("/")) {
    return tableName + "#" + metaTable;
  } else {
    return tableName + "." + metaTable;
  }
}

I think that convention for naming isn't unreasonable considering how we do it by default for Spark.

We could also allow passing in BiFunction<String, String, String> metadataTableName that we just default to the implementation above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This almost works. Unfortunately, table.toString() might return anything. For example, it will prepend hive. for friendly names in the Hive catalog and we won't be able to resolve hive.db.table.all_data_files. Exposing a correct TableIdentifier in Table would help but that would mean modifying public BaseTable as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I thought this would work because it is what we do in our Spark 2.4 build. Since we are using DSv2 catalogs, when the catalog adds its name to the identifier we actually get a working multi-catalog identifier.

Maybe we should add something to remove hive. for now, and take it out for Spark 3.0.

TestIcebergSourceHiveTables.currentIdentifier.name());
return null;
});
public void dropTable() throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to clean the location properly.

@mehtaashish23
Copy link
Contributor

Thanks @aokolnychyi for the PR. Based on what I see on PR, you are trying to clean up files that are not referenced by metadata table and all_data_files table. I am actually going to verify it, as I get to use this work in our project, but do the files which fail to get deleted during cleanup after commit here, is also cleaned up with this work? My point is whether the dataFile which is deleted as part of an expired snapshot belongs to all_data_files table or not.

@aokolnychyi
Copy link
Contributor Author

@mehtaashish23, this action should remove all orphan files including those that we failed to delete while expiring snapshots. The all_data_files metadata table gives us currently referenced files in metadata.


package org.apache.iceberg;

public interface Action<R> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create an actions package?

}
}

otherMetadataFiles.add(ops.metadataFileLocation("version-hint.text"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

List<String> matchingTopLevelFiles = Lists.newArrayList();

try {
Path path = new Path(location);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this is the table's location, we expect this to contain just two directories: data and metadata. I think the intent of this methods was to parallelize on the first level of partition directories, but that's not what will happen here.

It's a bit more tricky because we don't know the convention actually matches the default structure, but I think it would be reasonable to traverse the first 2 layers of directories to build the top-level set. To do that, adding a depth parameter to the recursive traversal makes sense so you can use it here and return after 2 levels (or a configurable number). That would also be a good thing for the parallel traversal to ensure this won't get caught in a symlink loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one more problem: the initial number of locations might be pretty small. It seems beneficial to list, for example, 3 levels on the driver and then parallelize. The number of top-level partitions might be small. At the same time, we should avoid listing too much on the driver if the data is written to the root table location. That's why I modified the listing logic so that we list 3 levels by default by don't list locations that have more than 10 sub-locations. The latter ones will be listed in a distributed manner. This should cover cases with a lot of top-level and leaf partitions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, ignore my previous comment. I'll think more about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to list a fixed-number of levels. If all the data is written to the root location, there's nothing we can do anyway because it can't be parallelized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this more, I think it still makes sense to list 2 levels and stop whenever we hit let's say 10 sub-locations. It won't solve the problem when the number of top-level partitions is small. However, it should help if the table is partitioned but the data is written to the table location. Consider tables that were migrated from Hive. If they have 1000 top-level partitions and 10 sub-partitions, we will be listing 10000 locations on the driver only for the first two levels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue, what do you think?


return (FlatMapFunction<Iterator<String>, String>) dirs -> {
List<String> files = Lists.newArrayList();
Predicate<FileStatus> predicate = file -> file.getModificationTime() < olderThanTimestamp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: predicate isn't a very descriptive name. Maybe pastOperationTimeLimit instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it to stay on one line below.

@rdblue
Copy link
Contributor

rdblue commented Apr 9, 2020

@aokolnychyi, this looks almost ready to me. The parallel file listing looks like it needs to be updated, and we need javadoc. Otherwise I think the other points can be done as follow-ups.

@aokolnychyi aokolnychyi changed the title [WIP] Spark: Implement an action to remove orphan files Spark: Implement an action to remove orphan files Apr 10, 2020
return manifestDF.union(otherMetadataFileDF);
}

private Dataset<Row> buildActualFileDF() {
Copy link
Contributor

@prodeezy prodeezy Apr 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a comment here on criteria for collecting all actual files would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prodeezy, do you mean the actual algorithm or that we select files older than a given timestamp?


private String location = null;
private long olderThanTimestamp = System.currentTimeMillis() - TimeUnit.DAYS.toMillis(3);
private Consumer<String> deleteFunc = new Consumer<String>() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we should be able to write this a table.io()::deleteFile

Copy link
Contributor Author

@aokolnychyi aokolnychyi Apr 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I tried in the first place. Unfortunately, it complains with:

Variable 'table' might not have been initialized

I think we had the same problem in RemoveSnapshots.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Let's go with this then.


Predicate<FileStatus> predicate = file -> file.getModificationTime() < olderThanTimestamp;

int maxDepth = Integer.MAX_VALUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems excessive, but not really that dangerous. When listing in executors, the purpose is to exit even if there is a reference cycle in the file system. This would technically do that, but would recurse 2 billion levels so the more likely failure is a stack overflow.

That's alright since it's the behavior that was here before, but I think it would be better to set this to 2,000 or something large but reasonable and then throw an exception if there are remaining directories when it returns.

@rdblue rdblue merged commit bedc9c7 into apache:master Apr 10, 2020
@rdblue
Copy link
Contributor

rdblue commented Apr 10, 2020

I'm merging this since it's large and the remaining comments are minor. That avoids needing to re-read the whole commit for small updates. Thanks for adding this, @aokolnychyi! I think it is going to be really useful.

sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 10, 2023
…he#894)

* Core: Serialize statistics files in TableMetadata (apache#5799)

(cherry picked from commit d1befd9)

* Add DR support of V2 tables without delete date files

Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>
rodmeneses pushed a commit to rodmeneses/iceberg that referenced this pull request Feb 19, 2024
* Internal: DR actions
* Internal: Add DR support of V2 tables without delete date files (apache#894)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants