Spark: Add CopyTable spark action #10024

laithalzyoud · 2024-03-22T14:56:23Z

This PR adds a new Spark action to copy an Iceberg table. The action includes processing metadata files, manifest lists, and position delete files to reflect changes in the location prefixes, aiming to support operations like migration to a new storage location or table duplication.

Here's a breakdown of what it does and how it works:

Initializes and validates input parameters, such as source and target location prefixes, start and end versions.
Executes the copy action, which involves rebuilding metadata to reflect the new locations, creating a staging area for the copied files, and generating lists of data and metadata files to move.
Supports rewriting of metadata files, manifest files and lists, and position delete files to update paths inside these files to accommodate the new location prefix.
Identify which data files need to be moved based on the snapshots included between the specified start and end versions of the table.
Utilizes Spark to parallelize the processing of rewriting files, enabling efficient handling of large tables.
Supports Spark 3.3, 3.4 and 3.5

This PR extends @flyrain original PR #4705

manuzhang · 2024-03-22T15:08:38Z

How about position delete files?

laithalzyoud · 2024-03-22T15:13:07Z

How about position delete files?

@manuzhang They are covered in this PR, let me add that in the description as well 👌

nastra · 2024-03-25T11:29:13Z

Supports Spark 3.3, 3.4 and 3.5

We should probably first focus on a single Spark version (3.5) and once the PR is merged, backport the changes to previous Spark versions. Otherwise it will be difficult to review/change the same stuff across multiple Spark versions.

nastra · 2024-03-25T11:32:54Z

core/src/main/java/org/apache/iceberg/actions/BaseCopyTableActionResult.java

+ */
+package org.apache.iceberg.actions;
+
+public class BaseCopyTableActionResult implements CopyTable.Result {


this should probably be similar to how all the other Result classes are implemented, such as

iceberg/core/src/main/java/org/apache/iceberg/actions/BaseDeleteReachableFiles.java

Lines 23 to 33 in 82e0a56

@Value.Enclosing

@SuppressWarnings("ImmutablesStyle")

@Value.Style(

typeImmutableEnclosing = "ImmutableDeleteReachableFiles",

visibilityString = "PUBLIC",

builderVisibilityString = "PUBLIC")

interface BaseDeleteReachableFiles extends DeleteReachableFiles {

@Value.Immutable

interface Result extends DeleteReachableFiles.Result {}

}

nastra · 2024-03-25T11:34:06Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestCopyTableAction.java

+        .expireSnapshotId(sourceTable.currentSnapshot().parentId())
+        .execute();
+
+    AssertHelpers.assertThrows(


this is deprecated code. please use assertThatThrownBy(...).isInstanceOf(..).hasMessage(...)

nastra · 2024-03-25T11:35:06Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestCopyTableAction.java

+
+    assertThat(count)
+        .as("The rebuilt metadata file number should be")
+        .isEqualTo(filesToMove.size());


actual/expected are wrong here. Should be assertThat(filesToMove).hasSize(count)

nastra · 2024-03-25T11:35:26Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestCopyTableAction.java

+            .as(Encoders.STRING())
+            .collectAsList();
+
+    assertThat(count).as("The rebuilt data file number should be").isEqualTo(filesToMove.size());


actual/expected are wrong here and should be the other way around

nastra · 2024-03-25T11:36:10Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestCopyTableAction.java

+            .as(Encoders.STRING())
+            .collectAsList();
+
+    assertThat(versionFileCount)


it seems a bunch of assertions have actual/expected in the wrong order. Please also update all the other places

nastra · 2024-03-25T11:37:42Z

core/src/main/java/org/apache/iceberg/TableMetadataUtil.java

+    // Utility class
+  }
+
+  public static TableMetadata replacePaths(


please add a TestTableMetadataUtil with some tests where metadata/prefixes can be null/empty/invalid/valid

nastra · 2024-03-25T11:39:06Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCopyTableSparkAction.java

+  }
+
+  @Override
+  public CopyTable lastCopiedVersion(String sVersion) {


it's not clear what sVersion refers to, so why not newStartVersion? Same for all the other params

nastra · 2024-03-25T11:40:16Z

@laithalzyoud thanks for working on this. I just did a very quick high-level review, but will do a more thorough one this week

laithalzyoud · 2024-03-25T13:39:50Z

@laithalzyoud thanks for working on this. I just did a very quick high-level review, but will do a more thorough one this week

Thanks for taking a look @nastra, I'll stash the 3.3 and 3.4 implementations for now and have them in another PR after this one is merged and address the comments as well 👍

jotarada · 2024-03-25T17:13:44Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCopyTableSparkAction.java

+    rewriteVersionFile(metadata, stagingPath);
+
+    List<MetadataLogEntry> versions = metadata.previousFiles();
+    for (int i = versions.size() - 1; i >= 0; i--) {


Could be rewritten as List<MetadataLogEntry> versions = Lists.reverse(metadata.previousFiles()); for (MetadataLogEntry version: versions) { if (version.file().equals(startVersion)) { break; } }

RussellSpitzer · 2024-03-28T14:14:37Z

@flyrain You should take a look at this as well

amogh-jahagirdar

Thanks @laithalzyoud this is great to see! Beyond @nastra's point of doing Spark 3.5 separately, it would be ideal to have the CopyTable API changes be in a separate PR first.

Having the API changes be separate allows us to discuss things like semantics/expectations/preconditions of the API just so the community is on the same page as to what they can expect when working with this action and if the right options are exposed.

After that point, we can look at the implementation. One aspect of implementation that I think I'd also separate is the exposing of ManifestLists/ManifestReader. I totally get why that's required for this action but I think it's worth having that in a separate commit.

Lastly, this gets more into the API and implementation but I think we should figure out if there should be separate sorts of exposed "operations" for replicating a given manifest/manifest list rather than having it all embedded in the spark procedure etc. Those operations can invidvidually be used for reasons beyond copying for replication; they can be used for fixing corrupt metadata if the right APIs are exposed.

amogh-jahagirdar · 2024-03-28T14:39:13Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/BaseCopyTableSparkAction.java

+    try {
+      dataFiles
+          .repartition(1)
+          .write()
+          .mode(SaveMode.Overwrite)
+          .format("text")
+          .save(dataFileListPath);
+    } catch (Exception e) {
+      throw new UnsupportedOperationException(
+          "Failed to build the data files dataframe, the end version you are "
+              + "trying to copy may contain invalid snapshots, please use the younger version which doesn't have invalid "
+              + "snapshots",
+          e);


This is something to discuss I think when we define the API semantics. I think it's a bit awkward to dump paths into our own "text manifest". Why shouldn't the action execute the copy of the actual data files (I mean the actual Parquet files)? There's a bunch of ways to do that and I think in Iceberg we should have the right interfaces and some basic implementations to facilitate that. Lmk if that makes sense.

If we want to support actually moving the files - then we will need to support different cloud providers (GCP, AWS, Azure) as well as on-prem setups (i.e copying in Linux). It might not be ideal for some use-cases as well - for example in our case where we are actually using this in a production setting, copying using GCS client libraries was extremely slow and inefficient for huge tables (>5TB) and instead we used a managed service from GCP to handle the move efficiently. So this can actually be very specific to where the files are stored and where you want to move them, you can consider a use-case of someone migrating between 2 different cloud providers or moving data from on-premise to the cloud for example and so on. So from my perspective for it's better to just rewrite the paths so the table is usable in a new location and leave the actual copying to the users, maybe in a future iteration some basic interface to move files for common use-cases can be implemented

flyrain · 2024-03-28T16:23:46Z

Thanks @laithalzyoud for taking lead for the copy table action. Agreed with @amogh-jahagirdar, can we separated the PR to interface only PR, and implementation PRs. That way, we can get a consensus on interface first.

ajantha-bhat · 2024-06-04T11:38:13Z

@laithalzyoud : Are you planning to address the comments on this? This feature is definitely useful.
If not, I would like to take it up.

laithalzyoud · 2024-06-05T16:07:16Z

@laithalzyoud : Are you planning to address the comments on this? This feature is definitely useful. If not, I would like to take it up.

Hey @ajantha-bhat! Yes I'm planning to continue working on this soon, you can help in the code review if you'd like 👍

huaxingao · 2024-07-29T15:56:38Z

@laithalzyoud Thanks for your work on this PR! I've noticed there hasn't been activity for a while, and I wanted to check if you're still able to continue working on it. If you're busy with other commitments and would like some help, I’d be glad to take over or assist. Thanks!

moomindani · 2024-07-30T03:44:23Z

@laithalzyoud I am super interested in this PR and it will unblock many use cases. Are you working on this now?

huaxingao · 2024-08-08T02:05:19Z

@laithalzyoud Thanks for the thumbs-up! Could you please confirm if you are planning to continue working on this PR, or would you like me to take over? I’m happy to help in any way needed. Thank you!

laithalzyoud · 2024-08-11T12:26:29Z

Hey @huaxingao! I'm planning to continue working on it starting this week. For now I'll close this PR and open a new one to just add the interface, once we agree on the interface, I'll create the implementation PR after like agreed earlier with @flyrain and @amogh-jahagirdar 👍

laithalzyoud · 2024-08-12T14:42:38Z

I created the PR to just add the interface, please feel to review it and provide feedback!

loudwanderingdune · 2025-09-25T00:58:25Z

I see the interface PR #10920 has been merged. Is the implementation ready to be worked on?

manuzhang · 2025-09-25T02:06:28Z

@loudwanderingdune This feature is already implemented and released. Please check out https://iceberg.apache.org/docs/nightly/spark-procedures/#table-replication.

Laith Alzyoud added 2 commits March 21, 2024 10:21

Add CopyTable spark action

cc2e05d

Add tests for Spark 3.4 and 3.5 and refactoring

f8903b9

github-actions bot added API spark core labels Mar 22, 2024

laithalzyoud changed the title ~~Spark: Add CopyTable spark action~~ [Draft] Spark: Add CopyTable spark action Mar 22, 2024

laithalzyoud marked this pull request as draft March 22, 2024 15:18

amogh-jahagirdar self-requested a review March 22, 2024 17:13

laithalzyoud changed the title ~~[Draft] Spark: Add CopyTable spark action~~ Spark: Add CopyTable spark action Mar 25, 2024

laithalzyoud marked this pull request as ready for review March 25, 2024 10:30

nastra reviewed Mar 25, 2024

View reviewed changes

jotarada reviewed Mar 25, 2024

View reviewed changes

amogh-jahagirdar requested changes Mar 28, 2024

View reviewed changes

anuragmantri mentioned this pull request Jul 13, 2024

How to move Iceberg table from one location to another #3142

Open

laithalzyoud closed this Aug 11, 2024

laithalzyoud mentioned this pull request Aug 12, 2024

Spark: Add RewriteTablePath action interface #10920

Merged

	@Value.Enclosing
	@SuppressWarnings("ImmutablesStyle")
	@Value.Style(
	typeImmutableEnclosing = "ImmutableDeleteReachableFiles",
	visibilityString = "PUBLIC",
	builderVisibilityString = "PUBLIC")
	interface BaseDeleteReachableFiles extends DeleteReachableFiles {

	@Value.Immutable
	interface Result extends DeleteReachableFiles.Result {}
	}

Spark: Add CopyTable spark action #10024

Spark: Add CopyTable spark action #10024

Uh oh!

Conversation

laithalzyoud commented Mar 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

manuzhang commented Mar 22, 2024

Uh oh!

laithalzyoud commented Mar 22, 2024

Uh oh!

nastra commented Mar 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nastra commented Mar 25, 2024

Uh oh!

laithalzyoud commented Mar 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Mar 28, 2024

Uh oh!

amogh-jahagirdar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain commented Mar 28, 2024

Uh oh!

ajantha-bhat commented Jun 4, 2024

Uh oh!

laithalzyoud commented Jun 5, 2024

Uh oh!

huaxingao commented Jul 29, 2024

Uh oh!

moomindani commented Jul 30, 2024

Uh oh!

huaxingao commented Aug 8, 2024

Uh oh!

laithalzyoud commented Aug 11, 2024

Uh oh!

laithalzyoud commented Aug 12, 2024

Uh oh!

loudwanderingdune commented Sep 25, 2025

Uh oh!

manuzhang commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

laithalzyoud commented Mar 22, 2024 •

edited

Loading

amogh-jahagirdar left a comment •

edited

Loading