-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Add CopyTable spark action #10024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Add CopyTable spark action #10024
Conversation
|
How about position delete files? |
@manuzhang They are covered in this PR, let me add that in the description as well 👌 |
We should probably first focus on a single Spark version (3.5) and once the PR is merged, backport the changes to previous Spark versions. Otherwise it will be difficult to review/change the same stuff across multiple Spark versions. |
| */ | ||
| package org.apache.iceberg.actions; | ||
|
|
||
| public class BaseCopyTableActionResult implements CopyTable.Result { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably be similar to how all the other Result classes are implemented, such as
iceberg/core/src/main/java/org/apache/iceberg/actions/BaseDeleteReachableFiles.java
Lines 23 to 33 in 82e0a56
| @Value.Enclosing | |
| @SuppressWarnings("ImmutablesStyle") | |
| @Value.Style( | |
| typeImmutableEnclosing = "ImmutableDeleteReachableFiles", | |
| visibilityString = "PUBLIC", | |
| builderVisibilityString = "PUBLIC") | |
| interface BaseDeleteReachableFiles extends DeleteReachableFiles { | |
| @Value.Immutable | |
| interface Result extends DeleteReachableFiles.Result {} | |
| } |
| .expireSnapshotId(sourceTable.currentSnapshot().parentId()) | ||
| .execute(); | ||
|
|
||
| AssertHelpers.assertThrows( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is deprecated code. please use assertThatThrownBy(...).isInstanceOf(..).hasMessage(...)
|
|
||
| assertThat(count) | ||
| .as("The rebuilt metadata file number should be") | ||
| .isEqualTo(filesToMove.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actual/expected are wrong here. Should be assertThat(filesToMove).hasSize(count)
| .as(Encoders.STRING()) | ||
| .collectAsList(); | ||
|
|
||
| assertThat(count).as("The rebuilt data file number should be").isEqualTo(filesToMove.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actual/expected are wrong here and should be the other way around
| .as(Encoders.STRING()) | ||
| .collectAsList(); | ||
|
|
||
| assertThat(versionFileCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems a bunch of assertions have actual/expected in the wrong order. Please also update all the other places
| // Utility class | ||
| } | ||
|
|
||
| public static TableMetadata replacePaths( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a TestTableMetadataUtil with some tests where metadata/prefixes can be null/empty/invalid/valid
| } | ||
|
|
||
| @Override | ||
| public CopyTable lastCopiedVersion(String sVersion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not clear what sVersion refers to, so why not newStartVersion? Same for all the other params
|
@laithalzyoud thanks for working on this. I just did a very quick high-level review, but will do a more thorough one this week |
Thanks for taking a look @nastra, I'll stash the 3.3 and 3.4 implementations for now and have them in another PR after this one is merged and address the comments as well 👍 |
| rewriteVersionFile(metadata, stagingPath); | ||
|
|
||
| List<MetadataLogEntry> versions = metadata.previousFiles(); | ||
| for (int i = versions.size() - 1; i >= 0; i--) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be rewritten as List<MetadataLogEntry> versions = Lists.reverse(metadata.previousFiles()); for (MetadataLogEntry version: versions) { if (version.file().equals(startVersion)) { break; } }
|
@flyrain You should take a look at this as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @laithalzyoud this is great to see! Beyond @nastra's point of doing Spark 3.5 separately, it would be ideal to have the CopyTable API changes be in a separate PR first.
Having the API changes be separate allows us to discuss things like semantics/expectations/preconditions of the API just so the community is on the same page as to what they can expect when working with this action and if the right options are exposed.
After that point, we can look at the implementation. One aspect of implementation that I think I'd also separate is the exposing of ManifestLists/ManifestReader. I totally get why that's required for this action but I think it's worth having that in a separate commit.
Lastly, this gets more into the API and implementation but I think we should figure out if there should be separate sorts of exposed "operations" for replicating a given manifest/manifest list rather than having it all embedded in the spark procedure etc. Those operations can invidvidually be used for reasons beyond copying for replication; they can be used for fixing corrupt metadata if the right APIs are exposed.
| try { | ||
| dataFiles | ||
| .repartition(1) | ||
| .write() | ||
| .mode(SaveMode.Overwrite) | ||
| .format("text") | ||
| .save(dataFileListPath); | ||
| } catch (Exception e) { | ||
| throw new UnsupportedOperationException( | ||
| "Failed to build the data files dataframe, the end version you are " | ||
| + "trying to copy may contain invalid snapshots, please use the younger version which doesn't have invalid " | ||
| + "snapshots", | ||
| e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something to discuss I think when we define the API semantics. I think it's a bit awkward to dump paths into our own "text manifest". Why shouldn't the action execute the copy of the actual data files (I mean the actual Parquet files)? There's a bunch of ways to do that and I think in Iceberg we should have the right interfaces and some basic implementations to facilitate that. Lmk if that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to support actually moving the files - then we will need to support different cloud providers (GCP, AWS, Azure) as well as on-prem setups (i.e copying in Linux). It might not be ideal for some use-cases as well - for example in our case where we are actually using this in a production setting, copying using GCS client libraries was extremely slow and inefficient for huge tables (>5TB) and instead we used a managed service from GCP to handle the move efficiently. So this can actually be very specific to where the files are stored and where you want to move them, you can consider a use-case of someone migrating between 2 different cloud providers or moving data from on-premise to the cloud for example and so on. So from my perspective for it's better to just rewrite the paths so the table is usable in a new location and leave the actual copying to the users, maybe in a future iteration some basic interface to move files for common use-cases can be implemented
|
Thanks @laithalzyoud for taking lead for the copy table action. Agreed with @amogh-jahagirdar, can we separated the PR to interface only PR, and implementation PRs. That way, we can get a consensus on interface first. |
|
@laithalzyoud : Are you planning to address the comments on this? This feature is definitely useful. |
Hey @ajantha-bhat! Yes I'm planning to continue working on this soon, you can help in the code review if you'd like 👍 |
|
@laithalzyoud Thanks for your work on this PR! I've noticed there hasn't been activity for a while, and I wanted to check if you're still able to continue working on it. If you're busy with other commitments and would like some help, I’d be glad to take over or assist. Thanks! |
|
@laithalzyoud I am super interested in this PR and it will unblock many use cases. Are you working on this now? |
|
@laithalzyoud Thanks for the thumbs-up! Could you please confirm if you are planning to continue working on this PR, or would you like me to take over? I’m happy to help in any way needed. Thank you! |
|
Hey @huaxingao! I'm planning to continue working on it starting this week. For now I'll close this PR and open a new one to just add the interface, once we agree on the interface, I'll create the implementation PR after like agreed earlier with @flyrain and @amogh-jahagirdar 👍 |
|
I created the PR to just add the interface, please feel to review it and provide feedback! |
|
I see the interface PR #10920 has been merged. Is the implementation ready to be worked on? |
|
@loudwanderingdune This feature is already implemented and released. Please check out https://iceberg.apache.org/docs/nightly/spark-procedures/#table-replication. |
This PR adds a new Spark action to copy an Iceberg table. The action includes processing metadata files, manifest lists, and position delete files to reflect changes in the location prefixes, aiming to support operations like migration to a new storage location or table duplication.
Here's a breakdown of what it does and how it works:
This PR extends @flyrain original PR #4705