Spark: Add 'skip_file_list' option to RewriteTablePathProcedure for optional file-list generation #12844

slfan1989 · 2025-04-18T23:48:27Z

This is a minor feature improvement. The background is that we are using RewriteTablePathProcedure to convert Hive tables to Iceberg tables, as detailed in #12762. RewriteTablePathProcedure generates a file-list file, and I need to manually clean up this file after each conversion. I understand that the file-list is mainly used to check data integrity, but since it is not essential for metadata, I believe allowing users to decide whether to generate this file would offer greater flexibility.

szehon-ho · 2025-04-19T05:18:25Z

Interesting, is it all that you need to do Hive -> Iceberg conversion. Seems simple and make sense to me. cc @flyrain @dramaticlly for any thoughts

slfan1989 · 2025-04-19T06:09:51Z

Interesting, is it all that you need to do Hive -> Iceberg conversion. Seems simple and make sense to me. cc @flyrain @dramaticlly for any thoughts

@szehon-ho Thank you for your reply! The sample code(#12769) I provided is consistent with the code in our production environment. We have successfully used this method to batch convert over 80 Hive tables to Iceberg tables. The migrated tables are expected, with both read and write tasks normally. This feature relies on the community's existing Snapshot and RewriteTablePath functions, which offer good stability. Additionally, by adjusting some Spark parameters, it enables the fast migration of large tables.

szehon-ho

Nice. As I said, Im ok with the idea, left some preliminary comment. But lets wait to see what the other say as well

szehon-ho · 2025-04-19T06:16:31Z

api/src/main/java/org/apache/iceberg/actions/RewriteTablePath.java

+   *     false to not skip.
+   * @return this for method chaining
+   */
+  RewriteTablePath skipFileList(boolean skipFileList);


we should make a default impl to avoid breaking change

Thank you for your suggestion! I have added a default implementation.

szehon-ho · 2025-04-19T06:17:58Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java


+    // skip file list
+    if (skipFileList) {
+      return "skip-file-list";


just thinking out loud, would 'null' be better?

I believe using null is fine, but during unit testing, I found that I cannot use null directly due to a non-null validation. Therefore, I considered using an empty string or the string 'null' as alternatives. In this version, I opted for the empty string as a replacement. I'm wondering if this approach is appropriate?

dramaticlly · 2025-04-21T17:02:15Z

Interesting, is it all that you need to do Hive -> Iceberg conversion. Seems simple and make sense to me. cc @flyrain @dramaticlly for any thoughts

Glad to hear RewriteTablePath can be used this way and I am ok to add a flag to control the behavior of saving file list location. I think currently it's tied to staging location and changing staging location will also influence where the metadata files will be saved on disk.

Also nit, I think your change include both Spark 3.4 and Spark 3.5, so you might want to reflect that in the PR title.

dramaticlly · 2025-04-21T16:59:42Z

api/src/main/java/org/apache/iceberg/actions/RewriteTablePath.java

+  /**
+   * Allows the user to skip saving the file list, determining whether certain files should be
+   * skipped from being saved.
+   *
+   * @param skipFileList A boolean value indicating whether to skip file saving. Pass true to skip,
+   *     false to not skip.
+   * @return this for method chaining
+   */


I feel we can be more concise about it since this is a boolean flag, how about?

/** * Whether to skip saving the file list location. * * @param skipFileList true to skip saving the file list, false to include it * @return this instance for method chaining */

Thank you for your suggestion! I have updated this part of the comments.

slfan1989 · 2025-04-22T01:55:00Z

Interesting, is it all that you need to do Hive -> Iceberg conversion. Seems simple and make sense to me. cc @flyrain @dramaticlly for any thoughts

Glad to hear RewriteTablePath can be used this way and I am ok to add a flag to control the behavior of saving file list location. I think currently it's tied to staging location and changing staging location will also influence where the metadata files will be saved on disk.

Also nit, I think your change include both Spark 3.4 and Spark 3.5, so you might want to reflect that in the PR title.

@szehon-ho @dramaticlly Thank you very much for your messages and for providing the RewriteTablePathProcedure, which makes Hive2Iceberg much simpler. I will continue improving #12762, and once this PR is ready, I will ask you to review the code. Thanks again!

manuzhang · 2025-04-22T13:27:10Z

@slfan1989 I'd suggest adding an option to print copy plan rather than not returning any info. Also, it will be easier to iterate on the idea if you target Spark 3.5 first.

szehon-ho · 2025-04-23T00:00:15Z

@manuzhang just curious, why print the copy plan, if the user doesnt want it?

manuzhang · 2025-04-23T01:10:26Z

@szehon-ho Otherwise, we don't know whether this procedure has run as expected and the intention of this PR is not to worry about cleaning the copy plan file.

manuzhang · 2025-04-23T03:10:09Z

.../v3.4/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteTablePathProcedure.java

  private static final ProcedureParameter STAGING_LOCATION_PARAM =
      ProcedureParameter.optional("staging_location", DataTypes.StringType);
+  private static final ProcedureParameter SKIP_FILE_LIST_PARAM =
+      ProcedureParameter.optional("skip_file_list", DataTypes.BooleanType);


I prefer an option like save_file_list which defaults to true. skip_file_list doesn't reflect that it's skipping saving file list to a file.

slfan1989 · 2025-04-23T03:21:30Z

@manuzhang @szehon-ho @dramaticlly Thank you very much for reviewing this PR. I will continue to improve it.

slfan1989 · 2025-04-23T03:34:28Z

@slfan1989 I'd suggest adding an option to print copy plan rather than not returning any info.

@manuzhang Thank you for your suggestion! but I would like to confirm your thoughts. I have previously looked at the content of the copy plan, which displays many records. When you refer to printing the copy plan, do you mean outputting it to the console, or should it be outputted to the file-list? From my understanding, outputting it to the file-list seems more reasonable.

Also, it will be easier to iterate on the idea if you target Spark 3.5 first.

We can indeed focus on the changes in Spark 3.5 first, and once the modifications for Spark 3.5 are complete, we can backport them to Spark 3.4.

szehon-ho · 2025-04-23T06:44:13Z

@manuzhang Thank you for your suggestion! but I would like to confirm your thoughts. I have previously looked at the content of the copy plan, which displays many records. When you refer to printing the copy plan, do you mean outputting it to the console, or should it be outputted to the file-list? From my understanding, outputting it to the file-list seems more reasonable.

Yea I was also a bit confused what you mean 'print' the plan? It can be a big plan

manuzhang · 2025-04-23T12:56:38Z

@slfan1989 @szehon-ho I meant outputting to console if user doesn't want to save to file. If that's not possible when the plan is big, maybe add two more output values, "rewrite_delete_files_count" and "rewrite_metadata_files_count". Then I think file_list file can be optional.

szehon-ho · 2025-04-23T19:12:06Z

Makes sense, i think adding a count sounds fine to me.

slfan1989 · 2025-04-24T01:13:39Z

@slfan1989 @szehon-ho I meant outputting to console if user doesn't want to save to file. If that's not possible when the plan is big, maybe add two more output values, "rewrite_delete_files_count" and "rewrite_metadata_files_count". Then I think file_list file can be optional.

@manuzhang @szehon-ho Thank you for your suggestions! I will make improvements to the code as soon as possible.

slfan1989 · 2025-05-09T03:30:26Z

@szehon-ho @manuzhang @dramaticlly Could you please review this PR again? Thank you very much!

manuzhang · 2025-05-09T03:46:23Z

@slfan1989 FYI, Spark 4.0 integration will be redone. Anyway, could you please create a separate PR for API change only?

slfan1989 · 2025-05-09T04:51:35Z

@slfan1989 FYI, Spark 4.0 integration will be redone. Anyway, could you please create a separate PR for API change only?

@manuzhang Thank you for your message! I will continue to follow up on this PR once the Spark 4.0 integration process is completed.

slfan1989 · 2025-05-22T05:23:54Z

@manuzhang @szehon-ho @dramaticlly The Spark 4.0 module has now been successfully merged. Can we restart this PR? Also, do we need to extract the API definitions in RewriteTablePath.java into a separate pr?

manuzhang · 2025-05-28T15:06:24Z

api/src/main/java/org/apache/iceberg/actions/RewriteTablePath.java

+   * @param skipFileList true to skip saving the file list, false to include it
+   * @return this instance for method chaining
+   */
+  default RewriteTablePath skipFileList(boolean skipFileList) {


I still prefer saveFileList or generateFileList.

@manuzhangThank you for your suggestion! I will optimize the code based on your feedback.

@szehon-ho @dramaticlly @manuzhang Could you please spare some time to review this PR? Thanks a lot! Apologies for the delayed response. Recently, I've been working on migrating Hive tables to Iceberg tables, and following the approach mentioned in #12762, I've migrated over 500 tables in the past two months. During this process, I've encountered some issues that I'd like to discuss with you. I’m currently summarizing some information from the migration process and plan to update #12762 within 1-2 days. Looking forward to your feedback.

dramaticlly · 2025-06-10T15:00:26Z

api/src/main/java/org/apache/iceberg/actions/RewriteTablePath.java

+    /** count of rewrite delete files, default value is 0 */
+    default int deleteFilesCount() {
+      return 0;
+    }
+
+    /** count of rewrite metadata files involved, default value is 0 */
+    default int metadataFilesCount() {
+      return 0;


Can you share a bit more on what are those 2 results are used for? I think add some unit tests with non-zero results would help.

@slfan1989 @szehon-ho I meant outputting to console if user doesn't want to save to file. If that's not possible when the plan is big, maybe add two more output values, "rewrite_delete_files_count" and "rewrite_metadata_files_count". Then I think file_list file can be optional.

@dramaticlly Thank you very much for reviewing the code! These changes were made based on discussions with @manuzhang.

Do you have any suggestions for the default values?

… file-list generation.

slfan1989 · 2025-06-19T01:57:14Z

@manuzhang @dramaticlly @szehon-ho I've completed the rebase on the code. If you have some time, Could you please review this PR and provide any feedback? Thank you very much!

github-actions · 2025-07-20T00:21:03Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

slfan1989 · 2025-07-25T23:34:44Z

Thank you all for your attention. This piece of code is a bit outdated, so I’ll reorganize it and submit a new version soon.

github-actions bot added API spark labels Apr 18, 2025

szehon-ho reviewed Apr 19, 2025

View reviewed changes

dramaticlly reviewed Apr 21, 2025

View reviewed changes

slfan1989 changed the title ~~Spark3.5: Add 'skip_file_list' option to RewriteTablePathProcedure for optional file-list generation~~ Spark: Add 'skip_file_list' option to RewriteTablePathProcedure for optional file-list generation Apr 22, 2025

manuzhang reviewed Apr 23, 2025

View reviewed changes

github-actions bot added the core label May 8, 2025

slfan1989 force-pushed the add_skip_file_list branch from b0d6cb8 to 98bff3f Compare May 9, 2025 01:33

slfan1989 force-pushed the add_skip_file_list branch from 991c614 to e77338d Compare May 20, 2025 13:51

manuzhang reviewed May 28, 2025

View reviewed changes

slfan1989 requested review from dramaticlly, manuzhang and szehon-ho June 10, 2025 01:41

dramaticlly reviewed Jun 10, 2025

View reviewed changes

slfan1989 added 2 commits June 17, 2025 10:44

Add 'skip_file_list' option to RewriteTablePathProcedure for optional…

c14bbc4

… file-list generation.

Add 'skip_file_list' option to RewriteTablePathProcedure for optional…

7a34790

… file-list generation.

slfan1989 force-pushed the add_skip_file_list branch from 50e0b4b to 7a34790 Compare June 17, 2025 02:58

github-actions bot added the stale label Jul 20, 2025

slfan1989 closed this Jul 25, 2025

slfan1989 mentioned this pull request Aug 18, 2025

API, Spark 4.0: Add create_file_list option to RewriteTablePathProcedure. #13837

Merged

Spark: Add 'skip_file_list' option to RewriteTablePathProcedure for optional file-list generation #12844

Spark: Add 'skip_file_list' option to RewriteTablePathProcedure for optional file-list generation #12844

Uh oh!

Conversation

slfan1989 commented Apr 18, 2025

Uh oh!

szehon-ho commented Apr 19, 2025

Uh oh!

slfan1989 commented Apr 19, 2025

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dramaticlly commented Apr 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slfan1989 commented Apr 22, 2025

Uh oh!

manuzhang commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Apr 23, 2025

Uh oh!

manuzhang commented Apr 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slfan1989 commented Apr 23, 2025

Uh oh!

slfan1989 commented Apr 23, 2025

Uh oh!

szehon-ho commented Apr 23, 2025

Uh oh!

manuzhang commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Apr 23, 2025

Uh oh!

slfan1989 commented Apr 24, 2025

Uh oh!

slfan1989 commented May 9, 2025

Uh oh!

manuzhang commented May 9, 2025

Uh oh!

slfan1989 commented May 9, 2025

Uh oh!

slfan1989 commented May 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slfan1989 commented Jun 19, 2025

Uh oh!

github-actions bot commented Jul 20, 2025

Uh oh!

slfan1989 commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

manuzhang commented Apr 22, 2025 •

edited

Loading

manuzhang commented Apr 23, 2025 •

edited

Loading