Skip to content

Conversation

@slfan1989
Copy link
Contributor

This is a minor feature improvement. The background is that we are using RewriteTablePathProcedure to convert Hive tables to Iceberg tables, as detailed in #12762. RewriteTablePathProcedure generates a file-list file, and I need to manually clean up this file after each conversion. I understand that the file-list is mainly used to check data integrity, but since it is not essential for metadata, I believe allowing users to decide whether to generate this file would offer greater flexibility.

@szehon-ho
Copy link
Member

Interesting, is it all that you need to do Hive -> Iceberg conversion. Seems simple and make sense to me. cc @flyrain @dramaticlly for any thoughts

@slfan1989
Copy link
Contributor Author

Interesting, is it all that you need to do Hive -> Iceberg conversion. Seems simple and make sense to me. cc @flyrain @dramaticlly for any thoughts

@szehon-ho Thank you for your reply! The sample code(#12769) I provided is consistent with the code in our production environment. We have successfully used this method to batch convert over 80 Hive tables to Iceberg tables. The migrated tables are expected, with both read and write tasks normally. This feature relies on the community's existing Snapshot and RewriteTablePath functions, which offer good stability. Additionally, by adjusting some Spark parameters, it enables the fast migration of large tables.

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. As I said, Im ok with the idea, left some preliminary comment. But lets wait to see what the other say as well

* false to not skip.
* @return this for method chaining
*/
RewriteTablePath skipFileList(boolean skipFileList);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make a default impl to avoid breaking change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion! I have added a default implementation.


// skip file list
if (skipFileList) {
return "skip-file-list";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just thinking out loud, would 'null' be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe using null is fine, but during unit testing, I found that I cannot use null directly due to a non-null validation. Therefore, I considered using an empty string or the string 'null' as alternatives. In this version, I opted for the empty string as a replacement. I'm wondering if this approach is appropriate?

@dramaticlly
Copy link
Contributor

Interesting, is it all that you need to do Hive -> Iceberg conversion. Seems simple and make sense to me. cc @flyrain @dramaticlly for any thoughts

Glad to hear RewriteTablePath can be used this way and I am ok to add a flag to control the behavior of saving file list location. I think currently it's tied to staging location and changing staging location will also influence where the metadata files will be saved on disk.

Also nit, I think your change include both Spark 3.4 and Spark 3.5, so you might want to reflect that in the PR title.

Comment on lines 89 to 94
/**
* Allows the user to skip saving the file list, determining whether certain files should be
* skipped from being saved.
*
* @param skipFileList A boolean value indicating whether to skip file saving. Pass true to skip,
* false to not skip.
* @return this for method chaining
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we can be more concise about it since this is a boolean flag, how about?

  /**
   * Whether to skip saving the file list location.
   *
   * @param skipFileList true to skip saving the file list, false to include it
   * @return this instance for method chaining
   */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion! I have updated this part of the comments.

@slfan1989 slfan1989 changed the title Spark3.5: Add 'skip_file_list' option to RewriteTablePathProcedure for optional file-list generation Spark: Add 'skip_file_list' option to RewriteTablePathProcedure for optional file-list generation Apr 22, 2025
@slfan1989
Copy link
Contributor Author

Interesting, is it all that you need to do Hive -> Iceberg conversion. Seems simple and make sense to me. cc @flyrain @dramaticlly for any thoughts

Glad to hear RewriteTablePath can be used this way and I am ok to add a flag to control the behavior of saving file list location. I think currently it's tied to staging location and changing staging location will also influence where the metadata files will be saved on disk.

Also nit, I think your change include both Spark 3.4 and Spark 3.5, so you might want to reflect that in the PR title.

@szehon-ho @dramaticlly Thank you very much for your messages and for providing the RewriteTablePathProcedure, which makes Hive2Iceberg much simpler. I will continue improving #12762, and once this PR is ready, I will ask you to review the code. Thanks again!

@manuzhang
Copy link
Member

manuzhang commented Apr 22, 2025

@slfan1989 I'd suggest adding an option to print copy plan rather than not returning any info. Also, it will be easier to iterate on the idea if you target Spark 3.5 first.

@szehon-ho
Copy link
Member

@manuzhang just curious, why print the copy plan, if the user doesnt want it?

@manuzhang
Copy link
Member

@szehon-ho Otherwise, we don't know whether this procedure has run as expected and the intention of this PR is not to worry about cleaning the copy plan file.

private static final ProcedureParameter STAGING_LOCATION_PARAM =
ProcedureParameter.optional("staging_location", DataTypes.StringType);
private static final ProcedureParameter SKIP_FILE_LIST_PARAM =
ProcedureParameter.optional("skip_file_list", DataTypes.BooleanType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer an option like save_file_list which defaults to true. skip_file_list doesn't reflect that it's skipping saving file list to a file.

@slfan1989
Copy link
Contributor Author

@manuzhang @szehon-ho @dramaticlly Thank you very much for reviewing this PR. I will continue to improve it.

@slfan1989
Copy link
Contributor Author

@slfan1989 I'd suggest adding an option to print copy plan rather than not returning any info.

@manuzhang Thank you for your suggestion! but I would like to confirm your thoughts. I have previously looked at the content of the copy plan, which displays many records. When you refer to printing the copy plan, do you mean outputting it to the console, or should it be outputted to the file-list? From my understanding, outputting it to the file-list seems more reasonable.

Also, it will be easier to iterate on the idea if you target Spark 3.5 first.

We can indeed focus on the changes in Spark 3.5 first, and once the modifications for Spark 3.5 are complete, we can backport them to Spark 3.4.

@szehon-ho
Copy link
Member

@manuzhang Thank you for your suggestion! but I would like to confirm your thoughts. I have previously looked at the content of the copy plan, which displays many records. When you refer to printing the copy plan, do you mean outputting it to the console, or should it be outputted to the file-list? From my understanding, outputting it to the file-list seems more reasonable.

Yea I was also a bit confused what you mean 'print' the plan? It can be a big plan

@manuzhang
Copy link
Member

manuzhang commented Apr 23, 2025

@slfan1989 @szehon-ho I meant outputting to console if user doesn't want to save to file. If that's not possible when the plan is big, maybe add two more output values, "rewrite_delete_files_count" and "rewrite_metadata_files_count". Then I think file_list file can be optional.

@szehon-ho
Copy link
Member

Makes sense, i think adding a count sounds fine to me.

@slfan1989
Copy link
Contributor Author

@slfan1989 @szehon-ho I meant outputting to console if user doesn't want to save to file. If that's not possible when the plan is big, maybe add two more output values, "rewrite_delete_files_count" and "rewrite_metadata_files_count". Then I think file_list file can be optional.

@manuzhang @szehon-ho Thank you for your suggestions! I will make improvements to the code as soon as possible.

@github-actions github-actions bot added the core label May 8, 2025
@slfan1989 slfan1989 force-pushed the add_skip_file_list branch from b0d6cb8 to 98bff3f Compare May 9, 2025 01:33
@slfan1989
Copy link
Contributor Author

@szehon-ho @manuzhang @dramaticlly Could you please review this PR again? Thank you very much!

@manuzhang
Copy link
Member

@slfan1989 FYI, Spark 4.0 integration will be redone. Anyway, could you please create a separate PR for API change only?

@slfan1989
Copy link
Contributor Author

@slfan1989 FYI, Spark 4.0 integration will be redone. Anyway, could you please create a separate PR for API change only?

@manuzhang Thank you for your message! I will continue to follow up on this PR once the Spark 4.0 integration process is completed.

@slfan1989 slfan1989 force-pushed the add_skip_file_list branch from 991c614 to e77338d Compare May 20, 2025 13:51
@slfan1989
Copy link
Contributor Author

@manuzhang @szehon-ho @dramaticlly The Spark 4.0 module has now been successfully merged. Can we restart this PR? Also, do we need to extract the API definitions in RewriteTablePath.java into a separate pr?

* @param skipFileList true to skip saving the file list, false to include it
* @return this instance for method chaining
*/
default RewriteTablePath skipFileList(boolean skipFileList) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still prefer saveFileList or generateFileList.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manuzhangThank you for your suggestion! I will optimize the code based on your feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho @dramaticlly @manuzhang Could you please spare some time to review this PR? Thanks a lot! Apologies for the delayed response. Recently, I've been working on migrating Hive tables to Iceberg tables, and following the approach mentioned in #12762, I've migrated over 500 tables in the past two months. During this process, I've encountered some issues that I'd like to discuss with you. I’m currently summarizing some information from the migration process and plan to update #12762 within 1-2 days. Looking forward to your feedback.

Comment on lines +126 to +133
/** count of rewrite delete files, default value is 0 */
default int deleteFilesCount() {
return 0;
}

/** count of rewrite metadata files involved, default value is 0 */
default int metadataFilesCount() {
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share a bit more on what are those 2 results are used for? I think add some unit tests with non-zero results would help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slfan1989 @szehon-ho I meant outputting to console if user doesn't want to save to file. If that's not possible when the plan is big, maybe add two more output values, "rewrite_delete_files_count" and "rewrite_metadata_files_count". Then I think file_list file can be optional.

@dramaticlly Thank you very much for reviewing the code! These changes were made based on discussions with @manuzhang.

Do you have any suggestions for the default values?

@slfan1989 slfan1989 force-pushed the add_skip_file_list branch from 50e0b4b to 7a34790 Compare June 17, 2025 02:58
@slfan1989
Copy link
Contributor Author

@manuzhang @dramaticlly @szehon-ho I've completed the rebase on the code. If you have some time, Could you please review this PR and provide any feedback? Thank you very much!

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jul 20, 2025
@slfan1989 slfan1989 closed this Jul 25, 2025
@slfan1989
Copy link
Contributor Author

Thank you all for your attention. This piece of code is a bit outdated, so I’ll reorganize it and submit a new version soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants