-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Add 'skip_file_list' option to RewriteTablePathProcedure for optional file-list generation #12844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Interesting, is it all that you need to do Hive -> Iceberg conversion. Seems simple and make sense to me. cc @flyrain @dramaticlly for any thoughts |
@szehon-ho Thank you for your reply! The sample code(#12769) I provided is consistent with the code in our production environment. We have successfully used this method to batch convert over 80 Hive tables to Iceberg tables. The migrated tables are expected, with both read and write tasks normally. This feature relies on the community's existing Snapshot and RewriteTablePath functions, which offer good stability. Additionally, by adjusting some Spark parameters, it enables the fast migration of large tables. |
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. As I said, Im ok with the idea, left some preliminary comment. But lets wait to see what the other say as well
| * false to not skip. | ||
| * @return this for method chaining | ||
| */ | ||
| RewriteTablePath skipFileList(boolean skipFileList); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should make a default impl to avoid breaking change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion! I have added a default implementation.
|
|
||
| // skip file list | ||
| if (skipFileList) { | ||
| return "skip-file-list"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just thinking out loud, would 'null' be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe using null is fine, but during unit testing, I found that I cannot use null directly due to a non-null validation. Therefore, I considered using an empty string or the string 'null' as alternatives. In this version, I opted for the empty string as a replacement. I'm wondering if this approach is appropriate?
Glad to hear RewriteTablePath can be used this way and I am ok to add a flag to control the behavior of saving file list location. I think currently it's tied to staging location and changing staging location will also influence where the metadata files will be saved on disk. Also nit, I think your change include both Spark 3.4 and Spark 3.5, so you might want to reflect that in the PR title. |
| /** | ||
| * Allows the user to skip saving the file list, determining whether certain files should be | ||
| * skipped from being saved. | ||
| * | ||
| * @param skipFileList A boolean value indicating whether to skip file saving. Pass true to skip, | ||
| * false to not skip. | ||
| * @return this for method chaining | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we can be more concise about it since this is a boolean flag, how about?
/**
* Whether to skip saving the file list location.
*
* @param skipFileList true to skip saving the file list, false to include it
* @return this instance for method chaining
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion! I have updated this part of the comments.
@szehon-ho @dramaticlly Thank you very much for your messages and for providing the RewriteTablePathProcedure, which makes Hive2Iceberg much simpler. I will continue improving #12762, and once this PR is ready, I will ask you to review the code. Thanks again! |
|
@slfan1989 I'd suggest adding an option to print copy plan rather than not returning any info. Also, it will be easier to iterate on the idea if you target Spark 3.5 first. |
|
@manuzhang just curious, why print the copy plan, if the user doesnt want it? |
|
@szehon-ho Otherwise, we don't know whether this procedure has run as expected and the intention of this PR is not to worry about cleaning the copy plan file. |
| private static final ProcedureParameter STAGING_LOCATION_PARAM = | ||
| ProcedureParameter.optional("staging_location", DataTypes.StringType); | ||
| private static final ProcedureParameter SKIP_FILE_LIST_PARAM = | ||
| ProcedureParameter.optional("skip_file_list", DataTypes.BooleanType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer an option like save_file_list which defaults to true. skip_file_list doesn't reflect that it's skipping saving file list to a file.
|
@manuzhang @szehon-ho @dramaticlly Thank you very much for reviewing this PR. I will continue to improve it. |
@manuzhang Thank you for your suggestion! but I would like to confirm your thoughts. I have previously looked at the content of the copy plan, which displays many records. When you refer to printing the copy plan, do you mean outputting it to the console, or should it be outputted to the file-list? From my understanding, outputting it to the file-list seems more reasonable.
We can indeed focus on the changes in Spark 3.5 first, and once the modifications for Spark 3.5 are complete, we can backport them to Spark 3.4. |
Yea I was also a bit confused what you mean 'print' the plan? It can be a big plan |
|
@slfan1989 @szehon-ho I meant outputting to console if user doesn't want to save to file. If that's not possible when the plan is big, maybe add two more output values, "rewrite_delete_files_count" and "rewrite_metadata_files_count". Then I think |
|
Makes sense, i think adding a count sounds fine to me. |
@manuzhang @szehon-ho Thank you for your suggestions! I will make improvements to the code as soon as possible. |
b0d6cb8 to
98bff3f
Compare
|
@szehon-ho @manuzhang @dramaticlly Could you please review this PR again? Thank you very much! |
|
@slfan1989 FYI, Spark 4.0 integration will be redone. Anyway, could you please create a separate PR for API change only? |
@manuzhang Thank you for your message! I will continue to follow up on this PR once the Spark 4.0 integration process is completed. |
991c614 to
e77338d
Compare
|
@manuzhang @szehon-ho @dramaticlly The Spark 4.0 module has now been successfully merged. Can we restart this PR? Also, do we need to extract the API definitions in RewriteTablePath.java into a separate pr? |
| * @param skipFileList true to skip saving the file list, false to include it | ||
| * @return this instance for method chaining | ||
| */ | ||
| default RewriteTablePath skipFileList(boolean skipFileList) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still prefer saveFileList or generateFileList.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@manuzhangThank you for your suggestion! I will optimize the code based on your feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szehon-ho @dramaticlly @manuzhang Could you please spare some time to review this PR? Thanks a lot! Apologies for the delayed response. Recently, I've been working on migrating Hive tables to Iceberg tables, and following the approach mentioned in #12762, I've migrated over 500 tables in the past two months. During this process, I've encountered some issues that I'd like to discuss with you. I’m currently summarizing some information from the migration process and plan to update #12762 within 1-2 days. Looking forward to your feedback.
| /** count of rewrite delete files, default value is 0 */ | ||
| default int deleteFilesCount() { | ||
| return 0; | ||
| } | ||
|
|
||
| /** count of rewrite metadata files involved, default value is 0 */ | ||
| default int metadataFilesCount() { | ||
| return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share a bit more on what are those 2 results are used for? I think add some unit tests with non-zero results would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slfan1989 @szehon-ho I meant outputting to console if user doesn't want to save to file. If that's not possible when the plan is big, maybe add two more output values, "rewrite_delete_files_count" and "rewrite_metadata_files_count". Then I think
file_listfile can be optional.
@dramaticlly Thank you very much for reviewing the code! These changes were made based on discussions with @manuzhang.
Do you have any suggestions for the default values?
… file-list generation.
… file-list generation.
50e0b4b to
7a34790
Compare
|
@manuzhang @dramaticlly @szehon-ho I've completed the rebase on the code. If you have some time, Could you please review this PR and provide any feedback? Thank you very much! |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
Thank you all for your attention. This piece of code is a bit outdated, so I’ll reorganize it and submit a new version soon. |
This is a minor feature improvement. The background is that we are using
RewriteTablePathProcedureto convert Hive tables to Iceberg tables, as detailed in #12762.RewriteTablePathProceduregenerates afile-listfile, and I need to manually clean up this file after each conversion. I understand that thefile-listis mainly used to check data integrity, but since it is not essential for metadata, I believe allowing users to decide whether to generate this file would offer greater flexibility.