-
Notifications
You must be signed in to change notification settings - Fork 395
feat(transaction): Add option to check added data files in FastAppendAction #2025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hi, thanks for contributing! the change looks good
Yes, this would be nice! |
|
added |
|
@CTTY what does the approval and merge process look like in this project? |
liurenjie1024
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @mitchellciupak thanks for this pr, but I don't think this is the right way to go. Maybe adding some kind of cache is a better approach to optimize it.
|
Hey @liurenjie1024, thanks for taking a look, I was planning on handling the cache prior to the commit call, essentially validating the snapshot once beforehand. The flag would then allow for cheaper retries on the commit method with data file checks disabled. Different users may implement their caching/validation strategies differently, but I am certainly happy to add a solution for it. Do you have any initial ideas on implementation? |
Which issue does this PR close?
This PR does not close any existing issues. It addresses an optimization opportunity in the fast append workflow.
Use Case
I need to validate data files before committing them to the table. Currently,
validate_added_data_files()is called internally duringcommit(), which means validation occurs on every commit attempt, including retries.Enhancement
By disabling
validate_added_data_files()in the commit method, I can perform validation once before attempting said commit. This allows for commit retries without re-running validation, reducing overhead in retry scenarios.It's a performance optimization that provides more control over the validation/commit lifecycle.
What changes are included in this PR?
This commit adds an option to the FastAppendAction to disable the validation step
snapshot_producer.validate_added_data_files()during commits. This is similar to the option to disablesnapshot_producer.validate_duplicate_files()append.rs.The change is implemented in
crates/iceberg/src/transaction/append.rs.Are these changes tested?
These changes have been manually tested outside the test framework. I noticed that the existing
with_check_duplicate()method has no test coverage. I'm not sure if either feature is just too small to be considered in scope for the project's test strategy. If helpful, I can add tests for bothwith_check_duplicate()and the newvalidate_added_data_files()method here in this PR.