feat: Infer partition values from bounds#1079
Conversation
| )); | ||
| } | ||
|
|
||
| if lower != upper { |
There was a problem hiding this comment.
CMIIW, it looks like the possible lower upper can be different, and their partitions are the same. So we need to check transform(lower) != transform(upper)? iceberg-python has the same logic, should we fix it? cc @kevinjqliu @Fokko
There was a problem hiding this comment.
I don't think so, transform(lower) == transform(upper) doesn't mean the transformed result of each row are all same.
There was a problem hiding this comment.
I don't think so, transform(lower) == transform(upper) doesn't mean the transformed result of each row are all same.
This is interesting. The check here restricts the appended data file to have the same value for partition column. But in spec, the data file only needs to guarantee that the partition value of partition column within single data file is same. e.g. for year(ts), 2015-10-13, 2015-11-13 is ok to exist in single data file I think. But under this restriction, we could not append data file containing these two row, right?
I'm not sure whether worth it, I think there are two ways to avoid this restriction:
- Scan whole data file to compute the partition and make sure they are same.
- For partition transform, preserve original order properties(I'm not sure whether this description is accurate, e.g year, month),
transform(lower) == transform(upper)means the transformed result of each row are all same?
There was a problem hiding this comment.
I agree that if the transform preserves order, we can relax the check.
liurenjie1024
left a comment
There was a problem hiding this comment.
Hi, @jonathanc-n I'm quite confused about this pr, how can you infer partition value from statistics? First of all, statistics are optional, and they are maybe inaccurate. For example, long string may be truncated. If you want to use them in appending parquet files to table transaction, you need to read partition source columns back and recalculate them.
I think this implementation is refer from pyiceberg, see: https://github.com/apache/iceberg-python/blob/4d4714a46241d0d89519a2a605dbce27b713a60e/pyiceberg/io/pyarrow.py#L2236. It uses lower bound and upper bound to compute the partition. In here this statistics(lower bound, upper bound) is generate when read the parquet file, so we can guarantee that they are valid and accurate I think.🤔 |
|
I think the function name is misleading I will change that. We are passing in the lower and upper bounds computed from the original parquet file read during the |
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @jonathanc-n for this pr, and @ZENOTME for review, just one minor bug.
|
|
||
| for field in table_spec.fields() { | ||
| if let (Some(lower), Some(upper)) = ( | ||
| lower_bounds.get(&field.field_id), |
There was a problem hiding this comment.
If we are checking source value, this should be source_id?
| )); | ||
| } | ||
|
|
||
| if lower != upper { |
There was a problem hiding this comment.
I don't think so, transform(lower) == transform(upper) doesn't mean the transformed result of each row are all same.
|
@liurenjie1024 Good catch, should be good now. Thank you for the review! |
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @jonathanc-n for this pr, LGTM!
Which issue does this PR close?
What changes are included in this PR?
Added API for creating partition struct from statistics
Are these changes tested?
Will add tests after follow up pr for integrating it with the add_parquet_file api