Skip to content

feat: Infer partition values from bounds#1079

Merged
liurenjie1024 merged 11 commits intoapache:mainfrom
jonathanc-n:add-to-partitioned
Apr 8, 2025
Merged

feat: Infer partition values from bounds#1079
liurenjie1024 merged 11 commits intoapache:mainfrom
jonathanc-n:add-to-partitioned

Conversation

@jonathanc-n
Copy link
Copy Markdown
Contributor

@jonathanc-n jonathanc-n commented Mar 13, 2025

Which issue does this PR close?

What changes are included in this PR?

Added API for creating partition struct from statistics

Are these changes tested?

Will add tests after follow up pr for integrating it with the add_parquet_file api

@jonathanc-n jonathanc-n changed the title feat: Infer partition values statistics feat: Infer partition values from statistics Mar 13, 2025
));
}

if lower != upper {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMIIW, it looks like the possible lower upper can be different, and their partitions are the same. So we need to check transform(lower) != transform(upper)? iceberg-python has the same logic, should we fix it? cc @kevinjqliu @Fokko

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, transform(lower) == transform(upper) doesn't mean the transformed result of each row are all same.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, transform(lower) == transform(upper) doesn't mean the transformed result of each row are all same.

This is interesting. The check here restricts the appended data file to have the same value for partition column. But in spec, the data file only needs to guarantee that the partition value of partition column within single data file is same. e.g. for year(ts), 2015-10-13, 2015-11-13 is ok to exist in single data file I think. But under this restriction, we could not append data file containing these two row, right?
I'm not sure whether worth it, I think there are two ways to avoid this restriction:

  1. Scan whole data file to compute the partition and make sure they are same.
  2. For partition transform, preserve original order properties(I'm not sure whether this description is accurate, e.g year, month), transform(lower) == transform(upper) means the transformed result of each row are all same?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that if the transform preserves order, we can relax the check.

Copy link
Copy Markdown
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @jonathanc-n I'm quite confused about this pr, how can you infer partition value from statistics? First of all, statistics are optional, and they are maybe inaccurate. For example, long string may be truncated. If you want to use them in appending parquet files to table transaction, you need to read partition source columns back and recalculate them.

@ZENOTME
Copy link
Copy Markdown
Contributor

ZENOTME commented Apr 2, 2025

Hi, @jonathanc-n I'm quite confused about this pr, how can you infer partition value from statistics? First of all, statistics are optional, and they are maybe inaccurate. For example, long string may be truncated. If you want to use them in appending parquet files to table transaction, you need to read partition source columns back and recalculate them.

I think this implementation is refer from pyiceberg, see: https://github.com/apache/iceberg-python/blob/4d4714a46241d0d89519a2a605dbce27b713a60e/pyiceberg/io/pyarrow.py#L2236. It uses lower bound and upper bound to compute the partition. In here this statistics(lower bound, upper bound) is generate when read the parquet file, so we can guarantee that they are valid and accurate I think.🤔

@jonathanc-n
Copy link
Copy Markdown
Contributor Author

I think the function name is misleading I will change that. We are passing in the lower and upper bounds computed from the original parquet file read during the parquet_to_data_file_builder

@jonathanc-n jonathanc-n changed the title feat: Infer partition values from statistics feat: Infer partition values from bounds Apr 2, 2025
Copy link
Copy Markdown
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonathanc-n for this pr, and @ZENOTME for review, just one minor bug.


for field in table_spec.fields() {
if let (Some(lower), Some(upper)) = (
lower_bounds.get(&field.field_id),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are checking source value, this should be source_id?

));
}

if lower != upper {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, transform(lower) == transform(upper) doesn't mean the transformed result of each row are all same.

@jonathanc-n
Copy link
Copy Markdown
Contributor Author

@liurenjie1024 Good catch, should be good now. Thank you for the review!

Copy link
Copy Markdown
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonathanc-n for this pr, LGTM!

@liurenjie1024 liurenjie1024 merged commit e3ef617 into apache:main Apr 8, 2025
17 checks passed
@jonathanc-n jonathanc-n deleted the add-to-partitioned branch April 8, 2025 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants