-
Notifications
You must be signed in to change notification settings - Fork 3k
Data: Add partition stats writer and reader #11216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
941505a to
05a80f6
Compare
core/src/main/java/org/apache/iceberg/data/PartitionStatsRecord.java
Outdated
Show resolved
Hide resolved
| Schema schema, | ||
| PartitionSpec spec, | ||
| int formatVersion, | ||
| Map<String, String> properties) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was no option to pass the table properties before.
Needed to pass different file format for paramterized test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not adding the parameter to the old create method and call the new method from the old one?
Like:
public static TestTable create(
File temp,
String name,
Schema schema,
PartitionSpec spec,
SortOrder sortOrder,
int formatVersion) {
return create(temp, name, schema, spec, SortOrder.unsorted(), formatVersion, ImmutableMap.of());
}
public static TestTable create(
File temp,
String name,
Schema schema,
PartitionSpec spec,
int formatVersion,
Map<String, String> properties) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followed a similar style when they added MetricsReporter,
Why not adding the parameter to the old create
It is public. So, need to modify all the callers.
But we can refactor a private method that can be helper to all these public method. I can do it in a follow up to keep minimal changes for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I don't really understand your answer here 😢
I'm not suggesting just simply adding the parameter.
What I'm suggesting is to create a new constructor with the extra parameter, but in the old constructor we should remove the implementation and call the new constructor with an empty property map. This way we will not have duplicated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, at some point I was considering introducing a Builder class for these TestTables because there are som many different create() functions now. Once this PR is merged, I think I'll re-visit this plan since we have yet another version of these functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Builder method is definitely better. I just extracted common code into a private function as of now to keep minimal changes in the PR.
|
@aokolnychyi: This PR is ready. But as we discussed previously, this PR wraps the I will explore adding the internal writers for Parquet and Orc. Similar to #11108. |
|
@RussellSpitzer: It would be good to have this in 1.7.0. |
|
I already tried POC for internal writers on another branch, The problems: b) Also, Using partitionData in StructLikeMap is not working fine. Some keys are missing in the map (looks like equals() logic), If I use Record, it is fine. Maybe in the next version we can have optimized writer and reader (without converter using internal reader and writers). |
data/src/test/java/org/apache/iceberg/data/TestPartitionStatsHandler.java
Outdated
Show resolved
Hide resolved
|
Moving out of 1.7.0 since we still have a bit of discussion here |
05a80f6 to
ee3b273
Compare
|
@RussellSpitzer: I have added the Assertion for Partition type as you suggested and replied to #11216 (comment), do you have anymore comments for this PR? |
|
I had a conversation with @rdblue today about internal writers. Ryan should have a bit of time to help/guide. |
data/src/main/java/org/apache/iceberg/data/PartitionStatsHandler.java
Outdated
Show resolved
Hide resolved
|
@RussellSpitzer @aokolnychyi I'm reviewing the stale PRs, and this one is open for month. Do we have a way to move forward ? I can do a new review, but at the end of the day, it won't help for the merge (as only committers can merge PR). |
7fafa91 to
113dc7d
Compare
|
Rebased and PR is ready. |
data/src/main/java/org/apache/iceberg/data/PartitionStatsHandler.java
Outdated
Show resolved
Hide resolved
pvary
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one last comment, but otherwise looks good to me.
I would wait @gaborkaszab to approve too, as he seems genuinely interested. He is OOO for this week, but next week we can merge
|
@pvary: Thanks a lot for the review and approval. |
gaborkaszab
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for waiting for me! I went through the PR one more time.
| schema, spec, sortOrder, temp.toString(), ImmutableMap.of(), formatVersion)); | ||
|
|
||
| return new TestTable(ops, name); | ||
| return createTable(temp, name, schema, spec, formatVersion, ImmutableMap.of(), sortOrder, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You moved the implementation of these create functions into a common place that is a nice refactor. However, I see one create() function above at L55 that still has this same implementation body that seems redundant. Can't you also replace that with a function call to the common implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function came from latest rebase. Someone added it as they wanted to pass properties.
Updated it.
| return Parquet.writeData(outputFile) | ||
| .schema(dataSchema) | ||
| .createWriterFunc(InternalWriter::create) | ||
| .overwrite() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Shouldn't we also remove the .overwrite() call for the writer here? In case we assume the stat files have unique name we shouldn't allow overwriting anything that shares the same name.
| .metadataFileLocation( | ||
| fileFormat.addExtension( | ||
| String.format( | ||
| Locale.ROOT, "partition-stats-%d-%s", snapshotId, UUID.randomUUID())))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the way to follow, however, we should somehow raise awareness that the approach to update both the table stats and partition stats can produce orphan files within the table folder. I think we should keep pinging the ongoing conversation on dev@ about this because with this approach we could easily flood the folder with orphan files and when we drop the table these files will remain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it is same as how other stats are supported (puffin files). I saw your reply on mailing list. Lets see what others think. We can normalize that behavior for all stats. No need to wait or block this PR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we shouldn't block this PR with the orphaned stat files issue as it is an existing one already. I just wanted to raise awareness that we should continue that discussion on dev@ to to figure out a solution if there is any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Let's move forward and discuss on dev@
| return Avro.writeData(outputFile) | ||
| .schema(dataSchema) | ||
| .createWriterFunc(org.apache.iceberg.avro.InternalWriter::create) | ||
| .overwrite() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above: stat file names are expected to be unique, we shouldn't overwrite any existing files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followed other writers pattern. Removed it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to keep it aligned with other writers pattern.
| @Override | ||
| public UpdatePartitionStatistics setPartitionStatistics(PartitionStatisticsFile file) { | ||
| Preconditions.checkArgument(null != file, "partition statistics file must not be null"); | ||
| if (file == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we update the comment on the interface to be clear that a null param results a noop?
I checked the implementation again for SetStatistics.setStatistics() and apparently there is no special handling for null inputs. I became a bit hesitant now because there giving a null stat file as param would result in an NPE when calling the Optional.of(statisticsFile).
Not sure now if noop is the way to go here or the Precondition was better. Sorry to this back and forth. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked Trino & Spark Puffin implementation, caller is making sure it is not null.
IcebergMetadata#finishStatisticsCollection() in Trino
ComputeTableStatsSparkAction#execute() in spark
API level NPE is not a good idea. I am ok with no-op as it avoids extra null handling for the callers.
So, I don't think we need to revert back again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think it was reasonable to throw an NPE and be consistent with UpdateStatistics. There is no problem with NPE as long as the error message is descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We went little back and forth on this.
For the end users, it avoids extra null check if the stats are null by going with no-op. So, went ahead with no-op
2551fc8 to
3003a5d
Compare
gaborkaszab
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ajantha-bhat !
|
I did the review and it looks good to me. Thanks ! |
|
@RussellSpitzer, @aokolnychyi: Anymore comments? others have approved the changes. |
3003a5d to
4c5ff33
Compare
|
Just rebased the PR as there was a conflict from #12419 in |
|
Thanks for all your work @ajantha-bhat! |
|
Thanks everyone for the review and merging it. |
aokolnychyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see this in, sorry it took me so long to get back. I left some comments to consider. Thanks, @ajantha-bhat!
| @Override | ||
| public UpdatePartitionStatistics setPartitionStatistics(PartitionStatisticsFile file) { | ||
| Preconditions.checkArgument(null != file, "partition statistics file must not be null"); | ||
| if (file == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think it was reasonable to throw an NPE and be consistent with UpdateStatistics. There is no problem with NPE as long as the error message is descriptive.
|
|
||
| private final int id; | ||
|
|
||
| Column(int id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether this enum is actually required. Why not simply define NestedField constants? We need those to create a schema anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added it for two reasons,
a) I didn't want to use hardcoded strings or string constant for each field name, field id while preparing NestedField, hence using the name from enum.
b) Plus I thought enum is cleaner than multiple nested fields constants.
Do you see any drawback for using this enum in terms of readability or performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @aokolnychyi suggests to use constants for the fields instead of the field names.
Checked the codebase, and this is the typical pattern.
See:
- https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/MetadataColumns.java
- https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestEntry.java
- https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/DataFile.java
I agree that we should follow the same pattern.
@ajantha-bhat: Could you please create a PR for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't predefine the field for
Types.NestedField partition = NestedField.required(1, Column.PARTITION.name(), unifiedPartitionType);
because the unifiedPartitionType is a variable.
For other fields, I can define the NestedField, but for this I can't. So, it won't become uniform. I still feel Enum is cleaner. But I can modify if you guys really want it to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. We have similar non-uniform code in DataFile also for this kind.
| static StructType getType(StructType partitionType) { |
I will raise a PR to have it similar way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | ||
|
|
||
| private static PartitionStats recordToPartitionStats(StructLike record) { | ||
| PartitionStats stats = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajantha-bhat, could you confirm whether CloseableIterable<StructLike> returned by the data reader reuses the same object? If so, we don't have to create a new PartitionStats object for every record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CloseableIterable returned by the data reader reuses the same object? If so, we don't have to create a new PartitionStats object for every record
By default it doesn't. While adding the internal readers, I have added the support for reuseContainers too. So, we can enable it.
I am not sure about enabling it here as it is an end user API, if the user want to keep stats for more than one partition at once, like our testcase where we prepare a list from this iterable (instead of consuming one by one), we cannot reuse the container. Hence, I didn't enable it.
|
@deniskuzZ : Hi, Is Hive intersted in the incremental compute of these partition stats instead of reading all the manifests at once? If so, I am planning to add a new API to incrementally compute stats (it checks the last snapshot that had partition stats and compute the difference manifest from that snapshot to current snapshot and compute new stats by reading the old stats + new stats from new manifests) |
Introduce APIs to write the partition stats into files in table default format using Iceberg internal writers and readers.