-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Auto promote sorted column metrics to truncate(16) #2240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Change Background: A suggestion from @aokolnychyi , we find that users do not tend to fine-tune column metrics. The default truncate(16) does not do much for certain sortable types, and reduces benefits of predicate pruning. Cost of storing full metrics should be small as number of sorted columns will be generally small. |
|
Thanks for the PR, @szehon-ho! Let me take a look now. |
| public static final String METRICS_MODE_COLUMN_CONF_PREFIX = "write.metadata.metrics.column."; | ||
| public static final String DEFAULT_WRITE_METRICS_MODE = "write.metadata.metrics.default"; | ||
| public static final String DEFAULT_WRITE_METRICS_MODE_DEFAULT = "truncate(16)"; | ||
| public static final String SORTED_COL_DEFAULT_METRICS_MODE = "write.metadata.sorted.metrics.default"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think we should simplify it a bit. The use case I was talking about is when the user configures the default sort mode as none or counts but creates a table with a sort order. In that case, we should promote the metrics for sort columns to be at least truncate(16) unless the user sets a mode for sort columns explicitly. It is probably too dangerous to promote to full as the values may be too long. I guess truncate(16) is a reasonable default for sort columns and it will apply only to string and binary columns. Longs/integers won't be affected.
Internally, we may want to change the default value to counts instead of truncate(16) for tables with many columns as we have a lot of tables with 100+ columns and we don't want tons of unnecessary metadata. But I am not sure that the community wants to do the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts, @szehon-ho @RussellSpitzer @rdblue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also say the promotion should be implicit and we don't need extra table properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i'm also in favor of just making the change without a flag to configure. Just to make configuration have one parameter less, do we really have a lot of use cases where truncate(16) isn't the right thing to do? And if it isn't we can just let users manually reconfigure those columns individually if they need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szehon-ho, could you update the PR to match the behavior in this comment, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I'll take a look, thanks for the feedback guys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recap if I understand correctly: if default < truncate(16), then sort columns => truncate(16). And no table property (user overrides the sort col metric property itself)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s just my suggestion, I can always be convinced otherwise too 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No you guys have a point, full is unnecessary if someone sorts by a huge text for example, and there's already a good way to override the config. Updated the PR and changed the title to reflect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this behavior and I also would not add a new table property.
|
Let me do another round now. |
|
I see that the current approach tries to promote the sort columns on read, which is one way of doing this. I think I'd prefer the approach we took with Basically, we could modify That will need two new methods in What are your thoughts, @RussellSpitzer @rdblue @holdenk @shardulm94 @pvary? |
|
Would that also catch alters? @aokolnychyi? I don't think that's unreasonable since it does make sense to keep it off the read path if possible ... |
|
We will catch altering sort order separately just like we catch a few other places for |
|
I don't think I would approach the problem as @aokolnychyi is suggesting. I thought that this would promote when accessing the metrics. So basically The advantage of that approach is that the sort order doesn't make changes to table properties. Those are always configured by users. |
The problem is that we build
I think having something like |
|
But I do get the concern of modifying table properties. However, it looks there will be a few places where will need to do that anyway (e.g. schema evolution). |
|
My concern isn't modifying table properties specifically. We need to do that so that we update the properties to reflect the user's intent. If we have a setting for a column, renaming the property to carry it through a column rename is preserving the user's intent. But if we update table properties then we are potentially losing what the user chose to do. For example, what do we do when the sort field is removed and there is a metrics config property? We can't remove a Even if we don't pass the table around, I think we should definitely pass the sort order in to get the metrics config. It is probably not the time to refactor and pass the table, but we can pass the sort order like we do the other table config. |
Yeah, that's the point we won't be able to differentiate if we modified the props. I am convinced now. I took a look at how hard will it be to move to I am fine either way. |
|
It sounds like it would take some work to serialize Table. I think we'd want to update it so that at least one version of TableMetadata is also carried through to avoid that read on every executor. In the long run, that's probably a good thing to do. But for now we can make easier progress serializing things separately. Eventually, I think we should broadcast the table and FileIO together to take care of this issue for all of the things that might be expensive to serialize. |
|
Thanks guys for the time and discussion (learning a lot on the side). If I understand correctly, it looks metrics-promotion on read is the way to go due to simplicity, esp in regards to alter table not having to alter auto-promoted metrics properties. But in this approach we have to manually pass in sortOrder in all places of write, and I missed it in the new SparkTableUtil and Flink, which I can add. |
|
Edit: and for now as Table is hard to serialize, we can have MetricsConfig.forSortOrder(properties, sortOrder), to make progress, as SortOrder is already serializable. |
a2c5899 to
58c6c83
Compare
|
Hi @rdblue @aokolnychyi , so I propagated the SortOrder through the various writers as discussed, including the ones I missed before. I spent some time verifying it by unit tests (TestMergingMetrics gives good coverage for Flink/Generic/Spark AppenderFactory, and TestSparkDataWrite is an end-to-end test on Spark side). When you get a chance, let me know what you think, thanks again. |
|
Sorry, I was distracted by other things. Let me take a look at this PR tomorrow. |
|
I think we have a similar effort in #2214. |
|
Thanks for looking @aokolnychyi . Oh yes what a coincidence, yea if that could be merged then I can rebase and just make the changes to the metrics part. |
|
Split out ORC nested metric mode fix out in a new PR : #2977 , which should make this change even smaller |
|
Thanks, @szehon-ho! Let me take a look now. |
aokolnychyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this one is converging. I did one more pass.
Let's get #2977 in and then rebase this one.
| } else { | ||
| return sortOrder.fields().stream() | ||
| .map(SortField::sourceId) | ||
| .map(sid -> sortOrder.schema().findColumnName(sid)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure it is always correct as not all transforms are order preserving.
For example, if we sort by bucket(id, 8), it does not mean the data is sorted by id.
That's why we added Transform#preservesOrder.
I think you have to filter this stream to include only sort fields whose transforms are order preserving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After another look with fresh eyes, I don't think my statement above is accurate. If we sort using an order preserving transform, it means the source columns are somehow sorted (but may not be perfectly sorted).
For example, sorting by month(date), means our dates are sorted across months but not necessarily within each month. Still, it probably makes sense to promote such columns. Identity and truncate transforms are also order preserving.
@RussellSpitzer @jackye1995, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea it makes sense. I wonder a bit the use case for sorting by bucket(id, 8), but yea it does not make too much sense to auto-promote them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation here looks correct to me now.
data/src/test/java/org/apache/iceberg/io/TestWriterMetrics.java
Outdated
Show resolved
Hide resolved
data/src/test/java/org/apache/iceberg/io/TestWriterMetrics.java
Outdated
Show resolved
Hide resolved
data/src/test/java/org/apache/iceberg/io/TestWriterMetrics.java
Outdated
Show resolved
Hide resolved
|
@szehon-ho, the other PR is in. Could you rebase this one? |
1f82d6d to
75b841b
Compare
|
Rebased and still working through some of the review comments. Need to take a bit of look at the order preserving transform. |
|
@aokolnychyi the comments should be addressed now when you have time for another look, thanks again |
| .asc("stringCol") | ||
| .asc("dateCol").build(); | ||
| PartitionSpec spec = PartitionSpec.unpartitioned(); | ||
| Table table = TestTables.create(tableDir, "test", SIMPLE_SCHEMA, spec, sortOrder, FORMAT_V2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this test create a v2 table? Won't this work with v1 as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I wanted to not increase test runtime that much, as metrics seemed pretty independent of formatVersion.
Changed to make the tests parameterized to run with v1 and v2, or let me know if prefer the original.
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple comments, but overall I think this looks ready. @aokolnychyi can you take another look?
karuppayya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left few minor comments.Thanks @szehon-ho for working on this
| public static Object[][] parameters() { | ||
| return new Object[][] { | ||
| {FileFormat.ORC}, | ||
| {FileFormat.PARQUET} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include avro too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avro returns null metrics, so it seems usually skipped from Metrics tests.
https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/avro/AvroMetrics.java
|
|
||
| @Before | ||
| public void setupTable() throws Exception { | ||
| this.tableDir = temp.newFolder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: tableDir can be local to this method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
@rdblue and @karuppayya thanks for taking a look! Addressed the comments, hope I understood them correctly. For the original format V2 hardcoding in the test, I parameterize the new tests with v1 and v2, let me know if that's not what we want to do. |
|
Sorry, I was off last week. I'll take a look now. |
aokolnychyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid to me too. I think we need to use the new method in WriteBuilder in Avro.
I'd also slightly tweak the javadoc.
| withSpec(table.spec()); | ||
| setAll(table.properties()); | ||
| metricsConfig(MetricsConfig.fromProperties(table.properties())); | ||
| metricsConfig(MetricsConfig.forTable(table)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do the same in Avro WriteBuilder too?
I don't think we use that method right now but should make sense for consistency.
We already handle that for Parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a sentence to each method to describe what they do?
Creates a metrics config from table properties.
Creates a metrics config for a table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nit: Also, case-sensitivity in param docs seems inconsistent. Can we fix that? I think we usually use lower case letters unless it is a name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and fixed case sensitivity. I also removed the return type documentation as it's now a bit redundant with this javadoc, let me know you prefer to keep it.
| } | ||
|
|
||
| // First set sorted column with sorted column default (can be overridden by user) | ||
| MetricsMode sortedColDefaultMode = sortedColumnDefaultMode(spec.defaultMode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks clean!
| } else { | ||
| return sortOrder.fields().stream() | ||
| .map(SortField::sourceId) | ||
| .map(sid -> sortOrder.schema().findColumnName(sid)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation here looks correct to me now.
-- Fix unrelated bug in configuring nested Orc Metrics -- Welcome feedback how to avoid propagation of sortOrder through AppenderFactories
-- Promote sort columns if default is counts || none to truncate(16) if columns are not explicitly configured.
-- Extend to rest of writers -- Add more unit tests for these writers
74ae12a to
5733597
Compare
|
Looks great to me. Thanks for the work, @szehon-ho! I'll merge when tests pass. |
-- Add flag "write.metadata.sorted.metrics.default" which defaults to "full"
-- Fix unrelated bug in configuring nested Orc Metrics
-- Welcome feedback how to avoid propagation of sortOrder through AppenderFactories