-
Notifications
You must be signed in to change notification settings - Fork 3k
Disable metrics for very wide tables to reduce metadata size #3959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f815584 to
c4145dd
Compare
d1d8686 to
7631ee8
Compare
|
|
||
| public static final String METRICS_MODE_COLUMN_CONF_PREFIX = "write.metadata.metrics.column."; | ||
| public static final String DEFAULT_WRITE_METRICS_MODE = "write.metadata.metrics.default"; | ||
| public static final String DEFAULT_WRITE_METRICS_MODE_DEFAULT = "truncate(16)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is changing a public API and to a new type. May need to reconsider this. Would probably be safer to just add the new variable below with a slightly different name?
Although if other folks don't mind, I think leaving this as is is also fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, changed it back, it made the diff smaller
| finalDefaultMode = MetricsModes.fromString(defaultModeProp); | ||
| } catch (IllegalArgumentException err) { | ||
| // User override was invalid, log the error and use the default | ||
| LOG.warn("Ignoring invalid default metrics mode: {}", defaultModeProp, err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we had this as a log warn before, seems like this should just fail ... but I don't think we need to change the behavior yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be mistaken, but I think it was the same behavior, warn log is still there. but in any case after changing the default from enum back to a string the diff now looks less. Now in this method only the original variable name "defaultMode" should be changed to avoid conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I just don't know why we originally had this behavior ... seems odd to me 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here was to not fail writes just because a user messed up a tangentially related config property. I think it's better to allow pipelines to continue.
|
@danielcweeks When we discussed this were you thinking about no metrics for tables with more than 100 columns? or no metrics for anything other than the first 100 columns. Obviously in both cases we keep sort column metrics by default |
|
Hi @rdblue @danielcweeks , any thought on this? Thanks in advance |
@RussellSpitzer I would think it would be either all or nothing. I'm not sure we can speculate that the first 100 columns are somehow more meaningful than those that follow (though I guess you could make that argument). For now it seems like just taking a simple approach would be better. There are some edge cases like what happens in the event that someone is incrementally adding columns and it crosses that threshold, but I wouldn't worry about it too much at this point. |
| public static MetricsConfig forTable(Table table) { | ||
| return from(table.properties(), table.sortOrder()); | ||
| String defaultMode; | ||
| if (table.schema().columns().size() <= MAX_COLUMNS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to override this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I can make a table config , does that sound reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"defaultMetricCollection : all" or something? Don't we already have that?
Also this feels like it should probably be a catalog level prop 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was initially thinking like 'write.metadata.metrics.max.columns'='100'? Yea good point maybe catalog level prop is cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took another look, and actually it's a bit messy to make this a catalog property in the current code (the Table is serialized and sent to writers, but Catalog is not..).
Was thinking it makes sense as a table property as well (ie, if i know this table is worth optimizing, set the max columns to be higher). May be a bit awkward to have a different catalog just for this. And if user really wants to have one global setting, this change is coming: #4011
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is okay for now. Tables will still respect whatever is set as write.metadata.metrics.default so this really just changes Iceberg's default in a reasonable way. It is also good to note that metrics for sort columns are automatically promoted to at least truncate[16] so it isn't as though we're losing all stats.
|
I am not sure how useful a table property is or whether we should even make it configurable. Specifically, I am worried about adding a rarely used table property. We are usually pretty conservative when it comes to adding new table properties. Is there a good use case for this to be configurable? It seems that if anyone needs to tune the metrics config, the existing framework is already good enough and allows us to set default and per-column modes. I think we should encourage users to set that. Also, I think 100 is too big. I'd prefer it to be around 24-32. I've seen tables with 100 string columns and the metadata size already had an impact on the performance. Any thoughts on that? Keep in mind we always collect stats for sort columns and will collect stats for identity columns in the future. |
f953922 to
4d40cd6
Compare
4d40cd6 to
181cddc
Compare
|
Hi @aokolnychyi @RussellSpitzer @rdblue @danielcweeks if you guys want another look after the last comments, thanks |
| * @param defaultMode default, if not set by user property | ||
| * @return metrics configuration | ||
| */ | ||
| private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of renaming the variable that is used everywhere, I'd just rename the incoming argument. There would be fewer changes if this used defaultDefaultMode or something.
| } else { | ||
| defaultMode = MetricsModes.None.get().toString(); | ||
| } | ||
| return from(table.properties(), table.sortOrder(), defaultMode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missing newline after control flow block and before return.
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is about ready, but there is still some active discussion about overriding the number of columns. I'd probably opt to ignore this for now because users still have final control through write.metadata.metrics.default.
|
Merging this. Thanks, @szehon-ho and everyone for the reviews! |
|
Thanks @rdblue for merge and all the reviewers! Sorry didn't see the last style comments before merge, i can prepare another pr for those |
|
@szehon-ho, don't worry about it. I didn't want to keep this outstanding for minor things like that. |
…ize (apache#3959) Reference: (cherry picked from commit 67f4716)
Giving a try to the suggestion in the last community sync to address very-wide-table.
This changes the default MetricsMode from "truncate(16)" to "none" if there are over 100 columns (should it be counts?). This is overriden if the user explicitly sets a default, and of course if a user explicitly sets metrics on a particular column.