Skip to content

Conversation

@szehon-ho
Copy link
Member

Giving a try to the suggestion in the last community sync to address very-wide-table.

This changes the default MetricsMode from "truncate(16)" to "none" if there are over 100 columns (should it be counts?). This is overriden if the user explicitly sets a default, and of course if a user explicitly sets metrics on a particular column.


public static final String METRICS_MODE_COLUMN_CONF_PREFIX = "write.metadata.metrics.column.";
public static final String DEFAULT_WRITE_METRICS_MODE = "write.metadata.metrics.default";
public static final String DEFAULT_WRITE_METRICS_MODE_DEFAULT = "truncate(16)";
Copy link
Member

@RussellSpitzer RussellSpitzer Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changing a public API and to a new type. May need to reconsider this. Would probably be safer to just add the new variable below with a slightly different name?

Although if other folks don't mind, I think leaving this as is is also fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, changed it back, it made the diff smaller

finalDefaultMode = MetricsModes.fromString(defaultModeProp);
} catch (IllegalArgumentException err) {
// User override was invalid, log the error and use the default
LOG.warn("Ignoring invalid default metrics mode: {}", defaultModeProp, err);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we had this as a log warn before, seems like this should just fail ... but I don't think we need to change the behavior yet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be mistaken, but I think it was the same behavior, warn log is still there. but in any case after changing the default from enum back to a string the diff now looks less. Now in this method only the original variable name "defaultMode" should be changed to avoid conflict.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I just don't know why we originally had this behavior ... seems odd to me 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here was to not fail writes just because a user messed up a tangentially related config property. I think it's better to allow pipelines to continue.

@RussellSpitzer
Copy link
Member

@danielcweeks When we discussed this were you thinking about no metrics for tables with more than 100 columns? or no metrics for anything other than the first 100 columns.

Obviously in both cases we keep sort column metrics by default

@szehon-ho
Copy link
Member Author

Hi @rdblue @danielcweeks , any thought on this? Thanks in advance

@danielcweeks
Copy link
Contributor

@danielcweeks When we discussed this were you thinking about no metrics for tables with more than 100 columns? or no metrics for anything other than the first 100 columns.

Obviously in both cases we keep sort column metrics by default

@RussellSpitzer I would think it would be either all or nothing. I'm not sure we can speculate that the first 100 columns are somehow more meaningful than those that follow (though I guess you could make that argument).

For now it seems like just taking a simple approach would be better. There are some edge cases like what happens in the event that someone is incrementally adding columns and it crosses that threshold, but I wouldn't worry about it too much at this point.

public static MetricsConfig forTable(Table table) {
return from(table.properties(), table.sortOrder());
String defaultMode;
if (table.schema().columns().size() <= MAX_COLUMNS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to override this behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I can make a table config , does that sound reasonable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"defaultMetricCollection : all" or something? Don't we already have that?

Also this feels like it should probably be a catalog level prop 🤷

Copy link
Member Author

@szehon-ho szehon-ho Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was initially thinking like 'write.metadata.metrics.max.columns'='100'? Yea good point maybe catalog level prop is cleaner

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another look, and actually it's a bit messy to make this a catalog property in the current code (the Table is serialized and sent to writers, but Catalog is not..).

Was thinking it makes sense as a table property as well (ie, if i know this table is worth optimizing, set the max columns to be higher). May be a bit awkward to have a different catalog just for this. And if user really wants to have one global setting, this change is coming: #4011

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is okay for now. Tables will still respect whatever is set as write.metadata.metrics.default so this really just changes Iceberg's default in a reasonable way. It is also good to note that metrics for sort columns are automatically promoted to at least truncate[16] so it isn't as though we're losing all stats.

@aokolnychyi
Copy link
Contributor

I am not sure how useful a table property is or whether we should even make it configurable. Specifically, I am worried about adding a rarely used table property. We are usually pretty conservative when it comes to adding new table properties. Is there a good use case for this to be configurable? It seems that if anyone needs to tune the metrics config, the existing framework is already good enough and allows us to set default and per-column modes. I think we should encourage users to set that.

Also, I think 100 is too big. I'd prefer it to be around 24-32. I've seen tables with 100 string columns and the metadata size already had an impact on the performance. Any thoughts on that? Keep in mind we always collect stats for sort columns and will collect stats for identity columns in the future.

@szehon-ho szehon-ho force-pushed the metrics_demotion_master branch from f953922 to 4d40cd6 Compare February 10, 2022 18:10
@szehon-ho szehon-ho force-pushed the metrics_demotion_master branch from 4d40cd6 to 181cddc Compare February 10, 2022 18:12
@szehon-ho
Copy link
Member Author

Hi @aokolnychyi @RussellSpitzer @rdblue @danielcweeks if you guys want another look after the last comments, thanks

* @param defaultMode default, if not set by user property
* @return metrics configuration
*/
private static MetricsConfig from(Map<String, String> props, SortOrder order, String defaultMode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of renaming the variable that is used everywhere, I'd just rename the incoming argument. There would be fewer changes if this used defaultDefaultMode or something.

} else {
defaultMode = MetricsModes.None.get().toString();
}
return from(table.properties(), table.sortOrder(), defaultMode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing newline after control flow block and before return.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is about ready, but there is still some active discussion about overriding the number of columns. I'd probably opt to ignore this for now because users still have final control through write.metadata.metrics.default.

@rdblue rdblue merged commit 67f4716 into apache:master Feb 14, 2022
@rdblue
Copy link
Contributor

rdblue commented Feb 14, 2022

Merging this. Thanks, @szehon-ho and everyone for the reviews!

@szehon-ho
Copy link
Member Author

Thanks @rdblue for merge and all the reviewers! Sorry didn't see the last style comments before merge, i can prepare another pr for those

@rdblue
Copy link
Contributor

rdblue commented Feb 15, 2022

@szehon-ho, don't worry about it. I didn't want to keep this outstanding for minor things like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants