Skip to content

Conversation

@szehon-ho
Copy link
Member

@szehon-ho szehon-ho commented Feb 16, 2021

-- Add flag "write.metadata.sorted.metrics.default" which defaults to "full"
-- Fix unrelated bug in configuring nested Orc Metrics
-- Welcome feedback how to avoid propagation of sortOrder through AppenderFactories

@szehon-ho
Copy link
Member Author

szehon-ho commented Feb 16, 2021

Change Background:

A suggestion from @aokolnychyi , we find that users do not tend to fine-tune column metrics. The default truncate(16) does not do much for certain sortable types, and reduces benefits of predicate pruning. Cost of storing full metrics should be small as number of sorted columns will be generally small.

@aokolnychyi
Copy link
Contributor

Thanks for the PR, @szehon-ho! Let me take a look now.

public static final String METRICS_MODE_COLUMN_CONF_PREFIX = "write.metadata.metrics.column.";
public static final String DEFAULT_WRITE_METRICS_MODE = "write.metadata.metrics.default";
public static final String DEFAULT_WRITE_METRICS_MODE_DEFAULT = "truncate(16)";
public static final String SORTED_COL_DEFAULT_METRICS_MODE = "write.metadata.sorted.metrics.default";
Copy link
Contributor

@aokolnychyi aokolnychyi Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we should simplify it a bit. The use case I was talking about is when the user configures the default sort mode as none or counts but creates a table with a sort order. In that case, we should promote the metrics for sort columns to be at least truncate(16) unless the user sets a mode for sort columns explicitly. It is probably too dangerous to promote to full as the values may be too long. I guess truncate(16) is a reasonable default for sort columns and it will apply only to string and binary columns. Longs/integers won't be affected.

Internally, we may want to change the default value to counts instead of truncate(16) for tables with many columns as we have a lot of tables with 100+ columns and we don't want tons of unnecessary metadata. But I am not sure that the community wants to do the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also say the promotion should be implicit and we don't need extra table properties.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i'm also in favor of just making the change without a flag to configure. Just to make configuration have one parameter less, do we really have a lot of use cases where truncate(16) isn't the right thing to do? And if it isn't we can just let users manually reconfigure those columns individually if they need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho, could you update the PR to match the behavior in this comment, please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'll take a look, thanks for the feedback guys

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recap if I understand correctly: if default < truncate(16), then sort columns => truncate(16). And no table property (user overrides the sort col metric property itself)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s just my suggestion, I can always be convinced otherwise too 😊

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No you guys have a point, full is unnecessary if someone sorts by a huge text for example, and there's already a good way to override the config. Updated the PR and changed the title to reflect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this behavior and I also would not add a new table property.

@szehon-ho szehon-ho changed the title Auto promote sorted column metrics to full Auto promote sorted column metrics to truncate(16) Feb 18, 2021
@aokolnychyi
Copy link
Contributor

Let me do another round now.

@aokolnychyi
Copy link
Contributor

I see that the current approach tries to promote the sort columns on read, which is one way of doing this. I think I'd prefer the approach we took with validateReferencedColumns that does its validation during table creation and in a few other relevant places. If we do it that way, we don't need to modify any ORC and Parquet code and the scope of the change will be really small.

Basically, we could modify TableMetadata#newTableMetadata as this:

    MetricsConfig metricsConfig = MetricsConfig.fromProperties(properties);
    metricsConfig.validateReferencedColumns(schema);
    MetricsConfig freshMetricsConfig = metricsConfig.copyWithPromotedSortColumns(freshSortOrder);

    Map<String, String> freshProperties = Maps.newHashMap(properties);
    freshProperties.putAll(freshMetricsConfig.toProperties());

That will need two new methods in MetricsConfig (let's not make them public):

  MetricsConfig copyWithPromotedSortColumns(SortOrder sortOrder) {
    // create a copy
    // check the default mode
    // if none or count -> iterate through the sort order fields and promote as needed if not set by the user
  }

  Map<String, String> toProperties() {
    // map into table props
  }

What are your thoughts, @RussellSpitzer @rdblue @holdenk @shardulm94 @pvary?

@RussellSpitzer
Copy link
Member

Would that also catch alters? @aokolnychyi? I don't think that's unreasonable since it does make sense to keep it off the read path if possible ...

@aokolnychyi
Copy link
Contributor

We will catch altering sort order separately just like we catch a few other places for validateReferencedColumns.

@rdblue
Copy link
Contributor

rdblue commented Feb 26, 2021

I don't think I would approach the problem as @aokolnychyi is suggesting. I thought that this would promote when accessing the metrics. So basically MetricsConfig.fromProperties becomes MetricsConfig.forTable so that it can be based on both the properties and the sort order for a table. I think that would be a reasonably small change.

The advantage of that approach is that the sort order doesn't make changes to table properties. Those are always configured by users.

@aokolnychyi
Copy link
Contributor

aokolnychyi commented Feb 26, 2021

So basically MetricsConfig.fromProperties becomes MetricsConfig.forTable so that it can be based on both the properties and the sort order for a table.

The problem is that we build MetricsConfig in places where we don't have access to the table object. We made a recent change to make BaseTable serializable but we would need to modify quite some places to pass it around. Also, I am not sure about a potential performance penalty. We do serialize things like spec and props so maybe it won't be a big deal to serialize the complete table object but someone has to test that. We could pass the sort order around but it anyway seemed like a major work to redo that in all places. Plus, it is easy to miss something. For example, the current change does not capture Flink and our import code in SparkTableUtil but all tests are green.

The advantage of that approach is that the sort order doesn't make changes to table properties. Those are always configured by users.

I think having something like toTableProperties on MetricsConfig will make it possible to migrate to column ids instead of names in the future. If we want to use column ids, then modifying table props seems inevitable to me.

@aokolnychyi
Copy link
Contributor

aokolnychyi commented Feb 26, 2021

But I do get the concern of modifying table properties. However, it looks there will be a few places where will need to do that anyway (e.g. schema evolution).

@rdblue
Copy link
Contributor

rdblue commented Feb 26, 2021

My concern isn't modifying table properties specifically. We need to do that so that we update the properties to reflect the user's intent. If we have a setting for a column, renaming the property to carry it through a column rename is preserving the user's intent. But if we update table properties then we are potentially losing what the user chose to do.

For example, what do we do when the sort field is removed and there is a metrics config property? We can't remove a truncate[16] setting because we don't know whether it was added when the field was added to the sort, or if it was specifically set by the table owner. Similarly, we have cases where the owner may set a field to counts and then add it to the sort. I think that the table owner's config should take precedence. So we don't replace the property, but how then does the owner signal that it should be defaulted instead? Last, how do we handle tables that currently have sort orders? If we base metrics on the current sort order then everything starts working automatically. If we don't, then we would need to trigger a commit to update the properties.

Even if we don't pass the table around, I think we should definitely pass the sort order in to get the metrics config. It is probably not the time to refactor and pass the table, but we can pass the sort order like we do the other table config.

@aokolnychyi
Copy link
Contributor

For example, what do we do when the sort field is removed and there is a metrics config property?

Yeah, that's the point we won't be able to differentiate if we modified the props. I am convinced now.
If we have enough confidence that this feature will be useful, I am ok if we want to modify all places where we write.

I took a look at how hard will it be to move to Table as that would simplify our life in the future. Since BaseTable uses a proxy for serializing, we only serialize FileIO and the metadata location. We will need to read the metadata file back on each executor. In addition, FileIO may contain a full Hadoop conf. That's why we broadcast FileIO and EncryptionManager in Spark. We could consider broadcasting Table instead of FileIO and EncryptionManager. However, we will still have to read the metadata file from each executor. Is that a big deal?

I am fine either way.

@rdblue
Copy link
Contributor

rdblue commented Feb 26, 2021

It sounds like it would take some work to serialize Table. I think we'd want to update it so that at least one version of TableMetadata is also carried through to avoid that read on every executor. In the long run, that's probably a good thing to do. But for now we can make easier progress serializing things separately.

Eventually, I think we should broadcast the table and FileIO together to take care of this issue for all of the things that might be expensive to serialize.

@szehon-ho
Copy link
Member Author

szehon-ho commented Feb 27, 2021

Thanks guys for the time and discussion (learning a lot on the side).

If I understand correctly, it looks metrics-promotion on read is the way to go due to simplicity, esp in regards to alter table not having to alter auto-promoted metrics properties. But in this approach we have to manually pass in sortOrder in all places of write, and I missed it in the new SparkTableUtil and Flink, which I can add.

@szehon-ho
Copy link
Member Author

Edit: and for now as Table is hard to serialize, we can have MetricsConfig.forSortOrder(properties, sortOrder), to make progress, as SortOrder is already serializable.

@szehon-ho
Copy link
Member Author

Hi @rdblue @aokolnychyi , so I propagated the SortOrder through the various writers as discussed, including the ones I missed before.

I spent some time verifying it by unit tests (TestMergingMetrics gives good coverage for Flink/Generic/Spark AppenderFactory, and TestSparkDataWrite is an end-to-end test on Spark side). When you get a chance, let me know what you think, thanks again.

@szehon-ho szehon-ho changed the title Auto promote sorted column metrics to truncate(16) Core: Auto promote sorted column metrics to truncate(16) Mar 2, 2021
@aokolnychyi
Copy link
Contributor

Sorry, I was distracted by other things. Let me take a look at this PR tomorrow.

@aokolnychyi
Copy link
Contributor

I think we have a similar effort in #2214.

@szehon-ho
Copy link
Member Author

Thanks for looking @aokolnychyi . Oh yes what a coincidence, yea if that could be merged then I can rebase and just make the changes to the metrics part.

@szehon-ho
Copy link
Member Author

Split out ORC nested metric mode fix out in a new PR : #2977 , which should make this change even smaller

@aokolnychyi
Copy link
Contributor

Thanks, @szehon-ho! Let me take a look now.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this one is converging. I did one more pass.

Let's get #2977 in and then rebase this one.

} else {
return sortOrder.fields().stream()
.map(SortField::sourceId)
.map(sid -> sortOrder.schema().findColumnName(sid))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is always correct as not all transforms are order preserving.

For example, if we sort by bucket(id, 8), it does not mean the data is sorted by id.
That's why we added Transform#preservesOrder.

I think you have to filter this stream to include only sort fields whose transforms are order preserving.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After another look with fresh eyes, I don't think my statement above is accurate. If we sort using an order preserving transform, it means the source columns are somehow sorted (but may not be perfectly sorted).

For example, sorting by month(date), means our dates are sorted across months but not necessarily within each month. Still, it probably makes sense to promote such columns. Identity and truncate transforms are also order preserving.

@RussellSpitzer @jackye1995, thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea it makes sense. I wonder a bit the use case for sorting by bucket(id, 8), but yea it does not make too much sense to auto-promote them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation here looks correct to me now.

@aokolnychyi
Copy link
Contributor

@szehon-ho, the other PR is in. Could you rebase this one?

@szehon-ho
Copy link
Member Author

Rebased and still working through some of the review comments. Need to take a bit of look at the order preserving transform.

@szehon-ho
Copy link
Member Author

@aokolnychyi the comments should be addressed now when you have time for another look, thanks again

.asc("stringCol")
.asc("dateCol").build();
PartitionSpec spec = PartitionSpec.unpartitioned();
Table table = TestTables.create(tableDir, "test", SIMPLE_SCHEMA, spec, sortOrder, FORMAT_V2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this test create a v2 table? Won't this work with v1 as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I wanted to not increase test runtime that much, as metrics seemed pretty independent of formatVersion.

Changed to make the tests parameterized to run with v1 and v2, or let me know if prefer the original.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple comments, but overall I think this looks ready. @aokolnychyi can you take another look?

Copy link
Contributor

@karuppayya karuppayya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left few minor comments.Thanks @szehon-ho for working on this

public static Object[][] parameters() {
return new Object[][] {
{FileFormat.ORC},
{FileFormat.PARQUET}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include avro too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avro returns null metrics, so it seems usually skipped from Metrics tests.
https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/avro/AvroMetrics.java


@Before
public void setupTable() throws Exception {
this.tableDir = temp.newFolder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: tableDir can be local to this method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@szehon-ho
Copy link
Member Author

@rdblue and @karuppayya thanks for taking a look! Addressed the comments, hope I understood them correctly. For the original format V2 hardcoding in the test, I parameterize the new tests with v1 and v2, let me know if that's not what we want to do.

@aokolnychyi
Copy link
Contributor

Sorry, I was off last week. I'll take a look now.

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid to me too. I think we need to use the new method in WriteBuilder in Avro.
I'd also slightly tweak the javadoc.

withSpec(table.spec());
setAll(table.properties());
metricsConfig(MetricsConfig.fromProperties(table.properties()));
metricsConfig(MetricsConfig.forTable(table));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do the same in Avro WriteBuilder too?
I don't think we use that method right now but should make sense for consistency.
We already handle that for Parquet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a sentence to each method to describe what they do?

Creates a metrics config from table properties.
Creates a metrics config for a table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit: Also, case-sensitivity in param docs seems inconsistent. Can we fix that? I think we usually use lower case letters unless it is a name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and fixed case sensitivity. I also removed the return type documentation as it's now a bit redundant with this javadoc, let me know you prefer to keep it.

}

// First set sorted column with sorted column default (can be overridden by user)
MetricsMode sortedColDefaultMode = sortedColumnDefaultMode(spec.defaultMode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks clean!

} else {
return sortOrder.fields().stream()
.map(SortField::sourceId)
.map(sid -> sortOrder.schema().findColumnName(sid))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation here looks correct to me now.

Szehon Ho and others added 11 commits September 14, 2021 14:20
-- Fix unrelated bug in configuring nested Orc Metrics
-- Welcome feedback how to avoid propagation of sortOrder through AppenderFactories
-- Promote sort columns if default is counts || none to truncate(16) if columns are not explicitly configured.
-- Extend to rest of writers
-- Add more unit tests for these writers
@aokolnychyi
Copy link
Contributor

Looks great to me. Thanks for the work, @szehon-ho! I'll merge when tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants