-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28948][SQL] Support passing all Table metadata in TableProvider #26750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is preferred over #26297, because
|
|
retest this please |
|
Hi, @cloud-fan . Could you fix the following two |
|
Test build #114857 has finished for PR 26750 at commit
|
|
Test build #114889 has finished for PR 26750 at commit
|
| * schema. | ||
| */ | ||
| Table getTable(CaseInsensitiveStringMap options); | ||
| Table getTable(StructType schema, Map<String, String> properties); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the main idea of this PR is to remove the case-insensitive requirements from the original java TableProvider DSv2 design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @dbtsai and @aokolnychyi for Iceberg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, the main idea is to add a new overload method to accept user-specified partitioning. Since we need to change the API, we change the option type as well, see https://github.com/apache/spark/pull/26750/files#r354692676
| * Return a {@link Table} instance with specified table properties to do read/write. | ||
| * Implementations should infer the table schema and partitioning. | ||
| * | ||
| * @param properties The specified table properties. It's case preserving (contains exactly what |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did miss most of the discussions unfortunately. Would this be used as table properties directly? Also include read/write options? Or are the read/write options going to translate to a catalog and identifier as we had discussed some time ago?
I guess a path based table would have a location table property, which constitutes the old option "path"`, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's used as table properties directly, not read/write options. The read/write options are case-insensitive and passed through Table.newScanBuilder/Table.newWriteBuilder.
Table properties doesn't have to be case-insensitive. So here I just define it as case-preserving. The implementation is free to interpret it case-sensitive or case-insensitive. (Spark can't control it anyway)
If you read a table with DataFrameReader, then the options will be passed to the data source twice: once with TableProvider.getTable, once with Table.newScanOptions.
If you create the table first with CREATE TABLE USING v2Provider TBLPROPERTIES ..., and then read the table with DataFrameReader.option(...).table, then table properties and read options are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a path based table would have a location table property, but I don't know why we can't reuse the old name path. cc @rdblue
brkyvz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this a lot. After this, we're a single step away from being able to create/replace v2 tables through the "save" API. I'd love others to also weigh in, since I've missed most V2 discussions in the last month.
One concern I have is the mismatch between path and location as data source options and Hive metastore table properties
|
|
||
| // A simple version of `TableProvider` which doesn't support specified table schema/partitioning | ||
| // and treats table properties case-insensitively. This is private and only used in builtin sources. | ||
| private[sql] trait SimpleTableProvider extends TableProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is private and only used by built-in sources, it should not be in the catalyst connector package. That is the public API package, and this will appear public outside of Scala.
I think this should be in org.apache.spark.sql.execution.datasources.v2.
| override def getTable(properties: util.Map[String, String]): Table = { | ||
| getTable(new CaseInsensitiveStringMap(properties)) | ||
| } | ||
| override def getTable(schema: StructType, properties: util.Map[String, String]): Table = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: These methods should have a newline between them.
| } | ||
|
|
||
| private [connector] trait SessionCatalogTest[T <: Table, Catalog <: TestV2SessionCatalogBase[T]] | ||
| private[connector] trait SessionCatalogTest[T <: Table, Catalog <: TestV2SessionCatalogBase[T]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a necessary change. If you were already changing this line it would be fine, but as it is this change can cause conflicts and should probably be reverted.
|
I find this approach a little awkward because it mixes really different use cases into the same API. One is where you have a metastore as the source of truth for schema and partitioning, and the other is where the implementation is the source of truth. This leads to strange requirements, like throwing an That's why I liked the approach of moving the schema and partitioning inference outside of this API. That way, Spark is responsible for determining things like whether schemas "match" and can use more context to make a reasonable choice. Why abandon the other approach? I thought that we were making progress and that the primary blocker was trying to do too much to be reviewed in a single PR. |
We can also do that too, e.g. first call Even if we have separated method My main point is, this PR is a natural extension of the existing API: if the If we all agree that the existing API is wrong (the way we accept user-specified schema), then this PR should be rejected as it extends a wrong API. But this seems not the case here. |
| throw new UnsupportedOperationException( | ||
| this.getClass().getSimpleName() + " source does not support user-specified schema"); | ||
| } | ||
| Table getTable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit curious about the parameter order in these 3 methods:
getTable(properties)
getTable(schema, properties)
getTable(schema, partitioning, properties)
Is it on purpose? Why not:
getTable(properties)
getTable(properties, schema)
getTable(properties, schema, partitioning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just follow the order in TableCatalog.createTable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind changing the parameter order? It is a bit wired.
Besides, previously the parameter order is like
getTable(options)
getTable(options, schema)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong preference, but probably better to be consistent with createTable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think consistency in the trait TableProvider itself is more important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not consistent?
getTable(properties)
getTable(schema, properties)
getTable(schema, partitioning, properties)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the common practice. The three methods look neater when each parameter is in a fixed position
getTable(properties)
getTable(properties, schema)
getTable(properties, schema, partitioning)
What changes were proposed in this pull request?
The
TableProvideronly accepts table schema and properties. It should accept table partitioning as well.This is extracted from #25651, to only keep the API changes and make the diff smaller.
Why are the changes needed?
Although
DataFrameReader/DataStreamReaderdon't support user-specified partitioning, we need to pass the table partitioning when getting tables fromTableProviderif we store tables in Hive metastore with v2 provider.Does this PR introduce any user-facing change?
not yet.
How was this patch tested?
existing tests