[SPARK-28948][SQL] Support passing all Table metadata in TableProvider #26750

cloud-fan · 2019-12-03T18:06:24Z

What changes were proposed in this pull request?

The TableProvider only accepts table schema and properties. It should accept table partitioning as well.

This is extracted from #25651, to only keep the API changes and make the diff smaller.

Why are the changes needed?

Although DataFrameReader/DataStreamReader don't support user-specified partitioning, we need to pass the table partitioning when getting tables from TableProvider if we store tables in Hive metastore with v2 provider.

Does this PR introduce any user-facing change?

not yet.

How was this patch tested?

existing tests

cloud-fan · 2019-12-04T07:37:59Z

This is preferred over #26297, because

This follows the existing API style, so much less diff.
It's hard to decouple scheme and partition inference. For example, file source needs to infer partitioning before reporting its schema, as partition columns are part of the table schema.

cloud-fan · 2019-12-04T08:09:27Z

retest this please

dongjoon-hyun · 2019-12-04T16:30:56Z

Hi, @cloud-fan . Could you fix the following two lint-java errors detected by GitHub Action? Our Jenkins PR builder seems to ignore it.

[ERROR] src/test/java/test/org/apache/spark/sql/connector/JavaSimpleDataSourceV2.java:[43] (sizes) LineLength: Line is longer than 100 characters (found 102).
34
[ERROR] src/test/java/test/org/apache/spark/sql/connector/JavaSchemaRequiredDataSource.java:[73] (sizes) LineLength: Line is longer than 100 characters (found 102).

SparkQA · 2019-12-04T16:46:24Z

Test build #114857 has finished for PR 26750 at commit 9fe392b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class NoopDataSource extends SimpleTableProvider with DataSourceRegister
class RateStreamProvider extends SimpleTableProvider with DataSourceRegister
class TextSocketSourceProvider extends SimpleTableProvider with DataSourceRegister with Logging

cloud-fan · 2019-12-05T03:43:09Z

cc @rdblue @brkyvz @gengliangwang

SparkQA · 2019-12-05T07:48:31Z

Test build #114889 has finished for PR 26750 at commit 7fcee0c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class NoopDataSource extends SimpleTableProvider with DataSourceRegister
class RateStreamProvider extends SimpleTableProvider with DataSourceRegister
class TextSocketSourceProvider extends SimpleTableProvider with DataSourceRegister with Logging

dongjoon-hyun · 2019-12-05T19:07:25Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableProvider.java

+   *                                  schema.
   */
-  Table getTable(CaseInsensitiveStringMap options);
+  Table getTable(StructType schema, Map<String, String> properties);


So, the main idea of this PR is to remove the case-insensitive requirements from the original java TableProvider DSv2 design?

cc @dbtsai and @aokolnychyi for Iceberg.

no, the main idea is to add a new overload method to accept user-specified partitioning. Since we need to change the API, we change the option type as well, see https://github.com/apache/spark/pull/26750/files#r354692676

brkyvz · 2019-12-06T02:29:57Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableProvider.java

+   * Return a {@link Table} instance with specified table properties to do read/write.
+   * Implementations should infer the table schema and partitioning.
+   *
+   * @param properties The specified table properties. It's case preserving (contains exactly what


I did miss most of the discussions unfortunately. Would this be used as table properties directly? Also include read/write options? Or are the read/write options going to translate to a catalog and identifier as we had discussed some time ago?

I guess a path based table would have a location table property, which constitutes the old option "path"`, correct?

Yes, it's used as table properties directly, not read/write options. The read/write options are case-insensitive and passed through Table.newScanBuilder/Table.newWriteBuilder.

Table properties doesn't have to be case-insensitive. So here I just define it as case-preserving. The implementation is free to interpret it case-sensitive or case-insensitive. (Spark can't control it anyway)

If you read a table with DataFrameReader, then the options will be passed to the data source twice: once with TableProvider.getTable, once with Table.newScanOptions.

If you create the table first with CREATE TABLE USING v2Provider TBLPROPERTIES ..., and then read the table with DataFrameReader.option(...).table, then table properties and read options are different.

a path based table would have a location table property, but I don't know why we can't reuse the old name path. cc @rdblue

brkyvz

I like this a lot. After this, we're a single step away from being able to create/replace v2 tables through the "save" API. I'd love others to also weigh in, since I've missed most V2 discussions in the last month.

One concern I have is the mismatch between path and location as data source options and Hive metastore table properties

rdblue · 2019-12-08T22:47:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/SimpleTableProvider.scala

+
+// A simple version of `TableProvider` which doesn't support specified table schema/partitioning
+// and treats table properties case-insensitively. This is private and only used in builtin sources.
+private[sql] trait SimpleTableProvider extends TableProvider {


If this is private and only used by built-in sources, it should not be in the catalyst connector package. That is the public API package, and this will appear public outside of Scala.

I think this should be in org.apache.spark.sql.execution.datasources.v2.

rdblue · 2019-12-08T22:48:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala

+  override def getTable(properties: util.Map[String, String]): Table = {
+    getTable(new CaseInsensitiveStringMap(properties))
+  }
+  override def getTable(schema: StructType, properties: util.Map[String, String]): Table = {


Nit: These methods should have a newline between them.

rdblue · 2019-12-08T22:51:02Z

...src/test/scala/org/apache/spark/sql/connector/DataSourceV2DataFrameSessionCatalogSuite.scala

 }

-private [connector] trait SessionCatalogTest[T <: Table, Catalog <: TestV2SessionCatalogBase[T]]
+private[connector] trait SessionCatalogTest[T <: Table, Catalog <: TestV2SessionCatalogBase[T]]


This isn't a necessary change. If you were already changing this line it would be fine, but as it is this change can cause conflicts and should probably be reverted.

rdblue · 2019-12-08T23:58:17Z

I find this approach a little awkward because it mixes really different use cases into the same API. One is where you have a metastore as the source of truth for schema and partitioning, and the other is where the implementation is the source of truth.

This leads to strange requirements, like throwing an IllegalArgumentException to reject a schema or partitioning. That doesn't make much sense when the source of truth is the metastore. And, the API doesn't distinguish between these cases, so an implementation doesn't know whether the table is being created by a DataFrameWriter (and should reject partitioning that doesn't match) or if it is created from metastore information (and should use the partitioning from the metastore).

That's why I liked the approach of moving the schema and partitioning inference outside of this API. That way, Spark is responsible for determining things like whether schemas "match" and can use more context to make a reasonable choice.

Why abandon the other approach? I thought that we were making progress and that the primary blocker was trying to do too much to be reviewed in a single PR.

cloud-fan · 2019-12-09T04:02:22Z

That way, Spark is responsible for determining things like whether schemas "match" and can use more context to make a reasonable choice.

We can also do that too, e.g. first call getTable(schema, partition, properties) and then check the returned table reports the compatible schema/partitioning as the one passed in.

Even if we have separated method inferSchema and inferPartitioning, we still require the getTable method to throw IllegalArgumentException to reject non-compatible schema/partitioning. E.g. there is a user-provided schema and we pass it to getTable directly.

My main point is, this PR is a natural extension of the existing API: if the TableProvider accepts user-specified schema, why not accept user-specified partitioning? The refactor might be good, but it should be a separated story.

If we all agree that the existing API is wrong (the way we accept user-specified schema), then this PR should be rejected as it extends a wrong API. But this seems not the case here.

gengliangwang · 2019-12-09T21:23:47Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableProvider.java

-    throw new UnsupportedOperationException(
-      this.getClass().getSimpleName() + " source does not support user-specified schema");
-  }
+  Table getTable(


I am a bit curious about the parameter order in these 3 methods:

getTable(properties) getTable(schema, properties) getTable(schema, partitioning, properties)

Is it on purpose? Why not:

getTable(properties) getTable(properties, schema) getTable(properties, schema, partitioning)

I just follow the order in TableCatalog.createTable

Do you mind changing the parameter order? It is a bit wired.
Besides, previously the parameter order is like

getTable(options) getTable(options, schema)

I don't have a strong preference, but probably better to be consistent with createTable?

Well, I think consistency in the trait TableProvider itself is more important.

why is this not consistent?

getTable(properties) getTable(schema, properties) getTable(schema, partitioning, properties)

I think this is the common practice. The three methods look neater when each parameter is in a fixed position

getTable(properties) getTable(properties, schema) getTable(properties, schema, partitioning)

cloud-fan force-pushed the simple branch from 45f37eb to d50facf Compare December 4, 2019 07:34

cloud-fan force-pushed the simple branch from d50facf to 9fe392b Compare December 4, 2019 12:34

cloud-fan mentioned this pull request Dec 5, 2019

[SPARK-29665][SQL] refine the TableProvider interface #26297

Closed

TableProvider should accept table partitioning

7fcee0c

cloud-fan force-pushed the simple branch from 9fe392b to 7fcee0c Compare December 5, 2019 03:42

apache deleted a comment from SparkQA Dec 5, 2019

dongjoon-hyun added the SQL label Dec 5, 2019

dongjoon-hyun reviewed Dec 5, 2019

View reviewed changes

brkyvz reviewed Dec 6, 2019

View reviewed changes

brkyvz approved these changes Dec 6, 2019

View reviewed changes

rdblue reviewed Dec 8, 2019

View reviewed changes

gengliangwang reviewed Dec 9, 2019

View reviewed changes

cloud-fan closed this Feb 10, 2020

[SPARK-28948][SQL] Support passing all Table metadata in TableProvider #26750

[SPARK-28948][SQL] Support passing all Table metadata in TableProvider #26750

Uh oh!

Conversation

cloud-fan commented Dec 3, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Dec 4, 2019

Uh oh!

cloud-fan commented Dec 4, 2019

Uh oh!

dongjoon-hyun commented Dec 4, 2019

Uh oh!

SparkQA commented Dec 4, 2019

Uh oh!

cloud-fan commented Dec 5, 2019

Uh oh!

SparkQA commented Dec 5, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brkyvz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Dec 8, 2019

Uh oh!

cloud-fan commented Dec 9, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Dec 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cloud-fan Dec 6, 2019 •

edited

Loading

gengliangwang Dec 10, 2019 •

edited

Loading