[SPARK-28948][SQL] Support passing all Table metadata in TableProvider #25651

cloud-fan · 2019-09-02T12:07:19Z

What changes were proposed in this pull request?

Currently Data Source V2 has 2 major use cases:

users plug in a custom catalog, which is tightly coupled with its own data. For example, users can plug in a cassandra catalog, and use Spark to read/write cassandra tables directly.
users read/write the external data as a table directly via DataFrameReader/Writer, or register it as a table in Spark.

Use case 1 is newly introduced in the master branch, which greatly improves the user experience when interacting with external storage systems that have catalogs, e.g. cassandra, JDBC, etc.

Use case 2 is the main use case of Data Source V1, which works well if the external storage system doesn't have a catalog, e.g. parquet files on S3.

However, use case 2 is not well supported. For example

class MyTableProvider extends TableProvider ...
sql("CREATE TABLE t USING com.abc.MyTableProvider")

This fails with AnalysisException: com.abc.MyTableProvider is not a valid Spark SQL Data Source. The session catalog always treats table provider as v1 source.

To support it, this PR updates TableProvider#getTable to accept additional table metadata info. The expected behaviors are defined in https://docs.google.com/document/d/1oaS0eIVL1WsCjr4CqIpRv6CGkS5EoMQrngn3FsY1d-Q/edit?usp=sharing

Why are the changes needed?

Make Data Source V2 supports the use case that is supported by Data Source V1.

Does this PR introduce any user-facing change?

Yes, it's a new feature

How was this patch tested?

a new test suite

cloud-fan · 2019-09-02T12:09:24Z

sql/catalyst/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java

   * @throws UnsupportedOperationException
   */
-  default Table getTable(CaseInsensitiveStringMap options, StructType schema) {
+  default Table getTable(


I'll refine the classdoc of this interface, after we reach an agreement of the proposal.

cloud-fan · 2019-09-02T12:38:31Z

cc @brkyvz @rdblue @jose-torres @gengliangwang @gatorsmile

SparkQA · 2019-09-02T15:07:07Z

Test build #110014 has finished for PR 25651 at commit 7fd7d23.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-03T07:05:01Z

Test build #110027 has finished for PR 25651 at commit 0da5453.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-09-03T08:58:40Z

retest this please.

SparkQA · 2019-09-03T11:49:56Z

Test build #110035 has finished for PR 25651 at commit 0da5453.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-09-03T13:02:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala

 }

-private[sql] object V2SessionCatalog {
+case class SchemaChangedException(


How about just SchemaException

gengliangwang · 2019-09-03T13:07:48Z

+1 with the proposal.

rdblue · 2019-09-03T23:20:16Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

      val dsOptions = new CaseInsensitiveStringMap(finalOptions.asJava)
      val table = userSpecifiedSchema match {
-        case Some(schema) => provider.getTable(dsOptions, schema)
+        case Some(schema) => provider.getTable(dsOptions, schema, Array.empty)


This isn't correct. The DataFrameReader does not know that the table is unpartitioned.

yeah, this is one of the most annoying and confusing behaviors of DataSourceV1. Being able to provide a schema but not partitioning information, which leads to minutes of partition schema inference.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanSupportCheck.scala

rdblue · 2019-09-03T23:22:43Z

external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroDataSourceV2.scala

+  override def getTable(
+      options: CaseInsensitiveStringMap,
+      schema: StructType,
+      partitions: Array[Transform]): Table = {


Why is this not identical to the metadata passed to createTable? Is there a reason not to pass the table properties as well as the read options?

I agree, this should look similar to createTable

read options should be passed in Table.newScanBuilder. The options here is the table properties.

But we do have a problem here. Table properties are case sensitive while scan options are case insensitive.

Think about 2 cases:

spark.read.format("myFormat").options(...).schema(...).load().
We need to get the table with the user-specifed options and schema. When scan the table, we need to use the user-specified options as scan options. The problem is, DataFrameReader.options specifies both table properties and scan options in this case.

CREATE TABLE t USING myFormat TABLEPROP ... and then spark.read.options(...).table("t")
In this case, DataFrameReader.options only specifies scan options.

Ideally, TableProvider.getTable takes table properties which should be case sensitive. However, DataFrameReader.options also specifies scan options which should be case insensitive.

I don't have a good idea now. Maybe it's OK to treat this as a special table which accepts case insensitive table properties.

Or we can make table properties case insensitive.

This interface should pass the table properties. There is no need to pass read or write options at this point, unless they can't be separated from table properties (as in the DataFrameReader case). The read options and write options should be passed to the logical plan -- this is added in #25681: https://github.com/apache/spark/pull/25681/files#diff-94fbd986b04087223f53697d4b6cab24R275

I propose passing table properties as a string map (java.util) through this interface. When the properties come from the metastore, then this is fine. When the properties come from DataFrameReader.option (or the write equivalent) then the original case sensitive map should be passed. Then the read options should additionally be passed to the correct plan node so that the physical plan can push them into the scan or the write.

jose-torres · 2019-09-04T01:03:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala

+      val partitions = new mutable.ArrayBuffer[Transform]()
+
+      v1Table.partitionColumnNames.foreach { col =>
+        partitions += LogicalExpressions.identity(col)


Nit: This parses the column's name as a multi-part identifier, which is subtly incorrect. (It'll cause issues if the column name contains special characters like ':'.)

Aren't column names with special characters supported? I thought you could escape any identifier using back-ticks.

rdblue · 2019-09-04T21:28:26Z

@cloud-fan, can you update the PR title and description? The USING clause is not the problem. That is passed to all catalogs. The problem is that generic catalogs can't pass table information to Table instances created by TableProvider. I think the title should be "Support passing all Table metadata in TableProvider".

I think that clarifying the problem statement will also help clean up the proposed changes. For example, this passes some -- but not all -- Table metadata. It should probably pass all of the fields needed to create a Table instance that behaves like V1Table.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableProvider.java

cloud-fan · 2019-09-18T15:06:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala

+      throw new NoSuchTableException(ident)
+    }
+
+    tryResolveTableProvider(V1Table(catalogTable))


This is the core change of this PR.

SparkQA · 2019-09-18T15:15:02Z

Test build #110919 has finished for PR 25651 at commit 2333585.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-24T14:54:16Z

Test build #111290 has finished for PR 25651 at commit 40e2894.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CatalogExtensionForTableProvider extends DelegatingCatalogExtension

SparkQA · 2019-09-24T15:40:38Z

Test build #111293 has finished for PR 25651 at commit 3a6d13d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CatalogExtensionForTableProvider extends DelegatingCatalogExtension

SparkQA · 2019-09-24T17:10:44Z

Test build #111298 has finished for PR 25651 at commit 0f9faca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CatalogExtensionForTableProvider extends DelegatingCatalogExtension

rdblue · 2019-09-25T00:13:24Z

@cloud-fan, thanks for working on this. I plan to review it tomorrow. Looks like this is huge and touches about 50 files. Is there a way to make it smaller?

cloud-fan · 2019-09-25T01:36:58Z

@rdblue yea it's possible. In this PR, I try to adopt your suggestion to make it clear that TableProvider.getTable should take all the table metadata, so the method signature becomes

def getTable(schema: StructType, partitions: Array[Transform], properties: Map[String, String])

TableProvider has another getTable method which needs to infer schema/partitioning, and previously the method signature was

def getTable(options: CaseInsensitiveStringMap)

To make it consistent, I change it to use properties: Map[String, String], also rename it to loadTable since we need to touch many files anyway.

We can still keep the old method signature with a TODO to change it later, so that this PR can be much smaller.

SparkQA · 2019-10-08T17:00:23Z

Test build #111905 has finished for PR 25651 at commit 1124b47.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-10-14T16:32:10Z

retest this please

SparkQA · 2019-10-14T17:12:10Z

Test build #112044 has finished for PR 25651 at commit 1124b47.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-15T15:14:53Z

Test build #112108 has finished for PR 25651 at commit 1235d78.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-10-15T23:47:27Z

I should have time to review this on Thursday if it is ready. Until then, I commented on some of the open threads.

cloud-fan · 2019-10-21T16:23:02Z

@rdblue , when I try to have separated methods inferSchema and inferPartitioning, problems keep popping up (mostly from file source v2).

The major problems hit so far:

inferSchema/inferPartitioning need to list files, and we should only do file listing once when we scan a directory without user-specified schema. This can be resolved by using a static cache or simply cache the listed files in the FileDataSourceV2 instance.
when writing to a directory, no schema/partition inference should be done. It looks to me that we need to have 2 separated methods getTableToRead and getTableToWrite, while getTableToWrite does not take schema/partitioning and we don't need to call inferSchema/inferPartitioning. But this makes the API ugly.

I feel that, the existing API (getTable with several overloads) is more flexible and allows implementations to have its special logic. For example, it allows file source to do schema inference lazily, which won't be triggered at all during write.

SparkQA · 2019-10-21T19:50:11Z

Test build #112405 has finished for PR 25651 at commit cfbe0a7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-10-21T20:40:17Z

inferSchema/inferPartitioning need to list files, and we should only do file listing once when we scan a directory without user-specified schema

Why should this internal concern of one source affect the API? Partitioning inference does not require listing all the files in a table, and I doubt that it is a good idea to do that for schema inference either. If a table is small, it doesn't matter if this work is done twice (before it is fixed); and if a table is really large, then it isn't a good idea to do this for schema inference anyway.

when writing to a directory, no schema/partition inference should be done.

This statement makes assumptions about the behavior of path-based tables and that behavior hasn't been clearly defined yet. Can you be more specific about the case and how you think path-based tables will behave?

I disagree that no schema or partition inference should be done for writing. Maybe it isn't done today, but if there is existing data, Spark shouldn't allow writing new data that will break a table by using an incompatible schema or partition layout. In that case, we would want to infer the schema and partitioning.

Also, if it isn't necessary to infer schema and partitioning, then this information still needs to be passed to the table. When running a CTAS operation, Spark might be called with partitionBy. In that case, if Spark doesn't call inferPartitioning then what is the problem?

cloud-fan · 2019-10-22T11:59:42Z

I was not trying to define the behavior, but talking about the existing behavior. df.write.mode("append").parquet("path_with_existing_data") will not do schema/partition inference, and just append the data even if the schema is incompatible. The following read will fail during schema inference. I think this is a reasonable behavior, as there is no "user-specified schema" in DataFrameWriter to skip schema inference.

cloud-fan · 2019-10-22T12:05:11Z

If you really think that schema or partition inference should be done for writing, we should disable file source v2 by default to not surprise users.

rdblue · 2019-10-22T18:23:36Z

If you really think that schema or partition inference should be done for writing, we should disable file source v2 by default to not surprise users.

I'm not saying that it is what we should do. That should be covered by a design doc for path-based tables. My point is that the claim that it won't be done is not necessarily true and makes assumptions about how these tables will behave.

cloud-fan · 2019-10-23T04:37:38Z

Do you mean we should block this PR until we figure out the behavior of path-based tables? This PR simply makes TableProvider to accept user-specified partitioning, while keeping the API style and existing file source behavior unchanged. I think we've gone too far about proposing a new API style for TableProvider.

I'm OK to adopt the new API style if it doesn't break the existing behavior. But seems now we are unable to keep file source skipping schema/partition inference during write. Shall we discuss the new API style later?

rdblue · 2019-10-23T19:30:25Z

Do you mean we should block this PR until we figure out the behavior of path-based tables?

No, and sorry for the misunderstanding! My point is that your claim that inference won't be used in the write path is not necessarily correct and depends on the behavior we decide for path-based tables.

But seems now we are unable to keep file source skipping schema/partition inference during write.

I think it's an exaggeration to say "unable". Partition inference in particular can be done much more easily and efficiently than depending on a recursive directory listing to find all data files. Granted, the current implementation would need to change, but do you really think that "unable" is an accurate description?

The problem is that this needs to be decided because it affects the API that will go into Spark 3.0. I think we should go with what we agreed was a good solution for the API -- adding the inferSchema and inferPartitioning methods -- because I haven't heard a very strong argument against it. Let's talk about this in the next v2 sync to get more opinions.

github-actions · 2020-03-19T00:10:12Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

SparkQA · 2020-05-18T00:54:39Z

Test build #122766 has finished for PR 25651 at commit cfbe0a7.

This patch fails build dependency tests.
This patch does not merge cleanly.
This patch adds no public classes.

github-actions · 2020-08-27T00:39:09Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions · 2020-12-06T00:48:29Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

cloud-fan commented Sep 2, 2019

View reviewed changes

dongjoon-hyun added the SQL label Sep 2, 2019

gengliangwang reviewed Sep 3, 2019

View reviewed changes

rdblue reviewed Sep 3, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanSupportCheck.scala Outdated Show resolved Hide resolved

rdblue reviewed Sep 3, 2019

View reviewed changes

jose-torres reviewed Sep 4, 2019

View reviewed changes

cloud-fan changed the title ~~[SPARK-28948][SQL] support data source v2 in CREATE TABLE USING~~ [SPARK-28948][SQL] Support passing all Table metadata in TableProvider Sep 12, 2019

cloud-fan force-pushed the dsv2-using branch from 0da5453 to 2333585 Compare September 18, 2019 15:00

cloud-fan commented Sep 18, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableProvider.java Show resolved Hide resolved

cloud-fan commented Sep 18, 2019

View reviewed changes

cloud-fan mentioned this pull request Sep 18, 2019

[SPARK-29908][SQL] Support partitioning and bucketing through DataFrameWriter.save for V2 Tables #25822

Closed

brkyvz mentioned this pull request Sep 18, 2019

[SPARK-29908][SQL] Alternative proposal for supporting partitioning through save for V2 tables #25833

Closed

cloud-fan force-pushed the dsv2-using branch from 2333585 to 40e2894 Compare September 24, 2019 14:44

cloud-fan force-pushed the dsv2-using branch from 40e2894 to 3a6d13d Compare September 24, 2019 15:13

cloud-fan force-pushed the dsv2-using branch from 3a6d13d to 0f9faca Compare September 24, 2019 16:39

cloud-fan force-pushed the dsv2-using branch from 2bf1c59 to cfc2264 Compare September 25, 2019 15:01

cloud-fan added 2 commits October 15, 2019 20:25

Merge remote-tracking branch 'origin/master' into dsv2-using

68cecf0

update

1235d78

fix test

cfbe0a7

cloud-fan mentioned this pull request Oct 23, 2019

[SPARK-29558][SQL] ResolveTables and ResolveRelations should be order-insensitive #26214

Closed

cloud-fan mentioned this pull request Oct 30, 2019

[SPARK-29665][SQL] refine the TableProvider interface #26297

Closed

This was referenced Dec 2, 2019

[SPARK-30001][SQL] ResolveRelations should handle both V1 and V2 tables. #26684

Closed

[SPARK-28948][SQL] Support passing all Table metadata in TableProvider #26750

Closed

github-actions bot added the Stale label Mar 19, 2020

cloud-fan removed the Stale label Mar 19, 2020

github-actions bot added the Stale label Aug 27, 2020

cloud-fan removed the Stale label Aug 27, 2020

github-actions bot added the Stale label Dec 6, 2020

github-actions bot closed this Dec 7, 2020

[SPARK-28948][SQL] Support passing all Table metadata in TableProvider #25651

[SPARK-28948][SQL] Support passing all Table metadata in TableProvider #25651

Uh oh!

Conversation

cloud-fan commented Sep 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 2, 2019

Uh oh!

SparkQA commented Sep 2, 2019

Uh oh!

SparkQA commented Sep 3, 2019

Uh oh!

gengliangwang commented Sep 3, 2019

Uh oh!

SparkQA commented Sep 3, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Sep 3, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Sep 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cloud-fan Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 18, 2019

Uh oh!

SparkQA commented Sep 24, 2019

Uh oh!

SparkQA commented Sep 24, 2019

Uh oh!

SparkQA commented Sep 24, 2019

Uh oh!

rdblue commented Sep 25, 2019

Uh oh!

cloud-fan commented Sep 25, 2019

Uh oh!

SparkQA commented Oct 8, 2019

Uh oh!

cloud-fan commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 15, 2019

Uh oh!

rdblue commented Oct 15, 2019

Uh oh!

cloud-fan commented Oct 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cloud-fan commented Sep 2, 2019 •

edited

Loading

rdblue commented Sep 4, 2019 •

edited

Loading

cloud-fan Sep 18, 2019 •

edited

Loading

cloud-fan commented Oct 21, 2019 •

edited

Loading

cloud-fan commented Oct 22, 2019 •

edited

Loading