[SPARK-26695][SQL] data source v2 API refactor - continuous read #23619

cloud-fan · 2019-01-23T02:53:40Z

What changes were proposed in this pull request?

Following #23430, this PR does the API refactor for continuous read, w.r.t. the doc

The major changes:

rename XXXContinuousReadSupport to XXXContinuousStream
at the beginning of continuous streaming execution, convert StreamingRelationV2 to StreamingDataSourceV2Relation directly, instead of StreamingExecutionRelation.
remove all the hacks as we have finished all the read side API refactor

How was this patch tested?

existing tests

cloud-fan · 2019-01-23T02:54:28Z

...rnal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousStream.scala

moved from https://github.com/apache/spark/pull/23619/files#diff-75718e2fd0d84469b882e6db9896e1b8L162

cloud-fan · 2019-01-23T02:54:59Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala

moved to https://github.com/apache/spark/pull/23619/files#diff-eeac5bdf3a1ecd7b9f8aaf10fff37f05R413

cloud-fan · 2019-01-23T02:57:54Z

...src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala

with the new abstraction, we should only stop sources when the stream query ends, instead of each reconfiguration.

As above, this looks like it's correctly implemented to me but we should keep an eye out for flakiness.

thanks for reminding! Yea I'll keep an eye on it.

cloud-fan · 2019-01-23T02:58:59Z

...src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala

ditto, with the new abstraction, we should create the ContinuousStream at the beginning of the ContinuousExecution, instead of each reconfiguration.

cloud-fan · 2019-01-23T03:03:56Z

...c/main/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketSourceProvider.scala

this is a small fix. The test needs to specify the numPartitions, but the socket source always use the spark default parallelism. Here I make numPartitions configurable.

SparkQA · 2019-01-23T03:04:51Z

Test build #101564 has finished for PR 23619 at commit 120be7c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-23T03:06:35Z

...n/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousTextSocketSource.scala

see https://github.com/apache/spark/pull/23619/files#r250041943

cloud-fan · 2019-01-23T03:06:51Z

...main/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketMicroBatchStream.scala

see https://github.com/apache/spark/pull/23619/files#r250041943

cloud-fan · 2019-01-23T03:07:36Z

...rc/test/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamProviderSuite.scala

we don't need this test now. With the new TableProvider abstraction, the lookup logic is unified between microbatch and continuous

SparkQA · 2019-01-23T03:22:38Z

Test build #101567 has finished for PR 23619 at commit 1be0946.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-23T08:05:02Z

Test build #101571 has finished for PR 23619 at commit 8fe6366.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-23T08:11:04Z

retest this please

SparkQA · 2019-01-23T11:37:54Z

Test build #101578 has finished for PR 23619 at commit 8fe6366.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-23T11:49:03Z

retest this please

SparkQA · 2019-01-23T15:43:55Z

Test build #101583 has finished for PR 23619 at commit 8fe6366.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jose-torres

Mostly looks good.

jose-torres · 2019-01-28T17:32:37Z

...kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala

-                if scan.readSupport.isInstanceOf[KafkaContinuousReadSupport] =>
-                scan.scanConfig.asInstanceOf[KafkaContinuousScanConfig]
-            }.exists { config =>
+                if scan.stream.isInstanceOf[KafkaContinuousStream] =>


I think this logic is correct, but let's keep an eye on the tests after merging since some flakiness slipped through in the last iteration of the refactoring.

jose-torres · 2019-01-28T17:35:58Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Scan.java

+   * @throws UnsupportedOperationException
+   */
+  default ContinuousStream toContinuousStream(String checkpointLocation) {
+    throw new UnsupportedOperationException("Continuous scans are not supported");


nit: I think the message should indicate they're unsupported just for this type of Scan - this makes it sound like they're not supported in general.

jose-torres · 2019-01-28T17:41:17Z

...src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala

As above, this looks like it's correctly implemented to me but we should keep an eye out for flakiness.

SparkQA · 2019-01-29T08:02:25Z

Test build #101792 has finished for PR 23619 at commit 9d2ee51.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

LGTM

Thanks! Merged to master

## What changes were proposed in this pull request? Following apache#23430, this PR does the API refactor for continuous read, w.r.t. the [doc](https://docs.google.com/document/d/1uUmKCpWLdh9vHxP7AWJ9EgbwB_U6T3EJYNjhISGmiQg/edit?usp=sharing) The major changes: 1. rename `XXXContinuousReadSupport` to `XXXContinuousStream` 2. at the beginning of continuous streaming execution, convert `StreamingRelationV2` to `StreamingDataSourceV2Relation` directly, instead of `StreamingExecutionRelation`. 3. remove all the hacks as we have finished all the read side API refactor ## How was this patch tested? existing tests Closes apache#23619 from cloud-fan/continuous. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

zhztheplayer · 2024-10-28T06:24:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala

+
+  // TODO: unify the equal/hashCode implementation for all data source v2 query plans.
+  override def equals(other: Any): Boolean = other match {
+    case other: BatchScanExec => this.batch == other.batch


@cloud-fan

Hi Wenchen, just bumped into this code.

Do you remember why output is not included in equality comparison, as well as in V1 scan?

cloud-fan commented Jan 23, 2019

View reviewed changes

cloud-fan force-pushed the continuous branch from 120be7c to 1be0946 Compare January 23, 2019 03:12

data source v2 API refactor - continuous read

8fe6366

cloud-fan force-pushed the continuous branch from 1be0946 to 8fe6366 Compare January 23, 2019 05:16

jose-torres reviewed Jan 28, 2019

View reviewed changes

address comments

9d2ee51

gatorsmile reviewed Jan 29, 2019

View reviewed changes

asfgit closed this in e97ab1d Jan 29, 2019

zhztheplayer reviewed Oct 28, 2024

View reviewed changes

[SPARK-26695][SQL] data source v2 API refactor - continuous read #23619

[SPARK-26695][SQL] data source v2 API refactor - continuous read #23619

Uh oh!

Conversation

cloud-fan commented Jan 23, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 23, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jan 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 23, 2019

Uh oh!

SparkQA commented Jan 23, 2019

Uh oh!

cloud-fan commented Jan 23, 2019

Uh oh!

SparkQA commented Jan 23, 2019

Uh oh!

cloud-fan commented Jan 23, 2019

Uh oh!

SparkQA commented Jan 23, 2019

Uh oh!

jose-torres left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 29, 2019

Uh oh!

gatorsmile left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cloud-fan Jan 23, 2019 •

edited

Loading

gatorsmile left a comment •

edited

Loading

zhztheplayer Oct 28, 2024 •

edited

Loading