-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26695][SQL] data source v2 API refactor - continuous read #23619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the new abstraction, we should only stop sources when the stream query ends, instead of each reconfiguration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, this looks like it's correctly implemented to me but we should keep an eye out for flakiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for reminding! Yea I'll keep an eye on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, with the new abstraction, we should create the ContinuousStream at the beginning of the ContinuousExecution, instead of each reconfiguration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a small fix. The test needs to specify the numPartitions, but the socket source always use the spark default parallelism. Here I make numPartitions configurable.
|
Test build #101564 has finished for PR 23619 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need this test now. With the new TableProvider abstraction, the lookup logic is unified between microbatch and continuous
|
Test build #101567 has finished for PR 23619 at commit
|
|
Test build #101571 has finished for PR 23619 at commit
|
|
retest this please |
|
Test build #101578 has finished for PR 23619 at commit
|
|
retest this please |
|
Test build #101583 has finished for PR 23619 at commit
|
jose-torres
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good.
| if scan.readSupport.isInstanceOf[KafkaContinuousReadSupport] => | ||
| scan.scanConfig.asInstanceOf[KafkaContinuousScanConfig] | ||
| }.exists { config => | ||
| if scan.stream.isInstanceOf[KafkaContinuousStream] => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic is correct, but let's keep an eye on the tests after merging since some flakiness slipped through in the last iteration of the refactoring.
| * @throws UnsupportedOperationException | ||
| */ | ||
| default ContinuousStream toContinuousStream(String checkpointLocation) { | ||
| throw new UnsupportedOperationException("Continuous scans are not supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think the message should indicate they're unsupported just for this type of Scan - this makes it sound like they're not supported in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, this looks like it's correctly implemented to me but we should keep an eye out for flakiness.
|
Test build #101792 has finished for PR 23619 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks! Merged to master
## What changes were proposed in this pull request? Following apache#23430, this PR does the API refactor for continuous read, w.r.t. the [doc](https://docs.google.com/document/d/1uUmKCpWLdh9vHxP7AWJ9EgbwB_U6T3EJYNjhISGmiQg/edit?usp=sharing) The major changes: 1. rename `XXXContinuousReadSupport` to `XXXContinuousStream` 2. at the beginning of continuous streaming execution, convert `StreamingRelationV2` to `StreamingDataSourceV2Relation` directly, instead of `StreamingExecutionRelation`. 3. remove all the hacks as we have finished all the read side API refactor ## How was this patch tested? existing tests Closes apache#23619 from cloud-fan/continuous. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
|
|
||
| // TODO: unify the equal/hashCode implementation for all data source v2 query plans. | ||
| override def equals(other: Any): Boolean = other match { | ||
| case other: BatchScanExec => this.batch == other.batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Wenchen, just bumped into this code.
Do you remember why output is not included in equality comparison, as well as in V1 scan?
What changes were proposed in this pull request?
Following #23430, this PR does the API refactor for continuous read, w.r.t. the doc
The major changes:
XXXContinuousReadSupporttoXXXContinuousStreamStreamingRelationV2toStreamingDataSourceV2Relationdirectly, instead ofStreamingExecutionRelation.How was this patch tested?
existing tests