Support VR test including TestStream for Spark runner in streaming mode #22620

mosche · 2022-08-08T15:36:51Z

Run VR tests for Spark streaming runner rather than custom tests (test are already run as part of the "normal" unit test run).

If forceStreaming is set to true, the TestSparkRunner will replace Read.Bounded with UnboundedReadFromBoundedSource so tests are run in streaming mode.
Additionally this PR adds support for TestStream.

Closes #22472

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

mosche · 2022-08-08T15:39:28Z

Run Spark ValidatesRunner

mosche · 2022-08-08T15:52:25Z

Unfortunately I'm stuck with some flaky tests. It looks like watermarks are not advanced in a deterministic way.
Below some logs of org.apache.beam.sdk.schemas.AvroSchemaTest.testAvroPipelineGroupBy (edited for readability).

@aromanenko-dev @echauchot if you have some time I'd be more than grateful for a 2nd pair of 👀 .

Successful run (watermark advanced early enough, so that timer is triggered and the only element is emitted):

14:42:52,938 [3] TRACE SparkGroupAlsoByWindowViaWindowSet  - Group.ByFields/ToKvs/GroupByKey: input elements: [ValueInGlobalWindow{value=Row1, pane=NO_FIRING}]
14:42:52,940 [3] TRACE SparkGroupAlsoByWindowViaWindowSet  - Group.ByFields/ToKvs/GroupByKey: non expired input elements: [ValueInGlobalWindow{value=Row1, pane=NO_FIRING}]
14:42:52,949 [3] TRACE WindowTracing  - ReduceFnRunner.scheduleGarbageCollectionTimer: Scheduling at GLOBALW_MAX for key:Row2; window:GlobalWindow where inputWatermark:BOUNDEDW_MIN; outputWatermark:null
14:42:52,957 [3] TRACE WindowTracing  - WatermarkHold.addHolds: element hold at GLOBALW_MAX is on time for key:Row2; window:GlobalWindow; inputWatermark:BOUNDEDW_MIN; outputWatermark:null
14:42:52,960 [3] DEBUG SparkGroupAlsoByWindowViaWindowSet  - Group.ByFields/ToKvs/GroupByKey: timerInternals before advance are SparkTimerInternals{highWatermark=BOUNDEDW_MIN, synchronizedProcessingTime=EPOCH, timers=[TimerData{timerId=0, timerFamilyId=, namespace=Window(GlobalWindow), timestamp=GLOBALW_MAX, outputTimestamp=GLOBALW_MAX, domain=EVENT_TIME, deleted=false}], inputWatermark=BOUNDEDW_MIN}
14:42:52,961 [3] DEBUG SparkGroupAlsoByWindowViaWindowSet  - Group.ByFields/ToKvs/GroupByKey: timers eligible for processing are [] [inputWatermark: BOUNDEDW_MIN]
14:42:52,962 [3] TRACE SparkGroupAlsoByWindowViaWindowSet  - Group.ByFields/ToKvs/GroupByKey: output elements are  0

14:42:53,137 [spark-listener-group-appStatus] INFO  GlobalWatermarkHolder  - Put new watermark block: {0=SparkWatermarks{lowWatermark=BOUNDEDW_MIN, highWatermark=BOUNDEDW_MAX, synchronizedProcessingTime=NOW}}

14:42:53,146 [15] DEBUG SparkGroupAlsoByWindowViaWindowSet  - Group.ByFields/ToKvs/GroupByKey: timerInternals before advance are SparkTimerInternals{highWatermark=BOUNDEDW_MAX, synchronizedProcessingTime=NOW, timers=[TimerData{timerId=0, timerFamilyId=, namespace=Window(GlobalWindow), timestamp=GLOBALW_MAX, outputTimestamp=GLOBALW_MAX, domain=EVENT_TIME, deleted=false}], inputWatermark=BOUNDEDW_MIN}
14:42:53,146 [15] DEBUG SparkGroupAlsoByWindowViaWindowSet  - Group.ByFields/ToKvs/GroupByKey: timers eligible for processing are [TimerData{timerId=0, timerFamilyId=, namespace=Window(GlobalWindow), timestamp=GLOBALW_MAX, outputTimestamp=GLOBALW_MAX, domain=EVENT_TIME, deleted=false}] [inputWatermark: BOUNDEDW_MAX]

14:42:53,146 [15] DEBUG WindowTracing  - ReduceFnRunner: Received timer key:Row2; window:GlobalWindow; data:TimerData{timerId=0, timerFamilyId=, namespace=Window(GlobalWindow), timestamp=GLOBALW_MAX, outputTimestamp=GLOBALW_MAX, domain=EVENT_TIME, deleted=false} with inputWatermark:BOUNDEDW_MAX; outputWatermark:null
14:42:53,148 [15] DEBUG WindowTracing  - ReduceFnRunner: Cleaning up for key:Row2; window:GlobalWindow with inputWatermark:BOUNDEDW_MAX; outputWatermark:null
14:42:53,148 [15] DEBUG WindowTracing  - WatermarkHold.extractAndRelease: for key:Row2; window:GlobalWindow; inputWatermark:BOUNDEDW_MAX; outputWatermark:null
14:42:53,149 [15] DEBUG WindowTracing  - WatermarkHold.extractAndRelease.read: clearing for key:Row2; window:GlobalWindow
14:42:53,150 [15] DEBUG WindowTracing  - describePane: ON_TIME pane (prev was null) for key:Row2; windowMaxTimestamp:GLOBALW_MAX; inputWatermark:BOUNDEDW_MAX; outputWatermark:null; isLateForOutput:false
14:42:53,152 [15] TRACE WindowTracing  - ReduceFnRunner.onTrigger: outputWindowedValue key:Row2 value:[Row1] at GLOBALW_MAX
14:42:53,152 [15] DEBUG WindowTracing  - WatermarkHold.clearHolds: For key:Row2; window:GlobalWindow; inputWatermark:BOUNDEDW_MAX; outputWatermark:null
14:42:53,153 [15] TRACE SparkGroupAlsoByWindowViaWindowSet  - Group.ByFields/ToKvs/GroupByKey: output elements are TimestampedValueInGlobalWindow{value=KV{Row2, [Row1]}, timestamp=GLOBALW_MAX, pane=PaneInfo{isFirst=true, isLast=true, timing=ON_TIME, index=0, onTimeIndex=0}} 1

Failed run (watermark is advanced too late, timer doesn't trigger and element is lost):

14:41:51,453 [3] TRACE SparkGroupAlsoByWindowViaWindowSet  - Group.ByFields/ToKvs/GroupByKey: input elements: [ValueInGlobalWindow{value=Row1, pane=NO_FIRING}]
14:41:51,455 [3] TRACE SparkGroupAlsoByWindowViaWindowSet  - Group.ByFields/ToKvs/GroupByKey: non expired input elements: [ValueInGlobalWindow{value=Row1, pane=NO_FIRING}]
14:41:51,463 [3] TRACE WindowTracing  - ReduceFnRunner.scheduleGarbageCollectionTimer: Scheduling at GLOBALW_MAX for key:Row2; window:GlobalWindow where inputWatermark:BOUNDEDW_MIN; outputWatermark:null
14:41:51,471 [3] TRACE WindowTracing  - WatermarkHold.addHolds: element hold at GLOBALW_MAX is on time for key:Row2; window:GlobalWindow; inputWatermark:BOUNDEDW_MIN; outputWatermark:null
14:41:51,474 [3] DEBUG SparkGroupAlsoByWindowViaWindowSet  - Group.ByFields/ToKvs/GroupByKey: timerInternals before advance are SparkTimerInternals{highWatermark=BOUNDEDW_MIN, synchronizedProcessingTime=EPOCH, timers=[TimerData{timerId=0, timerFamilyId=, namespace=Window(GlobalWindow), timestamp=GLOBALW_MAX, outputTimestamp=GLOBALW_MAX, domain=EVENT_TIME, deleted=false}], inputWatermark=BOUNDEDW_MIN}
14:41:51,474 [3] DEBUG SparkGroupAlsoByWindowViaWindowSet  - Group.ByFields/ToKvs/GroupByKey: timers eligible for processing are [] [inputWatermark: BOUNDEDW_MIN]
14:41:51,476 [3] TRACE SparkGroupAlsoByWindowViaWindowSet  - Group.ByFields/ToKvs/GroupByKey: output elements are  0

14:41:51,658 [15] DEBUG SparkGroupAlsoByWindowViaWindowSet  - Group.ByFields/ToKvs/GroupByKey: timerInternals before advance are SparkTimerInternals{highWatermark=BOUNDEDW_MIN, synchronizedProcessingTime=EPOCH, timers=[TimerData{timerId=0, timerFamilyId=, namespace=Window(GlobalWindow), timestamp=GLOBALW_MAX, outputTimestamp=GLOBALW_MAX, domain=EVENT_TIME, deleted=false}], inputWatermark=BOUNDEDW_MIN}
14:41:51,658 [15] DEBUG SparkGroupAlsoByWindowViaWindowSet  - Group.ByFields/ToKvs/GroupByKey: timers eligible for processing are [] [inputWatermark: BOUNDEDW_MIN]
14:41:51,658 [15] TRACE SparkGroupAlsoByWindowViaWindowSet  - Group.ByFields/ToKvs/GroupByKey: output elements are  0

14:41:51,662 [spark-listener-group-appStatus] INFO  GlobalWatermarkHolder  - Put new watermark block: {0=SparkWatermarks{lowWatermark=BOUNDEDW_MIN, highWatermark=BOUNDEDW_MAX, synchronizedProcessingTime=NOW}}

github-actions · 2022-08-08T16:05:37Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @lukecwik for label java.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

mosche · 2022-08-09T08:39:03Z

Run Spark ValidatesRunner

mosche · 2022-08-09T08:49:09Z

Run Spark ValidatesRunner

mosche · 2022-08-09T09:07:16Z

Run Spark ValidatesRunner

mosche · 2022-08-10T14:21:19Z

sdks/java/core/src/test/java/org/apache/beam/sdk/testing/TestStreamTest.java

@@ -105,6 +105,9 @@ public void testLateDataAccumulating() {
            .advanceWatermarkTo(instant.plus(Duration.standardMinutes(6)))
            // These elements are late but within the allowed lateness
            .addElements(TimestampedValue.of(4L, instant), TimestampedValue.of(5L, instant))
+            .advanceWatermarkTo(instant.plus(Duration.standardMinutes(10)))


@kennknowles Maybe you could answer this? I'm wondering if this is an issue of the Spark streaming runner (and how this is handled by other runners) or if it's a lack of my own understanding.

Without advancing the watermark once more the (lower) input watermark remains at 6 mins, but data in [0,5 min) won't be considered late until it passes 10 mins.

Just responding to let you know I have been on vacation and I will look at this later today.

Thanks a lot! I'm off as well for a bit, so no rush on this.

@kennknowles Finally back to this, if you could have a look it would be great:)

mosche · 2022-08-11T12:08:57Z

Run Spark ValidatesRunner

mosche · 2022-08-11T12:58:01Z

Run Java PreCommit

github-actions · 2022-08-19T12:13:59Z

Reminder, please take a look at this pr: @lukecwik

github-actions · 2022-08-23T12:14:25Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions · 2022-08-30T12:15:33Z

Reminder, please take a look at this pr: @kennknowles

github-actions · 2022-09-02T12:14:09Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @lukecwik for label java.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions · 2022-09-12T12:14:20Z

Reminder, please take a look at this pr: @lukecwik

kennknowles

This is super useful! Thank you!

kennknowles · 2022-09-12T15:23:20Z

runners/spark/spark_runner.gradle


  classpath = configurations.validatesRunner
-  testClassesDirs += files(project.sourceSets.test.output.classesDirs)
+  testClassesDirs += files(
+    project(":sdks:java:core").sourceSets.test.output.classesDirs,


This change makes sense. I don't understand how it worked before. Was it not actually running the VR tests?

Yes, in fact VR test were never run/supported for Spark in streaming mode. I just stumbled on this accidentally when I started looking into bugs related to onWindowExpiration :(
Instead there was some custom tests in the module that try to mimic the VR test, but they only cover a very small part (and also run as unit tests).

kennknowles · 2022-09-12T15:24:20Z

runners/spark/src/main/java/org/apache/beam/runners/spark/TestSparkRunner.java

+   * Override factory to replace {@link Read.Unbounded} with {@link UnboundedReadFromBoundedSource}
+   * to force streaming mode.
+   */
+  private static class UnboundedReadFromBoundedSourceOverrideFactory<T>


This seems useful as a general thing that could be in runners-core-construction FWIW.

Happy to do that though I'm not entirely sure if the factory is of much value by itself. I also had to fix the outputs in a non trivial way using a visitor after the replacement, see https://github.com/apache/beam/pull/22620/files#diff-d81f49eb0330230bd03ce6cd33b5f70f59c443aac57741e877ececbada32b16bR246-R274. I couldn't find a way to achieve this in mapOutputs of the override factory itself.
Let me know what you think.

.../org/apache/beam/runners/spark/translation/streaming/SparkRunnerStreamingContextFactory.java

...ers/spark/src/main/java/org/apache/beam/runners/spark/translation/streaming/TestDStream.java

sdks/java/core/src/test/java/org/apache/beam/sdk/testing/TestStreamTest.java

kennknowles · 2022-09-12T15:31:56Z

Run Spark ValidatesRunner

mosche · 2022-09-12T16:00:12Z

@kennknowles FYI, a significant number of VR tests are constantly failing here. If I run them independently they usually succeed. It looks like there's some indeterminism around watermark propagation in the runner, see #23129.
Wondering, would you know anyone who's familiar with that code?

…de (apache#22472).

kennknowles · 2022-09-13T20:44:44Z

I don't know if anyone currently around would be familiar with SparkRunner watermark propagation.

kennknowles · 2022-09-13T20:48:52Z

I think it is valuable to get these tests running and disable them. The test and list of disabled tests can be a real representation of the current state. That way things that are green can stay green.

kennknowles · 2022-09-13T20:49:00Z

run spark validatesrunner

…ds for Spark VR tests

mosche · 2022-09-14T12:18:28Z

Run Spark ValidatesRunner

mosche · 2022-09-14T12:29:11Z

Run Spark ValidatesRunner

mosche · 2022-09-14T13:36:05Z

I took a bit of a turn here after validating my initial approach replacing bounded sources with UnboundedReadFromBoundedSource with VR tests in Flink:

Tests that failed likely due to watermark issues with the Spark runner ([Bug]: Issues with Watermark propagation in Spark runner (streaming) #23129, see test results) ran fine with Flink suggesting there really is a major problem (in streaming mode).
Nevertheless, it also showed that the approach is somehow flawed. Some bounded test cases simply cannot be forced into a streaming execution, e.g. any GroupByKey will fail on the GlobalWindow if there's no trigger set.

The initial reason for this approach was to prevent the Spark runner from failing when streaming was forced via pipeline options in VR tests for bounded test cases: Spark refuses to start if there's no streaming workload scheduled.
Instead TestSparkRunner now just detects the translation mode and acts accordingly.

Unfortunately, this hides any watermark issues uncovered above as VR tests succeed.

kennknowles · 2022-09-14T20:11:12Z

Nevertheless, it also showed that the approach is somehow flawed. Some bounded test cases simply cannot be forced into a streaming execution, e.g. any GroupByKey will fail on the GlobalWindow if there's no trigger set.

In the Beam model, this condition is that a GroupByKey of an unbounded PCollection in global window must have a trigger. But you can still have a bounded PCollection in streaming mode.

So the summary is:

forcing a run in streaming mode, but leaving bounded PCollections as bounded is OK
automatically making all PCollections unbounded is flawed (but still can be useful to find bugs sometimes)

mosche · 2022-09-19T09:09:47Z

@kennknowles fine to merge this?

github-actions · 2022-09-26T12:14:32Z

Reminder, please take a look at this pr: @lukecwik

github-actions · 2022-09-29T12:14:25Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

aromanenko-dev · 2022-09-29T15:20:49Z

@kennknowles kind ping, are you ok to merge it?

github-actions bot added java runners spark labels Aug 8, 2022

github-actions bot added the Next Action: Reviewers label Aug 8, 2022

mosche force-pushed the 22472-Spark-TestStream branch from fed3793 to d6cfb9a Compare August 9, 2022 07:53

mosche commented Aug 10, 2022

View reviewed changes

mosche force-pushed the 22472-Spark-TestStream branch from d6cfb9a to e50fec9 Compare August 11, 2022 09:30

github-actions bot added the slow-review label Aug 19, 2022

github-actions bot removed the slow-review label Aug 23, 2022

github-actions bot added the slow-review label Aug 30, 2022

github-actions bot removed the slow-review label Sep 2, 2022

mosche mentioned this pull request Sep 9, 2022

[Bug]: Issues with Watermark propagation in Spark runner (streaming) #23129

Open

github-actions bot added the slow-review label Sep 12, 2022

kennknowles reviewed Sep 12, 2022

View reviewed changes

sdks/java/core/src/test/java/org/apache/beam/sdk/testing/TestStreamTest.java Outdated Show resolved Hide resolved

github-actions bot removed the slow-review label Sep 12, 2022

mosche mentioned this pull request Sep 13, 2022

Annotate stateful VR test in TestStreamTest with UsesStatefulParDo (related to #22472) #23202

Merged

4 tasks

Moritz Mack added 2 commits September 13, 2022 14:05

Support VR test including TestStream for Spark runner in streaming mo…

690b75a

…de (apache#22472).

Review feedback

9eaf52e

mosche force-pushed the 22472-Spark-TestStream branch from e50fec9 to 9eaf52e Compare September 13, 2022 12:42

Moritz Mack added 2 commits September 14, 2022 12:20

Detect batch/streaming mode based on Pipeline in TestSparkRunner

480560c

Don't force streaming by replacing bounded sources with unbounded rea…

b35dbaa

…ds for Spark VR tests

mosche force-pushed the 22472-Spark-TestStream branch from 79c1953 to b35dbaa Compare September 14, 2022 12:28

github-actions bot added the slow-review label Sep 26, 2022

github-actions bot removed the slow-review label Sep 29, 2022

kennknowles approved these changes Sep 30, 2022

View reviewed changes

kennknowles merged commit 3c7a4e0 into apache:master Sep 30, 2022

mosche deleted the 22472-Spark-TestStream branch October 1, 2022 06:22

Support VR test including TestStream for Spark runner in streaming mode #22620

Support VR test including TestStream for Spark runner in streaming mode #22620

Uh oh!

Conversation

mosche commented Aug 8, 2022

GitHub Actions Tests Status (on master branch)

Uh oh!

mosche commented Aug 8, 2022

Uh oh!

mosche commented Aug 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 8, 2022

Uh oh!

mosche commented Aug 9, 2022

Uh oh!

mosche commented Aug 9, 2022

Uh oh!

mosche commented Aug 9, 2022

Uh oh!

mosche Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kennknowles Aug 11, 2022

Choose a reason for hiding this comment

Uh oh!

mosche Aug 12, 2022

Choose a reason for hiding this comment

Uh oh!

mosche Sep 5, 2022

Choose a reason for hiding this comment

Uh oh!

mosche commented Aug 11, 2022

Uh oh!

mosche commented Aug 11, 2022

Uh oh!

github-actions bot commented Aug 19, 2022

Uh oh!

github-actions bot commented Aug 23, 2022

Uh oh!

github-actions bot commented Aug 30, 2022

Uh oh!

github-actions bot commented Sep 2, 2022

Uh oh!

github-actions bot commented Sep 12, 2022

Uh oh!

kennknowles left a comment

Choose a reason for hiding this comment

Uh oh!

kennknowles Sep 12, 2022

Choose a reason for hiding this comment

Uh oh!

mosche Sep 12, 2022

Choose a reason for hiding this comment

Uh oh!

kennknowles Sep 12, 2022

Choose a reason for hiding this comment

Uh oh!

mosche Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kennknowles commented Sep 12, 2022

Uh oh!

mosche commented Sep 12, 2022

Uh oh!

kennknowles commented Sep 13, 2022

Uh oh!

kennknowles commented Sep 13, 2022

Uh oh!

kennknowles commented Sep 13, 2022

Uh oh!

mosche commented Sep 14, 2022

Uh oh!

mosche commented Sep 14, 2022

mosche commented Aug 8, 2022 •

edited

Loading

mosche Aug 10, 2022 •

edited

Loading

mosche Sep 13, 2022 •

edited

Loading