-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-124] Flink and Spark running Examples WordCountIT #345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…park as supported runners in examples. Signed-off-by: Jason Kuster <jason@google.com>
|
Hey Max, Amit. Looking for some feedback on this pull request. The purpose is to remove the dependencies of Spark and Flink runner on Beam to enable them to run the WordCountIT in examples, as Dataflow currently does. As things were in the codebase, both Spark and Flink depended on examples for some of the code in WordCount.java and TfIdf.java. In the Flink case I've removed the tests; in the Spark case I've just added the code in. I'd love to hear your guys' thoughts on what the right thing to do is going forward. The benefit we get from this is that this is where the new End-to-End tests seem to be going, such that they can be written in a runner-agnostic way and then run just by flipping a few flags (for example, see the commands below for running this test). Let me know your thoughts! mvn clean verify -pl examples/java -am -rf :java-examples-all -DskipITs=false -DintegrationTestPipelineOptions='[ "--tempRoot=/tmp", "--inputFile=/tmp/kinglear.txt", "--runner=org.apache.beam.runners.spark.SparkPipelineRunner", "--sparkMaster=local" ]' mvn clean verify -pl examples/java -am -rf :java-examples-all -DskipITs=false -DintegrationTestPipelineOptions='[ "--tempRoot=/tmp", "--inputFile=/tmp/kinglear.txt", "--runner=org.apache.beam.runners.flink.FlinkPipelineRunner" ] mvn clean verify -pl examples/java -am -rf :java-examples-all -DskipITs=false -DintegrationTestPipelineOptions='[ "--tempRoot=gs://clouddfe-testing-temp-storage", "--runner=org.apache.beam.sdk.testing.TestDataflowPipelineRunner" ]' Jason |
|
I think this is the right way to go. In #343 I'm also removing these two examples because all RunnableOnService tests will be executed on Flink with those changes. |
| * Count) as a reusable PTransform subclass. Using composite transforms allows for easy reuse, | ||
| * modular testing, and an improved monitoring experience. | ||
| */ | ||
| public static class CountWords extends PTransform<PCollection<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use SimpleWordCountTest.CountWords instead. Maybe need to make a small change to the format function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look. Thanks!
|
I generally agree with @aljoscha and once #294 is done, and RunnableOnService tests will cover those use cases, they might be removed from Spark runner tests as well. |
|
Hi @jasonkuster! +1 for enabling end-to-end tests for all Runners. A couple questions: I wonder why do you remove a Flink TfIdf integration test and add one for Spark? |
|
Hey @mxm! I removed in Flink and added in Spark just to see what the two different methods of resolving the dependency issues would look like. I'm happy to do either for either one, but based on the above comments it looks like the |
|
(should be rebased, given relevant changes to the pom.) |
|
@jasonkuster The +1 for merging from my side (needs rebasing though) |
Add KafkaIO to Contrib KafkaIO is an Unbounded source for reading from Apache Kafka. Backports KafkaIO from Apache Beam. See apache/incubator-beam 7b175df
Manually add portability page to content
Be sure to do all of the following to help us incorporate your contribution
quickly and easily:
[BEAM-<Jira issue #>] Description of pull requestmvn clean verify. (Even better, enableTravis-CI on your fork and ensure the whole test matrix passes).
<Jira issue #>in the title with the actual Jira issuenumber, if there is one.
Individual Contributor License Agreement.