Skip to content

Conversation

@jasonkuster
Copy link
Contributor

@jasonkuster jasonkuster commented May 17, 2016

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

  • Make sure the PR title is formatted like:
    [BEAM-<Jira issue #>] Description of pull request
  • Make sure tests pass via mvn clean verify. (Even better, enable
    Travis-CI on your fork and ensure the whole test matrix passes).
  • Replace <Jira issue #> in the title with the actual Jira issue
    number, if there is one.
  • If this contribution is large, please file an Apache
    Individual Contributor License Agreement.

…park as supported runners in examples.

Signed-off-by: Jason Kuster <jason@google.com>
@jasonkuster
Copy link
Contributor Author

R: @mxm @amitsela

Hey Max, Amit.

Looking for some feedback on this pull request. The purpose is to remove the dependencies of Spark and Flink runner on Beam to enable them to run the WordCountIT in examples, as Dataflow currently does. As things were in the codebase, both Spark and Flink depended on examples for some of the code in WordCount.java and TfIdf.java. In the Flink case I've removed the tests; in the Spark case I've just added the code in. I'd love to hear your guys' thoughts on what the right thing to do is going forward.

The benefit we get from this is that this is where the new End-to-End tests seem to be going, such that they can be written in a runner-agnostic way and then run just by flipping a few flags (for example, see the commands below for running this test). Let me know your thoughts!

mvn clean verify -pl examples/java -am -rf :java-examples-all -DskipITs=false -DintegrationTestPipelineOptions='[ "--tempRoot=/tmp", "--inputFile=/tmp/kinglear.txt", "--runner=org.apache.beam.runners.spark.SparkPipelineRunner", "--sparkMaster=local" ]'

mvn clean verify -pl examples/java -am -rf :java-examples-all -DskipITs=false -DintegrationTestPipelineOptions='[ "--tempRoot=/tmp", "--inputFile=/tmp/kinglear.txt", "--runner=org.apache.beam.runners.flink.FlinkPipelineRunner" ]

mvn clean verify -pl examples/java -am -rf :java-examples-all -DskipITs=false -DintegrationTestPipelineOptions='[ "--tempRoot=gs://clouddfe-testing-temp-storage", "--runner=org.apache.beam.sdk.testing.TestDataflowPipelineRunner" ]'

Jason

@jasonkuster jasonkuster changed the title [BEAM-] Flink and Spark running Examples WordCountIT [BEAM-124] Flink and Spark running Examples WordCountIT May 17, 2016
@aljoscha
Copy link
Contributor

I think this is the right way to go. In #343 I'm also removing these two examples because all RunnableOnService tests will be executed on Flink with those changes.

* Count) as a reusable PTransform subclass. Using composite transforms allows for easy reuse,
* modular testing, and an improved monitoring experience.
*/
public static class CountWords extends PTransform<PCollection<String>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use SimpleWordCountTest.CountWords instead. Maybe need to make a small change to the format function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look. Thanks!

@amitsela
Copy link
Member

I generally agree with @aljoscha and once #294 is done, and RunnableOnService tests will cover those use cases, they might be removed from Spark runner tests as well.
@davorbonaci your thoughts on the examples pom.xml ?

@mxm
Copy link
Contributor

mxm commented May 23, 2016

Hi @jasonkuster! +1 for enabling end-to-end tests for all Runners. A couple questions: I wonder why do you remove a Flink TfIdf integration test and add one for Spark? ☺️ Presumably because the RunnableOnService tests are not yet integrated with the Spark Runner?

@jasonkuster
Copy link
Contributor Author

Hey @mxm! I removed in Flink and added in Spark just to see what the two different methods of resolving the dependency issues would look like. I'm happy to do either for either one, but based on the above comments it looks like the RunnableOnService tests are in process on both Spark and Flink, so once those are done and in it sounds like the right thing to do is just to remove the offending tests. I'm flexible though - my goal is just to get the E2E tests running everywhere. 😄

@davorbonaci
Copy link
Member

(should be rebased, given relevant changes to the pom.)

@mxm
Copy link
Contributor

mxm commented May 24, 2016

@jasonkuster The RunnableOnService tests are integrated with the Flink Runner for batched execution. So removing batch examples is fine. The streaming side still needs side inputs to support the tests.

+1 for merging from my side (needs rebasing though)

@jasonkuster jasonkuster closed this Aug 8, 2016
dhalperi pushed a commit to dhalperi/beam that referenced this pull request Aug 23, 2016
Add KafkaIO to Contrib

KafkaIO is an Unbounded source for reading from Apache Kafka.

Backports KafkaIO from Apache Beam. See apache/incubator-beam
7b175df
iemejia pushed a commit to iemejia/beam that referenced this pull request Jan 12, 2018
Manually add portability page to content
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants