-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-124] Spark Running WordCountIT Example #769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Are we modifying wordcount so that it counts the words in the apache license instead of the shakespeare example? |
|
For the reason of SparkRunner can't resolve |
|
Sure, makes sense to me. |
|
"Unable to find any files matching gs://dataflow-samples/apache/LICENSE" from the Jenkins output - looks like until we get that updated this will break presubmits. |
|
@jasonkuster Sorry for point it out. I don't have write access right to "gs://dataflow-samples/", can you give me the authentication or help me upload the file? |
| public static class InputFactory implements DefaultValueFactory<String> { | ||
| @Override | ||
| public String create(PipelineOptions options) { | ||
| if (options.getRunner().isAssignableFrom(SparkRunner.class)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to have everyone use GCS by default?
What if Dataflow was the only one that used the GCS one?
Also, this sets a poor precedent where there is "runner" specific configuration being done on a per test basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we want everyone use GCS by default, and FilnkRunner already support it. But WordCountIT can't use SparkRunner with path starting with "gs://" as for as I know. This is one tmp solution in order to aggregate this E2E test to pre/post-submit test. Otherwise, SparkRunner side will be a blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we construct the input file path in WordCountIT, and pass it to WordCount?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, use --inputFile flag. Put this key value pair inside -DintegrationTestPipelineOptions.
|
R: @lukecwik |
|
Working with @dhalperi to put new test data in a proper directory. |
1e6ec6e to
e979c82
Compare
|
PTAL @lukecwik
|
|
LGTM, will merge once jenkins/travis runs finish |
Be sure to do all of the following to help us incorporate your contribution
quickly and easily:
[BEAM-<Jira issue #>] Description of pull requestmvn clean verify. (Even better, enableTravis-CI on your fork and ensure the whole test matrix passes).
<Jira issue #>in the title with the actual Jira issuenumber, if there is one.
Individual Contributor License Agreement.
gs://right now.gs://apache-beam-samples/apache/LICENSEFollowing command is used to run WordCountIT with SparkRunner:
mvn clean verify -pl examples/java -DskipITs=false -Dit.test=WordCountIT -DintegrationTestPipelineOptions='[ "--tempRoot=/tmp", "--runner=org.apache.beam.runners.spark.SparkRunner" ]'This PR is duplicated from PR(#703), since we want to have Flink and Spark in separate review.