Skip to content

Conversation

@markflyhigh
Copy link
Contributor

@markflyhigh markflyhigh commented Aug 2, 2016

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

  • Make sure the PR title is formatted like:
    [BEAM-<Jira issue #>] Description of pull request
  • Make sure tests pass via mvn clean verify. (Even better, enable
    Travis-CI on your fork and ensure the whole test matrix passes).
  • Replace <Jira issue #> in the title with the actual Jira issue
    number, if there is one.
  • If this contribution is large, please file an Apache
    Individual Contributor License Agreement.

  • Add Spark dependency in order to use SparkRunner to execute WordCountIT.
  • Change default test file and add to project which avoid the problem that SparkRunner can't resove gs:// right now.
  • New test input is: gs://apache-beam-samples/apache/LICENSE

Following command is used to run WordCountIT with SparkRunner:

mvn clean verify -pl examples/java -DskipITs=false -Dit.test=WordCountIT -DintegrationTestPipelineOptions='[ "--tempRoot=/tmp", "--runner=org.apache.beam.runners.spark.SparkRunner" ]'

This PR is duplicated from PR(#703), since we want to have Flink and Spark in separate review.

@markflyhigh
Copy link
Contributor Author

+R: @amitsela @dhalperi

@jasonkuster
Copy link
Contributor

Are we modifying wordcount so that it counts the words in the apache license instead of the shakespeare example?

@markflyhigh
Copy link
Contributor Author

For the reason of SparkRunner can't resolve gs:// right now. Also there is a concern that using apache license instead of shakespeare example can avoid apache license issue if we want to put it in the project build.

@jasonkuster
Copy link
Contributor

Sure, makes sense to me.

@jasonkuster
Copy link
Contributor

"Unable to find any files matching gs://dataflow-samples/apache/LICENSE" from the Jenkins output - looks like until we get that updated this will break presubmits.

@markflyhigh
Copy link
Contributor Author

@jasonkuster Sorry for point it out. I don't have write access right to "gs://dataflow-samples/", can you give me the authentication or help me upload the file?

public static class InputFactory implements DefaultValueFactory<String> {
@Override
public String create(PipelineOptions options) {
if (options.getRunner().isAssignableFrom(SparkRunner.class)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to have everyone use GCS by default?

What if Dataflow was the only one that used the GCS one?

Also, this sets a poor precedent where there is "runner" specific configuration being done on a per test basis.

Copy link
Contributor Author

@markflyhigh markflyhigh Aug 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we want everyone use GCS by default, and FilnkRunner already support it. But WordCountIT can't use SparkRunner with path starting with "gs://" as for as I know. This is one tmp solution in order to aggregate this E2E test to pre/post-submit test. Otherwise, SparkRunner side will be a blocker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we construct the input file path in WordCountIT, and pass it to WordCount?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, use --inputFile flag. Put this key value pair inside -DintegrationTestPipelineOptions.

@lukecwik
Copy link
Member

lukecwik commented Aug 9, 2016

R: @lukecwik

@markflyhigh
Copy link
Contributor Author

Working with @dhalperi to put new test data in a proper directory.

@markflyhigh markflyhigh force-pushed the wordcount-e2e-spark-runner branch from 1e6ec6e to e979c82 Compare August 9, 2016 22:54
@markflyhigh
Copy link
Contributor Author

markflyhigh commented Aug 9, 2016

PTAL @lukecwik

  • New test input is updated on GCS and verified.
  • Solve the merge conflicts.

@lukecwik
Copy link
Member

lukecwik commented Aug 9, 2016

LGTM, will merge once jenkins/travis runs finish

@asfgit asfgit merged commit e979c82 into apache:master Aug 10, 2016
asfgit pushed a commit that referenced this pull request Aug 10, 2016
@markflyhigh markflyhigh deleted the wordcount-e2e-spark-runner branch November 7, 2016 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants