Skip to content

Conversation

@kennknowles
Copy link
Member

@kennknowles kennknowles commented May 5, 2016

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

  • Make sure the PR title is formatted like:
    [BEAM-<Jira issue #>] Description of pull request
  • Make sure tests pass via mvn clean verify. (Even better, enable
    Travis-CI on your fork and ensure the whole test matrix passes).
  • Replace <Jira issue #> in the title with the actual Jira issue
    number, if there is one.
  • If this contribution is large, please file an Apache
    Individual Contributor License Agreement.

This PR demonstrates how to configure the integration tests. A number of transforms are not supported; we can solve this via surefire exclusions perhaps.

@kennknowles
Copy link
Member Author

R: @amitsela just throwing this out there for comment; probably with your understanding it can come together quickly.

</exclusions>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something amiss with the maven config, as omitting this results in ClassDefNotFound looking for Mockito classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the table here: https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html the runner has a scope test dependency on java-sdk-all (classifier tests), and it (java-sdk-all) has a scope test dependency on mockito, so it's not transitive anymore..
Actually it seems as if scope test is never transitive..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I think I see. To get the behavior automatically, we would probably need a java-sdk-tests artifact with a compile-scoped dependency for everything that is currently a test-scoped dependency. I propose solving this in the future instead of now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

@kennknowles kennknowles force-pushed the spark-integration branch from d8a2a34 to 6b20a73 Compare May 6, 2016 01:27
"--streaming=false"
]
</beamTestPipelineOptions>
<dataflow.spark.test.reuseSparkContext>true</dataflow.spark.test.reuseSparkContext>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered this property, and it solved some of the issues, though the result appears to be extremely slow.

Copy link
Member

@amitsela amitsela May 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all the tests are running in the same JVM, and only one context is allowed, you have to use this flag.
This should actually even make things (a little bit) faster.. since your not creating the context for each test.
Keep in mind that all tests that are running a Spark pipeline, will run a standalone Spark instance - and as you can see in the runner unit tests, it takes ~3-4 seconds (each)..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there startup/teardown time included in the 3-4 seconds that we might reduce?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, let me check that, but take into account that it starts a standalone Spark node with WebUI server and all...
And currently al tests run in a single thread, need to check if we could speed things up by increasing the number of threads without breaking some tests.

On it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I took a deeper look into this "seconds-scale" overhead time we're hitting, and I don't see an easy way avoiding this, though there are some optimizations that we can do (i.e., turn off Spark WebUI), I couldn't find anything that will make a real difference - and that makes sense.
Those "unit-tests" are IT tests actually (though I've heard arguments about this terminology..) that spin-up a Spark instance, which takes a few seconds and that's reasonable, it's the same for Flink (last time I've checked).
This leads to a bigger question - do we want to execute ALL RunnableOnService tests on ALL runners ? because each test that executes a pipeline will probably present some overhead that is more than the Direct/InProcessPipelineRunner will..
I guess it's a trade-off between what could become a very long build, and a less comprehensive testing infrastructure.
We could also consider "grouping" together tests so one pipeline execution tests more than one component, and this is a compromise as well.

Thoughts ?

Copy link
Member Author

@kennknowles kennknowles May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we definitely don't really want to run all of them all the time, especially if they are too expensive.

In particular, there are three conflated things that should be teased apart somehow:

  • A test that can run on a service. Basically, a test that does not depend on in-memory hackery.
  • A test that should run on each runner (either in precommit or postcommit) to confirm the capability matrix and that we don't break the runner by messing with the SDK.
  • A test that requires a runner but is really an SDK unit test. These are tests, for example, that the composite transforms are correctly assembled. This category is sort of new from moving all runners out of the SDK, and is mostly not that interesting to run on anything other than the new in-process runner.

The current RunnableOnService tag is the first one, but actually that one is the least interesting now IMO. I thought I might get lucky and be able to gain some testing signal prior to properly separating these and organizing the tests by the capabilities they require.

@kennknowles kennknowles force-pushed the spark-integration branch from 6b20a73 to c562e11 Compare May 6, 2016 01:35
@kennknowles
Copy link
Member Author

Two side notes:

  • Seems to take forever in Travis, whereas not so much on my machine.
  • Jenkins appears to fail as an "error" not a "test failure" CC: @davorbonaci @jasonkuster

*
* To create a Spark streaming pipeline runner use {@link SparkStreamingPipelineOptions}
*/
public final class TestSparkPipelineRunner extends PipelineRunner<EvaluationResult> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the TestSparkPipelineRunner ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you don't - it is really just a hook to set configuration options. I see some test cases that seem to set them up prior to the run, so this might be a place for that knowledge to reside. You could do it in the pom, too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really mind the additional class, but am wondering if you see some future use for it ?
To rephrase that, do you think runners will benefit from having a TestPipelineRunner ? taking it to the next level, should it be a part of the runner API ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment of yours was prescient; I don't know if it is in the way you meant. The TestDataflowPipelineRunner does some additional bookkeeping to make sure that PAsserts ran as expected, using aggregators. Otherwise you might have an assertion that just never ran because the PCollection was empty, giving a false test pass. The logic really should be extracted into a generic TestPipelineRunner which could have a "sub" runner or some such. (Or it could just be a utility used in Java code to create the various test wrappers).

@amitsela
Copy link
Member

amitsela commented May 6, 2016

Funny, when I look at the Jenkins console output I see the same summary I have locally (Tests run: 165, Failures: 1, Errors: 53, Skipped: 1) but then I see: [ERROR] There are test failures.

@kennknowles
Copy link
Member Author

I just rebased and actually looked closer at the test results. It may sound crazy but I looked at each one :-) and here is my triage summary (without looking at the code) in order of frequency:

  1. Create.TimestampedValues missing a translator (note that it is composite)
  2. Read.Bounded missing a translator.
  3. Some null pointer exceptions in ParDo with side outputs.
  4. Perhaps one real issue in windowed combine, not sure.

If a translator for Read.Bounded were added, it should also take care of Create.Values that is currently treated as primitive for the Spark runner and definitely Create.TimestampedValues should work too (I think it already should; a bit confused about that).

@amitsela
Copy link
Member

amitsela commented May 13, 2016

Wow, thanks for the triage, I will take a look. I think it's definitely the time for the Spark runner to conform to Read.from as in the Runner API. And yes, Create.TimestampedValues is supported, and used in WindowedWordCountTest

@amitsela
Copy link
Member

amitsela commented May 15, 2016

So the SparkPipelineRunner overrides seem to be the problem, since Create.TimestampedValues is an instance of Create.Values and so it's treated as a SinglePrimitiveOutputPTransform though it's a composite.
This brings down the count to 33 errors and 6 failures. I'm not going to take care of this specifically because I think that handling this as a part of implementing Read.from is a better use of resources..
But this also means that this PR might "hang" for a while.. is this a blocker for something immediate ? if not I could get to it within a week or two.

@kennknowles
Copy link
Member Author

Ah, I missed the question at the end - not a blocker for anything in particular. Just working towards the ideal of having a test suite for the capability matrix. And also working out the kinks in the integration testing framework to be sure it is flexible enough for all our runners.

@jasonkuster
Copy link
Contributor

What's the story with this PR? I'm working on getting Spark running the E2E tests in #345 and this was listed as a blocker there.

@amitsela
Copy link
Member

amitsela commented Jun 8, 2016

The main issue here is Read.from implementation.. I will try to get to that, but since Spark API is changing now, this might take a while..

@kennknowles kennknowles force-pushed the spark-integration branch 4 times, most recently from 06cf37f to 11bcccc Compare June 16, 2016 03:44
@kennknowles
Copy link
Member Author

After a rebase, all tests are passing! It takes 6 minutes, but we are way up there already in build time so I'm inclined to merge now and optimize later.

@amitsela
Copy link
Member

Awesome!
BTW how did you get pass tests that use Read.from ?

@kennknowles
Copy link
Member Author

Ah, of course: they are now separated into the various sdks/io modules. So everything in sdks/core (Create, TextIO, ...) is handled specially by the spark runner.

Still, a great step. We would have arrived at the same point by using test categories to filter, but this is convenient.

This reminds me: I think the capability matrix suggests the Read.from transform is implemented, so we should update that.

@amitsela
Copy link
Member

in the Source API section ? true. I think I didn't really get this one at first..

@asfgit asfgit merged commit 4254749 into apache:master Jun 20, 2016
asfgit pushed a commit that referenced this pull request Jun 20, 2016
@kennknowles kennknowles deleted the spark-integration branch November 10, 2016 03:10
iemejia pushed a commit to iemejia/beam that referenced this pull request Jan 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants