[BEAM-3734] Performance tests for XmlIO #4747

lgajowy · 2018-02-26T13:59:48Z

This PR adds IO Integration Tests for XmlIO. The It includes:

The tests
mvn configuration for running them (with or without perfkit)
Jenkins jobs to be triggered periodically or on demand

Why/how: https://beam.apache.org/documentation/io/testing/#implementing-integration-tests

Follow this checklist to help us incorporate your contribution quickly and easily:

lgajowy · 2018-02-26T14:19:27Z

Run seed job

lgajowy · 2018-02-26T14:22:50Z

Run XmlIO Sink and ReadFiles Performance Test

The test can be parametrized with charset, number of records and filename prefix.

- generify getHashForRecordCount() (it's reusable in all IOITs now - move and rename appendTimestampToPrefix method

lgajowy · 2018-02-26T15:39:49Z

@jkff could you please take a look? You seem to know the IO's code best.

I realized (after trials that are visible above) that I cannot test Jenkins jobs here until this code is merged to master. Perfkit (in Jenkins jobs) runs git clone https://github.com/apache/beam.git and that's the code that it works on. There are no XmlIOITs related code yet, so it won't be able to run those on the PR. It's visible in logs of my failed trials. I see two possible moves now:

revert the jenkins jobs and merge as soon as the xml tests are ok and merged
leave the PR as is and fix the jobs if there will be problems

WDYT?

CC: @chamikaramj

chamikaramj

Thanks.

chamikaramj · 2018-02-28T18:40:24Z

.test-infra/jenkins/job_beam_PerformanceTests_XmlIO_IT.groovy

+                prCommitStatusName: 'Java XmlIO Sink and ReadFiles Performance Test',
+                prTriggerPhase    : 'Run XmlIO Sink and ReadFiles Performance Test',
+                extraPipelineArgs: [
+                        numberOfRecords: '100000000',


How long does this test take to run for 100000000 elements ?

These two are run separately:
writeThenReadViaSinkAndReadFiles(): ~20 minutes
writeThenReadViaWriteAndRead(): ~ 15 minutes

chamikaramj · 2018-02-28T19:30:27Z

sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/DeleteFileFn.java

+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *


Shouldn't this file be under java/io/file-based-io-tests ?

In such case, should xml module depend on file-based-io-tests module? My intention was to move code that is used by both modules to a place where both modules could easily access (hence the common module).

chamikaramj · 2018-02-28T19:31:29Z

sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestHelper.java

+package org.apache.beam.sdk.io.common;
+
+import java.util.Date;
+import java.util.Map;


Ditto - please move this to 'java/io/file-based-io-tests' if this is only intended to be by file-based tests.

The same question as above - XmlIOIT tests are in xml separate module.

chamikaramj · 2018-02-28T19:33:50Z

sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java


  void setCompressionType(String compressionType);
+
+  /* Xml */


Update comment to "Used by XmlIOIT"

chamikaramj · 2018-02-28T19:49:22Z

sdks/java/io/xml/build.gradle

+/*
+ * We need to rely on manually specifying these evaluationDependsOn to ensure that
+ * the following projects are evaluated before we evaluate this project. This is because
+ * we are attempting to reference the "sourceSets.test.output" directly.


Where do we attempt to reference "sourceSets.test.output" ?

You are right! We don't need it here - my bad.

chamikaramj · 2018-02-28T20:03:49Z

sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlIOIT.java

+    PCollection<String> testFileNames = pipeline
+      .apply("Generate sequence", GenerateSequence.from(0).to(numberOfRecords))
+      .apply("Create Birds", MapElements.via(new LongToBird()))
+      .apply("Write birds to xml files", FileIO.<Bird>write()


"Write XML files"

chamikaramj · 2018-02-28T20:14:48Z

sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlIOIT.java

+  }
+
+  @Test
+  public void writeThenReadViaSinkAndReadFiles() {


Will it be easier to just add a ReadAll transform to XMLIO (so that we can reduce this to a single test writeThenReadAll) ?

I think adding a ReadAll transform is a good idea. IMO, it would be even better if a Write transform returning output filenames is added too. I can add those. Should I do this in this PR or maybe finish this one first and then add the transforms? It looks like a separate JIRA issue to me. WDYT?

chamikaramj · 2018-02-28T21:04:36Z

sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlIOIT.java

+        .withCharset(charset));
+
+    PCollection<String> consolidatedHashcode = birds
+      .apply("Map birds to strings", MapElements.via(new BirdToString()))


Map XML records to strings

chamikaramj · 2018-02-28T21:05:17Z

sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlIOIT.java

+  }
+
+  @Test
+  public void writeThenReadViaWriteAndRead() {


Not sure what this name means. Could you please update.

chamikaramj · 2018-02-28T21:06:56Z

sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlIOIT.java

+
+    pipeline.run().waitUntilFinish();
+
+    PCollection<String> consolidatedHashcode = readPipeline


Why do we need both single pipeline and two pipeline versions of this test ?

I tested the two ways of using the XmlIO:

using FileIO with XmlIO.sink and then using XmlIO.readFiles() method (this involves FileIO usage but allows us to perform all operations on one pipeline)

using write() and read() methods from XmlIO only (this needs two pipelines)

This (in my opinion) has the following benefits:

tests both scenarios (which can be helpful in finding regressions)

shows two ways of implementing read and write scenario

can show the difference between execution times of the tests

It was a low effort to me so I decided to leave both. :) If this is not needed I can delete the "two pipeline" one. Should I?

lgajowy · 2018-03-02T16:36:49Z

Thank you @chamikaramj! Please find my questions in above comments.

chamikaramj

Thanks.

chamikaramj · 2018-03-05T21:58:53Z

sdks/java/io/xml/build.gradle

+/*
+ * We need to rely on manually specifying these evaluationDependsOn to ensure that
+ * the following projects are evaluated before we evaluate this project. This is because
+ * we are attempting to reference the "sourceSets.test.output" directly.


lgajowy wrote:
You are right! We don't need it here - my bad.

Seems like code wasn't updated. Forgot to push ?

chamikaramj · 2018-03-05T21:58:53Z

.test-infra/jenkins/job_beam_PerformanceTests_XmlIO_IT.groovy

+                prCommitStatusName: 'Java XmlIO Sink and ReadFiles Performance Test',
+                prTriggerPhase    : 'Run XmlIO Sink and ReadFiles Performance Test',
+                extraPipelineArgs: [
+                        numberOfRecords: '100000000',


lgajowy wrote:
These two are run separately:
writeThenReadViaSinkAndReadFiles(): ~20 minutes
writeThenReadViaWriteAndRead(): ~ 15 minutes

Acknowledged.

chamikaramj · 2018-03-05T21:58:53Z

sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlIOIT.java

+  public void writeThenReadViaSinkAndReadFiles() {
+    PCollection<String> testFileNames = pipeline
+      .apply("Generate sequence", GenerateSequence.from(0).to(numberOfRecords))
+      .apply("Create Birds", MapElements.via(new LongToBird()))


lgajowy wrote:
Ok

Please update.

chamikaramj · 2018-03-05T21:58:53Z

sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/DeleteFileFn.java

+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *


lgajowy wrote:
In such case, should xml module depend on file-based-io-tests module? My intention was to move code that is used by both modules to a place where both modules could easily access (hence the common module).

I think we should move XmlIOIT to file-bsed-io-tests module as well given that all other file-based IO tests are in this module. This will allow us to consolidate all similar ITs and common classes/resources.

chamikaramj · 2018-03-05T21:58:53Z

sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlIOIT.java

+  }
+
+  @Test
+  public void writeThenReadViaSinkAndReadFiles() {


lgajowy wrote:
I think adding a ReadAll transform is a good idea. IMO, it would be even better if a Write transform returning output filenames is added too. I can add those. Should I do this in this PR or maybe finish this one first and then add the transforms? It looks like a separate JIRA issue to me. WDYT?

Agree that it should be a separate JIRA/PR and should not block this PR.

chamikaramj · 2018-03-05T21:58:53Z

sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlIOIT.java

+
+    pipeline.run().waitUntilFinish();
+
+    PCollection<String> consolidatedHashcode = readPipeline


lgajowy wrote:
I tested the two ways of using the XmlIO:

using FileIO with XmlIO.sink and then using XmlIO.readFiles() method (this involves FileIO usage but allows us to perform all operations on one pipeline)

using write() and read() methods from XmlIO only (this needs two pipelines)

This (in my opinion) has the following benefits:

tests both scenarios (which can be helpful in finding regressions)

shows two ways of implementing read and write scenario

can show the difference between execution times of the tests

It was a low effort to me so I decided to leave both. :) If this is not needed I can delete the "two pipeline" one. Should I?

I think we should simplify and go with the single pipeline option. There is not enough of a diff between XmlIO.readFiles() and XmlIO.read() to justify an additional test.

chamikaramj · 2018-03-05T21:58:53Z

sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlIOIT.java

+    PCollection<String> testFileNames = pipeline
+      .apply("Generate sequence", GenerateSequence.from(0).to(numberOfRecords))
+      .apply("Create Birds", MapElements.via(new LongToBird()))
+      .apply("Write birds to xml files", FileIO.<Bird>write()


chamikaramj wrote:
"Write XML files"

Ditto.

chamikaramj · 2018-03-05T21:58:54Z

sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java


  void setCompressionType(String compressionType);
+
+  /* Xml */


lgajowy wrote:
Ok

Seems like this hasn't been done yet.

chamikaramj · 2018-03-05T21:58:54Z

sdks/java/io/xml/src/test/java/org/apache/beam/sdk/io/xml/XmlIOIT.java

+  }
+
+  @Test
+  public void writeThenReadViaWriteAndRead() {


lgajowy wrote:
Ok

Ditto.

lgajowy · 2018-03-07T12:40:49Z

@chamikaramj Thanks again. I posted the fixes. Could you take a look?

Two important points:

I also removed separate Jenkins job file for xml. Separate file is not needed since we have one test now. The jenkins job configuration was added to the common file file based io tests jenkins job file: .test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy
I created issues we discussed: BEAM-3796, BEAM-3795

chamikaramj · 2018-03-07T19:19:43Z

LGTM.

Will merge after tests pass.

chamikaramj · 2018-03-07T19:20:10Z

Run seed job

chamikaramj · 2018-03-07T19:45:13Z

Run Java XmlIO Performance Test

lgajowy · 2018-03-07T20:35:14Z

A kind reminder: please see the comment above

chamikaramj · 2018-03-07T20:55:58Z

Got it. I think it's fine to merge this and fix any issues.

Still have to wait for pre-commits to pass :)

chamikaramj · 2018-03-09T01:55:46Z

Retest this please

lgajowy force-pushed the xml-io-it branch from 1b00522 to 7584537 Compare February 26, 2018 14:18

lgajowy added 6 commits February 26, 2018 16:25

[BEAM-3734] Add XmlIOIT using sink and readFiles()

9eca6e1

The test can be parametrized with charset, number of records and filename prefix.

[BEAM-3734] Refactor. Reduce code duplication

f2b87a0

- generify getHashForRecordCount() (it's reusable in all IOITs now - move and rename appendTimestampToPrefix method

[BEAM-3734] Add Perfkit and Dataflow support

e538394

[BEAM-3734] Add hashes for various record quantity

a7a744b

[BEAM-3734] Add another XmlIOIT using write and read

8642edf

[BEAM-3734] Add Jenkins job definitions for Large scale tests

e3960b9

lgajowy force-pushed the xml-io-it branch from 7584537 to e3960b9 Compare February 26, 2018 15:25

lgajowy mentioned this pull request Feb 27, 2018

[BEAM-3217] Jenkins job for HadoopInputFormatIOIT #4758

Merged

10 tasks

chamikaramj self-requested a review February 28, 2018 19:48

chamikaramj reviewed Feb 28, 2018

View reviewed changes

chamikaramj reviewed Mar 5, 2018

View reviewed changes

[BEAM-3734] Post code review fixes

c8b66c3

lgajowy force-pushed the xml-io-it branch from 7120219 to c8b66c3 Compare March 7, 2018 13:51

chamikaramj merged commit bd3c087 into apache:master Mar 9, 2018

lgajowy deleted the xml-io-it branch March 14, 2018 11:32


		pipeline.run().waitUntilFinish();

		PCollection<String> consolidatedHashcode = readPipeline

[BEAM-3734] Performance tests for XmlIO #4747

[BEAM-3734] Performance tests for XmlIO #4747

Uh oh!

Conversation

lgajowy commented Feb 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lgajowy commented Feb 26, 2018

Uh oh!

lgajowy commented Feb 26, 2018

Uh oh!

lgajowy commented Feb 26, 2018

Uh oh!

chamikaramj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lgajowy commented Mar 2, 2018

Uh oh!

chamikaramj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lgajowy commented Mar 7, 2018

Uh oh!

chamikaramj commented Mar 7, 2018

lgajowy commented Feb 26, 2018 •

edited

Loading