Skip to content

Conversation

@lgajowy
Copy link
Contributor

@lgajowy lgajowy commented Feb 26, 2018

This PR adds IO Integration Tests for XmlIO. The It includes:

  • The tests
  • mvn configuration for running them (with or without perfkit)
  • Jenkins jobs to be triggered periodically or on demand

Why/how: https://beam.apache.org/documentation/io/testing/#implementing-integration-tests


Follow this checklist to help us incorporate your contribution quickly and easily:

  • Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
  • Write a pull request description that is detailed enough to understand:
    • What the pull request does
    • Why it does it
    • How it does it
    • Why this approach
  • Each commit in the pull request should have a meaningful subject line and body.
  • Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

@lgajowy
Copy link
Contributor Author

lgajowy commented Feb 26, 2018

Run seed job

@lgajowy
Copy link
Contributor Author

lgajowy commented Feb 26, 2018

Run XmlIO Sink and ReadFiles Performance Test

@lgajowy
Copy link
Contributor Author

lgajowy commented Feb 26, 2018

@jkff could you please take a look? You seem to know the IO's code best.

I realized (after trials that are visible above) that I cannot test Jenkins jobs here until this code is merged to master. Perfkit (in Jenkins jobs) runs git clone https://github.com/apache/beam.git and that's the code that it works on. There are no XmlIOITs related code yet, so it won't be able to run those on the PR. It's visible in logs of my failed trials. I see two possible moves now:

  • revert the jenkins jobs and merge as soon as the xml tests are ok and merged
  • leave the PR as is and fix the jobs if there will be problems

WDYT?

CC: @chamikaramj

Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

prCommitStatusName: 'Java XmlIO Sink and ReadFiles Performance Test',
prTriggerPhase : 'Run XmlIO Sink and ReadFiles Performance Test',
extraPipelineArgs: [
numberOfRecords: '100000000',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this test take to run for 100000000 elements ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two are run separately:
writeThenReadViaSinkAndReadFiles(): ~20 minutes
writeThenReadViaWriteAndRead(): ~ 15 minutes

* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this file be under java/io/file-based-io-tests ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such case, should xml module depend on file-based-io-tests module? My intention was to move code that is used by both modules to a place where both modules could easily access (hence the common module).

package org.apache.beam.sdk.io.common;

import java.util.Date;
import java.util.Map;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto - please move this to 'java/io/file-based-io-tests' if this is only intended to be by file-based tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same question as above - XmlIOIT tests are in xml separate module.


void setCompressionType(String compressionType);

/* Xml */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment to "Used by XmlIOIT"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

/*
* We need to rely on manually specifying these evaluationDependsOn to ensure that
* the following projects are evaluated before we evaluate this project. This is because
* we are attempting to reference the "sourceSets.test.output" directly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we attempt to reference "sourceSets.test.output" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! We don't need it here - my bad.

PCollection<String> testFileNames = pipeline
.apply("Generate sequence", GenerateSequence.from(0).to(numberOfRecords))
.apply("Create Birds", MapElements.via(new LongToBird()))
.apply("Write birds to xml files", FileIO.<Bird>write()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Write XML files"

}

@Test
public void writeThenReadViaSinkAndReadFiles() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be easier to just add a ReadAll transform to XMLIO (so that we can reduce this to a single test writeThenReadAll) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a ReadAll transform is a good idea. IMO, it would be even better if a Write transform returning output filenames is added too. I can add those. Should I do this in this PR or maybe finish this one first and then add the transforms? It looks like a separate JIRA issue to me. WDYT?

.withCharset(charset));

PCollection<String> consolidatedHashcode = birds
.apply("Map birds to strings", MapElements.via(new BirdToString()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Map XML records to strings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

}

@Test
public void writeThenReadViaWriteAndRead() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this name means. Could you please update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok


pipeline.run().waitUntilFinish();

PCollection<String> consolidatedHashcode = readPipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need both single pipeline and two pipeline versions of this test ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the two ways of using the XmlIO:

  • using FileIO with XmlIO.sink and then using XmlIO.readFiles() method (this involves FileIO usage but allows us to perform all operations on one pipeline)
  • using write() and read() methods from XmlIO only (this needs two pipelines)

This (in my opinion) has the following benefits:

  • tests both scenarios (which can be helpful in finding regressions)
  • shows two ways of implementing read and write scenario
  • can show the difference between execution times of the tests

It was a low effort to me so I decided to leave both. :) If this is not needed I can delete the "two pipeline" one. Should I?

@lgajowy
Copy link
Contributor Author

lgajowy commented Mar 2, 2018

Thank you @chamikaramj! Please find my questions in above comments.

Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

/*
* We need to rely on manually specifying these evaluationDependsOn to ensure that
* the following projects are evaluated before we evaluate this project. This is because
* we are attempting to reference the "sourceSets.test.output" directly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgajowy wrote:
You are right! We don't need it here - my bad.

Seems like code wasn't updated. Forgot to push ?

prCommitStatusName: 'Java XmlIO Sink and ReadFiles Performance Test',
prTriggerPhase : 'Run XmlIO Sink and ReadFiles Performance Test',
extraPipelineArgs: [
numberOfRecords: '100000000',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgajowy wrote:
These two are run separately:
writeThenReadViaSinkAndReadFiles(): ~20 minutes
writeThenReadViaWriteAndRead(): ~ 15 minutes

Acknowledged.

public void writeThenReadViaSinkAndReadFiles() {
PCollection<String> testFileNames = pipeline
.apply("Generate sequence", GenerateSequence.from(0).to(numberOfRecords))
.apply("Create Birds", MapElements.via(new LongToBird()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgajowy wrote:
Ok

Please update.

* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgajowy wrote:
In such case, should xml module depend on file-based-io-tests module? My intention was to move code that is used by both modules to a place where both modules could easily access (hence the common module).

I think we should move XmlIOIT to file-bsed-io-tests module as well given that all other file-based IO tests are in this module. This will allow us to consolidate all similar ITs and common classes/resources.

}

@Test
public void writeThenReadViaSinkAndReadFiles() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgajowy wrote:
I think adding a ReadAll transform is a good idea. IMO, it would be even better if a Write transform returning output filenames is added too. I can add those. Should I do this in this PR or maybe finish this one first and then add the transforms? It looks like a separate JIRA issue to me. WDYT?

Agree that it should be a separate JIRA/PR and should not block this PR.


pipeline.run().waitUntilFinish();

PCollection<String> consolidatedHashcode = readPipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgajowy wrote:
I tested the two ways of using the XmlIO:

  • using FileIO with XmlIO.sink and then using XmlIO.readFiles() method (this involves FileIO usage but allows us to perform all operations on one pipeline)
  • using write() and read() methods from XmlIO only (this needs two pipelines)

This (in my opinion) has the following benefits:

  • tests both scenarios (which can be helpful in finding regressions)
  • shows two ways of implementing read and write scenario
  • can show the difference between execution times of the tests

It was a low effort to me so I decided to leave both. :) If this is not needed I can delete the "two pipeline" one. Should I?

I think we should simplify and go with the single pipeline option. There is not enough of a diff between XmlIO.readFiles() and XmlIO.read() to justify an additional test.

PCollection<String> testFileNames = pipeline
.apply("Generate sequence", GenerateSequence.from(0).to(numberOfRecords))
.apply("Create Birds", MapElements.via(new LongToBird()))
.apply("Write birds to xml files", FileIO.<Bird>write()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chamikaramj wrote:
"Write XML files"

Ditto.


void setCompressionType(String compressionType);

/* Xml */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgajowy wrote:
Ok

Seems like this hasn't been done yet.

}

@Test
public void writeThenReadViaWriteAndRead() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgajowy wrote:
Ok

Ditto.

@lgajowy
Copy link
Contributor Author

lgajowy commented Mar 7, 2018

@chamikaramj Thanks again. I posted the fixes. Could you take a look?

Two important points:

  • I also removed separate Jenkins job file for xml. Separate file is not needed since we have one test now. The jenkins job configuration was added to the common file file based io tests jenkins job file: .test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy
  • I created issues we discussed: BEAM-3796, BEAM-3795

@chamikaramj
Copy link
Contributor

LGTM.

Will merge after tests pass.

@chamikaramj
Copy link
Contributor

Run seed job

@chamikaramj
Copy link
Contributor

Run Java XmlIO Performance Test

@lgajowy
Copy link
Contributor Author

lgajowy commented Mar 7, 2018

@chamikaramj
Copy link
Contributor

Got it. I think it's fine to merge this and fix any issues.

Still have to wait for pre-commits to pass :)

@chamikaramj
Copy link
Contributor

Retest this please

@chamikaramj chamikaramj merged commit bd3c087 into apache:master Mar 9, 2018
@lgajowy lgajowy deleted the xml-io-it branch March 14, 2018 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants