-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-3734] Performance tests for XmlIO #4747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Run seed job |
|
Run XmlIO Sink and ReadFiles Performance Test |
The test can be parametrized with charset, number of records and filename prefix.
- generify getHashForRecordCount() (it's reusable in all IOITs now - move and rename appendTimestampToPrefix method
|
@jkff could you please take a look? You seem to know the IO's code best. I realized (after trials that are visible above) that I cannot test Jenkins jobs here until this code is merged to master. Perfkit (in Jenkins jobs) runs
WDYT? CC: @chamikaramj |
chamikaramj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
| prCommitStatusName: 'Java XmlIO Sink and ReadFiles Performance Test', | ||
| prTriggerPhase : 'Run XmlIO Sink and ReadFiles Performance Test', | ||
| extraPipelineArgs: [ | ||
| numberOfRecords: '100000000', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this test take to run for 100000000 elements ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two are run separately:
writeThenReadViaSinkAndReadFiles(): ~20 minutes
writeThenReadViaWriteAndRead(): ~ 15 minutes
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this file be under java/io/file-based-io-tests ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In such case, should xml module depend on file-based-io-tests module? My intention was to move code that is used by both modules to a place where both modules could easily access (hence the common module).
| package org.apache.beam.sdk.io.common; | ||
|
|
||
| import java.util.Date; | ||
| import java.util.Map; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto - please move this to 'java/io/file-based-io-tests' if this is only intended to be by file-based tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same question as above - XmlIOIT tests are in xml separate module.
|
|
||
| void setCompressionType(String compressionType); | ||
|
|
||
| /* Xml */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update comment to "Used by XmlIOIT"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
sdks/java/io/xml/build.gradle
Outdated
| /* | ||
| * We need to rely on manually specifying these evaluationDependsOn to ensure that | ||
| * the following projects are evaluated before we evaluate this project. This is because | ||
| * we are attempting to reference the "sourceSets.test.output" directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we attempt to reference "sourceSets.test.output" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right! We don't need it here - my bad.
| PCollection<String> testFileNames = pipeline | ||
| .apply("Generate sequence", GenerateSequence.from(0).to(numberOfRecords)) | ||
| .apply("Create Birds", MapElements.via(new LongToBird())) | ||
| .apply("Write birds to xml files", FileIO.<Bird>write() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Write XML files"
| } | ||
|
|
||
| @Test | ||
| public void writeThenReadViaSinkAndReadFiles() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be easier to just add a ReadAll transform to XMLIO (so that we can reduce this to a single test writeThenReadAll) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding a ReadAll transform is a good idea. IMO, it would be even better if a Write transform returning output filenames is added too. I can add those. Should I do this in this PR or maybe finish this one first and then add the transforms? It looks like a separate JIRA issue to me. WDYT?
| .withCharset(charset)); | ||
|
|
||
| PCollection<String> consolidatedHashcode = birds | ||
| .apply("Map birds to strings", MapElements.via(new BirdToString())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Map XML records to strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
| } | ||
|
|
||
| @Test | ||
| public void writeThenReadViaWriteAndRead() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this name means. Could you please update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
|
|
||
| pipeline.run().waitUntilFinish(); | ||
|
|
||
| PCollection<String> consolidatedHashcode = readPipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need both single pipeline and two pipeline versions of this test ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the two ways of using the XmlIO:
- using FileIO with XmlIO.sink and then using XmlIO.readFiles() method (this involves FileIO usage but allows us to perform all operations on one pipeline)
- using write() and read() methods from XmlIO only (this needs two pipelines)
This (in my opinion) has the following benefits:
- tests both scenarios (which can be helpful in finding regressions)
- shows two ways of implementing read and write scenario
- can show the difference between execution times of the tests
It was a low effort to me so I decided to leave both. :) If this is not needed I can delete the "two pipeline" one. Should I?
|
Thank you @chamikaramj! Please find my questions in above comments. |
chamikaramj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
sdks/java/io/xml/build.gradle
Outdated
| /* | ||
| * We need to rely on manually specifying these evaluationDependsOn to ensure that | ||
| * the following projects are evaluated before we evaluate this project. This is because | ||
| * we are attempting to reference the "sourceSets.test.output" directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgajowy wrote:
You are right! We don't need it here - my bad.
Seems like code wasn't updated. Forgot to push ?
| prCommitStatusName: 'Java XmlIO Sink and ReadFiles Performance Test', | ||
| prTriggerPhase : 'Run XmlIO Sink and ReadFiles Performance Test', | ||
| extraPipelineArgs: [ | ||
| numberOfRecords: '100000000', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgajowy wrote:
These two are run separately:
writeThenReadViaSinkAndReadFiles(): ~20 minutes
writeThenReadViaWriteAndRead(): ~ 15 minutes
Acknowledged.
| public void writeThenReadViaSinkAndReadFiles() { | ||
| PCollection<String> testFileNames = pipeline | ||
| .apply("Generate sequence", GenerateSequence.from(0).to(numberOfRecords)) | ||
| .apply("Create Birds", MapElements.via(new LongToBird())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgajowy wrote:
Ok
Please update.
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgajowy wrote:
In such case, shouldxmlmodule depend onfile-based-io-testsmodule? My intention was to move code that is used by both modules to a place where both modules could easily access (hence thecommonmodule).
I think we should move XmlIOIT to file-bsed-io-tests module as well given that all other file-based IO tests are in this module. This will allow us to consolidate all similar ITs and common classes/resources.
| } | ||
|
|
||
| @Test | ||
| public void writeThenReadViaSinkAndReadFiles() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgajowy wrote:
I think adding a ReadAll transform is a good idea. IMO, it would be even better if a Write transform returning output filenames is added too. I can add those. Should I do this in this PR or maybe finish this one first and then add the transforms? It looks like a separate JIRA issue to me. WDYT?
Agree that it should be a separate JIRA/PR and should not block this PR.
|
|
||
| pipeline.run().waitUntilFinish(); | ||
|
|
||
| PCollection<String> consolidatedHashcode = readPipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgajowy wrote:
I tested the two ways of using the XmlIO:
- using FileIO with XmlIO.sink and then using XmlIO.readFiles() method (this involves FileIO usage but allows us to perform all operations on one pipeline)
- using write() and read() methods from XmlIO only (this needs two pipelines)
This (in my opinion) has the following benefits:
- tests both scenarios (which can be helpful in finding regressions)
- shows two ways of implementing read and write scenario
- can show the difference between execution times of the tests
It was a low effort to me so I decided to leave both. :) If this is not needed I can delete the "two pipeline" one. Should I?
I think we should simplify and go with the single pipeline option. There is not enough of a diff between XmlIO.readFiles() and XmlIO.read() to justify an additional test.
| PCollection<String> testFileNames = pipeline | ||
| .apply("Generate sequence", GenerateSequence.from(0).to(numberOfRecords)) | ||
| .apply("Create Birds", MapElements.via(new LongToBird())) | ||
| .apply("Write birds to xml files", FileIO.<Bird>write() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chamikaramj wrote:
"Write XML files"
Ditto.
|
|
||
| void setCompressionType(String compressionType); | ||
|
|
||
| /* Xml */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgajowy wrote:
Ok
Seems like this hasn't been done yet.
| } | ||
|
|
||
| @Test | ||
| public void writeThenReadViaWriteAndRead() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgajowy wrote:
Ok
Ditto.
|
@chamikaramj Thanks again. I posted the fixes. Could you take a look? Two important points:
|
|
LGTM. Will merge after tests pass. |
|
Run seed job |
|
Run Java XmlIO Performance Test |
|
Got it. I think it's fine to merge this and fix any issues. Still have to wait for pre-commits to pass :) |
|
Retest this please |
This PR adds IO Integration Tests for XmlIO. The It includes:
Why/how: https://beam.apache.org/documentation/io/testing/#implementing-integration-tests
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue.mvn clean verifyto make sure basic checks pass. A more thorough check will be performed on your pull request automatically.