Skip to content

Conversation

@lgajowy
Copy link
Contributor

@lgajowy lgajowy commented Dec 29, 2017

Follow this checklist to help us incorporate your contribution quickly and easily:

  • Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
  • Each commit in the pull request should have a meaningful subject line and body.
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

@chamikaramj could you take a look? What do you think about the idea of using JdbcIO to write the data? There is no write in HadoopInputFormatIO.

Currently the test is working only for small datasets (100 000 rows, which is about 3,43 MB). For 1 000 000 rows the test is flaky (different hashes get generated). I'm investigating the cause now. After that I'm planning to add the a large dataset hash.

Also, last thing: large scale scenarios (e.g 40 000 000 rows, which is 10gb according to dataflow's estimation) can take quite a long time to run. Only the write pipline executes for more than 35 minutes. The JdbcIO.write() seems to be the bottleneck, as it is done sequentially - one-row inserts, one after another. I think grouping and then inserting "batches" of rows to the database will speed things up, though I don't know how much. Should I do this or the execution time is bearable? Are there some other optimisations I might want to consider?

PS: Happy new year! 🎊 :)

Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

.apply("Get values only", Values.<TestRowDBWritable>create())
.apply("Values as string", ParDo.of(new SelectNameFn()))
.apply("Calculate hashcode", Combine.globally(new HashingFn()))
.apply(Reshuffle.<String>viaRandomKey());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining why we need this reshuffle and add a link to the JIRA.

writePipeline.run().waitUntilFinish();

PCollection<String> consolidatedHashcode = readPipeline
.apply("Read using DBInputFormat", HadoopInputFormatIO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prob. rename to "Read using HadoopInputFormat" since that explains better what we are testing here.

public void writeThenReadUsingDBInputFormat() {
writePipeline.apply("Generate sequence", GenerateSequence.from(0).to(numberOfRows))
.apply("Produce db rows", ParDo.of(new DeterministicallyConstructTestRowFn()))
.apply(JdbcIO.<TestRow>write()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a label, For example "Write using JDBCIO")

}

@Test
public void writeThenReadUsingDBInputFormat() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to readUsingHadoopInputFormat() ? (since write part is done usning JDBCIO).

}
}

public static void cleanUpDataTable(DataSource dataSource, String tableName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleteTable ?

return dataSource;
}

public static void createDataTable(DataSource dataSource, String tableName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createTable ?

@chamikaramj
Copy link
Contributor

cc: @jbonofre @iemejia

public void writeThenReadUsingDBInputFormat() {
writePipeline.apply("Generate sequence", GenerateSequence.from(0).to(numberOfRows))
.apply("Produce db rows", ParDo.of(new DeterministicallyConstructTestRowFn()))
.apply(JdbcIO.<TestRow>write()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing using JDBCIO is fine. Probably your pipeline is slow since you do all the writing From a single worker. Consider creating a PCollection of seed objects (splits) followed by a Reshuffle() followed by writing.

@lgajowy
Copy link
Contributor Author

lgajowy commented Jan 9, 2018

@chamikaramj thank you for your suggestions. @jbonofre @iemejia could you also take a look? I also added io-it-suite-local profile that was missing and jenkins job definition.

Writing using JDBCIO is fine. Probably your pipeline is slow since you do all the writing From a single worker. Consider creating a PCollection of seed objects (splits) followed by a Reshuffle() followed by writing.

I added only the reshuffle and it seems to be a little bit helpful. I didn't optimise it further due to a problem: different "consolidatedHashes" get calculated for each test run for datasets bigger than 600 000 rows. This makes it unable to determine hash for a large scale dataset (eg. 40 000 000 rows). The amount of read and written rows is the same. I also have the same problems while running JdbcIOIT on larger datasets. Also, as I checked, the database content seems to be all right. I can create a JIRA for that after you review this PR and agree that this behavior is odd, ok?

600 000 is approx. 160 MB. I wouldn't call that a large scale test but I think it is something we can start with and then increase the scale and optimize it gradually if needed and if possible (e.g. after tackling the hash calculaction problem i described). What do you think?

<name>Apache Beam :: SDKs :: Java :: IO :: Hadoop :: input-format</name>
<description>IO to read data from data sources which implement Hadoop Input Format.</description>

<profiles>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I simply duplicated the code from jdbc io, applying necessary changes. I think it can be improved later, see: https://issues.apache.org/jira/browse/BEAM-3440

Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM

}
}

public static void cleanUpDataTable(DataSource dataSource, String tableName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chamikaramj wrote:
deleteTable ?

Done.

}

@Test
public void writeThenReadUsingDBInputFormat() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chamikaramj wrote:
Rename to readUsingHadoopInputFormat() ? (since write part is done usning JDBCIO).

Done.

return dataSource;
}

public static void createDataTable(DataSource dataSource, String tableName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chamikaramj wrote:
createTable ?

Done.

public void writeThenReadUsingDBInputFormat() {
writePipeline.apply("Generate sequence", GenerateSequence.from(0).to(numberOfRows))
.apply("Produce db rows", ParDo.of(new DeterministicallyConstructTestRowFn()))
.apply(JdbcIO.<TestRow>write()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chamikaramj wrote:
Writing using JDBCIO is fine. Probably your pipeline is slow since you do all the writing From a single worker. Consider creating a PCollection of seed objects (splits) followed by a Reshuffle() followed by writing.

Done.

writePipeline.run().waitUntilFinish();

PCollection<String> consolidatedHashcode = readPipeline
.apply("Read using DBInputFormat", HadoopInputFormatIO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chamikaramj wrote:
Prob. rename to "Read using HadoopInputFormat" since that explains better what we are testing here.

Done.

.apply("Get values only", Values.<TestRowDBWritable>create())
.apply("Values as string", ParDo.of(new SelectNameFn()))
.apply("Calculate hashcode", Combine.globally(new HashingFn()))
.apply(Reshuffle.<String>viaRandomKey());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chamikaramj wrote:
Add a comment explaining why we need this reshuffle and add a link to the JIRA.

Done.

@chamikaramj
Copy link
Contributor

Might be worth filing a JIRA for the JDBC issue in case it's a bug in the sink.

I'm fine with enabling this tests for a smaller Dataset and increasing the size later after fixes.

@chamikaramj
Copy link
Contributor

Run seed job

@chamikaramj
Copy link
Contributor

Run Java HadoopInputFormatIO Performance Test

@chamikaramj
Copy link
Contributor

@lgajowy lgajowy force-pushed the HadoopInputFormatIO-test branch from cef5e27 to 7b2d06b Compare January 10, 2018 12:18
@lgajowy
Copy link
Contributor Author

lgajowy commented Jan 10, 2018

Run seed job

@lgajowy
Copy link
Contributor Author

lgajowy commented Jan 10, 2018

Run Java HadoopInputFormatIO Performance Test

1 similar comment
@lgajowy
Copy link
Contributor Author

lgajowy commented Jan 10, 2018

Run Java HadoopInputFormatIO Performance Test

@lgajowy lgajowy force-pushed the HadoopInputFormatIO-test branch from 7b2d06b to c27f1a5 Compare January 10, 2018 13:14
@lgajowy
Copy link
Contributor Author

lgajowy commented Jan 10, 2018

Run seed job

@lgajowy
Copy link
Contributor Author

lgajowy commented Jan 10, 2018

Run Java HadoopInputFormatIO Performance Test

2 similar comments
@lgajowy
Copy link
Contributor Author

lgajowy commented Jan 10, 2018

Run Java HadoopInputFormatIO Performance Test

@lgajowy
Copy link
Contributor Author

lgajowy commented Jan 10, 2018

Run Java HadoopInputFormatIO Performance Test

@lgajowy
Copy link
Contributor Author

lgajowy commented Jan 10, 2018

Any idea why following failed ?
https://builds.apache.org/job/beam_PerformanceTests_HadoopInputFormatIO_IT/1/console

@chamikaramj Judging from the logs you attached either the kubeconfig's location is wrong or the postgres.yml's. In the new commit i tried to overwrite the kubeconfig's location it with value I got from our local jenkins setup hoping that it's some default value. No luck.

I think someone with access to jenkins is needed - we need to know the path to kubeconfig to set it up correctly.

@lgajowy lgajowy force-pushed the HadoopInputFormatIO-test branch from c27f1a5 to 67cff14 Compare January 10, 2018 14:11
@lgajowy
Copy link
Contributor Author

lgajowy commented Jan 10, 2018

Run seed job

@lgajowy
Copy link
Contributor Author

lgajowy commented Jan 10, 2018

Run Java HadoopInputFormatIO Performance Test

@lgajowy
Copy link
Contributor Author

lgajowy commented Jan 10, 2018

I also tried to set different path to kubernetes scripts - analogous to the one that was in JDBC tests: https://builds.apache.org/view/A-D/view/Beam/job/beam_PerformanceTests_JDBC/215/console

also no luck, because i got some "permission denied error" even earlier (the path didn't even matter that early): https://builds.apache.org/job/beam_PerformanceTests_HadoopInputFormatIO_IT/16/console

@jbonofre can you help in diagnosing what is going on?

The kubernetes infrastructure that is needed for the jenkins job
to run is not available for now. We should add it once
the infrastructure is there.
@lgajowy
Copy link
Contributor Author

lgajowy commented Jan 17, 2018

Removed jenkins job due to the reasons described in 4392 pull request. The job should be added in a separate PR after the problems are solved.

@chamikaramj could you take a look?

@chamikaramj
Copy link
Contributor

Can you merge and resolve conflicts ?

LGTM other than that.

@lgajowy lgajowy force-pushed the HadoopInputFormatIO-test branch from ee7400e to d502078 Compare January 18, 2018 11:55
@chamikaramj
Copy link
Contributor

LGTM. Thanks.

@chamikaramj chamikaramj merged commit 2a96c1c into apache:master Jan 18, 2018
@lukecwik
Copy link
Member

This broke the gradle build:
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_GradleBuild/src/sdks/java/io/hadoop/input-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java:20: error: package org.apache.beam.sdk.io.common does not exist
import static org.apache.beam.sdk.io.common.TestRow.DeterministicallyConstructTestRowFn;
^
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_GradleBuild/src/sdks/java/io/hadoop/input-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java:20: error: static import only from classes and interfaces
import static org.apache.beam.sdk.io.common.TestRow.DeterministicallyConstructTestRowFn;
^

Filed: https://issues.apache.org/jira/browse/BEAM-3496

boyuanzz pushed a commit to boyuanzz/beam that referenced this pull request Feb 6, 2018
…mat (apache#4332)

The kubernetes infrastructure that is needed for the jenkins job
to run is not available for now. We should add it once
the infrastructure is there.
@lgajowy lgajowy deleted the HadoopInputFormatIO-test branch March 14, 2018 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants