[BEAM-3217] add HadoopInputFormatIO integration test using DBInputFormat #4332

lgajowy · 2017-12-29T17:19:55Z

Follow this checklist to help us incorporate your contribution quickly and easily:

Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
Each commit in the pull request should have a meaningful subject line and body.
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

@chamikaramj could you take a look? What do you think about the idea of using JdbcIO to write the data? There is no write in HadoopInputFormatIO.

Currently the test is working only for small datasets (100 000 rows, which is about 3,43 MB). For 1 000 000 rows the test is flaky (different hashes get generated). I'm investigating the cause now. After that I'm planning to add the a large dataset hash.

Also, last thing: large scale scenarios (e.g 40 000 000 rows, which is 10gb according to dataflow's estimation) can take quite a long time to run. Only the write pipline executes for more than 35 minutes. The JdbcIO.write() seems to be the bottleneck, as it is done sequentially - one-row inserts, one after another. I think grouping and then inserting "batches" of rows to the database will speed things up, though I don't know how much. Should I do this or the execution time is bearable? Are there some other optimisations I might want to consider?

PS: Happy new year! 🎊 :)

chamikaramj

Thanks.

chamikaramj · 2018-01-04T01:37:16Z

...ut-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java

+        .apply("Get values only", Values.<TestRowDBWritable>create())
+        .apply("Values as string", ParDo.of(new SelectNameFn()))
+        .apply("Calculate hashcode", Combine.globally(new HashingFn()))
+        .apply(Reshuffle.<String>viaRandomKey());


Add a comment explaining why we need this reshuffle and add a link to the JIRA.

chamikaramj · 2018-01-04T01:37:16Z

...ut-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java

+    writePipeline.run().waitUntilFinish();
+
+    PCollection<String> consolidatedHashcode = readPipeline
+        .apply("Read using DBInputFormat", HadoopInputFormatIO


Prob. rename to "Read using HadoopInputFormat" since that explains better what we are testing here.

chamikaramj · 2018-01-04T01:37:16Z

...ut-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java

+  public void writeThenReadUsingDBInputFormat() {
+    writePipeline.apply("Generate sequence", GenerateSequence.from(0).to(numberOfRows))
+        .apply("Produce db rows", ParDo.of(new DeterministicallyConstructTestRowFn()))
+        .apply(JdbcIO.<TestRow>write()


Add a label, For example "Write using JDBCIO")

chamikaramj · 2018-01-04T01:37:16Z

...ut-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java

+  }
+
+  @Test
+  public void writeThenReadUsingDBInputFormat() {


Rename to readUsingHadoopInputFormat() ? (since write part is done usning JDBCIO).

chamikaramj · 2018-01-04T01:37:16Z

sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/DatabaseTestHelper.java

+    }
+  }
+
+  public static void cleanUpDataTable(DataSource dataSource, String tableName)


deleteTable ?

chamikaramj · 2018-01-04T01:37:16Z

sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/DatabaseTestHelper.java

+    return dataSource;
+  }
+
+  public static void createDataTable(DataSource dataSource, String tableName)


createTable ?

chamikaramj · 2018-01-04T01:38:02Z

cc: @jbonofre @iemejia

chamikaramj · 2018-01-04T01:42:13Z

...ut-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java

+  public void writeThenReadUsingDBInputFormat() {
+    writePipeline.apply("Generate sequence", GenerateSequence.from(0).to(numberOfRows))
+        .apply("Produce db rows", ParDo.of(new DeterministicallyConstructTestRowFn()))
+        .apply(JdbcIO.<TestRow>write()


Writing using JDBCIO is fine. Probably your pipeline is slow since you do all the writing From a single worker. Consider creating a PCollection of seed objects (splits) followed by a Reshuffle() followed by writing.

lgajowy · 2018-01-09T17:43:51Z

@chamikaramj thank you for your suggestions. @jbonofre @iemejia could you also take a look? I also added io-it-suite-local profile that was missing and jenkins job definition.

Writing using JDBCIO is fine. Probably your pipeline is slow since you do all the writing From a single worker. Consider creating a PCollection of seed objects (splits) followed by a Reshuffle() followed by writing.

I added only the reshuffle and it seems to be a little bit helpful. I didn't optimise it further due to a problem: different "consolidatedHashes" get calculated for each test run for datasets bigger than 600 000 rows. This makes it unable to determine hash for a large scale dataset (eg. 40 000 000 rows). The amount of read and written rows is the same. I also have the same problems while running JdbcIOIT on larger datasets. Also, as I checked, the database content seems to be all right. I can create a JIRA for that after you review this PR and agree that this behavior is odd, ok?

600 000 is approx. 160 MB. I wouldn't call that a large scale test but I think it is something we can start with and then increase the scale and optimize it gradually if needed and if possible (e.g. after tackling the hash calculaction problem i described). What do you think?

lgajowy · 2018-01-09T18:34:14Z

sdks/java/io/hadoop/input-format/pom.xml

  <name>Apache Beam :: SDKs :: Java :: IO :: Hadoop :: input-format</name>
  <description>IO to read data from data sources which implement Hadoop Input Format.</description>

+  <profiles>


For now I simply duplicated the code from jdbc io, applying necessary changes. I think it can be improved later, see: https://issues.apache.org/jira/browse/BEAM-3440

chamikaramj

Thanks. LGTM

chamikaramj · 2018-01-09T22:26:22Z

sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/DatabaseTestHelper.java

+    }
+  }
+
+  public static void cleanUpDataTable(DataSource dataSource, String tableName)


chamikaramj wrote:
deleteTable ?

Done.

chamikaramj · 2018-01-09T22:26:22Z

...ut-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java

+  }
+
+  @Test
+  public void writeThenReadUsingDBInputFormat() {


chamikaramj wrote:
Rename to readUsingHadoopInputFormat() ? (since write part is done usning JDBCIO).

Done.

chamikaramj · 2018-01-09T22:26:22Z

sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/DatabaseTestHelper.java

+    return dataSource;
+  }
+
+  public static void createDataTable(DataSource dataSource, String tableName)


chamikaramj wrote:
createTable ?

Done.

chamikaramj · 2018-01-09T22:26:22Z

...ut-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java

+  public void writeThenReadUsingDBInputFormat() {
+    writePipeline.apply("Generate sequence", GenerateSequence.from(0).to(numberOfRows))
+        .apply("Produce db rows", ParDo.of(new DeterministicallyConstructTestRowFn()))
+        .apply(JdbcIO.<TestRow>write()


chamikaramj wrote:
Writing using JDBCIO is fine. Probably your pipeline is slow since you do all the writing From a single worker. Consider creating a PCollection of seed objects (splits) followed by a Reshuffle() followed by writing.

Done.

chamikaramj · 2018-01-09T22:26:22Z

...ut-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java

+    writePipeline.run().waitUntilFinish();
+
+    PCollection<String> consolidatedHashcode = readPipeline
+        .apply("Read using DBInputFormat", HadoopInputFormatIO


chamikaramj wrote:
Prob. rename to "Read using HadoopInputFormat" since that explains better what we are testing here.

Done.

chamikaramj · 2018-01-09T22:26:22Z

...ut-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java

+        .apply("Get values only", Values.<TestRowDBWritable>create())
+        .apply("Values as string", ParDo.of(new SelectNameFn()))
+        .apply("Calculate hashcode", Combine.globally(new HashingFn()))
+        .apply(Reshuffle.<String>viaRandomKey());


chamikaramj wrote:
Add a comment explaining why we need this reshuffle and add a link to the JIRA.

Done.

chamikaramj · 2018-01-09T22:28:45Z

Might be worth filing a JIRA for the JDBC issue in case it's a bug in the sink.

I'm fine with enabling this tests for a smaller Dataset and increasing the size later after fixes.

chamikaramj · 2018-01-09T22:29:57Z

Run seed job

chamikaramj · 2018-01-09T23:45:39Z

Run Java HadoopInputFormatIO Performance Test

chamikaramj · 2018-01-10T07:36:08Z

Any idea why following failed ?

https://builds.apache.org/job/beam_PerformanceTests_HadoopInputFormatIO_IT/1/console

lgajowy · 2018-01-10T12:19:29Z

Run seed job

lgajowy · 2018-01-10T12:23:13Z

Run Java HadoopInputFormatIO Performance Test

lgajowy · 2018-01-10T12:47:53Z

Run Java HadoopInputFormatIO Performance Test

lgajowy · 2018-01-10T13:16:20Z

Run seed job

lgajowy · 2018-01-10T13:16:37Z

Run Java HadoopInputFormatIO Performance Test

lgajowy · 2018-01-10T13:30:39Z

Run Java HadoopInputFormatIO Performance Test

lgajowy · 2018-01-10T13:32:34Z

Run Java HadoopInputFormatIO Performance Test

lgajowy · 2018-01-10T14:03:58Z

Any idea why following failed ?
https://builds.apache.org/job/beam_PerformanceTests_HadoopInputFormatIO_IT/1/console

@chamikaramj Judging from the logs you attached either the kubeconfig's location is wrong or the postgres.yml's. In the new commit i tried to overwrite the kubeconfig's location it with value I got from our local jenkins setup hoping that it's some default value. No luck.

I think someone with access to jenkins is needed - we need to know the path to kubeconfig to set it up correctly.

lgajowy · 2018-01-10T14:12:18Z

Run seed job

lgajowy · 2018-01-10T14:30:28Z

Run Java HadoopInputFormatIO Performance Test

lgajowy · 2018-01-10T14:34:58Z

I also tried to set different path to kubernetes scripts - analogous to the one that was in JDBC tests: https://builds.apache.org/view/A-D/view/Beam/job/beam_PerformanceTests_JDBC/215/console

also no luck, because i got some "permission denied error" even earlier (the path didn't even matter that early): https://builds.apache.org/job/beam_PerformanceTests_HadoopInputFormatIO_IT/16/console

@jbonofre can you help in diagnosing what is going on?

The kubernetes infrastructure that is needed for the jenkins job to run is not available for now. We should add it once the infrastructure is there.

lgajowy · 2018-01-17T13:39:21Z

Removed jenkins job due to the reasons described in 4392 pull request. The job should be added in a separate PR after the problems are solved.

@chamikaramj could you take a look?

chamikaramj · 2018-01-18T08:44:42Z

Can you merge and resolve conflicts ?

LGTM other than that.

chamikaramj · 2018-01-18T18:48:51Z

LGTM. Thanks.

lukecwik · 2018-01-18T20:09:42Z

This broke the gradle build:
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_GradleBuild/src/sdks/java/io/hadoop/input-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java:20: error: package org.apache.beam.sdk.io.common does not exist
import static org.apache.beam.sdk.io.common.TestRow.DeterministicallyConstructTestRowFn;
^
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_GradleBuild/src/sdks/java/io/hadoop/input-format/src/test/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIOIT.java:20: error: static import only from classes and interfaces
import static org.apache.beam.sdk.io.common.TestRow.DeterministicallyConstructTestRowFn;
^

Filed: https://issues.apache.org/jira/browse/BEAM-3496

…mat (apache#4332) The kubernetes infrastructure that is needed for the jenkins job to run is not available for now. We should add it once the infrastructure is there.

[BEAM-3217] add HadoopInputFormatIO integration test using DBInputFormat

138b0e0

chamikaramj reviewed Jan 4, 2018

View reviewed changes

[BEAM-3217] Post code review changes #1

421aabd

lgajowy commented Jan 9, 2018

View reviewed changes

chamikaramj reviewed Jan 9, 2018

View reviewed changes

lgajowy force-pushed the HadoopInputFormatIO-test branch from cef5e27 to 7b2d06b Compare January 10, 2018 12:18

lgajowy force-pushed the HadoopInputFormatIO-test branch from 7b2d06b to c27f1a5 Compare January 10, 2018 13:14

[BEAM-3217] Fix jenkins job configuration

67cff14

lgajowy force-pushed the HadoopInputFormatIO-test branch from c27f1a5 to 67cff14 Compare January 10, 2018 14:11

lgajowy mentioned this pull request Jan 11, 2018

[BEAM-3456] Enable jenkins and large scale scenario in JDBC #4392

Merged

6 tasks

[BEAM-3217] Remove Jenkins job for the test

cc49188

The kubernetes infrastructure that is needed for the jenkins job to run is not available for now. We should add it once the infrastructure is there.

Merge branch 'master' into HadoopInputFormatIO-test

d502078

lgajowy force-pushed the HadoopInputFormatIO-test branch from ee7400e to d502078 Compare January 18, 2018 11:55

chamikaramj merged commit 2a96c1c into apache:master Jan 18, 2018

lgajowy deleted the HadoopInputFormatIO-test branch March 14, 2018 11:40

[BEAM-3217] add HadoopInputFormatIO integration test using DBInputFormat #4332

[BEAM-3217] add HadoopInputFormatIO integration test using DBInputFormat #4332

Uh oh!

Conversation

lgajowy commented Dec 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chamikaramj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chamikaramj commented Jan 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lgajowy commented Jan 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chamikaramj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chamikaramj commented Jan 9, 2018

Uh oh!

chamikaramj commented Jan 9, 2018

Uh oh!

chamikaramj commented Jan 9, 2018

Uh oh!

chamikaramj commented Jan 10, 2018

Uh oh!

lgajowy commented Jan 10, 2018

Uh oh!

lgajowy commented Jan 10, 2018

Uh oh!

lgajowy commented Jan 10, 2018

Uh oh!

lgajowy commented Jan 10, 2018

Uh oh!

lgajowy commented Jan 10, 2018

Uh oh!

lgajowy commented Jan 10, 2018

Uh oh!

lgajowy commented Jan 10, 2018

Uh oh!

lgajowy commented Jan 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lgajowy commented Jan 10, 2018

Uh oh!

lgajowy commented Jan 10, 2018

Uh oh!

lgajowy commented Jan 10, 2018

Uh oh!

lgajowy commented Jan 17, 2018

Uh oh!

chamikaramj commented Jan 18, 2018

lgajowy commented Dec 29, 2017 •

edited

Loading

lgajowy commented Jan 9, 2018 •

edited

Loading

lgajowy commented Jan 10, 2018 •

edited

Loading