Skip to content

Conversation

@suztomo
Copy link
Contributor

@suztomo suztomo commented Jan 13, 2021

Closed this in favor of #13804

This PR upgrades the non-vendored Guava version to the latest 30.1-jre, while keeping the version 25.1-jre for certain modules (Hadoop and Cassandra-related) that require the old version of Guava.

Why do I want the latest Guava?

When Beam publishes a recommended version of Guava for Dataflow users (#13737, WIP), I want the recommended version in line with the one in the GCP Libraries BOM (with "-jre" suffix). This is because Google Cloud client libraries are built and tested with the newer version of Guava. I want Beam's Dataflow and Google Cloud Platform modules to be built, tested, and used with the same version of Guava as much as possible.

If we don't do this PR, we would end up a situation where the GCP Libraries BOM recommends to use Guava 30 and Beam's GCP BOM recommends Guava 25.

What is the problem with Guava 25?

When a library touch classes or methods that only exist in the newer version of Guava, it fails with NoClassDefFoundError or NoSuchMethodError. For example, gcsio uses Uninterruptibles.sleepUninterruptibly(java.time.Duration) in it and Linkage Checker detects the usage:

(com.google.guava:guava:25.1-jre) com.google.common.util.concurrent.Uninterruptibles's method sleepUninterruptibly(java.time.Duration) is not found;
  referenced by 3 class files
    com.google.cloud.hadoop.gcsio.cooplock.CoopLockOperationDao (com.google.cloud.bigdataoss:gcsio:2.1.6)
    com.google.cloud.hadoop.gcsio.cooplock.CoopLockRecordsDao (com.google.cloud.bigdataoss:gcsio:2.1.6)
    com.google.cloud.hadoop.gcsio.testing.InMemoryObjectEntry (com.google.cloud.bigdataoss:gcsio:2.1.6)

The method only exists in Guava 28 or higher. This might not be a problem for Dataflow-only users for now, but this may cause other use cases of the library. Therefore, I want to recommend the newer version of Guava to GCP users.

Problem with newer Guava version in Hadoop/Cassandra

If I naively upgrade the Guava version to 30.1-jre, the tests failed with NoSuchMethodError for Futures.addCallback and
NoSuchFieldError for DIGIT (CharMatcher). Details are in BEAM-11626.

This PR fixes the problem by keeping the Guava version lower for the Hadoop/Cassandra-related modules.

Where is the Guava dependency declared?

The following Gradle modules declare dependency to the guava variable:

suztomo-macbookpro44% find . -name 'build.gradle' |xargs grep 'library.java.guava'
./sdks/java/core/build.gradle:  shadowTest library.java.guava_testlib
./sdks/java/io/kinesis/build.gradle:  compile library.java.guava
./sdks/java/io/kinesis/build.gradle:  testCompile library.java.guava_testlib
./sdks/java/io/amazon-web-services2/build.gradle:  testCompile library.java.guava_testlib
./sdks/java/io/google-cloud-platform/build.gradle:  compile library.java.guava
./sdks/java/io/contextualtextio/build.gradle:    testCompile library.java.guava_testlib
./sdks/java/extensions/sql/zetasql/build.gradle:  compile library.java.guava
./sdks/java/maven-archetypes/examples/build.gradle:    'guava.version': dependencies.create(project.library.java.guava).getVersion(),
./runners/google-cloud-dataflow-java/build.gradle:  testCompile library.java.guava_testlib

Other than tests, the 3 modules declaring the Guava dependencies are sdks/java/io/kinesis, sdks/java/io/google-cloud-platform, and sdks/java/extensions/sql/zetasql.

  • sdks/java/io/kinesis module has
    • com.amazonaws:amazon-kinesis-client:1.13.0 built with Guava 26.0-jre
    • com.amazonaws:amazon-kinesis-producer:0.14.1 built with Guava 24.1.1-jre
    • Linkage Checker found no new linkage errors for beam-sdks-java-io-kinesis (link).
  • Google Cloud libraries are built with the latest Guava
  • The zetasql client 2020.10.1 is built with Guava 29.
    • Linkage Checker detected a potential conflict between org.apache.hadoop:hadoop-yarn-common:2.10.1 and Guava 30. The conflict already exists in Guava 29. Therefore, there is no problem declaring dependency with Guava 30.

The sdks/java/maven-archetypes/examples module is tricky one. I want Hadoop/Cassandra users to use Guava 25.1 and others to use Guava 30.

What's the impact to Beam's Cassandra / Hadoop users?

There's no impact to the Beam Cassandra and Hadoop artifacts. The Maven artifact org.apache.beam:beam-sdks-java-io-hadoop-format:2.27.0, org.apache.beam:beam-sdks-java-io-cassandra:2.27.0, or org.apache.beam:beam-sdks-java-io-hadoop-file-system:2.27.0
does not declare Guava dependency.

Instruction for Hadoop / Cassandra Beam users

If Beam Cassandra / Hadoop users use Beam with beam-sdks-java-io-kinesis, beam-sdks-java-io-google-cloud-platform, or beam-sdks-java-extensions-sql-zetasql, then the users need to pin Guava version to 25.1-jre. They can use <dependencyManagement> for Maven and force for Gradle.

Linkage Checker

The sdks/java/build-tools/beam-linkage-check.sh found a new conflict in a dependency of org.apache.hadoop:hadoop-client:2.10.1 (provided) in beam-sdks-java-extensions-sql-zetasql module.

(com.google.guava:guava:30.1-jre) com.google.common.base.CharMatcher's field WHITESPACE is not found;
  referenced by 1 class file
    org.apache.hadoop.yarn.webapp.WebApp (org.apache.hadoop:hadoop-yarn-common:2.10.1)

https://gist.github.com/suztomo/e5fa71d8a0800dbbbc9cd2626d50730e

If the zeta sql module happens to be used with YARN web app, then "Instruction for Hadoop / Cassandra Beam users" applies here to resolve the incompatibility.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status Build Status Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status --- --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run Java PostCommit

@codecov
Copy link

codecov bot commented Jan 13, 2021

Codecov Report

Merging #13740 (0183d98) into master (24179c3) will decrease coverage by 0.00%.
The diff coverage is 92.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #13740      +/-   ##
==========================================
- Coverage   82.75%   82.74%   -0.01%     
==========================================
  Files         466      466              
  Lines       57527    57543      +16     
==========================================
+ Hits        47607    47615       +8     
- Misses       9920     9928       +8     
Impacted Files Coverage Δ
sdks/python/apache_beam/io/kafka.py 80.76% <60.00%> (-4.95%) ⬇️
sdks/python/apache_beam/dataframe/frames.py 91.97% <100.00%> (+0.09%) ⬆️
sdks/python/apache_beam/internal/metrics/metric.py 86.45% <0.00%> (-1.05%) ⬇️
...hon/apache_beam/runners/worker/bundle_processor.py 93.44% <0.00%> (-0.39%) ⬇️
...eam/runners/interactive/interactive_environment.py 89.92% <0.00%> (-0.36%) ⬇️
sdks/python/apache_beam/io/iobase.py 84.55% <0.00%> (-0.27%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e0920c...0183d98. Read the comment docs.

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run Java_Examples_Dataflow PreCommit failed.

03:21:12 > Task :runners:google-cloud-dataflow-java:worker:legacy-worker:compileJava
03:23:13 Build timed out (after 30 minutes). Marking the build as aborted.
03:23:13 Build was aborted
03:23:13 Recording test results
03:23:16 > Task :runners:google-cloud-dataflow-java:worker:legacy-worker:compileJava FAILED

Retrying.

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run Java_Examples_Dataflow PreCommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Java PostCommit check failed

Error Message
org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.OutOfMemoryError: Java heap space
Stacktrace
org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.OutOfMemoryError: Java heap space
	at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:371)
	at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:339)
	at org.apache.beam.sdk.io.gcp.healthcare.FhirIOSearchIT.testFhirIOSearch(FhirIOSearchIT.java:154)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at 

Retrying.

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run Java PostCommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Another error.

com.google.gson.JsonParseException: Failed parsing JSON source: JsonReader at line 1 column 47913 path $[43].resource.code.coding[0].display to Json
	at com.google.gson.JsonParser.parseReader(JsonParser.java:89)
	at com.google.gson.JsonParser.parseReader(JsonParser.java:60)
	at com.google.gson.JsonParser.parseString(JsonParser.java:47)
	at org.apache.beam.sdk.io.gcp.healthcare.JsonArrayCoder.decode(JsonArrayCoder.java:46)
	at org.apache.beam.sdk.io.gcp.healthcare.JsonArrayCoder.decode(JsonArrayCoder.java:28)
	at org.apache.beam.sdk.coders.Coder.decode(Coder.java:159)
	at org.apache.beam.sdk.util.CoderUtils.decodeFromSafeStream(CoderUtils.java:118)
	at org.apache.beam.sdk.util.CoderUtils.decodeFromByteArray(CoderUtils.java:101)
	at org.apache.beam.sdk.util.CoderUtils.decodeFromByteArray(CoderUtils.java:95)
	at org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector.<init>(MutationDetectors.java:122)
	at org.apache.beam.sdk.util.MutationDetectors.forValueWithCoder(MutationDetectors.java:49)
	at org.apache.beam.runners.direct.ImmutabilityEnforcementFactory$ImmutabilityCheckingEnforcement.beforeElement(ImmutabilityEnforcementFactory.java:124)
	at org.apache.beam.runners.direct.DirectTransformExecutor.processElements(DirectTransformExecutor.java:162)
	at org.apache.beam.runners.direct.DirectTransformExecutor.run(DirectTransformExecutor.java:129)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at 

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run Java PostCommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

The test passed. Where is Cassandra problem now?

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run SQL PostCommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run Java HadoopFormatIO Performance Test

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run Dataflow ValidatesRunner

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run Spark ValidatesRunner

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run SQL Postcommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run SQL Postcommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run Spark ValidatesRunner

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run Dataflow ValidatesRunner

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run Java HadoopFormatIO Performance Test

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run Java PostCommit

@suztomo suztomo changed the title [BEAM-11626] The latest Guava version while keeping 25.1 for Cassandra integration [BEAM-11626] Guava version 30.1-jre (latest) Jan 13, 2021
@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run Java_Examples_Dataflow PreCommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Now "Run Java PreCommit" failed and shows what I was looking for

Test Result (22 failures / +22)
org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrarTest.testServiceLoader
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testDeleteNonExisting
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testRenameExistingDestination
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testMatch
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testCopy
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testMatchForNonExistentFile
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testCreateAndReadFile
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testMatchDirectory
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testRenameRetryScenario
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testRenameMissingTargetDir
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testCreateAndReadFileWithShift
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testCreateAndReadFileWithShiftToEnd
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testCopySourceMissing
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testRenameMissingSource
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testMatchNewResource
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testMatchForRecursiveGlob
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testDelete
org.apache.beam.sdk.io.hdfs.HadoopFileSystemTest.testRename
org.apache.beam.sdk.io.hdfs.HadoopResourceIdTest.testGetFilename
org.apache.beam.sdk.io.hdfs.HadoopResourceIdTest.testResourceIdTester
org.apache.beam.sdk.io.hadoop.format.HadoopFormatIOCassandraTest.classMethod
org.apache.beam.sdk.io.hadoop.format.HadoopFormatIOCassandraTest.classMethod

https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/15498/#showFailuresLink

@suztomo
Copy link
Contributor Author

suztomo commented Jan 13, 2021

Run Java PreCommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 14, 2021

Run SQL Postcommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 14, 2021

Run Spark ValidatesRunner

@suztomo
Copy link
Contributor Author

suztomo commented Jan 14, 2021

Run Dataflow ValidatesRunner

@suztomo
Copy link
Contributor Author

suztomo commented Jan 14, 2021

Run Java HadoopFormatIO Performance Test

@suztomo
Copy link
Contributor Author

suztomo commented Jan 14, 2021

Run Java PostCommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 14, 2021

Run Java_Examples_Dataflow PreCommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 14, 2021

Run Java PreCommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 14, 2021

Run Java PostCommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 14, 2021

Run Java PreCommit

Java precommit failed twice:

23:45:21 * What went wrong:
23:45:21 Execution failed for task ':sdks:java:io:cassandra:test'.
23:45:21 > Process 'Gradle Test Executor 40' finished with non-zero exit value 3
23:45:21   This problem might be caused by incorrect test process configuration.
23:45:21   Please refer to the test execution section in the User Manual at https://docs.gradle.org/6.7.1/userguide/java_testing.html#sec:test_execution

@suztomo
Copy link
Contributor Author

suztomo commented Jan 14, 2021

Run SQL Postcommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 14, 2021

Run Spark ValidatesRunner

@suztomo
Copy link
Contributor Author

suztomo commented Jan 14, 2021

Run Dataflow ValidatesRunner

@suztomo
Copy link
Contributor Author

suztomo commented Jan 14, 2021

Run Java HadoopFormatIO Performance Test

@suztomo
Copy link
Contributor Author

suztomo commented Jan 14, 2021

Run Java PostCommit

@suztomo
Copy link
Contributor Author

suztomo commented Jan 14, 2021

Run Java_Examples_Dataflow PreCommit

@suztomo suztomo changed the title [BEAM-11626] Guava version 30.1-jre (latest) [BEAM-11626] Guava version 25.1-jre for Hadoop/Cassandra and Guava version 30.1 (latest) for the rest Jan 14, 2021
@suztomo
Copy link
Contributor Author

suztomo commented Jan 14, 2021

Run Java_Examples_Dataflow PreCommit

@suztomo suztomo marked this pull request as ready for review January 15, 2021 00:07
@aaltay aaltay requested a review from kennknowles January 15, 2021 01:10
// Try to keep grpc_version consistent with gRPC version in google_cloud_platform_libraries_bom
def grpc_version = "1.32.2"
def guava_version = "25.1-jre"
def guava_version = guava25Projects.contains(project.path) ? "25.1-jre" : "30.1-jre"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not this be problematic, causing Beam to depend on 2 different versions? Which version, users of Beam will be depending if they need to use Beam with one of these 3 projects?

Copy link
Contributor Author

@suztomo suztomo Jan 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which version, users of Beam will be depending if they need to use Beam with one of these 3 projects?

There's no impact to the Beam Cassandra and Hadoop artifacts. The Maven artifact org.apache.beam:beam-sdks-java-io-hadoop-format:2.27.0, org.apache.beam:beam-sdks-java-io-cassandra:2.27.0, or org.apache.beam:beam-sdks-java-io-hadoop-file-system:2.27.0 does not declare Guava dependency.

However, if Beam Cassandra / Hadoop users use Beam with beam-sdks-java-io-kinesis, beam-sdks-java-io-google-cloud-platform, or beam-sdks-java-extensions-sql-zetasql (they declare Guava dependency), then the users need to pin Guava version to 25.1-jre. They can use <dependencyManagement> for Maven and force for Gradle.

If the Beam users don't depend on any of beam-sdks-java-io-kinesis, beam-sdks-java-io-google-cloud-platform, or beam-sdks-java-extensions-sql-zetasql, then this change does not have any effect to them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I think this would be an undocumented hurdle for the impacted users. I am not sure what is the best course of action. Hopefully @kennknowles would have a recommendation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think I should document that condition ("if Beam Cassandra / Hadoop users use Beam with beam-sdks-java-io-kinesis, ...") somewhere.

@suztomo suztomo mentioned this pull request Jan 22, 2021
4 tasks
Comment on lines +70 to +74
* The Java artifacts "beam-sdks-java-io-kinesis", "beam-sdks-java-io-google-cloud-platform", and
"beam-sdks-java-extensions-sql-zetasql" declare Guava 30.1-jre dependency (It was 25.1-jre in Beam 2.27.0).
This new Guava version may introduce dependency conflicts if your project or dependencies rely
on removed APIs. If affected, ensure to use an appropriate Guava version via `dependencyManagement` in Maven and
`force` in Gradle.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaltay I added this note for potential impact to Beam users. The potential risk described here is not special to this Guava version. Every dependency upgrade, in general, carries a risk of introducing dependency conflicts if a user relies on removed methods or classes. (Therefore this note might not be needed.)

// Try to keep grpc_version consistent with gRPC version in google_cloud_platform_libraries_bom
def grpc_version = "1.32.2"
def guava_version = "25.1-jre"
def guava_version = guava25Projects.contains(project.path) ? "25.1-jre" : "30.1-jre"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always treat library.java as a global constant. In all existing cases where a project requires a library version that deviates from library.java, we don't use library.java and instead hard-code that dependency in the project's build.gradle.

IMO making library.java conditional on the project being compiled defeats the purpose of declaring a common version in the first place.

Copy link
Contributor Author

@suztomo suztomo Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all existing cases where a project requires a library version that deviates from library.java, we don't use library.java and instead hard-code that dependency in the project's build.gradle.

That's great information. Let me try that. I see hadoop-common does that with force. Thanks.

Memo for myself in hadoop-common:

hadoopVersions.each {kv ->
  configurations."hadoopVersion$kv.key" {
    resolutionStrategy {
      force "org.apache.hadoop:hadoop-client:$kv.value"
      force "org.apache.hadoop:hadoop-common:$kv.value"
      force "org.apache.hadoop:hadoop-mapreduce-client-core:$kv.value"
    }
  }
}

@suztomo
Copy link
Contributor Author

suztomo commented Jan 25, 2021

Closing this in favor of #13804

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants