Skip to content

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Oct 17, 2022

#5410 added monitor metrics for Flink and the intention there was to include org.apache.flink:flink-metrics-dropwizard into the runtime jar. However, due to exclude group: 'org.apache.flink' for the entire implementation configuration, that dependency never made it into the jar. This can also be seen by looking at the latest Flink Runtime Jar that is being published to the Snapshot repo.

When using that runtime jar, things will fail with a NoClassDefFoundError as can be seen below:

java.lang.NoClassDefFoundError: com/codahale/metrics/Reservoir
	at org.apache.iceberg.flink.sink.IcebergStreamWriter.open(IcebergStreamWriter.java:55)
	at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:107)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:700)
	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:676)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:643)
	at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948)
	at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:917)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
	at java.base/java.lang.Thread.run(Unknown Source)

This PR moves the exclude out of the entire implementation configuration and also makes sure that the correct resolvable version (1.15.0 rather than 1.15) of org.apache.flink:flink-metrics-dropwizard is being selected.

@stevenzwu could you review this one please?

@rdblue
Copy link
Contributor

rdblue commented Oct 17, 2022

@nastra, can you add some context about what went wrong and why this is the correct solution?


// for dropwizard histogram metrics implementation
implementation "org.apache.flink:flink-metrics-dropwizard:${flinkMajorVersion}"
implementation "org.apache.flink:flink-metrics-dropwizard:${flinkVersion}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for fixing this bug

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue I believe this is the main fix. previously, it is using the wrong version number of 1.15. should be the full version of 1.15.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main fix is actually moving exclude group: 'org.apache.flink' out of the entire implementation configuration so that org.apache.flink:flink-metrics-dropwizard can make it into the runtime jar. Once I did that, I noticed that it's trying to pick up version 1.15 (which doesn't exist) rather than 1.15.0

dependencies {
implementation project(":iceberg-flink:iceberg-flink-${flinkMajorVersion}")
implementation(project(":iceberg-flink:iceberg-flink-${flinkMajorVersion}")) {
exclude group: 'org.apache.flink'
Copy link
Contributor

@stevenzwu stevenzwu Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra I assume this moved from line 124 above as this is more desired/precise style, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, because otherwise if the exlude is configured for the entire implementation configuration, the org.apache.flink:flink-metrics-dropwizard will never make it into the runtime jar (because it's excluded).


configurations {
implementation {
exclude group: 'org.apache.flink'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

Can we add a smoke test to Flink runtime jar to catch these kinds of issues?
Spark is already having one here for its runtime jar. https://github.com/apache/iceberg/blob/master/spark/v3.3/spark-runtime/src/integration/java/org/apache/iceberg/spark/SmokeTest.java

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I added a smoke test that was failing/hanging without this fix and works with this fix

@nastra nastra force-pushed the flink-runtime-jar branch from dadac5e to d5a1db9 Compare October 18, 2022 08:11
inputs.file(shadowJar.archiveFile.get().asFile.path)
}
integrationTest.dependsOn shadowJar
check.dependsOn integrationTest
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the integration test takes < 20 seconds in total, so it should be ok to run this as part of normal CI when the check task is running

@nastra nastra changed the title Flink: Fix NoClassDefFound with Flink runtime jar Flink: Fix NoClassDefFound with Flink runtime jar / Add integration test Oct 18, 2022
@nastra
Copy link
Contributor Author

nastra commented Oct 18, 2022

@stevenzwu could you re-review the latest changes that add an integration test (similar to the integration testing we do for Spark)?

@stevenzwu
Copy link
Contributor

Thanks @nastra for fixing the problem and adding an smoke test for the runtime module. Thanks @hililiwei and @ajantha-bhat for reviewing.

@stevenzwu stevenzwu merged commit ef3bbe7 into apache:master Oct 18, 2022
@nastra nastra deleted the flink-runtime-jar branch October 19, 2022 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants