Fix NPE in RemoteTaskRunner event handler causes JVM shutdown#9610
Fix NPE in RemoteTaskRunner event handler causes JVM shutdown#9610clintropolis merged 12 commits intoapache:masterfrom
Conversation
|
PR description says curator 4.3.0, but changes upgraded to 4.2.0? |
Good catch. It is suppose to be 4.3.0. Thanks |
|
This pull request introduces 2 alerts when merging 5110642 into b541996 - view on LGTM.com new alerts:
|
|
This pull request introduces 1 alert when merging 8681768 into b541996 - view on LGTM.com new alerts:
|
|
This pull request introduces 1 alert when merging 258bfe4 into b541996 - view on LGTM.com new alerts:
|
| <type>test-jar</type> | ||
| <scope>test</scope> | ||
| </dependency> | ||
| <dependency> |
There was a problem hiding this comment.
what is this change about, was this relying on curator providing it?
There was a problem hiding this comment.
Yes. curator 4.1.0 was providing both org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13 and org.codehaus.jackson:jackson-core-asl:jar:1.9.13. Curator 4.3.0 no longer provides those which then causes java.lang.NoClassDefFoundError: org/codehaus/jackson/map/AnnotationIntrospector.
There was a problem hiding this comment.
Hence, explicitly adding dependecy on those org.codehaus.jackson:jackson-mapper-asl:jar and org.codehaus.jackson:jackson-core-asl:jar
There was a problem hiding this comment.
Actually I think they can be scope test
There was a problem hiding this comment.
Changed to scope test.
There was a problem hiding this comment.
This was the dependency tree before the change in this PR
[INFO] | +- org.apache.curator:curator-x-discovery:jar:4.1.0:test
[INFO] | | - org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:test
[INFO] | | - org.codehaus.jackson:jackson-core-asl:jar:1.9.13:test
There was a problem hiding this comment.
If jackson-mapper-asl:1.9.13 is no longer in production code, can you check if its security vulnerability suppression can be removed? https://github.com/apache/druid/blob/master/owasp-dependency-check-suppressions.xml#L125
You can run the security vulnerability scan by running: mvn dependency-check:check
There was a problem hiding this comment.
I tried to run mvn dependency-check:check but it failed on druid-console. It also fails on druid-console even on master. Am I doing something wrong? I dont think I see these failures on the cron job
One or more dependencies were identified with known vulnerabilities in druid-console:
package.json (pkg:npm/acorn@6.3.0) : 1488
package.json (pkg:npm/kind-of@3.2.2) : 1490
package.json (pkg:npm/yargs-parser@13.1.1) : CVE-2020-7608
package.json (pkg:npm/minimist@0.0.10) : 1179
package.json (pkg:npm/yargs@7.1.0) : CVE-2020-7608
package.json (pkg:npm/yargs-parser@5.0.0) : CVE-2020-7608
package.json (pkg:npm/yargs-parser@10.1.0) : CVE-2020-7608
package.json (pkg:npm/yargs@12.0.5) : CVE-2020-7608
package.json (pkg:npm/yargs-parser@11.1.1) : CVE-2020-7608
There was a problem hiding this comment.
If you skip druid console, do all the other modules pass? If so, then the suppression for jackson-mapper-asl can be removed. We'll need to fix the other issues in different PRs.
There was a problem hiding this comment.
No. Other modules such as druid-orc-extensions and druid-hdfs-storage fails. This is because jackson-mapper-asl-1.9.13.jar is pulled in by org.apache.hadoop:hadoop-common:jar:2.8.5
| switch (event.getType()) { // lgtm [java/dereferenced-value-may-be-null] | ||
| case CHILD_ADDED: | ||
| case CHILD_UPDATED: | ||
| taskId = ZKPaths.getNodeFromPath(event.getData().getPath()); // lgtm [java/dereferenced-value-may-be-null] |
There was a problem hiding this comment.
Per the PR description/linked javadocs, can this and other usages of event.getData() in this switch statement be null? I guess we are relying on an NPE happening and then the catch checking and handling there? It might be nicer to explicitly check and throw a more useful error about the ChildData for the event unexpectedly being null so that the alerted exception is more useful
There was a problem hiding this comment.
Yes. event.getData() can be null here and everywhere else in this function. I dont think we need to do anything differently in handling the null from event.getData() so I just let NPE happens and the catch will take care of logging and making sure the JVM doesnt shutdown. You can deduce that the NPE from event.getData() happens from the data map of the alert and the Exception that was produced
There was a problem hiding this comment.
The childData itself is null. The only other data that we have is the eventType which we can always add to the alert? or only add to alert if childData is null
There was a problem hiding this comment.
Yes. event.getData() can be null here and everywhere else in this function. I dont think we need to do anything differently in handling the null from event.getData() so I just let NPE happens and the catch will take care of logging and making sure the JVM doesnt shutdown.
even.getData() can be null depending on the event type; it must not be null for CHILD_ADDED, CHILD_UPDATED, and CHILD_REMOVED events. It should be null otherwise. I think it's worth checking null depending on the event type.
There was a problem hiding this comment.
You can deduce that the NPE from event.getData() happens from the data map of the alert and the Exception that was produced
Ah yeah, I know I can deduce this, my point was more to be useful for operators who see an error or alert and don't want to dig through the code, which is probably most of them, that we could indicate friendlier messaging that an unexpected situation where the child data was null occurred on a zk event.
There was a problem hiding this comment.
- Confirmed that event cannot be null. Removed null check for the event.
- Added a null check blocks for event.getData() with different alert message than the existing message for catching any Exception. This should be more helpful to operators. Let me know if you have suggestion on different wording or other KV data to add to alert. This check is for the CHILD_ADDED, CHILD_UPDATED, and CHILD_REMOVED events which even.getData() must not be null.
|
This pull request introduces 1 alert when merging 0e62d3d into 4d277db - view on LGTM.com new alerts:
|
| <groupId>org.codehaus.jackson</groupId> | ||
| <artifactId>jackson-core-asl</artifactId> | ||
| <version>${codehaus.jackson.version}</version> | ||
| <scope>test</scope> |
There was a problem hiding this comment.
Please put these test dependencies with others below (at Line 133) together.
| catch (Exception e) { | ||
| String znode = null; | ||
| String eventType = null; | ||
| if (event != null) { |
There was a problem hiding this comment.
I don't think event can ever be null. Check out callers of PathChildrenCache.callListeners().
There was a problem hiding this comment.
event cannot be null. I added the check just in case upstream and/or dependency library changes. This is to prevent the JVM from shutting down. I removed it for now.
| switch (event.getType()) { // lgtm [java/dereferenced-value-may-be-null] | ||
| case CHILD_ADDED: | ||
| case CHILD_UPDATED: | ||
| taskId = ZKPaths.getNodeFromPath(event.getData().getPath()); // lgtm [java/dereferenced-value-may-be-null] |
There was a problem hiding this comment.
Yes. event.getData() can be null here and everywhere else in this function. I dont think we need to do anything differently in handling the null from event.getData() so I just let NPE happens and the catch will take care of logging and making sure the JVM doesnt shutdown.
even.getData() can be null depending on the event type; it must not be null for CHILD_ADDED, CHILD_UPDATED, and CHILD_REMOVED events. It should be null otherwise. I think it's worth checking null depending on the event type.
…#9610) * Fix NPE in RemoteTaskRunner event handler causes JVM shutdown * address comments * fix compile * fix checkstyle * fix lgtm * fix merge * fix test * fix tests * change scope * address comments * address comments
…#9641) * Fix NPE in RemoteTaskRunner event handler causes JVM shutdown * address comments * fix compile * fix checkstyle * fix lgtm * fix merge * fix test * fix tests * change scope * address comments * address comments
…#9610) * Fix NPE in RemoteTaskRunner event handler causes JVM shutdown * address comments * fix compile * fix checkstyle * fix lgtm * fix merge * fix test * fix tests * change scope * address comments * address comments
Fix NPE in RemoteTaskRunner event handler causes JVM shutdown
Description
An overlord can encountered the following NPE and subsequently triggered a JVM shutdown:
This is in the following code reproduced below:
This method needs a null check for
event.getData(), since data can be null: https://curator.apache.org/apidocs/org/apache/curator/framework/recipes/cache/PathChildrenCacheEvent.htmlAlso, added null checks for other parameters just to be safe.
Also updated the version of Curator to 4.3.0. Note that Curator version 4.3.0 has the fix for #8177 which earlier (before 4.3.0 exist) prevent us upgrading Curator's version.
This PR has: