Skip to content

Fix NPE in RemoteTaskRunner event handler causes JVM shutdown#9610

Merged
clintropolis merged 12 commits intoapache:masterfrom
maytasm:IMPLY-2589
Apr 7, 2020
Merged

Fix NPE in RemoteTaskRunner event handler causes JVM shutdown#9610
clintropolis merged 12 commits intoapache:masterfrom
maytasm:IMPLY-2589

Conversation

@maytasm
Copy link
Copy Markdown
Contributor

@maytasm maytasm commented Apr 3, 2020

Fix NPE in RemoteTaskRunner event handler causes JVM shutdown

Description

An overlord can encountered the following NPE and subsequently triggered a JVM shutdown:

java.lang.NullPointerException: null
at org.apache.druid.indexing.overlord.RemoteTaskRunner.lambda$addWorker$4(RemoteTaskRunner.java:1070) 

This is in the following code reproduced below:

catch (Exception e) {
log.makeAlert(e, "Failed to handle new worker status")
.addData("worker", zkWorker.getWorker().getHost())
.addData("znode", event.getData().getPath())
.emit();
}

This method needs a null check for event.getData(), since data can be null: https://curator.apache.org/apidocs/org/apache/curator/framework/recipes/cache/PathChildrenCacheEvent.html
Also, added null checks for other parameters just to be safe.

Also updated the version of Curator to 4.3.0. Note that Curator version 4.3.0 has the fix for #8177 which earlier (before 4.3.0 exist) prevent us upgrading Curator's version.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.
  • added integration tests.
  • been tested in a test Druid cluster.

@ccaominh
Copy link
Copy Markdown
Contributor

ccaominh commented Apr 3, 2020

PR description says curator 4.3.0, but changes upgraded to 4.2.0?

@maytasm
Copy link
Copy Markdown
Contributor Author

maytasm commented Apr 3, 2020

PR description says curator 4.3.0, but changes upgraded to 4.2.0?

Good catch. It is suppose to be 4.3.0. Thanks

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Apr 3, 2020

This pull request introduces 2 alerts when merging 5110642 into b541996 - view on LGTM.com

new alerts:

  • 2 for Dereferenced variable may be null

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Apr 4, 2020

This pull request introduces 1 alert when merging 8681768 into b541996 - view on LGTM.com

new alerts:

  • 1 for Dereferenced variable may be null

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Apr 4, 2020

This pull request introduces 1 alert when merging 258bfe4 into b541996 - view on LGTM.com

new alerts:

  • 1 for Dereferenced variable may be null

<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this change about, was this relying on curator providing it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. curator 4.1.0 was providing both org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13 and org.codehaus.jackson:jackson-core-asl:jar:1.9.13. Curator 4.3.0 no longer provides those which then causes java.lang.NoClassDefFoundError: org/codehaus/jackson/map/AnnotationIntrospector.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence, explicitly adding dependecy on those org.codehaus.jackson:jackson-mapper-asl:jar and org.codehaus.jackson:jackson-core-asl:jar

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think they can be scope test

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to scope test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the dependency tree before the change in this PR
[INFO] | +- org.apache.curator:curator-x-discovery:jar:4.1.0:test
[INFO] | | - org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:test
[INFO] | | - org.codehaus.jackson:jackson-core-asl:jar:1.9.13:test

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If jackson-mapper-asl:1.9.13 is no longer in production code, can you check if its security vulnerability suppression can be removed? https://github.com/apache/druid/blob/master/owasp-dependency-check-suppressions.xml#L125

You can run the security vulnerability scan by running: mvn dependency-check:check

Copy link
Copy Markdown
Contributor Author

@maytasm maytasm Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to run mvn dependency-check:check but it failed on druid-console. It also fails on druid-console even on master. Am I doing something wrong? I dont think I see these failures on the cron job

One or more dependencies were identified with known vulnerabilities in druid-console:

package.json (pkg:npm/acorn@6.3.0) : 1488
package.json (pkg:npm/kind-of@3.2.2) : 1490
package.json (pkg:npm/yargs-parser@13.1.1) : CVE-2020-7608
package.json (pkg:npm/minimist@0.0.10) : 1179
package.json (pkg:npm/yargs@7.1.0) : CVE-2020-7608
package.json (pkg:npm/yargs-parser@5.0.0) : CVE-2020-7608
package.json (pkg:npm/yargs-parser@10.1.0) : CVE-2020-7608
package.json (pkg:npm/yargs@12.0.5) : CVE-2020-7608
package.json (pkg:npm/yargs-parser@11.1.1) : CVE-2020-7608

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you skip druid console, do all the other modules pass? If so, then the suppression for jackson-mapper-asl can be removed. We'll need to fix the other issues in different PRs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Other modules such as druid-orc-extensions and druid-hdfs-storage fails. This is because jackson-mapper-asl-1.9.13.jar is pulled in by org.apache.hadoop:hadoop-common:jar:2.8.5

switch (event.getType()) { // lgtm [java/dereferenced-value-may-be-null]
case CHILD_ADDED:
case CHILD_UPDATED:
taskId = ZKPaths.getNodeFromPath(event.getData().getPath()); // lgtm [java/dereferenced-value-may-be-null]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the PR description/linked javadocs, can this and other usages of event.getData() in this switch statement be null? I guess we are relying on an NPE happening and then the catch checking and handling there? It might be nicer to explicitly check and throw a more useful error about the ChildData for the event unexpectedly being null so that the alerted exception is more useful

Copy link
Copy Markdown
Contributor Author

@maytasm maytasm Apr 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. event.getData() can be null here and everywhere else in this function. I dont think we need to do anything differently in handling the null from event.getData() so I just let NPE happens and the catch will take care of logging and making sure the JVM doesnt shutdown. You can deduce that the NPE from event.getData() happens from the data map of the alert and the Exception that was produced

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The childData itself is null. The only other data that we have is the eventType which we can always add to the alert? or only add to alert if childData is null

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. event.getData() can be null here and everywhere else in this function. I dont think we need to do anything differently in handling the null from event.getData() so I just let NPE happens and the catch will take care of logging and making sure the JVM doesnt shutdown.

even.getData() can be null depending on the event type; it must not be null for CHILD_ADDED, CHILD_UPDATED, and CHILD_REMOVED events. It should be null otherwise. I think it's worth checking null depending on the event type.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can deduce that the NPE from event.getData() happens from the data map of the alert and the Exception that was produced

Ah yeah, I know I can deduce this, my point was more to be useful for operators who see an error or alert and don't want to dig through the code, which is probably most of them, that we could indicate friendlier messaging that an unexpected situation where the child data was null occurred on a zk event.

Copy link
Copy Markdown
Contributor Author

@maytasm maytasm Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Confirmed that event cannot be null. Removed null check for the event.
  • Added a null check blocks for event.getData() with different alert message than the existing message for catching any Exception. This should be more helpful to operators. Let me know if you have suggestion on different wording or other KV data to add to alert. This check is for the CHILD_ADDED, CHILD_UPDATED, and CHILD_REMOVED events which even.getData() must not be null.

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Apr 4, 2020

This pull request introduces 1 alert when merging 0e62d3d into 4d277db - view on LGTM.com

new alerts:

  • 1 for Dereferenced variable may be null

<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-core-asl</artifactId>
<version>${codehaus.jackson.version}</version>
<scope>test</scope>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put these test dependencies with others below (at Line 133) together.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

catch (Exception e) {
String znode = null;
String eventType = null;
if (event != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think event can ever be null. Check out callers of PathChildrenCache.callListeners().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event cannot be null. I added the check just in case upstream and/or dependency library changes. This is to prevent the JVM from shutting down. I removed it for now.

switch (event.getType()) { // lgtm [java/dereferenced-value-may-be-null]
case CHILD_ADDED:
case CHILD_UPDATED:
taskId = ZKPaths.getNodeFromPath(event.getData().getPath()); // lgtm [java/dereferenced-value-may-be-null]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. event.getData() can be null here and everywhere else in this function. I dont think we need to do anything differently in handling the null from event.getData() so I just let NPE happens and the catch will take care of logging and making sure the JVM doesnt shutdown.

even.getData() can be null depending on the event type; it must not be null for CHILD_ADDED, CHILD_UPDATED, and CHILD_REMOVED events. It should be null otherwise. I think it's worth checking null depending on the event type.

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 after CI

@clintropolis clintropolis merged commit b95a1b9 into apache:master Apr 7, 2020
maytasm added a commit to maytasm/druid that referenced this pull request Apr 7, 2020
…#9610)

* Fix NPE in RemoteTaskRunner event handler causes JVM shutdown

* address comments

* fix compile

* fix checkstyle

* fix lgtm

* fix merge

* fix test

* fix tests

* change scope

* address comments

* address comments
fjy pushed a commit that referenced this pull request Apr 8, 2020
…#9641)

* Fix NPE in RemoteTaskRunner event handler causes JVM shutdown

* address comments

* fix compile

* fix checkstyle

* fix lgtm

* fix merge

* fix test

* fix tests

* change scope

* address comments

* address comments
@jihoonson jihoonson added this to the 0.18.0 milestone Apr 21, 2020
JulianJaffePinterest pushed a commit to JulianJaffePinterest/druid that referenced this pull request Jun 12, 2020
…#9610)

* Fix NPE in RemoteTaskRunner event handler causes JVM shutdown

* address comments

* fix compile

* fix checkstyle

* fix lgtm

* fix merge

* fix test

* fix tests

* change scope

* address comments

* address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants