KAFKA-7136: Avoid deadlocks in synchronized metrics reporters#5341
KAFKA-7136: Avoid deadlocks in synchronized metrics reporters#5341guozhangwang merged 2 commits intoapache:trunkfrom
Conversation
junrao
left a comment
There was a problem hiding this comment.
@rajinisivaram : Thanks for the patch. LGTM. Just a couple of minor comments below.
| op.run(); | ||
| } | ||
| } catch (Throwable t) { | ||
| log.error("Metric {} failed with exception", opName, t); |
There was a problem hiding this comment.
If we get an exception, should we fail the test?
There was a problem hiding this comment.
The test fails if there is an exception since it verifies that the task is still running.
| * </p><p> | ||
| * Locking order (assume all MetricsReporter methods may be synchronized): | ||
| * <ul> | ||
| * <li>Sensor#add: Sensor -> Metrics -> MetricsReporter</li> |
There was a problem hiding this comment.
It would be useful to document that the Sensor object itself is used as a lock to protect the access to stats and metrics.
There was a problem hiding this comment.
@junrao Thanks for the review, I have added that to the comment.
guozhangwang
left a comment
There was a problem hiding this comment.
One nit comment, otherwise LGTM.
| */ | ||
| private Object metricLock(Stat stat) { | ||
| return this; | ||
| private Object metricLock() { |
There was a problem hiding this comment.
nit: do we still need this function? Could we just reference the metricLock object directly? Since it is private my understanding is that it was not intended to be used outside this class.
There was a problem hiding this comment.
@guozhangwang Thanks for the review. I left the method in since it gives a good place to document :-) I think the method would get optimized away at runtime anyway. I don't mind changing it if you think it may be confusing.
There was a problem hiding this comment.
We can document the field too, right? If there's no reason to have a method, I'd remove it.
We need to use the same lock for metric update and read to avoid NPE and concurrent modification exceptions. Sensor add/remove/update are synchronized on Sensor since they access lists and maps that are not thread-safe. Reporters are notified of metrics add/remove while holding (Sensor, Metrics) locks and reporters may synchronize on the reporter lock. Metric read may be invoked by metrics reporters while holding a reporter lock. So read/update cannot be synchronized using Sensor since that could lead to deadlock. This PR introduces a new lock in Sensor for update/read. Locking order: - Sensor#add: Sensor -> Metrics -> MetricsReporter - Metrics#removeSensor: Sensor -> Metrics -> MetricsReporter - KafkaMetric#metricValue: MetricsReporter -> Sensor#metricLock - Sensor#record: Sensor -> Sensor#metricLock Reviewers: Jun Rao <junrao@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
We need to use the same lock for metric update and read to avoid NPE and concurrent modification exceptions. Sensor add/remove/update are synchronized on Sensor since they access lists and maps that are not thread-safe. Reporters are notified of metrics add/remove while holding (Sensor, Metrics) locks and reporters may synchronize on the reporter lock. Metric read may be invoked by metrics reporters while holding a reporter lock. So read/update cannot be synchronized using Sensor since that could lead to deadlock. This PR introduces a new lock in Sensor for update/read. Locking order: - Sensor#add: Sensor -> Metrics -> MetricsReporter - Metrics#removeSensor: Sensor -> Metrics -> MetricsReporter - KafkaMetric#metricValue: MetricsReporter -> Sensor#metricLock - Sensor#record: Sensor -> Sensor#metricLock Reviewers: Jun Rao <junrao@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
guozhangwang
left a comment
There was a problem hiding this comment.
Cherry-picked to 1.1 and 2.0.
This was broken when picking #5341 into 1.1. Reviewers: Guozhang Wang <wangguoz@gmail.com>
…:1.1.1-sync to 1.1-nflx * commit '9611672e287c1a7933a78590e3f381da2ae7d136': (57 commits) MINOR: increase dev version from 1.1.1-SNAPSHOT to 1.1.2-SNAPSHOT (apache#5409) MINOR: Add thread dumps if broker node cannot be stopped (apache#5373) MINOR: update release.py MINOR: fix upgrade docs for Streams (apache#5392) MINOR: improve docs version numbers (apache#5372) Update version on the branch to 1.1.2-SNAPSHOT KAFKA-6292; Improve FileLogInputStream batch position checks to avoid type overflow (apache#4928) HOTFIX: Fix checkstyle errors in MetricsTest (apache#5345) KAFKA-7136: Avoid deadlocks in synchronized metrics reporters (apache#5341) MINOR: Close timing window in SimpleAclAuthorizer startup (apache#5318) MINOR: Use kill_java_processes when killing ConsoleConsumer in system tests (apache#5297) KAFKA-7104: More consistent leader's state in fetch response (apache#5305) Revert "MINOR: Avoid coarse lock in Pool#getAndMaybePut (apache#5258)" MINOR: Avoid coarse lock in Pool#getAndMaybePut (apache#5258) MINOR: bugfix streams total metrics (apache#5277) KAFKA-7082: Concurrent create topics may throw NodeExistsException (apache#5259) MINOR: Upgrade to Gradle 4.8.1 KAFKA-7012: Don't process SSL channels without data to process (apache#5237) KAFKA-7058: Comparing schema default values using Objects#deepEquals() KAFKA-7047: Added SimpleHeaderConverter to plugin isolation whitelist ...
We need to use the same lock for metric update and read to avoid NPE and concurrent modification exceptions. Sensor add/remove/update are synchronized on
Sensorsince they access lists and maps that are not thread-safe. Reporters are notified of metrics add/remove while holding (Sensor,Metrics) locks and reporters may synchronize on the reporter lock. Metric read may be invoked by metrics reporters while holding a reporter lock. So read/update cannot be synchronized usingSensorsince that could lead to deadlock. This PR introduces a new lock in Sensor for update/read.Locking order:
Committer Checklist (excluded from commit message)