-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Common] Reduce duration of locks when processing items held in ConcurrentOpenHashMaps #9764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Common] Reduce duration of locks when processing items held in ConcurrentOpenHashMaps #9764
Conversation
eolivelli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like this work and I am +1 to the new method.
But I am not sure we can touch so many sensible parts of the codebase in one single patch.
We could break things without knowing and it will be hard to track down to the cause.
I don't know which is the best strategy to incorporate this work safely
...ommon/src/test/java/org/apache/pulsar/common/util/collections/ConcurrentOpenHashMapTest.java
Show resolved
Hide resolved
that's a valid concern when a major change is made. However, the problems that the long living locks cause are nondeterministic and much hard to track down. One of the issues with StampedLocks used by ConcurrentOpenHashMap is that a dead lock doesn't get detected by the JVM. I made a thread dump of an experiment where I caused a dead lock by modifying the |
|
Here's an example of a suspicious stacktrace from one of the ReplicatorTest timeouts (full threaddump): whenever Policies get updated, this code will get executed: pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java Lines 1659 to 1682 in dbf29e9
One can find the other possible partiticipants of the dead lock by searching for "ConcurrentOpenHashMap" in the thread dump file. This type of issues stopped occuring in ReplicatorTest after applying the changes in this PR. When looking at the existing code in |
ff62064 to
576b13b
Compare
The primary goal of having these custom concurrent maps implementation was actually to avoid memory allocations to the max possible extent :) Even if these allocations are short lived, they still contribute to generate garbage that will fill up the new-gen space at a faster rate. That in turn will cause more "in-flight", reasonably short-lived objects to get caught in a collection event and copied over to old-gen spaces, where they'll have to get compacted later on. On the class there are also already 2 methods,
We shouldn't be doing blocking calls from within the process method. I think we should rather refactor that code instead. |
Yes, it makes sense to avoid memory allocations in general, but here there could be reason to make a compromise in order to make Pulsar more stable. In this case, the memory allocated in
This is easier said than done. It's a major refactoring to refactor the code in that way. It won't happen quickly. There's plenty of code that does blocking calls within After reviewing how the current I have added a few signs of deadlocks from ReplicatorTest in comments of this PR. All such issues go away after applying the changes in this PR. That's why I'm confident that this PR is useful. In the past, I have been investigating some problems in real environments where the symptoms observed from the thread dumps look similar to thread dumps from stalled ReplicatorTests. A certain set of problems seem to happen only in Pulsar deployments where there's a large amount of topics. When there's a lot of topics, the Perhaps we could evaluate each individual usage of |
|
/pulsarbot run-failure-checks |
|
@lhotari |
@eolivelli Why limit to the cases that happen to show up? Matteo commented earlier: "We shouldn't be doing blocking calls from within the process method. I think we should rather refactor that code instead." Based on this rule, it can be determined from the code whether a certain usage of |
|
/pulsarbot run-failure-checks |
This map is used in many places throughout the code base and, depending on the use cases, there can be instances with a lot of entries or many instances with few entries. Also, the rate of this scanning through the map could be quite high in some cases. In light of that, I'm not convinced that a model where we always defensively copy the entire map when we need to do a map scan is a good general option, or a good replacement in the current code where most places assumed the scanning not be expensive (allocations-wise), even though there might be few specific cases in which it's appropriate. If we do have blocking in that code that process the map items, and that cannot be easily removed, I believe a better alternative would be to make the scanning more flexible. For example, we're not expecting to do a scan over a consistent snapshot of the map (for the simple reason that the map is broken down into independent sections). We only need to maintain the consistency between a key and it's associated value. With that in mind, we don't strictly need to get a read lock for the entire duration of scanning the whole section.
Another alternative could be to measure the time spent in the process function and decide based on that wether to make a copy or keep the read lock during the scan. |
That's the reason for the previous suggestion that we would evaluate each usage of
It's not really a defensive copy of the entire map. The copy is done one section at a time. Memory allocations and GC is extremely efficient in modern JVMs. Since it's about 28 bytes per entry, there would have to have millions of entries for the memory allocation overhead to have any noticeable significance. If necessary, this could be benchmarked.
We do have IO operations (interactions with Zookeeper) in most of the code that processes the map items and it cannot be easily removed.
The Improving Pulsar stability is high priority and this should be addressed. Please don't take me wrong that I would think that this PR resolves all such issues. I know that it will only resolve the issues that are caused by a dead lock which could be avoided by not locking at all when processing items of a ConcurrentOpenHashMap. It would be interesting if @Vanlightly could take a look from a formal verification / modelling perspective on what would be practical in this case. @merlimat WDYT? |
|
@lhotari if we execute the loop over the sections always in the same order and keep/release the locks in the appropriate order. |
Not an easy question. I don't have an answer to that. Perhaps @Vanlightly could help us? |
|
I see now, this is a case of deadlock, we are issuing a blocking call to ZooKeeper into a forEach, and the ZooKeeper library handles every operation in only one single thread. |
|
I've not spent much time in the Pulsar codebase, so don't know much about the wider context. But if it helps, I could model this ConcurrentOpenHashMap in TLA+ and see if it has any deadlock scenarios. Couldn't give any commitment on timing though. |
By defensive, I mean that we'd be taking a copy of the entire map (section by section) each time we need to traverse it, even though there would no need to do it (eg: when a section is not being modified). In case of a map which is not updated super-frequently, that would impose a size-able overhead.
I've been battling GC and allocations for many years now and while that might be true to a certain extent, there are also many bad GC behaviors that we had fixed doing changes like switching from the regular
The problem is that there can be easily millions of entries (aggregated from 100s of thousands of maps) on a single instance. That has been the case in many production deployments for many years.
Sure, but we should take care of them (eg: during the refactoring to port these components to use MetadataStore instead of direct ZK access. Also, I was commenting that we could do it such that we automatically switch to individual locks (instead of whole section lock) if the processing is taking more than X amount of time.
While it's clear that a slow scan will provoke a spike in write latency into the map, I'm still not clear on the exact case of deadlock here. Typically the way to get deadlocked on the map is if, while doing a scan, you're trying to access it again from a different thread (or even same thread, since the lock is non-reentrant) and that blocks the processing function.
@eolivelli Even if do a ZK blocking call (which we shouldn't), that doesn't necessarily mean that there will be a deadlock. It will only be the case if the ZK callback thread is blocked trying to read/write on the same map. Do you have the other threads stacks? To conclude, in my opinion the easiest/lower-risk/lower-overhead change here would be to switch the That will ensure that the process function is always call without holding a lock on the map. It can lead to spurious inconsistencies during re-hashing time, but I don't think that will be a problem. And anyway that can be solved by requesting to behave like the current implementation for these specific cases. |
Scratch that. it's actually possible to do it with the current concurrency level, no funky stuff. |
Thank you for the detailed feedback @merlimat . This makes sense. I believe that it would solve most of the problems. I'll revisit this PR with that approach. |
|
I have a WIP solution already. Need to look it a bit more to see if I'm missing anything and I'll create a PR. |
Ok great. I'm looking forward to that. |
I didn't paste the full dump of competing stacktraces sorry. |
|
@eolivelli @lhotari Created #9787 |
Motivation
Currently
ConcurrentOpenHashMap.forEachkeeps a read lock in each section for the duration of the processing. This can lead to various issues when I/O operations are done in the processing function. This is a common patter in Pulsar code base. The I/O usually happens to Zookeeper and it leads to threads piling up in waiting for locks in ConcurrentOpenHashMap when there are both reads and writes for the same section in concurrently executing threads.It is also possible that long living locks lead to dead locks.
This is the current implementation of
ConcurrentOpenHashMap.forEachpulsar/pulsar-common/src/main/java/org/apache/pulsar/common/util/collections/ConcurrentOpenHashMap.java
Lines 157 to 161 in 34ca893
and the referenced
Section.forEachpulsar/pulsar-common/src/main/java/org/apache/pulsar/common/util/collections/ConcurrentOpenHashMap.java
Lines 356 to 395 in 34ca893
Modifications
ConcurrentOpenHashMap.forEachInSnapshotwhich makes a copy of the entries in each section before processing the entries.forEachwithforEachInSnapshotThere are tradeoffs involved. Some memory allocations will have to be made to make a copy. However, allocations aren't a real issue. JVMs are really efficient for such short time allocations and the allocations aren't causing new bottlenecks or performance issues.
The other tradeoff is a related to data consistency. The entry might have already been removed from the collection at the time it is processed. This is also a non-issue in most cases since it would be a bad solution in the first place to determine state based on the membership in the collection.
Since the previous
forEachimplementation remains, it's possible to choose between the usage offorEachorforEachInSnapshotin each specific use case. If this feature isn't needed, it is possible to replace theforEachimplemention with the snapshotting implemention completely.