Skip to content

Fix Huge Number of Watches in ZooKeeper#17482

Merged
kgyrtkirk merged 53 commits intoapache:masterfrom
GWphua:zk-fix
May 8, 2025
Merged

Fix Huge Number of Watches in ZooKeeper#17482
kgyrtkirk merged 53 commits intoapache:masterfrom
GWphua:zk-fix

Conversation

@GWphua
Copy link
Copy Markdown
Contributor

@GWphua GWphua commented Nov 15, 2024

Fixes #6647

Description

This PR is built upon #6683 and #9172 and aims to reduce the number of ZooKeeper watch counts.

Fixed Huge Number of Watches in ZooKeeper

The current Announcer.java leverages on Apache Curator's PathChildrenCache. In its present form, the announcement mechanism watches the immediate parent of the specified path. This results in all child nodes under the parent path being monitored by the ZooKeeper ensemble, including sibling nodes and children of the specified path. This causes an unnecessarily large number of ZooKeeper watches to be produced.

The new NodeAnnouncer.java class is simply Announcer.java but leverages on NodeCache instead to watch a single node during announcement. By eliminating the watches on child nodes, this approach significantly reduces the total number of watch counts in ZooKeeper. Users can opt-in to use the new NodeAnnouncer by setting toggling the feature flag druid.zk.service.pathChildrenCacheStrategy=false.

Tests conducted on the production server also indicate a decrease in watch counts resulting from this change.

ZK Watch Count

Note:
The use of the two different announcer classes simultaneously may result in a KeeperException.NotEmptyException. This happens when two nodes are sharing the same parent, and since both announcers do not have a full picture of the nodes it is watching, the exception will be thrown when the following occurs:

  1. PathChildrenAnnouncer removes all of its tracked children nodes.
  2. Thinking that after all the removal the parent node has no children anymore, PathChildrenAnnouncer tries to remove the parent node.
  3. If NodeAnnouncer is still watching one or more child node, the attempt by PathChildrenAnnouncer in removing the parent node will result in the exception.

Documentation

  • Remove humor in error logs.
  • Add JavaDocs and comments within code to better describe the process.
  • Add documentations on how to enable NodeAnnouncer.

Refactoring

  • Rename Announcer to PathChildrenAnnouncer
  • Shift Announceable class out of PathChildrenAnnouncer.
  • Add ServiceAnnouncer interface to facilitate dependency injection for different flavours of caching strategies.
  • Refactor long methods by creating helper functions.
  • Add ZKPathsUtils.java to abstract the retrieval of ZooKeeper path and ZooKeeper node.

Release note

New: A new opt-in caching strategy is provided that uses a much smaller number of ZooKeeper watches for service announcement.


Key changed/added classes in this PR
  • Announcer.java -> PathChildrenAnnouncer.java
  • ServiceAnnouncer.java
  • NodeAnnouncer.java
  • Announceable.java
  • AnnouncerModule.java
  • CuratorConfig.java
  • DirectExecutorAnnouncer & SingleThreadedAnnouncer annotations for Guice.
  • docs/configuration/index.md & .spelling for docs.
  • Related test files.

This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@GWphua GWphua marked this pull request as draft November 21, 2024 01:24
@GWphua
Copy link
Copy Markdown
Contributor Author

GWphua commented Mar 17, 2025

Hi @kgyrtkirk,

I have made changes according to your suggestions.
I also took the suggestion of @cryptoe and created a feature flag to allow users to choose between the two flavours of announcers currently available. The default is set to using the old PathChildrenCache Announcer, and users can opt in for the new NodeCache Announcer should they face problems with huge ZooKeeper watch count.

The PR description is edited accordingly.

@GWphua GWphua requested a review from kgyrtkirk April 10, 2025 02:55
@kgyrtkirk
Copy link
Copy Markdown
Member

sorry for forgotting about this ; I'll reply back today

Copy link
Copy Markdown
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me


@LifecycleStart
@Override
public void start()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeAnnouncer and PathChildrenAnnouncer share a lot of common pieces ; I wonder if it could be placed into a common abstract - or the old approach should be removed after the cache is proven to work correctly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not mind either choices, I can try and make an abstract by next week.

Copy link
Copy Markdown
Contributor Author

@GWphua GWphua Apr 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kgyrtkirk, I have taken a look at making an abstract -- It is true that there's alot of shared methods, but it is not trivial to make the changes. I am considering this instead

  1. We give some time to use the NodeAnnouncer instead, if it proves to perform better, we can simply delete the PathChildrenCacheAnnouncer.
  2. If using NodeCache provides a trade-off of CPU for Memory, a PR can further work towards replacing the deprecated PathChildrenAnnouncer using CuratorCache. We can decide to make an Abstract class then, as some of the complexities (such as listeners having type of ConcurrentMap<String, PathChildrenCache> will be changed to ConcurrentMap<String, CuratorCache>, which can be shared with NodeAnnouncer). This seems to be the direction where Apache Curator is trying to go with their caches.

|`druid.zk.service.connectionTimeoutMs`|ZooKeeper connection timeout, in milliseconds.|`15000`|
|`druid.zk.service.compress`|Boolean flag for whether or not created Znodes should be compressed.|`true`|
|`druid.zk.service.acl`|Boolean flag for whether or not to enable ACL security for ZooKeeper. If ACL is enabled, zNode creators will have all permissions.|`false`|
|`druid.zk.service.pathChildrenCacheStrategy`|Dictates the underlying caching strategy for service announcements. Set true to let announcers to use Apache Curator's PathChildrenCache strategy, otherwise NodeCache strategy. Consider using NodeCache strategy when you are dealing with huge number of ZooKeeper watches in your cluster.|`true`|
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why not make the NodeCache approach the default as that should work better - but retain the old approach if some issue happens?

cc: @cryptoe

@FrankChen021 FrankChen021 requested a review from Copilot April 21, 2025 02:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reduces the number of ZooKeeper watches by introducing a new NodeAnnouncer that leverages Curator’s NodeCache instead of the PathChildrenCache, along with refactoring the announcement classes and their dependency injections.

  • Replaces Announcer with NodeAnnouncer (and conditionally with PathChildrenAnnouncer based on a config flag).
  • Introduces a new ServiceAnnouncer interface, refactors tests and updates configuration accordingly.
  • Updates related documentation and dependency injection modules.

Reviewed Changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated no comments.

Show a summary per file
File Description
server/src/test/java/org/apache/druid/server/coordination/coordination/BatchDataSegmentAnnouncerTest.java Updated tests to use NodeAnnouncer and reformatted multi-line JOINER calls.
server/src/test/java/org/apache/druid/curator/discovery/CuratorDruidNodeAnnouncerAndDiscoveryTest.java Replaced Announcer instantiation with NodeAnnouncer.
server/src/test/java/org/apache/druid/curator/announcement/PathChildrenAnnouncerTest.java Renamed test cases and updated instantiation to PathChildrenAnnouncer.
server/src/test/java/org/apache/druid/curator/announcement/NodeAnnouncerTest.java Added new tests for NodeAnnouncer feature including update and session kill cases.
server/src/test/java/org/apache/druid/client/client/BatchServerInventoryViewTest.java Updated to use NodeAnnouncer and refined thread pool executor creation.
server/src/main/java/org/apache/druid/server/coordination/CuratorDataSegmentServerAnnouncer.java Changed injection from Announcer to ServiceAnnouncer.
server/src/main/java/org/apache/druid/server/coordination/BatchDataSegmentAnnouncer.java Updated type references and dependency injection for the new ServiceAnnouncer.
server/src/main/java/org/apache/druid/guice/AnnouncerModule.java Provided alternative bindings for single-threaded and direct executor announcers.
server/src/main/java/org/apache/druid/curator/discovery/CuratorDruidNodeAnnouncer.java Updated constructor injection and logging to reflect the new ServiceAnnouncer.
server/src/main/java/org/apache/druid/curator/announcement/ServiceAnnouncer.java Added new interface to abstract announcer behavior.
server/src/main/java/org/apache/druid/curator/announcement/PathChildrenAnnouncer.java Refactored implementation and logging to support the PathChildrenAnnouncer functionality.
server/src/main/java/org/apache/druid/curator/announcement/Announceable.java Moved the Announceable class out of the original Announcer to support reuse.
server/src/main/java/org/apache/druid/curator/CuratorConfig.java Added new configuration property to switch between caching strategies.
docs/configuration/index.md & docs/api-reference/tasks-api.md Updated documentation to explain and reflect the new caching strategy option.
Various indexing-service test files Updated tests to use NodeAnnouncer and adjusted thread configuration for executor creation.

@kgyrtkirk
Copy link
Copy Markdown
Member

@GWphua you've a conflict with master; I think we could merge it after that's addressed!

@GWphua
Copy link
Copy Markdown
Contributor Author

GWphua commented May 6, 2025

@kgyrtkirk The conflicts have been addressed. Thanks for the heads-up!

@kgyrtkirk kgyrtkirk merged commit 228304e into apache:master May 8, 2025
74 checks passed
@kgyrtkirk
Copy link
Copy Markdown
Member

thank you @GWphua for improving on this!

@GWphua GWphua deleted the zk-fix branch May 13, 2025 06:23
@capistrant capistrant added this to the 34.0.0 milestone Jul 22, 2025
@maytasm
Copy link
Copy Markdown
Contributor

maytasm commented Aug 20, 2025

@GWphua What version of zk are you running?

@GWphua
Copy link
Copy Markdown
Contributor Author

GWphua commented Aug 22, 2025

We are using 3.5.9

@jtuglu1
Copy link
Copy Markdown
Contributor

jtuglu1 commented Aug 22, 2025

3.5.9

Thanks @GWphua. What about Curator version? Same as OSS? Did you have to make any ZK/Curator-related upgrades/config changes to get things to work here?
We've seen some issues with the following combination:

Curator version = 5.8.0 (Druid v34 OSS version)
ZK server version = 3.5.8
ZK client version = 3.8.4 (Druid v34 OSS version)

@GWphua
Copy link
Copy Markdown
Contributor Author

GWphua commented Aug 22, 2025

Maybe the Curator/ZK version i'm using is too old:

<apache.curator.version>4.3.0</apache.curator.version>
<zookeeper.version>3.5.9</zookeeper.version>

What are the issues you are facing + Did you experience problems in both PathChildrenCache + NodeCache?

@maytasm
Copy link
Copy Markdown
Contributor

maytasm commented Aug 22, 2025

Maybe the Curator/ZK version i'm using is too old:

<apache.curator.version>4.3.0</apache.curator.version>
<zookeeper.version>3.5.9</zookeeper.version>

What are the issues you are facing + Did you experience problems in both PathChildrenCache + NodeCache?

Yes, we are experiencing issues with Curator versions greater than 5.0.0, which are used in the latest OSS (v34) Druid. We have encountered problems with both versions 5.5.0 and 5.8.0. This appears to be related to CURATOR-549, which was introduced in Curator 5.0.0 and later. Curator is attempting to use a new feature that is only available in ZooKeeper server version 3.6 or higher. Since we are running ZooKeeper server version 3.5.8, this causes Druid to fail to connect to ZooKeeper. As a result, we are seeing continuous SUSPEND/RECONNECT events, with the ZooKeeper server closing the connection.

Note that this issue only occurs with NodeCache, as that is where Curator attempts to use the new feature. PathChildrenCache does not have this problem because it does not use the new ZooKeeper feature code path.

We suspect that this new feature Curator is trying to use is related to persistent watchers/CuratorCache. Seems related to https://lists.apache.org/thread/nl7zrzgyfp2b5wxdkrovk0yhqfto9yl7

So, I think to use NodeCache, you would either have to be on Curator <5.0.0 or running ZK Server >=3.6. Although, I think Curator has some critical fixes in 5.x.x too.

@maytasm
Copy link
Copy Markdown
Contributor

maytasm commented Aug 22, 2025

@GWphua Btw....related to Huge Number of Watches, have you try setting druid.announcer.skipSegmentAnnouncementOnZk to true and use http for segment discovery (druid.serverview.type=http). Http for segment discovery has been the default since Druid v25 (#13592 (comment))

@GWphua
Copy link
Copy Markdown
Contributor Author

GWphua commented Aug 25, 2025

Hey @maytasm, we did not configure this setting in our clusters. However, we can take a look at whether it helps us with our use-case. Thanks 😄

@GWphua
Copy link
Copy Markdown
Contributor Author

GWphua commented Aug 25, 2025

Note that this issue only occurs with NodeCache, as that is where Curator attempts to use the new feature. PathChildrenCache does not have this problem because it does not use the new ZooKeeper feature code path.

We suspect that this new feature Curator is trying to use is related to persistent watchers/CuratorCache. Seems related to https://lists.apache.org/thread/nl7zrzgyfp2b5wxdkrovk0yhqfto9yl7

Not too confident about this, but I feel the problem may be because of upgrading the deprecated NodeCache to CuratorCache. Should Curator remove the deprecated PathChildrenCache in the future, we may be forced to:

  1. Stay at a lower version
  2. Upgrade PathChildrenCache to CuratorCache (Which may cause similar problems?)
  3. Move away from ZooKeeper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

huge number of watch in zookeeper cause zookeeper full gc

8 participants