Skip to content

Conversation

@gaozhangmin
Copy link
Contributor

@gaozhangmin gaozhangmin commented Aug 13, 2025

Motivation

I found there is problem related to bk's zktimeout config, the default bk zktimeout of bk is 10s, but
We caculate tickTimeMillis = zktimeout/15.

this.tickTimeMillis = zk.getSessionTimeout() / 15;

In below future.get, always timeout.

try {
zkClientState = future.get(tickTimeMillis, TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
// Consider zk disconnection if zk operation takes more than TICK_TIME
zkClientState = Watcher.Event.KeeperState.Disconnected;

We should support set bk zktimeut by
bkConf.setZkTimeout((int) conf.getMetadataStoreSessionTimeoutMillis());

Log

2025-08-13T00:14:32,770+0800 [metadata-store-zk-session-watcher-38-1] ^[[33mWARN ^[[m org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkConnectionStatus(ZKSessionWatcher.java:115) - ^[[33mZooKeeper operation timed out, xxxxx:2182/bk-clusters/xxxxx-2000-30000^[[m
2025-08-13T00:14:32,770+0800 [metadata-store-zk-session-watcher-19-1] ^[[33mWARN ^[[m org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkConnectionStatus(ZKSessionWatcher.java:115) - ^[[33mZooKeeper operation timed out, xxxxx:2182/bk-clusters/xxxxxxx-2000-30000^[[m
2025-08-13T00:14:32,771+0800 [metadata-store-zk-session-watcher-19-1] ^[[33mWARN ^[[m org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkState(ZKSessionWatcher.java:161) - ^[[33m[73863839042194267] ZooKeeper client is disconnected. Waiting to reconnect, time remaining = 25.0 seconds^[[m
2025-08-13T00:14:32,771+0800 [metadata-store-zk-session-watcher-38-1] ^[[33mWARN ^[[m org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkState(ZKSessionWatcher.java:161) - ^[[33m[73863839042194268] ZooKeeper client is disconnected. Waiting to reconnect, time remaining = 25.0 seconds^[[m
2025-08-13T00:14:34,772+0800 [metadata-store-zk-session-watcher-38-1] ^[[32mINFO ^[[m org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkState(ZKSessionWatcher.java:173) - ^[[32mZooKeeper client reconnection with server quorum. Current status: ConnectionLost^[[m
2025-08-13T00:14:34,772+0800 [metadata-store-zk-session-watcher-19-1] ^[[32mINFO ^[[m org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkState(ZKSessionWatcher.java:173) - ^[[32mZooKeeper client reconnection with server quorum. Current status: ConnectionLost^[[m

Modifications

bkConf.setZkTimeout((int) conf.getMetadataStoreSessionTimeoutMillis());

Verifying this change

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@gaozhangmin gaozhangmin self-assigned this Aug 13, 2025
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Aug 13, 2025
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.32%. Comparing base (bbf56f6) to head (43e970d).

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #24624      +/-   ##
============================================
+ Coverage     74.23%   74.32%   +0.09%     
+ Complexity    33179    33177       -2     
============================================
  Files          1881     1881              
  Lines        146850   146851       +1     
  Branches      16866    16866              
============================================
+ Hits         109019   109153     +134     
+ Misses        29191    29027     -164     
- Partials       8640     8671      +31     
Flag Coverage Δ
inttests 26.78% <100.00%> (?)
systests 23.29% <100.00%> (-0.04%) ⬇️
unittests 73.82% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...che/pulsar/broker/BookKeeperClientFactoryImpl.java 80.18% <100.00%> (+0.18%) ⬆️

... and 90 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The zk timeout setting would only matter when bookkeeperMetadataServiceUri is set.

@lhotari lhotari changed the title [fix][broker] fix CheckConnectionStatus disconnected [fix][broker] Pass metadataStoreSessionTimeoutMillis to bookkeeper client configuration to properly configure ZK timeout Aug 27, 2025
@lhotari
Copy link
Member

lhotari commented Aug 27, 2025

@gaozhangmin During a test related PR #24171 I noticed this about the same code location:

try {
zkClientState = future.get(tickTimeMillis, TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
// Consider zk disconnection if zk operation takes more than TICK_TIME
zkClientState = Watcher.Event.KeeperState.Disconnected;
}
checkState(zkClientState);

This will trigger a ConnectionLost session event. That's why a delay should be kept under 2000 ms.
Currently MockZooKeeper's session timeout is hard coded to 30000ms and the check interval is 1/15 of this, therefore 2000ms.

When there's a test that blocks the ZooKeeper thread for more than 2000ms, it will result in a ConnectionLost event.
This PR makes the session timeout configurable and it's possible to increase the session timeout for tests where it's necessary to use a longer session timeout to avoid ConnectionLost session events.

This applies likewise to production code. This is problematic since the effective timeout becomes 1/15 of the zk session timeout for Pulsar operations.

Together with this change made in #23018 it will add significant load to ZooKeepers:

// Clear cache after session expired.
if (event == SessionEvent.SessionReestablished || event == SessionEvent.Reconnected) {
for (MetadataCacheImpl metadataCache : metadataCaches) {
metadataCache.invalidateAll();
}
invalidateAll();
}

I'll file a new issue about resolving this problem.

@lhotari
Copy link
Member

lhotari commented Aug 27, 2025

Issue about ZKSessionWatcher is #24674

@Technoboy- Technoboy- merged commit 0903f45 into apache:master Sep 1, 2025
98 of 100 checks passed
@Technoboy- Technoboy- added this to the 4.2.0 milestone Sep 1, 2025
KannarFr pushed a commit to CleverCloud/pulsar that referenced this pull request Sep 22, 2025
…ient configuration to properly configure ZK timeout (apache#24624)
walkinggo pushed a commit to walkinggo/pulsar that referenced this pull request Oct 8, 2025
…ient configuration to properly configure ZK timeout (apache#24624)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs ready-to-test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants