Skip to content

Conversation

@heesung-sohn
Copy link
Contributor

@heesung-sohn heesung-sohn commented Jan 27, 2025

Fixes #23889

Motivation

Fixes #23889
zk.put creates persistent znode although it passes ephemeral node creation option.

Modifications

  • pass the creation option when set fails(if the node exists already)
  • fix zk stat ephemeral node check

ephemeralOwner The session id of the owner of this znode if the znode is an ephemeral node. If it is not an ephemeral node, it will be zero.

ref : https://zookeeper.apache.org/doc/r3.5.5/zookeeperProgrammers.html

  • delete the lock if the returned result is non-ephemeral

Verifying this change

  • Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jan 27, 2025
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work @heesung-sn

@lhotari lhotari changed the title [fix][metadata] fixed ephemeral zk put [fix][meta] Fix ephemeral zk put Jan 27, 2025
CompletableFuture<Void> result = new CompletableFuture<>();
store.put(path, payload, Optional.of(version), EnumSet.of(CreateOption.Ephemeral))
.thenAccept(stat -> {
if (!stat.isEphemeral()) {
Copy link
Contributor Author

@heesung-sohn heesung-sohn Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hit this error from testLookupConnectionNotCloseIfFailedToAcquireOwnershipOfBundle, where the lock suddenly became persistent, after invalidating the cache and updating it with null.

This is a bit surprising to me that the ephemeral lock can suddenly become persistent.

        cache.invalidateLocalOwnerCache();
        final var lock = pulsar.getCoordinationService().getLockManager(NamespaceEphemeralData.class)
                .acquireLock(ServiceUnitUtils.path(bundle), new NamespaceEphemeralData()).join();
        lock.updateValue(null);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this problem is caused by org.apache.zookeeper.MockZooKeeper? org.apache.pulsar.zookeeper.ZookeeperServerTest would be a real Zookeeper server. It would be great to have support directly in org.apache.pulsar.broker.testcontext.PulsarTestContext to use a real Zookeeper server. It's not that hard to implement in a similar way as withMockZookeeper():

/**
* Configure this PulsarTestContext to use a mock ZooKeeper instance which is
* shared for both the local and configuration metadata stores.
*
* @return the builder
*/
public Builder withMockZookeeper() {
return withMockZookeeper(false);
}
/**
* Configure this PulsarTestContext to use a mock ZooKeeper instance.
*
* @param useSeparateGlobalZk if true, the global (configuration) zookeeper will be a separate instance
* @return the builder
*/
public Builder withMockZookeeper(boolean useSeparateGlobalZk) {
try {
mockZooKeeper(createMockZooKeeper());
if (useSeparateGlobalZk) {
mockZooKeeperGlobal(createMockZooKeeper());
}
} catch (Exception e) {
throw new RuntimeException(e);
}
return this;
}
private MockZooKeeper createMockZooKeeper() throws Exception {
MockZooKeeper zk = MockZooKeeper.newInstance(MoreExecutors.newDirectExecutorService());
List<ACL> dummyAclList = new ArrayList<>(0);
ZkUtils.createFullPathOptimistic(zk, "/ledgers/available/192.168.1.1:" + 5000,
"".getBytes(StandardCharsets.UTF_8), dummyAclList, CreateMode.PERSISTENT);
zk.create("/ledgers/LAYOUT", "1\nflat:1".getBytes(StandardCharsets.UTF_8), dummyAclList,
CreateMode.PERSISTENT);
registerCloseable(zk::shutdown);
return zk;
}

Adding similar withRealZookeeper() (using org.apache.pulsar.zookeeper.ZookeeperServerTest) would be useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rings a bell, found #13066. There's probably more changes required to implement it properly in MockZooKeeper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's https://github.com/apache/pulsar/blob/master/pulsar-broker/src/test/java/org/apache/pulsar/broker/MultiBrokerTestZKBaseTest.java as a way how it's currently possible to use a real Zookeeper with tests. However, adding direct support to PulsarTestContext and making it easy to override some protected method in MockedPulsarServiceBaseTest to choose it would be more flexible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heesung-sn I pushed some changes to this PR for MockZooKeeperSession/MockZooKeeper to support ephemeral owner. The value for persistent nodes is now 0. Support was missing completely for multi-ops. In the mock zookeeper solution, the session id gets passed in a thread local since there's not another way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the testLookupConnectionNotCloseIfFailedToAcquireOwnershipOfBundle method with a real ZooKeeper before this fix, and the result was the same as after this fix; future.get() always managed to get a result, causing the unit test to fail. It appears that the testLookupConnectionNotCloseIfFailedToAcquireOwnershipOfBundle method previously relied on a flaw in MockZooKeeper, which incidentally allowed it to pass.

Next, should we:

  1. Optimize the testLookupConnectionNotCloseIfFailedToAcquireOwnershipOfBundle method to ensure its flow can pass and guarantee the code merge.
  2. Remove the part of the code in ResourceLockImpl.java that actively deletes the ZooKeeper node, as it is redundant. After fixing MockZooKeeper or using a real ZooKeeper, this piece of code will no longer be needed.
  3. Add support for withRealZookeeper in PulsarTestContext.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👆@heesung-sn do you have a chance to check what @Joforde suggested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried 2. and still see some test failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Planning to get back to this work on Thursday this week.

Copy link
Member

@lhotari lhotari Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heesung-sn I added .withTestZookeeper (uses TestZKServer which is real ZooKeeper) support to PulsarTestContext and an easy way to use this in tests, that's in commit 17ec37a. I modified BrokerServiceLookupTest to use it so that it runs for both in 083d3ed. That's useful for comparing the results and catching problems when using real ZooKeeper. It seems that there's a bigger mess to fix.

@lhotari lhotari changed the title [fix][meta] Fix ephemeral zk put [fix][meta] Fix ephemeral Zookeeper put which creates a persistent znode Feb 11, 2025
@lhotari lhotari added the triage/lhotari/important lhotari's triaging label for important issues or PRs label Feb 11, 2025
@lhotari
Copy link
Member

lhotari commented Feb 14, 2025

@heesung-sn I'm closing this PR since this is superseded by #23984. I'll push a new PR with the remaining fixes, including fixes to MockZooKeeper to properly support ephemeral nodes and properly return ZK's stat information. That appears to be broken in MockZooKeeper with many other details.

@lhotari lhotari closed this Feb 14, 2025
@lhotari
Copy link
Member

lhotari commented Feb 14, 2025

PR to fix remaining issues: #23988

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs release/3.0.10 release/3.3.5 release/4.0.3 triage/lhotari/important lhotari's triaging label for important issues or PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metadata information is not cleaned when broker exits abnormally

3 participants