-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix][broker] Fix delete namespace fail by a In-flight topic #19374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][broker] Fix delete namespace fail by a In-flight topic #19374
Conversation
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractTopic.java
Show resolved
Hide resolved
pulsar-common/src/main/java/org/apache/pulsar/common/naming/SystemTopicNames.java
Outdated
Show resolved
Hide resolved
pulsar-broker/src/test/java/org/apache/pulsar/broker/admin/AdminApi2Test.java
Outdated
Show resolved
Hide resolved
|
This patch is related @dlg99 @aymkhalil PTAL |
|
I am not sure that this holds "Always assume the existence of an event topic " |
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractTopic.java
Show resolved
Hide resolved
pulsar-common/src/main/java/org/apache/pulsar/common/naming/SystemTopicNames.java
Outdated
Show resolved
Hide resolved
|
@mattisonchao can this be done slightly differently: There is still a chance that a topic creation passed the check before ns deletion started, and completed after the ns is marked deleted. I don't think we can solve it easily, this will require either distributed locking or serialization of ns/topic metadata operations e.g. via global system topic. |
As you said. the topic still has the chance to go into the race condition.
We don't need to solve the user-created topic but we should make sure to clean up the system topic. |
|
@mattisonchao what's happening in this PR as I understand:
This does not sound like a comprehensive solution (can't topic creation happen after p.2 and still cause namespace deletion to fail?) plus it breaks contract for handling of ns.deleted as @eolivelli moted. On another note, events topic is being deleted using I'd leave ns.deleted handling logic as it is, set ns.deleted for namespace earlier before actually listing topics (seems to be the right thing to do logically). |
Yes, that is what I'm changing right now. I would move the |
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractTopic.java
Show resolved
Hide resolved
pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/NamespacesBase.java
Outdated
Show resolved
Hide resolved
|
@eolivelli @dlg99 Please review it again. I've changed the logic to retry to help delete in-flight topics. |
| RetryUtil.retryAsynchronously(() -> internalDeleteNamespaceAsync0(force), | ||
| new BackoffBuilder() | ||
| .setInitialTime(200, TimeUnit.MILLISECONDS) | ||
| .setMandatoryStop(15, TimeUnit.SECONDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is good to set a time based limit, also because a namespace may have thousand of topics and it may really take more that 15 seconds.
we should set a value in the order of minutes (10 minutes ?)
and we should log something at every trial, this way from the logs it will be clear that something is going on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if the user gives up waiting and then it issues again the same command while the backoff is still running ? Ideally everything should work well and the second execution should wait for the previous execution to complete. The expectation for the user is that when the command completes the namespace is deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @eolivelli
I am trying to use the number of retrying to instead back off.
what happens if the user gives up waiting and then it issues again the same command while the backoff is still running ? Ideally everything should work well and the second execution should wait for the previous execution to complete. The expectation for the user is that when the command completes the namespace is deleted.
For this problem, we can give some kind of the same operation a distributed lock to avoid calling the same operation concurrently. I can send the discussion to the mailing list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distributed locks are very expensive and in any case you will have to deal with timeouts.
We should make the operations idempotent or chain them.
if a new operation comes and the deletion is already in progress we must wait for the result of the pending operation
dlg99
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM assuming you address Enrico's comments + CI passes (currently "You have 1 Checkstyle violation")
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java
Show resolved
Hide resolved
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java
Outdated
Show resolved
Hide resolved
…ersistent/PersistentTopic.java
pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/NamespacesBase.java
Outdated
Show resolved
Hide resolved
eolivelli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry logic is broker
please check my comments
pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/NamespacesBase.java
Outdated
Show resolved
Hide resolved
pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/NamespacesBase.java
Show resolved
Hide resolved
eolivelli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
eolivelli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #19083
Fixes #11866
Motivation
Currently, the force delete namespace operation has the following steps:
deletedto avoid creating the topic again by client lookup or reconnect.A race condition will make step 2 unable to get full of topics. e.g:
deletedcheckdeletedDirectory not empty for /managed-ledgers/test-tenant/test-ns2/persistentModifications
namespacePolicies.deletedlogic to ensure the new topic will not be deleted byPersistenTopic#checkReplication.Verifying this change
Documentation
docdoc-requireddoc-not-neededdoc-complete