[fix][broker] Fix delete namespace fail by a In-flight topic #19374

mattisonchao · 2023-01-31T14:03:26Z

Fixes #19083
Fixes #11866

Motivation

Currently, the force delete namespace operation has the following steps:

Do pre-check
Get full of topics
Mark the namespace as deleted to avoid creating the topic again by client lookup or reconnect.
Delete all of the user-created topics
Delete all system topics.
Delete namespace event topics.
clean up namespace metadata and resources

A race condition will make step 2 unable to get full of topics. e.g:

-	Thread A	Thread B
time 1	Got full of topics	Trying to create a new topic and passed the `deleted` check
time 2	Mark `deleted`	Do other checks
time 3	Do the rest of steps 4,5,6	Created managed ledger and has persistent info to metadata
time 4	Step 7, Got exception `Directory not empty for /managed-ledgers/test-tenant/test-ns2/persistent`	Do the rest work
time 5	Return the exception	Topic created

Modifications

Deleting topic xxxx because local cluster is not part of global namespace repl list

Remove namespacePolicies.deleted logic to ensure the new topic will not be deleted by PersistenTopic#checkReplication.
Add retry to resolve some topics that were not created successfully(in metadata) during the deletion.
Add a test case to test if the topic object is left behind.

Verifying this change

Make sure that the change passes the CI checks.

Documentation

doc
doc-required
doc-not-needed
doc-complete

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractTopic.java

pulsar-common/src/main/java/org/apache/pulsar/common/naming/SystemTopicNames.java

pulsar-broker/src/test/java/org/apache/pulsar/broker/admin/AdminApi2Test.java

eolivelli · 2023-01-31T14:17:49Z

This patch is related
#19097

@dlg99 @aymkhalil PTAL

eolivelli · 2023-01-31T16:49:30Z

I am not sure that this holds "Always assume the existence of an event topic "

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractTopic.java

pulsar-common/src/main/java/org/apache/pulsar/common/naming/SystemTopicNames.java

dlg99 · 2023-01-31T18:56:01Z

@mattisonchao can this be done slightly differently:
swap order of steps 2 and 3 (mark deleted first, do the rest, unmark on error).
In combo with #19097 this should work good enough.

There is still a chance that a topic creation passed the check before ns deletion started, and completed after the ns is marked deleted. I don't think we can solve it easily, this will require either distributed locking or serialization of ns/topic metadata operations e.g. via global system topic.

mattisonchao · 2023-01-31T21:53:55Z

@dlg99

swap order of steps 2 and 3 (mark deleted first, do the rest, unmark on error).
In combo with #19097 this should work good enough.

There is still a chance that a topic creation passed the check before ns deletion started, and completed after the ns is marked deleted. I don't think we can solve it easily, this will require either distributed locking or serialization of ns/topic metadata operations e.g. via global system topic.

As you said. the topic still has the chance to go into the race condition.

I don't think we can solve it easily,

We don't need to solve the user-created topic but we should make sure to clean up the system topic.

dlg99 · 2023-02-01T00:28:37Z

@mattisonchao what's happening in this PR as I understand:

delete all topics + all system topics except for the events one.
Then delete the events topic
but that brings in some problems with handling of ns.deleted metadata (because of rpc?) so let's remove that logic in some places.

This does not sound like a comprehensive solution (can't topic creation happen after p.2 and still cause namespace deletion to fail?) plus it breaks contract for handling of ns.deleted as @eolivelli moted.

On another note, events topic is being deleted using admin.topics().deletePartitionedTopicAsync while for other partitioned topics we use namespaceResources().getPartitionedTopicResources().deletePartitionedTopicAsync (in internalDeletePartitionedTopicsAsync). Is it done on purpose?

I'd leave ns.deleted handling logic as it is, set ns.deleted for namespace earlier before actually listing topics (seems to be the right thing to do logically).
Do the rest of the steps with retries/backoff - ns is marked as deleted so topic creations should settle at some point. List topics again and delete again. then delete the namespace metadata.

mattisonchao · 2023-02-01T01:07:41Z

I'd leave ns.deleted handling logic as it is, set ns.deleted for namespace earlier before actually listing topics (seems to be the right thing to do logically).
Do the rest of the steps with retries/backoff - ns is marked as deleted so topic creations should settle at some point. List topics again and delete again. then delete the namespace metadata.

Yes, that is what I'm changing right now. I would move the markDelete before listing topics and handling the duplicate deletion exception.

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractTopic.java

pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/NamespacesBase.java

mattisonchao · 2023-02-01T09:43:34Z

@eolivelli @dlg99 Please review it again. I've changed the logic to retry to help delete in-flight topics.

eolivelli · 2023-02-01T11:09:09Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/NamespacesBase.java

+        RetryUtil.retryAsynchronously(() -> internalDeleteNamespaceAsync0(force),
+                new BackoffBuilder()
+                        .setInitialTime(200, TimeUnit.MILLISECONDS)
+                        .setMandatoryStop(15, TimeUnit.SECONDS)


I don't think it is good to set a time based limit, also because a namespace may have thousand of topics and it may really take more that 15 seconds.

we should set a value in the order of minutes (10 minutes ?)
and we should log something at every trial, this way from the logs it will be clear that something is going on

what happens if the user gives up waiting and then it issues again the same command while the backoff is still running ? Ideally everything should work well and the second execution should wait for the previous execution to complete. The expectation for the user is that when the command completes the namespace is deleted.

Hi, @eolivelli
I am trying to use the number of retrying to instead back off.

what happens if the user gives up waiting and then it issues again the same command while the backoff is still running ? Ideally everything should work well and the second execution should wait for the previous execution to complete. The expectation for the user is that when the command completes the namespace is deleted.

For this problem, we can give some kind of the same operation a distributed lock to avoid calling the same operation concurrently. I can send the discussion to the mailing list.

distributed locks are very expensive and in any case you will have to deal with timeouts.
We should make the operations idempotent or chain them.
if a new operation comes and the deletion is already in progress we must wait for the result of the pending operation

dlg99

LGTM assuming you address Enrico's comments + CI passes (currently "You have 1 Checkstyle violation")

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java

…ersistent/PersistentTopic.java

pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/NamespacesBase.java

eolivelli

The retry logic is broker

please check my comments

pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/NamespacesBase.java

eolivelli

LGTM

eolivelli

LGTM

mattisonchao requested review from Technoboy-, codelipenghui and eolivelli January 31, 2023 14:03

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jan 31, 2023

mattisonchao commented Jan 31, 2023

View reviewed changes

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractTopic.java Show resolved Hide resolved

mattisonchao requested a review from Jason918 January 31, 2023 14:04

mattisonchao self-assigned this Jan 31, 2023

mattisonchao added this to the 3.0.0 milestone Jan 31, 2023

mattisonchao commented Jan 31, 2023

View reviewed changes

pulsar-common/src/main/java/org/apache/pulsar/common/naming/SystemTopicNames.java Outdated Show resolved Hide resolved

mattisonchao commented Jan 31, 2023

View reviewed changes

pulsar-broker/src/test/java/org/apache/pulsar/broker/admin/AdminApi2Test.java Outdated Show resolved Hide resolved

eolivelli requested changes Jan 31, 2023

View reviewed changes

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractTopic.java Show resolved Hide resolved

pulsar-common/src/main/java/org/apache/pulsar/common/naming/SystemTopicNames.java Outdated Show resolved Hide resolved

mattisonchao marked this pull request as draft February 1, 2023 01:08

eolivelli requested changes Feb 1, 2023

View reviewed changes

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractTopic.java Show resolved Hide resolved

pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/NamespacesBase.java Outdated Show resolved Hide resolved

mattisonchao added 2 commits February 1, 2023 17:40

Using retry to fix problem

1b6af5a

Remove useless test code

2ab44b9

mattisonchao marked this pull request as ready for review February 1, 2023 09:42

mattisonchao changed the title ~~[fix][broker] Fix delete namespace fail by a race condition~~ [fix][broker] Fix delete namespace fail by a In-flight topic Feb 1, 2023

mattisonchao requested a review from eolivelli February 1, 2023 12:28

mattisonchao added the ready-to-test label Feb 1, 2023

mattisonchao closed this Feb 1, 2023

mattisonchao reopened this Feb 1, 2023

eolivelli reviewed Feb 1, 2023

View reviewed changes

dlg99 approved these changes Feb 1, 2023

View reviewed changes

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java Show resolved Hide resolved

mattisonchao commented Feb 1, 2023

View reviewed changes

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java Outdated Show resolved Hide resolved

mattisonchao added 2 commits February 2, 2023 06:05

Update pulsar-broker/src/main/java/org/apache/pulsar/broker/service/p…

67758ca

…ersistent/PersistentTopic.java

Fix checkstyle

1d0eb64

aymkhalil mentioned this pull request Feb 6, 2023

[improve] Refresh ns policy when deciding auto topic creation eligibility #19097

Closed

15 tasks

mattisonchao added 5 commits February 15, 2023 15:49

Using retry to instead backoff.

ef7b10f

Fix checkstyle

24ed229

Remove reduntant logic

2b810fe

Fix checkstyle

3124e58

Remove test invocations

4681d5b

eolivelli reviewed Feb 15, 2023

View reviewed changes

pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/NamespacesBase.java Outdated Show resolved Hide resolved

mattisonchao added 2 commits February 15, 2023 22:41

Add the log

522aa6e

Add forgotten -1

f8db021

poorbarcode approved these changes Feb 15, 2023

View reviewed changes

eolivelli requested changes Feb 15, 2023

View reviewed changes

pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/NamespacesBase.java Outdated Show resolved Hide resolved

pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/NamespacesBase.java Show resolved Hide resolved

mattisonchao added 2 commits February 16, 2023 00:15

Apply comments

288374f

Apply comment

4d55e1e

eolivelli approved these changes Feb 16, 2023

View reviewed changes

mattisonchao added 2 commits February 17, 2023 13:32

Change the logic to fix the test

3fa7af4

Revert useless code

9fff22f

eolivelli approved these changes Feb 17, 2023

View reviewed changes

Technoboy- approved these changes Feb 17, 2023

View reviewed changes

Technoboy- merged commit 3855585 into apache:master Feb 17, 2023

mattisonchao deleted the fix/admin/delete_namespace branch February 17, 2023 15:16

Technoboy- added the release/2.10.5 label Jul 14, 2023

Technoboy- removed the release/2.10.5 label Dec 4, 2023

[fix][broker] Fix delete namespace fail by a In-flight topic #19374

[fix][broker] Fix delete namespace fail by a In-flight topic #19374

Uh oh!

Conversation

mattisonchao commented Jan 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Verifying this change

Documentation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eolivelli commented Jan 31, 2023

Uh oh!

eolivelli commented Jan 31, 2023

Uh oh!

Uh oh!

Uh oh!

dlg99 commented Jan 31, 2023

Uh oh!

mattisonchao commented Jan 31, 2023

Uh oh!

dlg99 commented Feb 1, 2023

Uh oh!

mattisonchao commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattisonchao commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eolivelli Feb 1, 2023

Choose a reason for hiding this comment

Uh oh!

eolivelli Feb 1, 2023

Choose a reason for hiding this comment

Uh oh!

mattisonchao Feb 15, 2023

Choose a reason for hiding this comment

Uh oh!

eolivelli Feb 15, 2023

Choose a reason for hiding this comment

Uh oh!

dlg99 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mattisonchao commented Jan 31, 2023 •

edited

Loading

mattisonchao commented Feb 1, 2023 •

edited

Loading

mattisonchao commented Feb 1, 2023 •

edited

Loading