Skip to content

Conversation

@poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Jun 27, 2022

Motivation

Problem occur

With the transaction feature, we send and receive messages, and at the same time, execute admin API: unload namespace 1000 times. Then the problem occur: Consumer could not receive any message, and there has no error log. After that we tried admin API: get topic stats, and the response showed only producers are registered on topic, and no consumers are registered on topic, but consumer stat is Ready in the client. This means that the state of the consumer is inconsistent between the broker and the client.

Location problem

Then we found the problem: Two PersistentTopic which have the same name registered at a broker node, consumer registered on one (aka topic-c), and producer registered on another one (aka topic-p). At this time, when we send messages, the data flow like this :

client: producer sends a message

broker: handle cmd-send

broker: find the topic by name, it is "topic-p"

broker: find all subscriptions registered on "topic-p"

broker: found one subscription, but it has no consumers registered

broker: no need to send the message to the client

But the consumer exactly registered on another topic: topic-c, so consumer could not receive any message.

Repreduce

How to reproduce two topics registered at the same broker node ?

Make transaction buffer recover, admin unload namespace, client create consumer, client create producer executed at the same time, the process flow like this (at the step-11, the problem occurs ):

Time transaction buffer recoverr admin unload namespace client create consumer client create producer
1 TB recover
2 TB recover failure topic.unload
3 topic.close(false) topic.close(true)
4 brokerService.topics.remove(topicName)
5 remove topic finish lookup
6 create topic-c
7 consumer registered on topic-c
8 brokerService.topics.remove(topic)
9 remove topic-c finish lookup
10 create topic-p
11 producer registered on topic-p
  • Each column means the individual process. e.g. client create consumer, client create producer.
  • Multiple processes are going on at the same time, and all effet the brokerService.topics.
  • Column Time is used only to indicate the order of each step, not the actual time.
  • The important steps are explained below:

step 3

Even if persistent topic propertyisClosingOrDeleting have already changed to true, it still can be executed another once, see line-1247::

public CompletableFuture<Void> close(boolean closeWithoutWaitingClientDisconnect) {
CompletableFuture<Void> closeFuture = new CompletableFuture<>();
lock.writeLock().lock();
try {
// closing managed-ledger waits until all producers/consumers/replicators get closed. Sometimes, broker
// forcefully wants to close managed-ledger without waiting all resources to be closed.
if (!isClosingOrDeleting || closeWithoutWaitingClientDisconnect) {
fenceTopicToCloseOrDelete();
} else {

Whether close can be executed depends on two predicates: is closing or @param closeWithoutWaitingClientDisconnect is true. This means that method topic.close can be reentrant executed when @param closeWithoutWaitingClientDisconnect is true, and in the implementation of admin API: unload namespace the parameter closeWithoutWaitingClientDisconnect is exactly true.

public CompletableFuture<Void> unloadNamespaceBundle(NamespaceBundle bundle, long timeout, TimeUnit timeoutUnit) {
return unloadNamespaceBundle(bundle, timeout, timeoutUnit, true);
}

So when transaction buffer recover fail and admin unload namespace is executed at the same time, and transaction buffer recover fail before admin unload namespace, the topic will be removed from brokerService.topics twice.

step-4 / step-8

Because of the current implementation of BrokerService. removeTopicFromCache use cmd map.remove(key), not use map.remove(key, value), So this cmd can remove any value in the map, even if it's not the desired one.

To sum up: We should make these two changes:

  • Make method topic.close non-reentrant. Also prevent reentrant between topic.close and topic.delete.
  • Use cmd map.remove(key, value) instead of map.remove(key) in implementation of BrokerService. removeTopicFromCache. This change will apply to both scenes: topic.close and topic.delete.

Other Modifications

In the current implementation, if closing the ledger fails, it determines that the closing topic failed. Then will reset the topic stat to no-fenced. But it changes two states [isFenced, isClosingOrDeleting] without locking, this could not guarantee consistency between them. I will fix it in this PR too (this change may not be relevant to current subject).

}).exceptionally(exception -> {
log.error("[{}] Error closing topic", topic, exception);
unfenceTopicToResume();
closeFuture.completeExceptionally(exception);
return null;
});

PR Relations

#16240

When repeating creates the same topic at one broker node, the following phenomena occur: Transaction pending ack store reuses the cached managed cursor object when initializing constructor, the process flow like this:

Time client create consumer 1 client create consumer 2
1 lookup lookup
1 create topic'
2 create subscription' create topic''
2 open new pending_ackledger create subscription''
3 open new pending_ack_cursor reuse the cached pending_ackledger
4 reuse the cached pending_ack_cursor

If the Transaction pending ack store reuses the managed cursor in the cache, it will cause the task replay pending ack log to loop forever. This PR solves the problem of “Repeat creates topic”, and also eliminates the possibility of reuse pending-ack-cursor: #16240.

Documentation

Check the box below or label this PR directly.

Need to update docs?

  • doc-required
    (Your PR needs to update docs, and you will update later)

  • doc-not-needed
    (Please explain why)

  • doc
    (Your PR contains doc changes)

  • doc-complete
    (Docs have been already added)

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jun 27, 2022
@poorbarcode
Copy link
Contributor Author

@gaoran10 @congbobo184 @Technoboy- Please take a look. ^_^

@poorbarcode poorbarcode changed the title [fix] [broker] Repeat create same topic [fix] [broker] Repeat creates same topic Jun 27, 2022
@poorbarcode
Copy link
Contributor Author

/pulsarbot run-failure-checks

2 similar comments
@poorbarcode
Copy link
Contributor Author

/pulsarbot run-failure-checks

@poorbarcode
Copy link
Contributor Author

/pulsarbot run-failure-checks

@poorbarcode poorbarcode changed the title [fix] [broker] Repeat creates same topic [fix] [broker] The broker has two identical Perisitenttopics Jun 28, 2022
@poorbarcode poorbarcode changed the title [fix] [broker] The broker has two identical Perisitenttopics [fix] [broker] The broker has two identical Persitenttopics Jun 28, 2022
@codelipenghui codelipenghui added this to the 2.11.0 milestone Jun 28, 2022
@codelipenghui codelipenghui added type/bug The PR fixed a bug or issue reported a bug area/broker release/2.10.2 release/2.9.4 labels Jun 28, 2022
Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also should consider to avoid a topic can be closed more than once.

Comment on lines 1925 to 1978
final CompletableFuture<Optional<Topic>> createTopicFuture = getTopic(topicNameString, false);
return createTopicFuture.thenCompose(optionalTopic -> {
if (optionalTopic.isPresent() && optionalTopic.get() == topic){
return removeTopicFromCache(topicNameString, createTopicFuture);
}
// If topic is not in Cache, do nothing.
return CompletableFuture.completedFuture(null);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use map.compute() to simplify the logic? And looks like we don't need to wait for the future complete, because we already have the topic reference here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use map.compute() to simplify the logic? And looks like we don't need to wait for the future complete

Yes, I've rewritten the logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also should consider to avoid a topic can be closed more than once.

I have appended the reason why topic.close was executed twice to the Motivation, and in this PR I've overwritten the topic.close to fix it. I also added a lock to the 'reset topic stat to UN-fenced' operation, could you review the code.

@poorbarcode poorbarcode reopened this Jun 29, 2022
@poorbarcode
Copy link
Contributor Author

Hi @codelipenghui

We also should consider to avoid a topic can be closed more than once.

I have appended the reason why topic.close was executed twice to the Motivation, and in this PR I've overwritten the topic.close to fix it. I also added a lock to the 'reset topic stat to UN-fenced' operation, could you review the code.

We can use map.compute() to simplify the logic?

Unfortunately, we can't use map.compute tto simplify the logic.

And looks like we don't need to wait for the future complete, because we already have the topic reference here.
Yes, I have fixed it. The current implementation doesn't need to wait for the future to complete.

I also rewritten the Motivation of this PR to make it easier to understand. Thanks.

@poorbarcode
Copy link
Contributor Author

@eolivelli @lhotari @michaeljmarshall @gaozhangmin @Jason918 @nicoloboschi Please take a look, if you have time. Thanks.

poorbarcode added a commit to poorbarcode/pulsar that referenced this pull request Jun 29, 2022
@poorbarcode poorbarcode force-pushed the fix/topic_repeat_create branch from 3a3e08e to 0707daa Compare June 29, 2022 17:20
poorbarcode added a commit to poorbarcode/pulsar that referenced this pull request Jun 29, 2022
@poorbarcode poorbarcode force-pushed the fix/topic_repeat_create branch from 0707daa to 482fdfe Compare June 29, 2022 17:27
poorbarcode added a commit to poorbarcode/pulsar that referenced this pull request Jun 29, 2022
@poorbarcode poorbarcode force-pushed the fix/topic_repeat_create branch from 482fdfe to 7a9a1a0 Compare June 29, 2022 17:47
poorbarcode added a commit to poorbarcode/pulsar that referenced this pull request Jun 30, 2022
@poorbarcode poorbarcode force-pushed the fix/topic_repeat_create branch from 7a9a1a0 to 8ce8d11 Compare June 30, 2022 01:23
@poorbarcode
Copy link
Contributor Author

/pulsarbot run-failure-checks

@Anonymitaet Anonymitaet removed the doc-not-needed Your PR changes do not impact docs label Jun 30, 2022
@Anonymitaet Anonymitaet added doc Your PR contains doc changes, no matter whether the changes are in markdown or code files. doc-not-needed Your PR changes do not impact docs and removed doc Your PR contains doc changes, no matter whether the changes are in markdown or code files. labels Jun 30, 2022
@poorbarcode poorbarcode removed their assignment Jul 3, 2022
@codelipenghui codelipenghui modified the milestones: 2.11.0, 2.12.0 Jul 26, 2022
@poorbarcode poorbarcode force-pushed the fix/topic_repeat_create branch from 8ce8d11 to 5337613 Compare August 17, 2022 03:08
@poorbarcode
Copy link
Contributor Author

/pulsarbot rerun-failure-checks

return CompletableFuture.completedFuture(null);
}
// We don't need to wait for the future complete, because we already have the topic reference here.
if (!createTopicFuture.isDone()){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should wait for this future to complete, otherwise it may create another Topic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can save us from using getNow and we can chain the CompletableFuture with "thenCompose"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should wait for this future to complete, otherwise it may create another Topic

One comment suggestion here: "don't need to wait for the future complete"
#16247 (comment)

we can save us from using getNow and we can chain the CompletableFuture with "thenCompose"

Already use "thenCompose" instead "getNow". Thanks

return removeTopicFromCache(topic, (CompletableFuture) null);
}

public CompletableFuture<Void> removeTopicFromCache(String topic, CompletableFuture createTopicFuture) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add generic type to CompletableFuture

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already add generic type to CompletableFuture. Thanks

return this.fullyCloseFuture;
} else {
// Why not return this.fullyCloseFuture ?
// I don't know, just keep the same implementation as before.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this kind of comments "I don't know"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already remove.Thanks

// Close limiters.
try {
closeLimiters();
} catch (Throwable t){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we catching "Throwable" ? this is usually a bad practice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already instead "Throwable" to "Exception". Thanks

closePhase2Future = closeClientsFuture.thenCompose(__ -> asyncCloseLedger());
}
// Complete resultFuture. If managed ledger close failure, reset topic to resume.
closePhase2Future.thenApply(__ -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can use "whenComplete" instead of theApply/exceptionally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already use "whenComplete" instead of "theApply". Thanks

}).exceptionally(exception -> {
log.error("[{}] Error closing topic", topic, exception);
// Restart rate-limiter after close managed ledger failure. Success is not guaranteed.
// TODO Guarantee rate-limiter open finish.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please to not leave "TODOs", open a new GH ticket and link it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already remove "TODO". Thanks

// TODO Guarantee rate-limiter open finish.
try {
restartLimitersAfterCloseTopicFail();
} catch (Throwable t){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not catch Throwable blindly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already instead "Throwable" to "Exception". Thanks

@poorbarcode poorbarcode requested review from eolivelli and removed request for codelipenghui August 18, 2022 06:50
@poorbarcode
Copy link
Contributor Author

/pulsarbot rerun-failure-checks

@poorbarcode poorbarcode force-pushed the fix/topic_repeat_create branch from 60180d2 to bb68f61 Compare August 18, 2022 09:18
@poorbarcode
Copy link
Contributor Author

/pulsarbot rerun-failure-checks

@poorbarcode
Copy link
Contributor Author

Hi @eolivelli

Could you review this PR again?

@eolivelli eolivelli requested a review from lhotari August 22, 2022 07:25
@Jason918
Copy link
Contributor

@codelipenghui @eolivelli @lhotari PTAL

}

public CompletableFuture<Void> removeTopicFromCache(String topic){
return removeTopicFromCache(topic, (CompletableFuture) null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return removeTopicFromCache(topic, (CompletableFuture) null);
return removeTopicFromCache(topic, (CompletableFuture<Optional<Topic>>) null);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already fixed

// Close limiters.
try {
closeLimiters();
} catch (Exception t){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should catch the Exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to keep the logic the same as before: close limiters failure does not affect topic close.

}

// Close client components.
CompletableFuture<Void> closeClientsFuture = asyncCloseClients();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CompletableFuture<Void> closeClientsFuture = asyncCloseClients();
CompletableFuture<Void> closeClientsFuture = asyncCloseClients(boolean closeWithoutWaitingClientDisconnect);

Copy link
Contributor Author

@poorbarcode poorbarcode Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in line:1294~line:1298 handles the logic closeWithoutWaitingClientDisconnect:

CompletableFuture<Void> closePhase2Future;
if (closeWithoutWaitingClientDisconnect){
    closePhase2Future = asyncCloseLedger();
} else {
    closePhase2Future = closeClientsFuture.thenCompose(__ -> asyncCloseLedger());
}

log.error("[{}] Error closing topic", topic, ex);
// Restart rate-limiter after close managed ledger failure. Success is not guaranteed.
try {
restartLimitersAfterCloseTopicFail();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should catch the Exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to keep the logic the same as before: Restart rate-limiter after close managed ledger failure. Success is not guaranteed.

@poorbarcode
Copy link
Contributor Author

To make this PR easy to Review, it was split into two other PR:

@poorbarcode poorbarcode closed this Sep 7, 2022
@poorbarcode poorbarcode deleted the fix/topic_repeat_create branch September 17, 2022 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/broker doc-not-needed Your PR changes do not impact docs release/2.9.4 release/2.10.3 type/bug The PR fixed a bug or issue reported a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants