[Issue #5234][pulsar-client-cpp] Fix memory leak caused by deadline_timer holding object reference #5246

massakam · 2019-09-21T18:24:04Z

Motivation

In the C++ client library, memory usage increases as producers are repeatedly created and closed. This is because deadline_timer such as sendTimer_ holds a reference to the ProducerImpl object and the ProducerImpl destructor is not executed.

pulsar/pulsar-client-cpp/lib/ProducerImpl.cc

Lines 183 to 184 in 822531c

    
           sendTimer_->async_wait( 
        
               std::bind(&ProducerImpl::handleSendTimeout, shared_from_this(), std::placeholders::_1));

ClientConnection seems to have the same problem as ProducerImpl.

cf. https://stackoverflow.com/questions/27065954/boost-deadline-timer-holds-reference-to-object

Modifications

The destructor is executed by resetting shared pointers for these deadline_timer when closing the ProducerImpl object.

merlimat · 2019-09-22T02:09:25Z

pulsar-client-cpp/lib/ConnectionPool.cc

    return future;
 }

+void ConnectionPool::close() {


Could this be done implicitly in the destructor itself?

Yes. I added the destructor instead of the close() method.

massakam · 2019-09-24T12:54:20Z

When creating and closing a Pulsar client object repeatedly, the program may stop when the ClientConnection destructor is called. I don't know the cause yet...

massakam · 2019-09-24T13:03:15Z

pulsar-client-cpp/lib/ClientConnection.cc

    }
+
+    if (executor_) {
+        executor_.reset();


The program doesn't seem to stop unless executor_ is reset here. However, the ClientConnection destructor is not executed, causing a memory leak.

Removed this in this PR.

merlimat · 2019-09-24T16:35:16Z

When creating and closing a Pulsar client object repeatedly, the program may stop when the ClientConnection destructor is called. I don't know the cause yet...

@massakam Do you have a code example to reproduce the issue?

Also, would that be strictly related to this PR (and what is fixing) or is that a separate issue?

massakam · 2019-09-25T02:17:18Z

@merlimat

Do you have a code example to reproduce the issue?

Running such code reproduces the issue.

#include <iostream>
#include <pulsar/Client.h>

using namespace std;
using namespace pulsar;

int main() {
    for (int i = 0; i < 100; i++) {
        Client client("pulsar://localhost:6650");

        Producer producer;
        Result result = client.createProducer("persistent://public/default/test", producer);

        if (result != ResultOk) {
            cerr << "Failed to create producer: " << result << endl;
            return -1;
        }

        Message msg = MessageBuilder().setContent("my-message").build();
        Result res = producer.send(msg);

        if (res == ResultOk) {
            cout << "Send: " << msg.getDataAsString() << endl;
        } else {
            cerr << "Failed to send message: " << res << endl;
        }

        producer.close();
        client.close();
    }
}

Also, would that be strictly related to this PR (and what is fixing) or is that a separate issue?

As I wrote above, this seems to occur if executor_.reset() is executed in the close() method of ClientConnection. The fix the leak of ProducerImpl object, the original purpose of this pull-request, is unrelated to this modification. So, I will create a separate issue.

…ection

merlimat · 2019-09-25T03:32:55Z

pulsar-client-cpp/lib/ConnectionPool.cc

+        pool_.clear();
+    }
+
+    lock.unlock();


nit: the unlock() is automatically done in the lock destructor.

merlimat · 2019-09-25T03:46:37Z

pulsar-client-cpp/lib/ConnectionPool.h

    ExecutorServiceProviderPtr executorProvider_;
    AuthenticationPtr authentication_;
-    typedef std::map<std::string, ClientConnectionWeakPtr> PoolMap;
+    typedef std::map<std::string, ClientConnectionPtr> PoolMap;


What is the reason for changing the weak_ptr into a shared_ptr?

This could also potentially have side effects, like not destructing the connections (that are already closed) while the pool is active.

As I understand this change, the reason is to iterate through the map in the ~ConnectionPool and force calling ClientConnectionPtr::close(). Couldn't we just do the same with the weak_ptr and calling lock() while iterating through the map and closing the ptr that are still valid?

What is the reason for changing the weak_ptr into a shared_ptr?

While iterating the pool map, I am worried that the iterator would be broken when ClientConnection is destructed.

Couldn't we just do the same with the weak_ptr and calling lock() while iterating through the map and closing the ptr that are still valid?

I will try to fix it like that.

While iterating the pool map, I am worried that the iterator would be broken when ClientConnection is destructed.

The iterator itself is on a map in ConnectionPool so that will still be always safe to use here.

The ClientConnectionWeakPtr can, of course, be possibly already destroyed, though the lock() operation will attempt to acquire a ref count on the object and get a shared_ptr. The resulting shared_ptr has to be checked if it's valid, and if it is, it will be kept alive by the shared_ptr.

massakam · 2019-09-25T06:40:00Z

PTAL

merlimat

👍

Master Issue: #5234 ### Motivation The other day, I fixed a memory leak caused by not being executed the destructor of C++ producer in #5246. However, when running a producer application written in Go in an environment with the modified C++ client library installed, the program occasionally crashes due to a "bad_weak_ptr" error. ``` 2019/10/01 16:34:30.210 c_client.go:68: [info] INFO | ProducerImpl:481 | [persistent://massakam/global/test/t1, dc1-904-1012912] Closing producer for topic persistent://massakam/global/test/t1 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012912] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012912] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}] 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | BatchMessageContainer:171 | [numberOfBatchesSent = 1] [averageBatchSize = 1] terminate called after throwing an instance of 'std::bad_weak_ptr' what(): 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012911] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012911] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}] bad_weak_ptr 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | BatchMessageContainer:171 | [numberOfBatchesSent = 1] [averageBatchSize = 1] 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012910] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012910] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}] SIGABRT: abort PC=0x7fc78d39d2c7 m=0 sigcode=18446744073709551610 ``` As a result of the investigation, I found that the destructor was called before the process of closing `ProducerImpl` was completed, and the object was destroyed. ### Modifications To keep the `ProducerImpl` object alive, get its own shared pointer at the beginning of `ProducerImpl::closeAsync()`. And the pointer must be passed to `ProducerImpl::handleClose()`. Otherwise, the object will be destroyed before `handleClose()` is called. So far, this issue has not been reproduced in `ConsumerImpl`, but I fixed it in the same way as `ProducerImpl`.

…imer holding object reference (#5246) * Fix memory leak caused by deadline_timer holding object reference * Add ConnectionPool destructor * Do not reset executor_ in ClientConnection::close() * Make ConnectionPool have shared_ptr instead of weak_ptr to ClientConnection * Revert PoolMap type * Remove lock.unlock() from ~ConnectionPool() (cherry picked from commit d430441)

Master Issue: #5234 ### Motivation The other day, I fixed a memory leak caused by not being executed the destructor of C++ producer in #5246. However, when running a producer application written in Go in an environment with the modified C++ client library installed, the program occasionally crashes due to a "bad_weak_ptr" error. ``` 2019/10/01 16:34:30.210 c_client.go:68: [info] INFO | ProducerImpl:481 | [persistent://massakam/global/test/t1, dc1-904-1012912] Closing producer for topic persistent://massakam/global/test/t1 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012912] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012912] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}] 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | BatchMessageContainer:171 | [numberOfBatchesSent = 1] [averageBatchSize = 1] terminate called after throwing an instance of 'std::bad_weak_ptr' what(): 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012911] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012911] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}] bad_weak_ptr 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | BatchMessageContainer:171 | [numberOfBatchesSent = 1] [averageBatchSize = 1] 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012910] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012910] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}] SIGABRT: abort PC=0x7fc78d39d2c7 m=0 sigcode=18446744073709551610 ``` As a result of the investigation, I found that the destructor was called before the process of closing `ProducerImpl` was completed, and the object was destroyed. ### Modifications To keep the `ProducerImpl` object alive, get its own shared pointer at the beginning of `ProducerImpl::closeAsync()`. And the pointer must be passed to `ProducerImpl::handleClose()`. Otherwise, the object will be destroyed before `handleClose()` is called. So far, this issue has not been reproduced in `ConsumerImpl`, but I fixed it in the same way as `ProducerImpl`. (cherry picked from commit dbd48ab)

Fixes apache#17392 ### Motivation All timers in `ProducerImpl` are `std::shared_ptr` objects that can be reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to null pointer access in some cases. See apache#17392 (comment) for the analysis. Generally it's not necessary to hold a nullable pointer to the timer. However, to resolve the cyclic reference issue, apache#5246 reset the shared pointer to reduce the reference count manually. It's not a good solution because we have to perform null check for timers everywhere. The null check still has some race condition issue like: Thread 1: ```c++ if (timer) { // [1] timer is not nullptr timer->async_wait(/* ... */); // [3] timer is null now, see [2] below } ``` Thread 2: ```c++ timer.reset(); // [2] ``` The best solution is to capture `weak_ptr` in timer's callback and call `lock()` to check if the referenced object is still valid. ### Modifications - Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`, not a `shared_ptr`. - Use `PeriodicTask` instead of the `deadline_timer` for token refresh. - Migrate `weak_from_this()` method from C++17 and capture `weak_from_this()` instead of `shared_from_this()` in callbacks. ### Verifying this change Run the `testResendViaSendCallback` for many times and we can see it won't fail after this patch. ```bash ./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30 ```

Fixes #17392 ### Motivation All timers in `ProducerImpl` are `std::shared_ptr` objects that can be reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to null pointer access in some cases. See #17392 (comment) for the analysis. Generally it's not necessary to hold a nullable pointer to the timer. However, to resolve the cyclic reference issue, #5246 reset the shared pointer to reduce the reference count manually. It's not a good solution because we have to perform null check for timers everywhere. The null check still has some race condition issue like: Thread 1: ```c++ if (timer) { // [1] timer is not nullptr timer->async_wait(/* ... */); // [3] timer is null now, see [2] below } ``` Thread 2: ```c++ timer.reset(); // [2] ``` The best solution is to capture `weak_ptr` in timer's callback and call `lock()` to check if the referenced object is still valid. ### Modifications - Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`, not a `shared_ptr`. - Use `PeriodicTask` instead of the `deadline_timer` for token refresh. - Migrate `weak_from_this()` method from C++17 and capture `weak_from_this()` instead of `shared_from_this()` in callbacks. ### Verifying this change Run the `testResendViaSendCallback` for many times and we can see it won't fail after this patch. ```bash ./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30 ```

Fixes #17392 ### Motivation All timers in `ProducerImpl` are `std::shared_ptr` objects that can be reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to null pointer access in some cases. See #17392 (comment) for the analysis. Generally it's not necessary to hold a nullable pointer to the timer. However, to resolve the cyclic reference issue, #5246 reset the shared pointer to reduce the reference count manually. It's not a good solution because we have to perform null check for timers everywhere. The null check still has some race condition issue like: Thread 1: ```c++ if (timer) { // [1] timer is not nullptr timer->async_wait(/* ... */); // [3] timer is null now, see [2] below } ``` Thread 2: ```c++ timer.reset(); // [2] ``` The best solution is to capture `weak_ptr` in timer's callback and call `lock()` to check if the referenced object is still valid. ### Modifications - Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`, not a `shared_ptr`. - Use `PeriodicTask` instead of the `deadline_timer` for token refresh. - Migrate `weak_from_this()` method from C++17 and capture `weak_from_this()` instead of `shared_from_this()` in callbacks. ### Verifying this change Run the `testResendViaSendCallback` for many times and we can see it won't fail after this patch. ```bash ./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30 ``` (cherry picked from commit 7d6f394)

Fixes apache#17392 ### Motivation All timers in `ProducerImpl` are `std::shared_ptr` objects that can be reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to null pointer access in some cases. See apache#17392 (comment) for the analysis. Generally it's not necessary to hold a nullable pointer to the timer. However, to resolve the cyclic reference issue, apache#5246 reset the shared pointer to reduce the reference count manually. It's not a good solution because we have to perform null check for timers everywhere. The null check still has some race condition issue like: Thread 1: ```c++ if (timer) { // [1] timer is not nullptr timer->async_wait(/* ... */); // [3] timer is null now, see [2] below } ``` Thread 2: ```c++ timer.reset(); // [2] ``` The best solution is to capture `weak_ptr` in timer's callback and call `lock()` to check if the referenced object is still valid. ### Modifications - Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`, not a `shared_ptr`. - Use `PeriodicTask` instead of the `deadline_timer` for token refresh. - Migrate `weak_from_this()` method from C++17 and capture `weak_from_this()` instead of `shared_from_this()` in callbacks. ### Verifying this change Run the `testResendViaSendCallback` for many times and we can see it won't fail after this patch. ```bash ./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30 ``` (cherry picked from commit 7d6f394) (cherry picked from commit 25b691b)

Fix memory leak caused by deadline_timer holding object reference

5a7cce1

massakam added type/bug The PR fixed a bug or issue reported a bug component/c++ labels Sep 21, 2019

massakam added this to the 2.4.2 milestone Sep 21, 2019

massakam self-assigned this Sep 21, 2019

merlimat reviewed Sep 22, 2019

View reviewed changes

Add ConnectionPool destructor

7ef79c2

massakam changed the title ~~[Issue #5234][pulsar-client-cpp] Fix memory leak caused by deadline_timer holding object reference~~ [WIP][Issue #5234][pulsar-client-cpp] Fix memory leak caused by deadline_timer holding object reference Sep 22, 2019

massakam commented Sep 24, 2019

View reviewed changes

Masahiro Sakamoto added 2 commits September 25, 2019 11:21

Do not reset executor_ in ClientConnection::close()

4f76aaa

Make ConnectionPool have shared_ptr instead of weak_ptr to ClientConn…

7208ad0

…ection

massakam changed the title ~~[WIP][Issue #5234][pulsar-client-cpp] Fix memory leak caused by deadline_timer holding object reference~~ [Issue #5234][pulsar-client-cpp] Fix memory leak caused by deadline_timer holding object reference Sep 25, 2019

merlimat reviewed Sep 25, 2019

View reviewed changes

Masahiro Sakamoto added 2 commits September 25, 2019 14:15

Revert PoolMap type

d35bc2a

Remove lock.unlock() from ~ConnectionPool()

74a8364

merlimat approved these changes Sep 25, 2019

View reviewed changes

merlimat merged commit d430441 into apache:master Sep 25, 2019

massakam deleted the fix-cpp-memory-leak branch September 26, 2019 01:23

This was referenced Sep 27, 2019

[pulsar-client-cpp] Fix memory leak caused by not being executed ClientConnection destructor #5286

Merged

[pulsar-client-cpp] Fix bad_weak_ptr error when closing producer #5315

Merged

BewareMyPower mentioned this pull request Sep 1, 2022

[fix][cpp] Fix potential segfault when resending messages #17395

Merged

4 tasks

	sendTimer_->async_wait(
	std::bind(&ProducerImpl::handleSendTimeout, shared_from_this(), std::placeholders::_1));

[Issue #5234][pulsar-client-cpp] Fix memory leak caused by deadline_timer holding object reference #5246

[Issue #5234][pulsar-client-cpp] Fix memory leak caused by deadline_timer holding object reference #5246

Uh oh!

Conversation

massakam commented Sep 21, 2019

Motivation

Modifications

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

massakam commented Sep 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merlimat commented Sep 24, 2019

Uh oh!

massakam commented Sep 25, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

massakam commented Sep 25, 2019

Uh oh!

merlimat left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants