-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Issue #5234][pulsar-client-cpp] Fix memory leak caused by deadline_timer holding object reference #5246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return future; | ||
| } | ||
|
|
||
| void ConnectionPool::close() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be done implicitly in the destructor itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I added the destructor instead of the close() method.
|
When creating and closing a Pulsar client object repeatedly, the program may stop when the |
| } | ||
|
|
||
| if (executor_) { | ||
| executor_.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The program doesn't seem to stop unless executor_ is reset here. However, the ClientConnection destructor is not executed, causing a memory leak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this in this PR.
@massakam Do you have a code example to reproduce the issue? Also, would that be strictly related to this PR (and what is fixing) or is that a separate issue? |
Running such code reproduces the issue. #include <iostream>
#include <pulsar/Client.h>
using namespace std;
using namespace pulsar;
int main() {
for (int i = 0; i < 100; i++) {
Client client("pulsar://localhost:6650");
Producer producer;
Result result = client.createProducer("persistent://public/default/test", producer);
if (result != ResultOk) {
cerr << "Failed to create producer: " << result << endl;
return -1;
}
Message msg = MessageBuilder().setContent("my-message").build();
Result res = producer.send(msg);
if (res == ResultOk) {
cout << "Send: " << msg.getDataAsString() << endl;
} else {
cerr << "Failed to send message: " << res << endl;
}
producer.close();
client.close();
}
}
As I wrote above, this seems to occur if |
| pool_.clear(); | ||
| } | ||
|
|
||
| lock.unlock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the unlock() is automatically done in the lock destructor.
| ExecutorServiceProviderPtr executorProvider_; | ||
| AuthenticationPtr authentication_; | ||
| typedef std::map<std::string, ClientConnectionWeakPtr> PoolMap; | ||
| typedef std::map<std::string, ClientConnectionPtr> PoolMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for changing the weak_ptr into a shared_ptr?
This could also potentially have side effects, like not destructing the connections (that are already closed) while the pool is active.
As I understand this change, the reason is to iterate through the map in the ~ConnectionPool and force calling ClientConnectionPtr::close(). Couldn't we just do the same with the weak_ptr and calling lock() while iterating through the map and closing the ptr that are still valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for changing the
weak_ptrinto ashared_ptr?
While iterating the pool map, I am worried that the iterator would be broken when ClientConnection is destructed.
Couldn't we just do the same with the
weak_ptrand callinglock()while iterating through the map and closing the ptr that are still valid?
I will try to fix it like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While iterating the pool map, I am worried that the iterator would be broken when ClientConnection is destructed.
The iterator itself is on a map in ConnectionPool so that will still be always safe to use here.
The ClientConnectionWeakPtr can, of course, be possibly already destroyed, though the lock() operation will attempt to acquire a ref count on the object and get a shared_ptr. The resulting shared_ptr has to be checked if it's valid, and if it is, it will be kept alive by the shared_ptr.
|
PTAL |
merlimat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Master Issue: #5234 ### Motivation The other day, I fixed a memory leak caused by not being executed the destructor of C++ producer in #5246. However, when running a producer application written in Go in an environment with the modified C++ client library installed, the program occasionally crashes due to a "bad_weak_ptr" error. ``` 2019/10/01 16:34:30.210 c_client.go:68: [info] INFO | ProducerImpl:481 | [persistent://massakam/global/test/t1, dc1-904-1012912] Closing producer for topic persistent://massakam/global/test/t1 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012912] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012912] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}] 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | BatchMessageContainer:171 | [numberOfBatchesSent = 1] [averageBatchSize = 1] terminate called after throwing an instance of 'std::bad_weak_ptr' what(): 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012911] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012911] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}] bad_weak_ptr 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | BatchMessageContainer:171 | [numberOfBatchesSent = 1] [averageBatchSize = 1] 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012910] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012910] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}] SIGABRT: abort PC=0x7fc78d39d2c7 m=0 sigcode=18446744073709551610 ``` As a result of the investigation, I found that the destructor was called before the process of closing `ProducerImpl` was completed, and the object was destroyed. ### Modifications To keep the `ProducerImpl` object alive, get its own shared pointer at the beginning of `ProducerImpl::closeAsync()`. And the pointer must be passed to `ProducerImpl::handleClose()`. Otherwise, the object will be destroyed before `handleClose()` is called. So far, this issue has not been reproduced in `ConsumerImpl`, but I fixed it in the same way as `ProducerImpl`.
…imer holding object reference (#5246) * Fix memory leak caused by deadline_timer holding object reference * Add ConnectionPool destructor * Do not reset executor_ in ClientConnection::close() * Make ConnectionPool have shared_ptr instead of weak_ptr to ClientConnection * Revert PoolMap type * Remove lock.unlock() from ~ConnectionPool() (cherry picked from commit d430441)
Master Issue: #5234 ### Motivation The other day, I fixed a memory leak caused by not being executed the destructor of C++ producer in #5246. However, when running a producer application written in Go in an environment with the modified C++ client library installed, the program occasionally crashes due to a "bad_weak_ptr" error. ``` 2019/10/01 16:34:30.210 c_client.go:68: [info] INFO | ProducerImpl:481 | [persistent://massakam/global/test/t1, dc1-904-1012912] Closing producer for topic persistent://massakam/global/test/t1 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012912] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012912] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}] 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | BatchMessageContainer:171 | [numberOfBatchesSent = 1] [averageBatchSize = 1] terminate called after throwing an instance of 'std::bad_weak_ptr' what(): 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012911] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012911] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}] bad_weak_ptr 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | BatchMessageContainer:171 | [numberOfBatchesSent = 1] [averageBatchSize = 1] 2019/10/01 16:34:30.211 c_client.go:68: [info] INFO | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012910] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012910] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}] SIGABRT: abort PC=0x7fc78d39d2c7 m=0 sigcode=18446744073709551610 ``` As a result of the investigation, I found that the destructor was called before the process of closing `ProducerImpl` was completed, and the object was destroyed. ### Modifications To keep the `ProducerImpl` object alive, get its own shared pointer at the beginning of `ProducerImpl::closeAsync()`. And the pointer must be passed to `ProducerImpl::handleClose()`. Otherwise, the object will be destroyed before `handleClose()` is called. So far, this issue has not been reproduced in `ConsumerImpl`, but I fixed it in the same way as `ProducerImpl`. (cherry picked from commit dbd48ab)
Fixes apache#17392 ### Motivation All timers in `ProducerImpl` are `std::shared_ptr` objects that can be reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to null pointer access in some cases. See apache#17392 (comment) for the analysis. Generally it's not necessary to hold a nullable pointer to the timer. However, to resolve the cyclic reference issue, apache#5246 reset the shared pointer to reduce the reference count manually. It's not a good solution because we have to perform null check for timers everywhere. The null check still has some race condition issue like: Thread 1: ```c++ if (timer) { // [1] timer is not nullptr timer->async_wait(/* ... */); // [3] timer is null now, see [2] below } ``` Thread 2: ```c++ timer.reset(); // [2] ``` The best solution is to capture `weak_ptr` in timer's callback and call `lock()` to check if the referenced object is still valid. ### Modifications - Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`, not a `shared_ptr`. - Use `PeriodicTask` instead of the `deadline_timer` for token refresh. - Migrate `weak_from_this()` method from C++17 and capture `weak_from_this()` instead of `shared_from_this()` in callbacks. ### Verifying this change Run the `testResendViaSendCallback` for many times and we can see it won't fail after this patch. ```bash ./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30 ```
Fixes #17392 ### Motivation All timers in `ProducerImpl` are `std::shared_ptr` objects that can be reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to null pointer access in some cases. See #17392 (comment) for the analysis. Generally it's not necessary to hold a nullable pointer to the timer. However, to resolve the cyclic reference issue, #5246 reset the shared pointer to reduce the reference count manually. It's not a good solution because we have to perform null check for timers everywhere. The null check still has some race condition issue like: Thread 1: ```c++ if (timer) { // [1] timer is not nullptr timer->async_wait(/* ... */); // [3] timer is null now, see [2] below } ``` Thread 2: ```c++ timer.reset(); // [2] ``` The best solution is to capture `weak_ptr` in timer's callback and call `lock()` to check if the referenced object is still valid. ### Modifications - Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`, not a `shared_ptr`. - Use `PeriodicTask` instead of the `deadline_timer` for token refresh. - Migrate `weak_from_this()` method from C++17 and capture `weak_from_this()` instead of `shared_from_this()` in callbacks. ### Verifying this change Run the `testResendViaSendCallback` for many times and we can see it won't fail after this patch. ```bash ./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30 ```
Fixes #17392 ### Motivation All timers in `ProducerImpl` are `std::shared_ptr` objects that can be reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to null pointer access in some cases. See #17392 (comment) for the analysis. Generally it's not necessary to hold a nullable pointer to the timer. However, to resolve the cyclic reference issue, #5246 reset the shared pointer to reduce the reference count manually. It's not a good solution because we have to perform null check for timers everywhere. The null check still has some race condition issue like: Thread 1: ```c++ if (timer) { // [1] timer is not nullptr timer->async_wait(/* ... */); // [3] timer is null now, see [2] below } ``` Thread 2: ```c++ timer.reset(); // [2] ``` The best solution is to capture `weak_ptr` in timer's callback and call `lock()` to check if the referenced object is still valid. ### Modifications - Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`, not a `shared_ptr`. - Use `PeriodicTask` instead of the `deadline_timer` for token refresh. - Migrate `weak_from_this()` method from C++17 and capture `weak_from_this()` instead of `shared_from_this()` in callbacks. ### Verifying this change Run the `testResendViaSendCallback` for many times and we can see it won't fail after this patch. ```bash ./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30 ``` (cherry picked from commit 7d6f394)
Fixes #17392 ### Motivation All timers in `ProducerImpl` are `std::shared_ptr` objects that can be reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to null pointer access in some cases. See #17392 (comment) for the analysis. Generally it's not necessary to hold a nullable pointer to the timer. However, to resolve the cyclic reference issue, #5246 reset the shared pointer to reduce the reference count manually. It's not a good solution because we have to perform null check for timers everywhere. The null check still has some race condition issue like: Thread 1: ```c++ if (timer) { // [1] timer is not nullptr timer->async_wait(/* ... */); // [3] timer is null now, see [2] below } ``` Thread 2: ```c++ timer.reset(); // [2] ``` The best solution is to capture `weak_ptr` in timer's callback and call `lock()` to check if the referenced object is still valid. ### Modifications - Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`, not a `shared_ptr`. - Use `PeriodicTask` instead of the `deadline_timer` for token refresh. - Migrate `weak_from_this()` method from C++17 and capture `weak_from_this()` instead of `shared_from_this()` in callbacks. ### Verifying this change Run the `testResendViaSendCallback` for many times and we can see it won't fail after this patch. ```bash ./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30 ``` (cherry picked from commit 7d6f394)
Fixes #17392 ### Motivation All timers in `ProducerImpl` are `std::shared_ptr` objects that can be reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to null pointer access in some cases. See #17392 (comment) for the analysis. Generally it's not necessary to hold a nullable pointer to the timer. However, to resolve the cyclic reference issue, #5246 reset the shared pointer to reduce the reference count manually. It's not a good solution because we have to perform null check for timers everywhere. The null check still has some race condition issue like: Thread 1: ```c++ if (timer) { // [1] timer is not nullptr timer->async_wait(/* ... */); // [3] timer is null now, see [2] below } ``` Thread 2: ```c++ timer.reset(); // [2] ``` The best solution is to capture `weak_ptr` in timer's callback and call `lock()` to check if the referenced object is still valid. ### Modifications - Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`, not a `shared_ptr`. - Use `PeriodicTask` instead of the `deadline_timer` for token refresh. - Migrate `weak_from_this()` method from C++17 and capture `weak_from_this()` instead of `shared_from_this()` in callbacks. ### Verifying this change Run the `testResendViaSendCallback` for many times and we can see it won't fail after this patch. ```bash ./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30 ``` (cherry picked from commit 7d6f394)
Fixes apache#17392 ### Motivation All timers in `ProducerImpl` are `std::shared_ptr` objects that can be reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to null pointer access in some cases. See apache#17392 (comment) for the analysis. Generally it's not necessary to hold a nullable pointer to the timer. However, to resolve the cyclic reference issue, apache#5246 reset the shared pointer to reduce the reference count manually. It's not a good solution because we have to perform null check for timers everywhere. The null check still has some race condition issue like: Thread 1: ```c++ if (timer) { // [1] timer is not nullptr timer->async_wait(/* ... */); // [3] timer is null now, see [2] below } ``` Thread 2: ```c++ timer.reset(); // [2] ``` The best solution is to capture `weak_ptr` in timer's callback and call `lock()` to check if the referenced object is still valid. ### Modifications - Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`, not a `shared_ptr`. - Use `PeriodicTask` instead of the `deadline_timer` for token refresh. - Migrate `weak_from_this()` method from C++17 and capture `weak_from_this()` instead of `shared_from_this()` in callbacks. ### Verifying this change Run the `testResendViaSendCallback` for many times and we can see it won't fail after this patch. ```bash ./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30 ``` (cherry picked from commit 7d6f394) (cherry picked from commit 25b691b)
Fixes #5234
Motivation
In the C++ client library, memory usage increases as producers are repeatedly created and closed. This is because
deadline_timersuch assendTimer_holds a reference to theProducerImplobject and theProducerImpldestructor is not executed.pulsar/pulsar-client-cpp/lib/ProducerImpl.cc
Lines 183 to 184 in 822531c
ClientConnectionseems to have the same problem asProducerImpl.cf. https://stackoverflow.com/questions/27065954/boost-deadline-timer-holds-reference-to-object
Modifications
The destructor is executed by resetting shared pointers for these
deadline_timerwhen closing theProducerImplobject.