Skip to content

Conversation

@massakam
Copy link
Contributor

Fixes #5234

Motivation

In the C++ client library, memory usage increases as producers are repeatedly created and closed. This is because deadline_timer such as sendTimer_ holds a reference to the ProducerImpl object and the ProducerImpl destructor is not executed.

sendTimer_->async_wait(
std::bind(&ProducerImpl::handleSendTimeout, shared_from_this(), std::placeholders::_1));

ClientConnection seems to have the same problem as ProducerImpl.

cf. https://stackoverflow.com/questions/27065954/boost-deadline-timer-holds-reference-to-object

Modifications

The destructor is executed by resetting shared pointers for these deadline_timer when closing the ProducerImpl object.

@massakam massakam added type/bug The PR fixed a bug or issue reported a bug component/c++ labels Sep 21, 2019
@massakam massakam added this to the 2.4.2 milestone Sep 21, 2019
@massakam massakam self-assigned this Sep 21, 2019
return future;
}

void ConnectionPool::close() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be done implicitly in the destructor itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I added the destructor instead of the close() method.

@massakam massakam changed the title [Issue #5234][pulsar-client-cpp] Fix memory leak caused by deadline_timer holding object reference [WIP][Issue #5234][pulsar-client-cpp] Fix memory leak caused by deadline_timer holding object reference Sep 22, 2019
@massakam
Copy link
Contributor Author

When creating and closing a Pulsar client object repeatedly, the program may stop when the ClientConnection destructor is called. I don't know the cause yet...

}

if (executor_) {
executor_.reset();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The program doesn't seem to stop unless executor_ is reset here. However, the ClientConnection destructor is not executed, causing a memory leak.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this in this PR.

@merlimat
Copy link
Contributor

When creating and closing a Pulsar client object repeatedly, the program may stop when the ClientConnection destructor is called. I don't know the cause yet...

@massakam Do you have a code example to reproduce the issue?

Also, would that be strictly related to this PR (and what is fixing) or is that a separate issue?

@massakam
Copy link
Contributor Author

@merlimat

Do you have a code example to reproduce the issue?

Running such code reproduces the issue.

#include <iostream>
#include <pulsar/Client.h>

using namespace std;
using namespace pulsar;

int main() {
    for (int i = 0; i < 100; i++) {
        Client client("pulsar://localhost:6650");

        Producer producer;
        Result result = client.createProducer("persistent://public/default/test", producer);

        if (result != ResultOk) {
            cerr << "Failed to create producer: " << result << endl;
            return -1;
        }

        Message msg = MessageBuilder().setContent("my-message").build();
        Result res = producer.send(msg);

        if (res == ResultOk) {
            cout << "Send: " << msg.getDataAsString() << endl;
        } else {
            cerr << "Failed to send message: " << res << endl;
        }

        producer.close();
        client.close();
    }
}

Also, would that be strictly related to this PR (and what is fixing) or is that a separate issue?

As I wrote above, this seems to occur if executor_.reset() is executed in the close() method of ClientConnection. The fix the leak of ProducerImpl object, the original purpose of this pull-request, is unrelated to this modification. So, I will create a separate issue.

@massakam massakam changed the title [WIP][Issue #5234][pulsar-client-cpp] Fix memory leak caused by deadline_timer holding object reference [Issue #5234][pulsar-client-cpp] Fix memory leak caused by deadline_timer holding object reference Sep 25, 2019
pool_.clear();
}

lock.unlock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the unlock() is automatically done in the lock destructor.

ExecutorServiceProviderPtr executorProvider_;
AuthenticationPtr authentication_;
typedef std::map<std::string, ClientConnectionWeakPtr> PoolMap;
typedef std::map<std::string, ClientConnectionPtr> PoolMap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for changing the weak_ptr into a shared_ptr?

This could also potentially have side effects, like not destructing the connections (that are already closed) while the pool is active.

As I understand this change, the reason is to iterate through the map in the ~ConnectionPool and force calling ClientConnectionPtr::close(). Couldn't we just do the same with the weak_ptr and calling lock() while iterating through the map and closing the ptr that are still valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for changing the weak_ptr into a shared_ptr?

While iterating the pool map, I am worried that the iterator would be broken when ClientConnection is destructed.

Couldn't we just do the same with the weak_ptr and calling lock() while iterating through the map and closing the ptr that are still valid?

I will try to fix it like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While iterating the pool map, I am worried that the iterator would be broken when ClientConnection is destructed.

The iterator itself is on a map in ConnectionPool so that will still be always safe to use here.

The ClientConnectionWeakPtr can, of course, be possibly already destroyed, though the lock() operation will attempt to acquire a ref count on the object and get a shared_ptr. The resulting shared_ptr has to be checked if it's valid, and if it is, it will be kept alive by the shared_ptr.

@massakam
Copy link
Contributor Author

PTAL

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@merlimat merlimat merged commit d430441 into apache:master Sep 25, 2019
@massakam massakam deleted the fix-cpp-memory-leak branch September 26, 2019 01:23
sijie pushed a commit that referenced this pull request Oct 10, 2019
Master Issue: #5234

### Motivation

The other day, I fixed a memory leak caused by not being executed the destructor of C++ producer in #5246. However, when running a producer application written in Go in an environment with the modified C++ client library installed, the program occasionally crashes due to a "bad_weak_ptr" error.

```
2019/10/01 16:34:30.210 c_client.go:68: [info] INFO  | ProducerImpl:481 | [persistent://massakam/global/test/t1, dc1-904-1012912] Closing producer for topic persistent://massakam/global/test/t1
2019/10/01 16:34:30.211 c_client.go:68: [info] INFO  | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012912] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012912] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}]
2019/10/01 16:34:30.211 c_client.go:68: [info] INFO  | BatchMessageContainer:171 | [numberOfBatchesSent = 1] [averageBatchSize = 1]
terminate called after throwing an instance of 'std::bad_weak_ptr'
  what():  2019/10/01 16:34:30.211 c_client.go:68: [info] INFO  | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012911] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012911] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}]
bad_weak_ptr
2019/10/01 16:34:30.211 c_client.go:68: [info] INFO  | BatchMessageContainer:171 | [numberOfBatchesSent = 1] [averageBatchSize = 1]
2019/10/01 16:34:30.211 c_client.go:68: [info] INFO  | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012910] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012910] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}]
SIGABRT: abort
PC=0x7fc78d39d2c7 m=0 sigcode=18446744073709551610
```

As a result of the investigation, I found that the destructor was called before the process of closing `ProducerImpl` was completed, and the object was destroyed.

### Modifications

To keep the `ProducerImpl` object alive, get its own shared pointer at the beginning of `ProducerImpl::closeAsync()`. And the pointer must be passed to `ProducerImpl::handleClose()`. Otherwise, the object will be destroyed before `handleClose()` is called.

So far, this issue has not been reproduced in `ConsumerImpl`, but I fixed it in the same way as `ProducerImpl`.
wolfstudy pushed a commit that referenced this pull request Nov 20, 2019
…imer holding object reference (#5246)

* Fix memory leak caused by deadline_timer holding object reference

* Add ConnectionPool destructor

* Do not reset executor_ in ClientConnection::close()

* Make ConnectionPool have shared_ptr instead of weak_ptr to ClientConnection

* Revert PoolMap type

* Remove lock.unlock() from ~ConnectionPool()

(cherry picked from commit d430441)
wolfstudy pushed a commit that referenced this pull request Nov 20, 2019
Master Issue: #5234

### Motivation

The other day, I fixed a memory leak caused by not being executed the destructor of C++ producer in #5246. However, when running a producer application written in Go in an environment with the modified C++ client library installed, the program occasionally crashes due to a "bad_weak_ptr" error.

```
2019/10/01 16:34:30.210 c_client.go:68: [info] INFO  | ProducerImpl:481 | [persistent://massakam/global/test/t1, dc1-904-1012912] Closing producer for topic persistent://massakam/global/test/t1
2019/10/01 16:34:30.211 c_client.go:68: [info] INFO  | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012912] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012912] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}]
2019/10/01 16:34:30.211 c_client.go:68: [info] INFO  | BatchMessageContainer:171 | [numberOfBatchesSent = 1] [averageBatchSize = 1]
terminate called after throwing an instance of 'std::bad_weak_ptr'
  what():  2019/10/01 16:34:30.211 c_client.go:68: [info] INFO  | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012911] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012911] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}]
bad_weak_ptr
2019/10/01 16:34:30.211 c_client.go:68: [info] INFO  | BatchMessageContainer:171 | [numberOfBatchesSent = 1] [averageBatchSize = 1]
2019/10/01 16:34:30.211 c_client.go:68: [info] INFO  | ProducerImpl:463 | Producer - [persistent://massakam/global/test/t1, dc1-904-1012910] , [batchMessageContainer = { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://massakam/global/test/t1] [producerName_ = dc1-904-1012910] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 1] [averageBatchSize = 1]}]
SIGABRT: abort
PC=0x7fc78d39d2c7 m=0 sigcode=18446744073709551610
```

As a result of the investigation, I found that the destructor was called before the process of closing `ProducerImpl` was completed, and the object was destroyed.

### Modifications

To keep the `ProducerImpl` object alive, get its own shared pointer at the beginning of `ProducerImpl::closeAsync()`. And the pointer must be passed to `ProducerImpl::handleClose()`. Otherwise, the object will be destroyed before `handleClose()` is called.

So far, this issue has not been reproduced in `ConsumerImpl`, but I fixed it in the same way as `ProducerImpl`.

(cherry picked from commit dbd48ab)
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this pull request Sep 1, 2022
Fixes apache#17392

### Motivation

All timers in `ProducerImpl` are `std::shared_ptr` objects that can be
reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to
null pointer access in some cases.

See
apache#17392 (comment)
for the analysis.

Generally it's not necessary to hold a nullable pointer to the timer.
However, to resolve the cyclic reference issue, apache#5246 reset the shared
pointer to reduce the reference count manually. It's not a good solution
because we have to perform null check for timers everywhere. The null
check still has some race condition issue like:

Thread 1:

```c++
if (timer) {  // [1] timer is not nullptr
    timer->async_wait(/* ... */);  // [3] timer is null now, see [2] below
}
```

Thread 2:

```c++
timer.reset();  // [2]
```

The best solution is to capture `weak_ptr` in timer's callback and call
`lock()` to check if the referenced object is still valid.

### Modifications
- Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`,
  not a `shared_ptr`.
- Use `PeriodicTask` instead of the `deadline_timer` for token refresh.
- Migrate `weak_from_this()` method from C++17 and capture
  `weak_from_this()` instead of `shared_from_this()` in callbacks.

### Verifying this change

Run the `testResendViaSendCallback` for many times and we can see it
won't fail after this patch.

```bash
./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30
```
BewareMyPower added a commit that referenced this pull request Sep 6, 2022
Fixes #17392

### Motivation

All timers in `ProducerImpl` are `std::shared_ptr` objects that can be
reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to
null pointer access in some cases.

See
#17392 (comment)
for the analysis.

Generally it's not necessary to hold a nullable pointer to the timer.
However, to resolve the cyclic reference issue, #5246 reset the shared
pointer to reduce the reference count manually. It's not a good solution
because we have to perform null check for timers everywhere. The null
check still has some race condition issue like:

Thread 1:

```c++
if (timer) {  // [1] timer is not nullptr
    timer->async_wait(/* ... */);  // [3] timer is null now, see [2] below
}
```

Thread 2:

```c++
timer.reset();  // [2]
```

The best solution is to capture `weak_ptr` in timer's callback and call
`lock()` to check if the referenced object is still valid.

### Modifications
- Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`,
  not a `shared_ptr`.
- Use `PeriodicTask` instead of the `deadline_timer` for token refresh.
- Migrate `weak_from_this()` method from C++17 and capture
  `weak_from_this()` instead of `shared_from_this()` in callbacks.

### Verifying this change

Run the `testResendViaSendCallback` for many times and we can see it
won't fail after this patch.

```bash
./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30
```
BewareMyPower added a commit that referenced this pull request Sep 13, 2022
Fixes #17392

### Motivation

All timers in `ProducerImpl` are `std::shared_ptr` objects that can be
reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to
null pointer access in some cases.

See
#17392 (comment)
for the analysis.

Generally it's not necessary to hold a nullable pointer to the timer.
However, to resolve the cyclic reference issue, #5246 reset the shared
pointer to reduce the reference count manually. It's not a good solution
because we have to perform null check for timers everywhere. The null
check still has some race condition issue like:

Thread 1:

```c++
if (timer) {  // [1] timer is not nullptr
    timer->async_wait(/* ... */);  // [3] timer is null now, see [2] below
}
```

Thread 2:

```c++
timer.reset();  // [2]
```

The best solution is to capture `weak_ptr` in timer's callback and call
`lock()` to check if the referenced object is still valid.

### Modifications
- Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`,
  not a `shared_ptr`.
- Use `PeriodicTask` instead of the `deadline_timer` for token refresh.
- Migrate `weak_from_this()` method from C++17 and capture
  `weak_from_this()` instead of `shared_from_this()` in callbacks.

### Verifying this change

Run the `testResendViaSendCallback` for many times and we can see it
won't fail after this patch.

```bash
./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30
```

(cherry picked from commit 7d6f394)
BewareMyPower added a commit that referenced this pull request Sep 13, 2022
Fixes #17392

### Motivation

All timers in `ProducerImpl` are `std::shared_ptr` objects that can be
reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to
null pointer access in some cases.

See
#17392 (comment)
for the analysis.

Generally it's not necessary to hold a nullable pointer to the timer.
However, to resolve the cyclic reference issue, #5246 reset the shared
pointer to reduce the reference count manually. It's not a good solution
because we have to perform null check for timers everywhere. The null
check still has some race condition issue like:

Thread 1:

```c++
if (timer) {  // [1] timer is not nullptr
    timer->async_wait(/* ... */);  // [3] timer is null now, see [2] below
}
```

Thread 2:

```c++
timer.reset();  // [2]
```

The best solution is to capture `weak_ptr` in timer's callback and call
`lock()` to check if the referenced object is still valid.

### Modifications
- Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`,
  not a `shared_ptr`.
- Use `PeriodicTask` instead of the `deadline_timer` for token refresh.
- Migrate `weak_from_this()` method from C++17 and capture
  `weak_from_this()` instead of `shared_from_this()` in callbacks.

### Verifying this change

Run the `testResendViaSendCallback` for many times and we can see it
won't fail after this patch.

```bash
./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30
```

(cherry picked from commit 7d6f394)
BewareMyPower added a commit that referenced this pull request Sep 13, 2022
Fixes #17392

### Motivation

All timers in `ProducerImpl` are `std::shared_ptr` objects that can be
reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to
null pointer access in some cases.

See
#17392 (comment)
for the analysis.

Generally it's not necessary to hold a nullable pointer to the timer.
However, to resolve the cyclic reference issue, #5246 reset the shared
pointer to reduce the reference count manually. It's not a good solution
because we have to perform null check for timers everywhere. The null
check still has some race condition issue like:

Thread 1:

```c++
if (timer) {  // [1] timer is not nullptr
    timer->async_wait(/* ... */);  // [3] timer is null now, see [2] below
}
```

Thread 2:

```c++
timer.reset();  // [2]
```

The best solution is to capture `weak_ptr` in timer's callback and call
`lock()` to check if the referenced object is still valid.

### Modifications
- Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`,
  not a `shared_ptr`.
- Use `PeriodicTask` instead of the `deadline_timer` for token refresh.
- Migrate `weak_from_this()` method from C++17 and capture
  `weak_from_this()` instead of `shared_from_this()` in callbacks.

### Verifying this change

Run the `testResendViaSendCallback` for many times and we can see it
won't fail after this patch.

```bash
./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30
```

(cherry picked from commit 7d6f394)
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Sep 16, 2022
Fixes apache#17392

### Motivation

All timers in `ProducerImpl` are `std::shared_ptr` objects that can be
reset with `nullptr` in `ProducerImpl::cancelTimers`. It could lead to
null pointer access in some cases.

See
apache#17392 (comment)
for the analysis.

Generally it's not necessary to hold a nullable pointer to the timer.
However, to resolve the cyclic reference issue, apache#5246 reset the shared
pointer to reduce the reference count manually. It's not a good solution
because we have to perform null check for timers everywhere. The null
check still has some race condition issue like:

Thread 1:

```c++
if (timer) {  // [1] timer is not nullptr
    timer->async_wait(/* ... */);  // [3] timer is null now, see [2] below
}
```

Thread 2:

```c++
timer.reset();  // [2]
```

The best solution is to capture `weak_ptr` in timer's callback and call
`lock()` to check if the referenced object is still valid.

### Modifications
- Change the type of `sendTimer_` and `batchTimer_` to `deadline_timer`,
  not a `shared_ptr`.
- Use `PeriodicTask` instead of the `deadline_timer` for token refresh.
- Migrate `weak_from_this()` method from C++17 and capture
  `weak_from_this()` instead of `shared_from_this()` in callbacks.

### Verifying this change

Run the `testResendViaSendCallback` for many times and we can see it
won't fail after this patch.

```bash
./tests/main --gtest_filter='BasicEndToEndTest.testResendViaSendCallback' --gtest_repeat=30
```

(cherry picked from commit 7d6f394)
(cherry picked from commit 25b691b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type/bug The PR fixed a bug or issue reported a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory Leak in CreateProducer of pulsar-client-go

2 participants