-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Description
There's a memory leak in the Pulsar Java Client that happens under high load. This happens in the case of using Reader API with a lot of shortly used Reader instances (instances created-used-closed, with async api) and when the Pulsar server side (brokers/bookies) is under heavy load and doesn't respond to all requests because of an overloaded situation.
The symptom is that the heap memory consumption grows until an out of memory error happens.
After running out of memory, the system sometimes in able to resume operations. After some time, the memory gets freed since there is some behavior that closes the connection (perhaps related to maxNumberOfRejectedRequestPerConnection). Closing the connection releases all the memory tied to ClientCnx and the system resumes. However GC uses about 50% of CPU before the system stalls completely.
By analysing the heap dumps, the observation is that there are a lot of CompletableFutures in pendingGetLastMessageIdRequests and they don't get removed.
This is happening in an application that extensively uses Reader API and short living Reader instances. A Reader is created, used and then closed. The asynchronous API is used.
The pending get last message id requests are originating from the Reader API usage. By looking at Pulsar Java client source code, it looks like closing the Reader doesn't remove the last message id requests from the ClientCnx, thus the CompletableFutures held in ClientCnx's pendingGetLastMessageIdRequests keep a strong reference to all of the Reader's underlying ConsumerImpl references and that prevents them from being garbage collected.
pendingGetLastMessageIdRequests doesn't have a timeout solution in ClientCnx like there is for pendingLookupRequests or pendingRequests.
Since each ConsumerImpl consumes a lot of memory (#7680), the heap is quickly filled and the JVM runs out of memory.
When a ClientCnx is closed, the memory gets released so this is why the system is able to resume after an OOM.
However it becomes almost completely unavailable since 50% of CPU is used in constant Full GCs before the connection gets closed and the memory gets released.
Current behavior
- Using the Reader API for a lot of operations under heavy load causes the client's memory consumption grow until there is an OOM.
Expected behavior
- When a Consumer or Reader is closed, it is expected that all related resources are removed and cleaned so that memory isn't leaked. No references should be held to the closed Consumer or Reader instance. Currenty the pendingGetLastMessageIdRequests are holding references to the ConsumerImpl instances.
- When the server doesn't reply to a get last message id request, there would be timeout handling that completes the future held in pendingGetLastMessageIdRequests
- When the system is under heavy load, that there would be proper backpressure for the Reader API and it wouldn't lead to a situation where the system breaks under heavy load. Backpressure can happen in the form of rejecting requests. Some type of backpressure is necessary so that an application using the Reader API can reject requests to it's own clients and there is end-to-end backpressure in place. I assume that the design of the Pulsar Client is already handling this. The expectation is that also the Reader API would have back pressure in a form or another.
Pulsar Client version: 2.6.1
Java 11.0.7