KAFKA-4840 : BufferPool errors can cause buffer pool to go into a bad state#2659
KAFKA-4840 : BufferPool errors can cause buffer pool to go into a bad state#2659smccauliff wants to merge 10 commits intoapache:trunkfrom
Conversation
There was a problem hiding this comment.
We don't need the BUFFERPOOL prefix here since it's already defined in the BufferPool class. Thanks for the PR, will hopefully review it soon.
There was a problem hiding this comment.
Thanks for the feedback. I've removed this prefix.
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
Readability.
… something in the free list.
Always signal waiters when allocation is complete. Remove BUFFERPOOL prefix from constant.
There was a problem hiding this comment.
I am not sure it is useful to have a static variable for this string since it is used only once. It is also different from the code style in Sender.SenderMetrics which assigns sensor name directly. Maybe we can preserve the existing style for simplicity and code style consistence.
There was a problem hiding this comment.
There is a mock that refers to this variable otherwise I would not have bothered to put this into a static final variable.
There was a problem hiding this comment.
Since we only release lock in the finally here, do we still need to check lock.isHeldByCurrentThread?
There was a problem hiding this comment.
Probably not. The original code was unlocking in different places so this was needed. I will remove this.
There was a problem hiding this comment.
The name restoreAvailableMemoryOnFailure is a bit weird because we should always restore available memory on failure. Maybe we can name it hasError and set it to false right before return buffer.
There was a problem hiding this comment.
We can probably remove this line restoreAvailableMemoryOnFailure = false.
There was a problem hiding this comment.
Should we also remove this condition variable within a finally block?
There was a problem hiding this comment.
Would it be possible and simpler to merge the two finally into one?
There was a problem hiding this comment.
Not without additional complications. One finally deals with cases caused by waiting for memory to become available the other deals with cases causes by just allocating memory.
There was a problem hiding this comment.
I think there is actually a reasonable way to merge the two finally into one. The idea is that the code in the first finally block doesn't throw exception and is no-op if this.availableMemory + freeListSize >= size. This is no-op if you change it to the following:
if (hasError)
this.availableMemory += accumulated;
if (moreMemory != null)
this.waiters.remove(moreMemory);
Don't check to see if lock is held before unlocking, since there is only one place where the lock is unlocked.
|
@lindong28 Can you check this again? Thanks. |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
@smccauliff I think we can further simplify the code by merging the two |
| freeUp(size); | ||
| this.availableMemory -= size; | ||
| lock.unlock(); | ||
| return allocateByteBuffer(size); |
There was a problem hiding this comment.
If an OOM is thrown from here, it seems the available memory would be already decremented and not added back.
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
| // over for them | ||
| if (!(this.availableMemory == 0 && this.free.isEmpty()) && !this.waiters.isEmpty()) | ||
| this.waiters.peekFirst().signal(); | ||
| lock.unlock(); |
There was a problem hiding this comment.
It seems findBugs is not happy with this finally block although I think it should be fine. Can we have a try finally block here to ensure lock.unlock() is executed so that findBugs is happy?
|
@smccauliff Thanks for the update. LGTM. findBugs was complaining about one of the finally block. Can we change it to make findBugs happy? |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Thanks for updating the patch. Merged to trunk. |
| // we have enough unallocated or pooled memory to immediately | ||
| // satisfy the request | ||
| freeUp(size); | ||
| ByteBuffer allocatedBuffer = allocateByteBuffer(size); |
There was a problem hiding this comment.
I think it was intentional that we didn't allocate memory with the lock held. It seems like this optimisation was lost?
There was a problem hiding this comment.
I'll see if there is some clean way to restore that optimization, but if there is OOM on buffer allocation then the lock needs to be acquired and the allocation freed once more. Seems pretty ugly.
There was a problem hiding this comment.
Yeah, it's not pretty. However, reducing the concurrency of BufferPool in the common path is not desireable. We definitely need to handle OOMs correctly, but they are relatively rare and it's OK if that path is slower.
There was a problem hiding this comment.
Good point. I think it is still worth keeping the optimization, although typically the the producer will only allocate poolable batch size, so the actual memory allocation should not happen very often.
No description provided.