This repository was archived by the owner on Jan 24, 2024. It is now read-only.
Fix NPE caused by empty polls for a consumer of multiple partitions #1033
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #1032
Motivation
When a consumer of multiple partitions polls empty, i.e. there is no available message a
DelayedFetchinstance is created and added toDelayedOperationPurgatory. TheDelayedFetchinstance can be triggered when the timeout (maxWaitMs) exceeds or triggered byKafkaRequestHandler#notifyPendingFetcheswhen some messages are sent successfully.The latter behavior was introduced from #973. However, it could cause NPE because the
MessageFetchContextinstance held byDelayedFetchmight have already be recycled. It's because there is no way to notify the purgatory that there is a delayed fetch operation whosemessageFetchContextfield is null now. For aMessageFetchContextinstance, if one partition has available messages, thetryCompletemethod will be triggered eventually. Thenrecycle()will be called. But the delayed fetch operation won't be removed.Therefore,
onDataWrittenToSomePartitionwill be called on a recycledMessageFetchContextinDelayedFetch#tryComplete, and NPE will happen because all fields ofMessageFetchContextare null now.The KoP release 2.8.2.2+ and 2.9.1.2+ are affected by the bug.
Modifications
We can simply fix this bug by adding null check inMessageFetchContext#onDataWrittenToSomePartition. However, instead of that, this PR chooses to save theDelayedFetchinstances inMessageFetchContextand remove them fromDelayedOperationPurgatoryinrecycle()method.Compared with adding null check inonDataWrittenToSomePartition, the solution of this PR can remove invalidDelayFetchinstances from the purgatory and the associatedTimerTaskinstances from the task list.This PR adds a test
testEmptyPollWhenProduceAndConsumeConcurrentlyto reproduce the empty polls for a consumer of multiple partitions by increasing themaxWaitMsto 3 seconds. The test will fail if without the changes inkafka-implmodule.