KAFKA-8483/KAFKA-8484; Ensure safe handling of producerId resets#6883
KAFKA-8483/KAFKA-8484; Ensure safe handling of producerId resets#6883hachikuji merged 6 commits intoapache:trunkfrom
Conversation
|
Note this patch also contains a fix to KAFKA-8484. I can separate them if preferable, but I needed some of the common testing logic. |
4bd23f9 to
d7d712c
Compare
|
It's fine to include both fixes, can we update the PR title and description to mention that? |
|
I think it would be great to include this in 2.3, if possible. |
guozhangwang
left a comment
There was a problem hiding this comment.
LGTM overall. Some minor comments.
| } | ||
| } | ||
|
|
||
| public synchronized void handleCompletedBatch(ProducerBatch batch, ProduceResponse.PartitionResponse response) { |
There was a problem hiding this comment.
Just to clarify this function is just moved here, no logical changes?
Also if it is indeed the case, could you make some comments when creating the PR for ease of review :) ?
There was a problem hiding this comment.
Yes, it is moved to simplify the TransactionManager API so that I could write better test cases. Otherwise it was very difficult to hit the cases without effectively rewriting this logic in the test case.
| synchronized void adjustSequencesDueToFailedBatch(ProducerBatch batch) { | ||
| if (!topicPartitionBookkeeper.contains(batch.topicPartition)) | ||
| private void adjustSequencesDueToFailedBatch(ProducerBatch batch) { | ||
| if (!topicPartitionBookkeeper.contains(batch.topicPartition) || !hasProducerIdAndEpoch(batch.producerId(), batch.producerEpoch())) |
There was a problem hiding this comment.
Why adding the second condition?
There was a problem hiding this comment.
This was the fix for KAFKA-8484. However, it seems redundant after I moved things around. Now handleFailedBatch verifies the producerId and epoch, so I think we can remove it.
| } | ||
|
|
||
| public synchronized void handleFailedBatch(ProducerBatch batch, RuntimeException exception, boolean adjustSequenceNumbers) { | ||
| maybeTransitionToErrorState(exception); |
There was a problem hiding this comment.
Same here, the caller function moved here seems not changing any logic (the key change is in adjustSequencesDueToFailedBatch) right?
There was a problem hiding this comment.
The main fix is the producerId and epoch check below.
| int sequence = 0; | ||
| for (ProducerBatch inFlightBatch : topicPartitionBookkeeper.getPartition(topicPartition).inflightBatchesBySequence) { | ||
| private void startSequencesAtBeginning(TopicPartition topicPartition) { | ||
| final AtomicInteger sequence = new AtomicInteger(0); |
There was a problem hiding this comment.
The caller function canRetry is synchronized, do we need an atomic integer?
There was a problem hiding this comment.
It was needed only because of the lambda. I guess this is the ugly side of Java 8.
| TransactionManager transactionManager = new TransactionManager(); | ||
| transactionManager.setProducerIdAndEpoch(producerIdAndEpoch); | ||
|
|
||
| ProducerBatch b1 = writeIdempotentBatchWithValue(transactionManager, tp0, "1"); |
There was a problem hiding this comment.
Reading the iterator() code of PriorityQueue, I think three batches are sufficient to expose the randomness of its iterator(). Are there any reasons that you want to have 5, or it's just your favorite magic number?
There was a problem hiding this comment.
Not really; 5 seemed like a sufficient interesting number to catch this bug and any future regressions.
|
LGTM, feel free to merge after jenkins green. |
The idempotent producer attempts to detect spurious UNKNOWN_PRODUCER_ID errors and handle them by reassigning sequence numbers to the inflight batches. The inflight batches are tracked in a PriorityQueue. The problem is that the reassignment of sequence numbers depends on the iteration order of PriorityQueue, which does not guarantee any ordering. So this can result in sequence numbers being assigned in the wrong order. This patch fixes the problem by using a sorted set instead of a priority queue so that the iteration order preserves the sequence order. Note that resetting sequence numbers is an exceptional case. This patch also fixes KAFKA-8484, which can cause an IllegalStateException when the producerId is reset while there are pending produce requests inflight. The solution is to ensure that sequence numbers are only reset if the producerId of a failed batch corresponds to the current producerId. Reviewers: Guozhang Wang <wangguoz@gmail.com>
…che#6883) The idempotent producer attempts to detect spurious UNKNOWN_PRODUCER_ID errors and handle them by reassigning sequence numbers to the inflight batches. The inflight batches are tracked in a PriorityQueue. The problem is that the reassignment of sequence numbers depends on the iteration order of PriorityQueue, which does not guarantee any ordering. So this can result in sequence numbers being assigned in the wrong order. This patch fixes the problem by using a sorted set instead of a priority queue so that the iteration order preserves the sequence order. Note that resetting sequence numbers is an exceptional case. This patch also fixes KAFKA-8484, which can cause an IllegalStateException when the producerId is reset while there are pending produce requests inflight. The solution is to ensure that sequence numbers are only reset if the producerId of a failed batch corresponds to the current producerId. Reviewers: Guozhang Wang <wangguoz@gmail.com>
The idempotent producer attempts to detect spurious UNKNOWN_PRODUCER_ID errors and handle them by reassigning sequence numbers to the inflight batches. The inflight batches are tracked in a PriorityQueue. The problem is that the reassignment of sequence numbers depends on the iteration order of PriorityQueue, which does not guarantee any ordering. So this can result in sequence numbers being assigned in the wrong order. This patch fixes the problem by using a sorted set instead of a priority queue so that the iteration order preserves the sequence order. Note that resetting sequence numbers is an exceptional case.
This patch also fixes KAFKA-8484, which can cause an IllegalStateException when the producerId is reset while there are pending produce requests inflight. The solution is to ensure that sequence numbers are only reset if the producerId of a failed batch corresponds to the current producerId.
Committer Checklist (excluded from commit message)