KAFKA-7786: Ignore OffsetsForLeaderEpoch response if leader epoch changed while request in flight#6101
Conversation
…nged while request in flight.
|
It would be good to have a unit/integration test as well. |
hachikuji
left a comment
There was a problem hiding this comment.
Thanks for the patch. If we can hit the case in a unit test, that is probably sufficient. It may be possible to do something through MockFetcherThread by letting it block in the call to fetchEpochsFromLeader.
| //Check no leadership and no leader epoch changes happened whilst we were unlocked, fetching epochs | ||
| val leaderEpochs = fetchedEpochs.filter { case (tp, _) => | ||
| val curPartitionState = partitionStates.stateValue(tp) | ||
| val leaderEpochInRequest = epochRequests.get(tp).get.currentLeaderEpoch.get |
There was a problem hiding this comment.
Perhaps no harm being a little more defensive here. At least perhaps we can ensure tp is contained in epochRequests?
There was a problem hiding this comment.
But we still throw an exception, right? I guess it's better to throw IllegalStateException with a descriptive message vs. NPE.
There was a problem hiding this comment.
Or we can log a warning and ignore that partition. Well, perhaps an exception with a nice message is preferable since this would suggest the remote broker gave us data for a partition that we didn't request.
There was a problem hiding this comment.
Yeah, generally Option.get should never be used in favour of a more descriptive error.
There was a problem hiding this comment.
Yeah, generally Option.get should never be used in favour of a more descriptive error.
hachikuji
left a comment
There was a problem hiding this comment.
Thanks, the fix looks good. Just a couple small comments.
| import kafka.log.LogAppendInfo | ||
| import kafka.message.NoCompressionCodec | ||
| import kafka.server.AbstractFetcherThread.ResultWithPartitions | ||
| import kafka.server.PartitionFetchState |
| val curPartitionState = partitionStates.stateValue(tp) | ||
| val leaderEpochInRequest = epochRequests.get(tp) match { | ||
| case Some(request) => request.currentLeaderEpoch.get | ||
| case _ => |
There was a problem hiding this comment.
nit: case None since there are no other alternatives
|
|
||
| val leaderLog = Seq( | ||
| mkBatch(baseOffset = 0, leaderEpoch = 0, new SimpleRecord("c".getBytes))) | ||
| val leaderState = MockFetcherThread.PartitionState(leaderLog, leaderEpoch = 1, highWatermark = 0L) |
There was a problem hiding this comment.
In this case, the leader has updated the epoch before the follower and sends back the fenced error. It is also possible that the leader is still on the old epoch and returns a valid response (which we should also ignore). Is it worthwhile having a separate test for that case?
There was a problem hiding this comment.
Good point, I agree it's worthwhile having a separate test like that
… changed on follower
hachikuji
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the patch!
| val mockNetwork = new ReplicaFetcherMockBlockingSend(offsetsReply, brokerEndPoint, new SystemTime()) | ||
| val thread = new ReplicaFetcherThread("bob", 0, brokerEndPoint, configs(0), replicaManager, new Metrics(), new SystemTime(), quota, Some(mockNetwork)) | ||
| thread.addPartitions(Map(t1p0 -> offsetAndEpoch(0L), t2p1 -> offsetAndEpoch(0L))) | ||
| thread.addPartitions(Map(t1p0 -> offsetAndEpoch(0L), t1p1 -> offsetAndEpoch(0L))) |
|
@hachikuji thanks for the review. PR builder |
…ged during leader epoch request
|
@hachikuji I fixed the ReplicaManagerTest, where the bug in the test caused exactly the race condition this PR fixed (but not intentionally, hence the failure) and the test itself had a race condition, that's why it did not fail 100% of the times. Most recent build failures are unrelated to this PR: JDK 11:
|
…ile request in flight (#6101) There is a race condition in ReplicaFetcherThread, where we can update PartitionFetchState with the new leader epoch (same leader) before handling the OffsetsForLeaderEpoch response with FENCED_LEADER_EPOCH error which causes removing partition from partitionStates, which in turn causes no fetching until the next LeaderAndIsr. This patch adds logic to ensure that the leader epoch doesn't change while an OffsetsForLeaderEpoch request is in flight (which could happen with back-to-back leader elections). If it has changed, we ignore the response. Also added toString() implementation to PartitionData, because some log messages did not show useful info which I found while investigating the above system test failure. Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
…ile request in flight (apache#6101) There is a race condition in ReplicaFetcherThread, where we can update PartitionFetchState with the new leader epoch (same leader) before handling the OffsetsForLeaderEpoch response with FENCED_LEADER_EPOCH error which causes removing partition from partitionStates, which in turn causes no fetching until the next LeaderAndIsr. This patch adds logic to ensure that the leader epoch doesn't change while an OffsetsForLeaderEpoch request is in flight (which could happen with back-to-back leader elections). If it has changed, we ignore the response. Also added toString() implementation to PartitionData, because some log messages did not show useful info which I found while investigating the above system test failure. Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
There is a race condition in ReplicaFetcherThread, where we can update PartitionFetchState with the new leader epoch (same leader) before handling the OffsetsForLeaderEpoch response with FENCED_LEADER_EPOCH error which causes removing partition from partitionStates, which in turn causes no fetching until the next LeaderAndIsr.
Our system test kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.reassign_from_offset_zero=True failed 3 times due to this error in the last couple of months. Since this test is already able to test this condition, not adding any more tests.
Also added toString() implementation to PartitionData, because some log messages did not show useful info which I found while investigating the above system test failure.
cc @hachikuji who suggested the fix.
Committer Checklist (excluded from commit message)