KAFKA-7786: Ignore OffsetsForLeaderEpoch response if leader epoch changed while request in flight by apovzner · Pull Request #6101 · apache/kafka

apovzner · 2019-01-07T22:00:41Z

There is a race condition in ReplicaFetcherThread, where we can update PartitionFetchState with the new leader epoch (same leader) before handling the OffsetsForLeaderEpoch response with FENCED_LEADER_EPOCH error which causes removing partition from partitionStates, which in turn causes no fetching until the next LeaderAndIsr.

Our system test kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.reassign_from_offset_zero=True failed 3 times due to this error in the last couple of months. Since this test is already able to test this condition, not adding any more tests.

Also added toString() implementation to PartitionData, because some log messages did not show useful info which I found while investigating the above system test failure.

cc @hachikuji who suggested the fix.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…nged while request in flight.

ijuma · 2019-01-07T22:16:15Z

It would be good to have a unit/integration test as well.

hachikuji

Thanks for the patch. If we can hit the case in a unit test, that is probably sufficient. It may be possible to do something through MockFetcherThread by letting it block in the call to fetchEpochsFromLeader.

hachikuji · 2019-01-07T23:15:48Z

+        //Check no leadership and no leader epoch changes happened whilst we were unlocked, fetching epochs
+        val leaderEpochs = fetchedEpochs.filter { case (tp, _) =>
+          val curPartitionState = partitionStates.stateValue(tp)
+          val leaderEpochInRequest = epochRequests.get(tp).get.currentLeaderEpoch.get


Perhaps no harm being a little more defensive here. At least perhaps we can ensure tp is contained in epochRequests?

But we still throw an exception, right? I guess it's better to throw IllegalStateException with a descriptive message vs. NPE.

Or we can log a warning and ignore that partition. Well, perhaps an exception with a nice message is preferable since this would suggest the remote broker gave us data for a partition that we didn't request.

Yeah, generally Option.get should never be used in favour of a more descriptive error.

hachikuji

Thanks, the fix looks good. Just a couple small comments.

hachikuji · 2019-01-08T18:12:28Z

 import kafka.log.LogAppendInfo
 import kafka.message.NoCompressionCodec
 import kafka.server.AbstractFetcherThread.ResultWithPartitions
+import kafka.server.PartitionFetchState


nit: unneeded import

hachikuji · 2019-01-08T18:15:51Z

+          val curPartitionState = partitionStates.stateValue(tp)
+          val leaderEpochInRequest = epochRequests.get(tp) match {
+            case Some(request) => request.currentLeaderEpoch.get
+            case _ =>


nit: case None since there are no other alternatives

hachikuji · 2019-01-08T18:27:58Z

+
+    val leaderLog = Seq(
+      mkBatch(baseOffset = 0, leaderEpoch = 0, new SimpleRecord("c".getBytes)))
+    val leaderState = MockFetcherThread.PartitionState(leaderLog, leaderEpoch = 1, highWatermark = 0L)


In this case, the leader has updated the epoch before the follower and sends back the fenced error. It is also possible that the leader is still on the old epoch and returns a valid response (which we should also ignore). Is it worthwhile having a separate test for that case?

Good point, I agree it's worthwhile having a separate test like that

… changed on follower

hachikuji

LGTM. Thanks for the patch!

hachikuji · 2019-01-08T20:49:04Z

    val mockNetwork = new ReplicaFetcherMockBlockingSend(offsetsReply, brokerEndPoint, new SystemTime())
    val thread = new ReplicaFetcherThread("bob", 0, brokerEndPoint, configs(0), replicaManager, new Metrics(), new SystemTime(), quota, Some(mockNetwork))
-    thread.addPartitions(Map(t1p0 -> offsetAndEpoch(0L), t2p1 -> offsetAndEpoch(0L)))
+    thread.addPartitions(Map(t1p0 -> offsetAndEpoch(0L), t1p1 -> offsetAndEpoch(0L)))


Nice catch.

apovzner · 2019-01-08T21:27:14Z

@hachikuji thanks for the review. PR builder
JDK 11 failed consistently on each of 3 PR builds due to ReplicaManagerTest.testBecomeFollowerWhenLeaderIsUnchangedButMissedLeaderUpdate. So, most likely related to this PR. Looks like an issue how we do mocking (still investigating, since it does not fail locally for me). Let me add a commit to fix this once I find the issue.

…ged during leader epoch request

apovzner · 2019-01-09T03:04:59Z

@hachikuji I fixed the ReplicaManagerTest, where the bug in the test caused exactly the race condition this PR fixed (but not intentionally, hence the failure) and the test itself had a race condition, that's why it did not fail 100% of the times.

Most recent build failures are unrelated to this PR:
JDK 8:
kafka.api.AdminClientIntegrationTest.testForceClose
org.junit.runners.model.TestTimedOutException: test timed out after 120000 milliseconds

JDK 11:

kafka.api.ConsumerBounceTest.testCloseDuringRebalance
org.apache.kafka.common.KafkaException: Socket server failed to bind to localhost:43476: Address already in use.
org.apache.kafka.streams.KafkaStreamsTest.statefulTopologyShouldCreateStateDirectory
java.io.UncheckedIOException: java.nio.file.NoSuchFileException: /tmp/kafka-9Bd3Q/appId/1_0/rocksdb/statefulTopologyShouldCreateStateDirectory-counts/MANIFEST-000001

hachikuji

LGTM

…ile request in flight (#6101) There is a race condition in ReplicaFetcherThread, where we can update PartitionFetchState with the new leader epoch (same leader) before handling the OffsetsForLeaderEpoch response with FENCED_LEADER_EPOCH error which causes removing partition from partitionStates, which in turn causes no fetching until the next LeaderAndIsr. This patch adds logic to ensure that the leader epoch doesn't change while an OffsetsForLeaderEpoch request is in flight (which could happen with back-to-back leader elections). If it has changed, we ignore the response. Also added toString() implementation to PartitionData, because some log messages did not show useful info which I found while investigating the above system test failure. Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>

…ile request in flight (apache#6101) There is a race condition in ReplicaFetcherThread, where we can update PartitionFetchState with the new leader epoch (same leader) before handling the OffsetsForLeaderEpoch response with FENCED_LEADER_EPOCH error which causes removing partition from partitionStates, which in turn causes no fetching until the next LeaderAndIsr. This patch adds logic to ensure that the leader epoch doesn't change while an OffsetsForLeaderEpoch request is in flight (which could happen with back-to-back leader elections). If it has changed, we ignore the response. Also added toString() implementation to PartitionData, because some log messages did not show useful info which I found while investigating the above system test failure. Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>

KAFKA-7786: Ignore OffsetsForLeaderEpoch response if leader epoch cha…

fe4ad21

…nged while request in flight.

hachikuji reviewed Jan 7, 2019

View reviewed changes

Throw nicer exception and added unit tests

7746809

hachikuji reviewed Jan 8, 2019

View reviewed changes

apovzner added 2 commits January 8, 2019 10:33

using getOrElse

7706833

Added unit test for ignoring successful epoch fetch response if epoch…

a9f6d1a

… changed on follower

hachikuji approved these changes Jan 8, 2019

View reviewed changes

Fix ReplicaManagerTest exposed by check that leader epoch hasn't chan…

19f1ac3

…ged during leader epoch request

hachikuji approved these changes Jan 9, 2019

View reviewed changes

hachikuji merged commit b2b79c4 into apache:trunk Jan 9, 2019

apovzner mentioned this pull request Jan 11, 2019

KAFKA-7040: Ignore OffsetsForLeaderEpoch response if leader epoch changed while request in flight #6122

Closed

3 tasks

Conversation

apovzner commented Jan 7, 2019

Committer Checklist (excluded from commit message)

Uh oh!

ijuma commented Jan 7, 2019

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apovzner commented Jan 8, 2019

Uh oh!

apovzner commented Jan 9, 2019

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants