KAFKA-6361: Fix log divergence between leader and follower after fast leader fail over by apovzner · Pull Request #4882 · apache/kafka

apovzner · 2018-04-16T20:06:59Z

Implementation of KIP-279 as described here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-279%3A+Fix+log+divergence+between+leader+and+follower+after+fast+leader+fail+over

In summary:

Added leader_epoch to OFFSET_FOR_LEADER_EPOCH_RESPONSE
Leader replies with the pair( largest epoch less than or equal to the requested epoch, the end offset of this epoch)
If Follower does not know about the leader epoch that leader replies with, it truncates to the end offset of largest leader epoch less than leader epoch that leader replied with, and sends another OffsetForLeaderEpoch request. That request contains the largest leader epoch less than leader epoch that leader replied with.

Added integration test EpochDrivenReplicationProtocolAcceptanceTest.logsShouldNotDivergeOnUncleanLeaderElections that does 3 fast leader changes where unclean leader election is enabled and min isr is 1. The test failed before the fix was implemented.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

apovzner · 2018-04-26T00:11:26Z

@lindong28 and @junrao Regarding the truncation logic for future replica (ReplicaAlterLogDirsThread): I thought about this a bit more, and I think we don't need the same truncation logic as in the replica fetcher. We can implement a much simpler logic based on the truncation offset of the local replica. Here are main points:

Currently (and in the original implementation), if the replica becomes a follower, it does a truncation and forces the truncation state on the future replica and provides it with its truncation offset. The original implementation and this PR right now does not use that offset except if the future or current replica does not have any epochs recorded.
I think it is safe for the future replica to truncate and fetch from the offset that the local replica truncated to, i.e., min of that offset and future replica's LOE. The reason is that the future replica only ever fetches from the local replica == one source. So, the only reason why future replica and local replica's logs may diverge is if the local replica truncates and fetches different offsets from the new leader. In this case, we already force truncation of the future replica and provide the new truncate offset of the local replica.
One possible reason against is that what if we "lose" future epoch sequence file or entries from it. The leader epoch truncation logic may help if we lose some recent entries because it will truncate to the end offset of the last known epoch. However, it does not guarantee proper recovery in many cases. Plus, I understand that the future replica will do same recovery logic as normal replica if say something gets corrupted, or we lose the leader epoch file, e.g., someone deletes it (it gets rebuilt from the log).

Based on the above, unless I missed something or my assumptions are wrong, I think we should truncate future replica to the truncation offset of the local replica. This will result in a simpler code/logic which is easier to reason about and debug in the future.

The PR is also ready to review.

lindong28 · 2018-04-26T04:34:51Z

@apovzner Thanks for the patch! I will finish the first round of review this week.

lindong28

Thanks for the patch! LGTM. I only have some minor comments regarding the Java doc and the consistency between ReplicaFetcherThread and ReplicaAlterDirThread.

lindong28 · 2018-04-29T21:02:04Z

nits: leader replied with an offset $offsetToTruncateTo not (logEndOffset)

Oh that's a good catch, and also made me realize it's not a completely correct log message for all the cases, will fix.

lindong28 · 2018-04-29T22:30:34Z

Since we are changing this Java doc, can we make this Java doc a bit more consistent with the actual implementation. For example, If the leader replied with undefined epoch can probably be the first case.

lindong28 · 2018-04-29T22:31:21Z

nits: This is not related to this patch. Can we add one comment saying that the initial offset in this case will be the high watermark, so that it is more consistent with the Java doc?

Added comment above the logging

lindong28 · 2018-04-29T22:40:47Z

nits: Would it be a bit more accurate a bit more accurate to replace <= with <?

lindong28 · 2018-04-29T22:58:24Z

nits: It probably does not affect the correctness of the code. But I am wondering if we can make ReplicaFetcherThread.maybeTruncate() a bit more consistent with the ReplicaAlterDirThread.maybeTruncate(). Currently ReplicaFetcherThread.maybeTruncate() will specifically handle the scenario that epochOffset.leaderEpoch == UNDEFINED_EPOCH whereas ReplicaAlterDirThread.maybeTruncate() handles the scenario through futureEndOffset == UNDEFINED_EPOCH_OFFSET.

Also, should we also consider to min with the initial offset here just like this patch does in ReplicaAlterDirThread.maybeTruncate()? I am wondering whether we can make them more consistent, or whether there is reason that the initial offset is needed in only one of them.

About the question in the second paragraph, it would be incorrect to do min with the initial offset (high watermark) here, because this will fallback to pre-KIP-101 implementation and we can actually lose a committed message (see scenario 1 in KIP-101) . This particular case can happen if the leader is on the protocol version of pre this KIP but post-KIP-101, so it replies with the valid offset but invalid leader epoch. In this case, we want to do KIP-101 implementation of truncating to leader's offset, rather than falling back to pre-KIP-101 implementation.

Regarding a bigger question of making AlterLogDirThread more consistent with ReplicaAlterLogDirThread, I wanted to discuss the possibility of ReplicaAlterLogDirThread using only initial offset (which is truncation offset of the main replica) for truncation instead of following offset for leader epoch logic. I left the comment earlier in this PR and wanted to get your opinion.
-- The initial offset in AlterLogDirThread is main replica truncation offset (if main replica is a follower) or main replica's high watermark (if main replica is the leader), which is different from initial offset in ReplicaFetcherThread which is a high watermark.
-- There is no way that future replica needs to truncate further back than initial offset, because it is always a follower of the main replica, and if the main replica truncated and re-fetched offsets from the leader causing temporary log divergence with the future replica, we already force truncation on the future replica setting future replica's initial offset to the main replica truncation offset.

Otherwise, I will change ReplicaAlterDirThread.maybeTruncate() to be more consistent with ReplicaAlterDirThread.maybeTruncate() in the case mentioned above. Although the current implementation is correct, because falling back to the initial offset for the future replica is safe (vs. for follower replica), because the future replica is always a follower. The whole reason for KIP-101 and KIP-279 is to deal with replicas changing their leader/follower status, while the future replica is always a follower.

@apovzner Thanks for the explanation. The case that the leader is on the protocol version of pre this KIP but post-KIP-101 AND this patch is used in some broker, can only happen if the Kafka cluster being upgraded to use this patch. The time window of this state is very small and maybe we do not need to take care of this scenario.

Regarding the possibility of ReplicaAlterLogDirThread using only initial offset, Jun has provided a very good example. Basically this approach is not reliable if the future replica is offline when leader replica is truncated.

Actually, to be more precise, both leader and follower could be on pre- this KIP protocol version, if the user upgrades the brokers but do not bump the protocol version. So I think we want post KIP-101 behavior, which is what's implemented, vs. going back to pre-KIP-101.

OK, offline future replica is a good example I did not consider. I agree we should use the same algorithm in ReplicaAlterLogDirThread. I will make it more consistent.

junrao

@apovzner : Thanks for the patch. Left a few comments below.

junrao · 2018-04-30T23:50:18Z

Since this affects inter broker protocol, we need to (1) document this api change for "2.0-IV0" in ApiVersion.scala, (2) update the upgrade section in the doc, (3) only use the new protocol if the inter broker protocol is 2.0-IV0 or above.

Done all three.

Regarding (3), the fetcher falls back to KIP-101 logic if inter-broker protocol version < KAFKA_2_0_IV0 (ignores leader epoch returned in the response and uses end offset).

junrao · 2018-05-01T00:06:35Z

watermark => high watermark

junrao · 2018-05-01T00:09:17Z

To be more precise, 1) should be "the leader is still using message format older than KAFKA_0_11_0_IV2".

junrao · 2018-05-01T00:12:16Z

KAFKA_1_1_IV0 should be KAFKA_2_0_IV0

junrao · 2018-05-01T00:21:08Z

I think in this case, it's probably better to fall back to high watermark. That way, if the leader epoch logic doesn't apply, we always consistently fall back to the old method.

My concern about falling back to high watermark in this particular case is that post-KIP-101 code (and pre-2.0) behaves exactly as described, since the leader does not send leader epoch, so we don't check it and use leader's offset to truncate. And also if brokers upgrade to 2.0, but do not upgrade protocol version. Then, we upgrade to 2.0 protocol version, and we are back to high watermark in this case.

junrao · 2018-05-01T01:30:20Z

The above logic may still be needed in the following sequence: (1) future replica copies data above HW from current replica; (2) future replica is offline (e.g. disk failure); (3) current replica truncates data above HW and re-replicated new data from leader on the truncated offsets. To avoid duplicates, perhaps we can share the code between ReplicaFetchThread and here?

That's a good example I did not consider. In that case, I agree, we need the leader epoch logic. I will try to move this code out into a common method that both fetchers re-use.

junrao · 2018-05-01T01:35:37Z

typo "the the". Also, we are now returning the leader epoch and the end offset.

junrao · 2018-05-01T01:40:08Z

Would it be simpler to just initialize updatedOffsetsOpt to offsets and make it a none Option?

junrao · 2018-05-01T01:48:49Z

This is probably because the broker port changes on restart?

junrao · 2018-05-01T01:56:52Z

Change -1 to UNDEFINED_EPOCH_OFFSET?

apovzner · 2018-05-01T22:24:44Z

@junrao and @lindong28 Thanks a lot for your comments. I addressed all of them.

Based on the use case of future replica being offline and missing "mark for truncation" event, I agree that ReplicaAlterLogDirsThread should use the same leader epoch logic for truncation as in ReplicaFetherThread. I moved the common logic that finds truncation offset to AbstractFetcherThread.getOffsetTruncationState, so now the truncation logic in both fetchers is consistent.

junrao

@apovzner : Thanks for the updated patch. A few more comments below. A couple of other things.

Could you also run the system tests?
This is not an issue directly related to this patch. But I noticed that in Log.truncateTo(), if the truncation point is in the middle of a message set, we will actually be truncating to the first offset of the message set. In that case, the replica fetcher thread should adjust the fetch offset to the actual truncated offset. Typically, the truncation point should never be in the middle of a message set. However, this could potentially happen during message format upgrade. We can tighten this up in a separate jira.

junrao · 2018-05-02T20:22:17Z

Hmm, in ReplicaManager.alterReplicaLogDirs(), the initial offset for future replica is also set to its HW. We update future replica's HW In ReplicaAlterLogDirsThread.processPartitionData(). We probably want to bound it by future replica's log end offset.

I fixed the comment to say it is either high watermark for future replica, or current replica's truncation offset.
Also, changed ReplicaAlterLogDirsThread.processPartitionData() to bound future replica's high watermark to its log end offset.

junrao · 2018-05-02T20:30:16Z

Do we need to test useLeaderEpochInResponse? The only case we want to cover here is that the follower uses version 0 of OffsetForLeaderEpoch.

I am testing it in ReplicaFetcherThreadTest.shouldUseLeaderEndOffsetIfInterBrokerVersionBelow20

junrao · 2018-05-02T20:36:48Z

not tracking offsets => not tracking leader epochs ?

junrao · 2018-05-02T20:43:01Z

This is a normal behavior. So the logging probably should be info.

junrao · 2018-05-02T21:45:03Z

Now that we can truncate in more than 1 step, it's probably useful to always bound the truncation point by the replica's log end offset.

junrao · 2018-05-02T22:20:09Z

We want to mention that we are returning both the epoch and the offset.

junrao · 2018-05-02T22:23:02Z

This exists in line 23 already.

removed dup from like 23

junrao · 2018-05-02T22:56:10Z

This test should actually fail right now since the version of the leaderEpochRequest is alway version 1. So, we probably want to check the latestAllowedVersion() in the builder in ReplicaFetcherMockBlockingSend.

Right, this test ended up testing local broker on 0.11 and remote broker on latest version, which actually does not fail because we don't check leader epoch in leaderEpochResponse, and the truncation is done using KIP-101 approach (which is verified by this test). I will update the test to use undefined leader epoch in response to simulate another broker also on the older protocol version.

junrao · 2018-05-02T23:10:06Z

replicaLeaderEpoch and leaderEpochOffset may be confusing. How about followerEpoch and leaderEpochOffset?

junrao · 2018-05-02T23:15:28Z

KAFKA_0_11_0_IV2 => KAFKA_0_11_0

… leader fail over

…CH (same -1 as before)

…irsThread

…ments

…ss comments

… version

lindong28

Thanks for the update! Left a few comments

lindong28 · 2018-05-08T06:01:08Z

    private static final Schema OFFSET_FOR_LEADER_EPOCH_REQUEST_V0 = new Schema(
            new Field(TOPICS_KEY_NAME, new ArrayOf(OFFSET_FOR_LEADER_EPOCH_REQUEST_TOPIC_V0), "An array of topics to get epochs for"));

+    /* v2 request is the same as v1. Per-partition leader epoch has been added to response */


typo. Probably should be v1 instead of v2.

lindong28 · 2018-05-08T06:03:29Z

+    // OFFSET_FOR_LEADER_EPOCH_RESPONSE_PARTITION_V1 added a per-partition leader epoch field,
+    // which specifies which leader epoch the end offset belongs to
+    private static final Schema OFFSET_FOR_LEADER_EPOCH_RESPONSE_PARTITION_V1 = new Schema(
+        ERROR_CODE,


nits: can we make the indention same as the existing indention in this file?

lindong28 · 2018-05-08T06:06:07Z

    // and KafkaStorageException for fetch requests.
    "1.1-IV0" -> KAFKA_1_1_IV0,
    "1.1" -> KAFKA_1_1_IV0,
+    // Introduced OffsetsForLeaderEpochRequest/OffsetsForLeaderEpochResponse V1 via KIP-279


nits: to be more consistent with the existing comment, we can just say Introduced OffsetsForLeaderEpochRequest V1 via KIP-279

lindong28 · 2018-05-08T06:12:43Z

  /**
    * @param leaderEpoch Requested leader epoch
-    * @return The last offset of messages published under this leader epoch.
+    * @return The requested leader epoch and the last offset of messages published under this


It looks like the existing Java doc (prior to this patch) of this method is not correct. According to Java doc of LeaderEpochFileCache.endOffsetFor(...), it says The End Offset is the end offset of this epoch, which is defined as the start offset of the first Leader Epoch larger than the Leader Epoch requested, or else the Log End Offset if the latest epoch was requested.

Yes, I think the prior description is more of a shortcut, which is actually not correct. I just realized that we should use "end offset" instead of the last offset of messages published here -- the description in LeaderEpochFileCache is more precise. I will update this comment accordingly.

lindong28 · 2018-05-08T06:22:59Z

+    val followerName = if (isFutureReplica) "future replica" else "follower"
+
+    // Called when 'offsetToTruncateTo' is the final offset to truncate to.
+    def finalFetchLeaderEpochOffset(offsetToTruncateTo: Long, offsetFromLeader: Long): OffsetTruncationState = {


nits: now that we don't have any logging in finalFetchLeaderEpochOffset(), we can probably remove this method and replace its usage with one line. For example, finalFetchLeaderEpochOffset(leaderEpochOffset.endOffset, leaderEpochOffset.endOffset) is equivalent to OffsetTruncationState(math.min(offsetToTruncateTo, replica.logEndOffset.messageOffset), truncationCompleted = true)

lindong28 · 2018-05-08T06:35:16Z

                                isInterruptible = false,
-                                includeLogTruncation = true) {
+                                includeLogTruncation = true,
+                                useLeaderEpochInResponse = brokerConfig.interBrokerProtocolVersion >= KAFKA_2_0_IV0) {


In ReplicaFetcherThread.fetchEpochsFromLeader(), the version of OffsetsForLeaderEpochRequest should probably be determined based on the interBrokerProtocolVersion. We can use OffsetsForLeaderEpochRequest V1 only if the interBrokerProtocolVersion >= KAFKA_2_0_IV0. Otherwise, we can rolling bounce the cluster to upgrade the code, the leader may still be running the old code and not recognizes OffsetsForLeaderEpochRequest V1.

The OffsetsForLeaderEpoch request is exactly the same in v0, so we don't need to explicitly check the protocol version when building the requests. If the leader is on older version, it will send v0 response, which will not include leader epoch field, which is handled in OffsetsForLeaderEpochResponse constructor by setting leader epoch field to undefined. In the fetcher thread, we handle this case (where the leader epoch is undefined) in maybeTruncate() and fall back to KIP-101 behavior, same as when this broker is on older protocol version.

Hmm.. my understanding is that the version of the response should always match the version of the request. Thus in order to receive OffsetsForLeaderEpochResponse V1, the broker needs to send OffsetsForLeaderEpochRequest V1. And the broker should reject the request if the version of the request is not recognized. Did I miss something?

Yes, correct. I meant nothing different to do in the fetcher thread. If I understood the code correctly, ReplicaFetcherThread.fetchEpochsFromLeader() passes the OffsetsForLeaderEpochRequest.Builder to sendRequest(), and then build() is called on that builder with with a version in NetworkClient.doSend. It looks like the proper version will be used in that case.

Currently if we do not explicitly specify the version for AbstractRequest.Builder(), the latest version of this request, as determined by ApiKeys.latestVersion(), will be used for this request. The latest version for OffsetsForLeaderEpochRequest will be V1 after this patch. we probably need to explicitly pass the version (determined by the IBP) to OffsetsForLeaderEpochRequest.Builder, similar to what we do for UpdateMetadataRequest.Builder() in ControllerBrokerRequestBatch.sendRequestsToBrokers().

Oh I see, thank you, let me take a look.

Thanks a lot for your help, I updated the code to use the OffsetsForLeaderEpochRequest version when building a request.

lindong28 · 2018-05-08T06:43:03Z

+   */
+  def getOffsetTruncationState(tp: TopicPartition, leaderEpochOffset: EpochEndOffset, replica: Replica, isFutureReplica: Boolean = false): OffsetTruncationState = {
+    // to make sure we can distinguish log output for fetching from remote leader or local replica
+    val followerName = if (isFutureReplica) "future replica" else "follower"


nits: this replica can be either follower or future replica. Maybe the variable can be named replicaName?

Yeah, I already went back and forth couple of times regarding "replica" vs. "follower" (also re: your comment below). Jun commented (in this PR) that replica is also confusing in a way that leader is also a replica. And in case of future replica, it is also a follower, but of a different type. I propose to keep this name as is, but replace replicaEndOffset with followerEndOffset re: your comment below.

lindong28 · 2018-05-08T06:43:40Z

+    } else {
+      // get (leader epoch, end offset) pair that corresponds to the largest leader epoch
+      // less than or equal to the requested epoch.
+      val (followerEpoch, replicaEndOffset) = replica.epochs.get.endOffsetFor(leaderEpochOffset.leaderEpoch)


nits: would the name replicaEpoch be more consistent with the name replicaEndOffset?

lindong28 · 2018-05-08T06:44:50Z

+
+case class OffsetTruncationState(offset: Long, truncationCompleted: Boolean) {
+
+  def this (offset: Long) = this(offset, true)


nits: can we remove the space after this?

lindong28 · 2018-05-08T06:51:46Z

                                     isInterruptible: Boolean = true,
-                                     includeLogTruncation: Boolean)
+                                     includeLogTruncation: Boolean,
+                                     useLeaderEpochInResponse: Boolean = true)


My personal opinion is that it may be more general to just pass the interBrokerProtocolVersion to the constructor of AbstactFetcherThread. And we use this variable to determine the version of OffsetsForLeaderEpochRequest when we actually generate the builder for OffsetsForLeaderEpochRequest. It is more consistent with the existing usage KafkaConfig.interBrokerProtocolVersion in the code base. And if in the future there is some other logic that relies on the interBrokerProtocol in the AbstractFetcherThread, we won't need to add more variable to the constructor.

I agree. However, I just tried it, and it requires changes to ConsumerFetcherThread constructor, and then ConsumerFetcherManager, and so on. I think it would be easy to change later when we need more logic dependent on inter broker protocol version, and especially once we remove old consumer fetcher code.

junrao

@apovzner : Thanks for the latest patch. LGTM. Just a few more minor comments below.

junrao · 2018-05-09T00:22:04Z

    * @param fetchOffsets the partitions to mark truncation complete
    */
-  private def markTruncationCompleteAndUpdateFetchOffset(fetchOffsets: Map[TopicPartition, Long]) {
+  private def markTruncationCompleteAndUpdateFetchOffset(fetchOffsets: Map[TopicPartition, OffsetTruncationState]) {


The method name now is not very accurate. It doesn't always mark truncation as completed.

junrao · 2018-05-09T00:32:00Z

+      warn(s"Based on $followerName's leader epoch, leader replied with an unknown offset in ${replica.topicPartition}. " +
+           s"The initial fetch offset ${partitionStates.stateValue(tp).fetchOffset} will be used for truncation.")
+      OffsetTruncationState(partitionStates.stateValue(tp).fetchOffset, truncationCompleted = true)
+    } else if (leaderEpochOffset.leaderEpoch == UNDEFINED_EPOCH || !useLeaderEpochInResponse) {


It seems that we don't really need the flag useLeaderEpochInResponse. If interBrokerProtocolVersion < KAFKA_2_0_IV0, it's guaranteed that leaderEpochOffset.leaderEpoch is UNDEFINED_EPOCH.

@junrao I have thought about this as well. But for future replica, even if interBrokerProtocolVersion < KAFKA_2_0_IV0, ReplicaAlterDirThread.fetchEpochsFromLeader() may still return EpochEndOffset whose leaderEpoch is not UNDEFINED_EPOCH. Maybe this method should also return UNDEFINED_EPOCH if interBrokerProtocolVersion < KAFKA_2_0_IV0?

Oh, thanks for raising this, Dong. I think we should then make ReplicaAlterDirThread.fetchEpochsFromLeader() return response with UNDEFINED_EPOCH to match "older protocol response".

junrao · 2018-05-09T00:45:30Z

+   *  truncate the leader's offset (and do not send any more leader epoch requests).
+   *  -- Otherwise, truncate to min(leader's offset, end offset on the follower for epoch that
+   *  leader replied with, follower's Log End Offset).
+   */


It seems that this comment is really for AbstractFetcherThread.getOffsetTruncationState(). If we move the comment there, we can also simplify the comment in ReplicaAlterLogDirsThread.maybeTruncate().

junrao · 2018-05-09T00:54:30Z

+      // less than or equal to the requested epoch.
+      val (followerEpoch, followerEndOffset) = replica.epochs.get.endOffsetFor(leaderEpochOffset.leaderEpoch)
+      if (followerEndOffset == UNDEFINED_EPOCH_OFFSET) {
+        // This can happen if replica was not tracking leader epochs at that point (before the


Since the code uses follower, perhaps we can say "if follower was not"

junrao · 2018-05-09T01:03:54Z

    //We should have truncated to the offsets in the response
-    assertTrue(truncateToCapture.getValues.asScala.contains(156))
-    assertTrue(truncateToCapture.getValues.asScala.contains(172))
+    assertTrue("Expected offset 156 in captured truncation offsets " + truncateToCapture.getValues,


Perhaps we can change the text to sth like "Expect partition t1p0 to truncate to offset 156".

junrao · 2018-05-09T01:14:15Z

+
+
+    //We should have truncated to the offsets in the first response
+    assertTrue("Expected offset 155 in captured truncation offsets " + truncateToCapture.getValues,


Should we further assert that the builder for OFFSET_FOR_LEADER_EPOCH in ReplicaFetcherMockBlockingSend.sendRequest() is set with the right version?

I modified ReplicaFetcherMockBlockingSend to save the version of OffsetsForLeaderEpochRequest and added couple of checks in the test.

junrao

@apovzner : Thanks for the new patch. A couple of more comments.

junrao · 2018-05-09T17:43:12Z

-        tp -> new EpochEndOffset(Errors.NONE, replicaMgr.getReplicaOrException(tp).epochs.get.endOffsetFor(epoch))
+        val (leaderEpoch, leaderOffset) = replicaMgr.getReplicaOrException(tp).epochs.get.endOffsetFor(epoch)
+        val leaderEpochInResponse: Int =
+          if (brokerConfig.interBrokerProtocolVersion >= KAFKA_2_0_IV0) leaderEpoch


Do we need this check? Since we are getting the leader epoch from the current replica's log directly, even when IBP < KAFKA_2_0_IV0, it seems that we can just return leaderEpoch.

If we are on protocol < 2.0, then the local replica will be fetching from leader based on older protocol (not using leader epoch). If we don't check here, the future replica will be fetching from the local replica based on leader epoch. Seems inconsistent? On the other hand, it should still work for the future replica to truncate using leader epoch in that case too.

Yes, preserving the leader epoch always gives better outcome. So, if we can do it, there is no reason to switch to a worse method. We have no choice btw follower and leader because of IBP. However, here, everything is local. So, there is no need to be constraint by IBP.

If we keep the leader epoch here for better outcome, should we still check useLeaderEpochInResponse in getOffsetTruncationState() so that it returns OffsetTruncationState(min(leaderEpochOffset.endOffset, replica.logEndOffset.messageOffset), truncationCompleted = true) if useLeaderEpochInResponse is false?

If we use leader epoch, then we should go all the way using the new protocol, i.e., continue truncating until finding the consistent point.
Ok, I will change back to using leader epoch if available for future replica.

junrao · 2018-05-09T17:50:28Z

+        OffsetTruncationState(intermediateOffsetToTruncateTo, truncationCompleted = false)
+      } else {
+        val offsetToTruncateTo = min(followerEndOffset, leaderEpochOffset.endOffset)
+        OffsetTruncationState(min(offsetToTruncateTo, replica.logEndOffset.messageOffset), truncationCompleted = true)


In general, we don't expect the truncation point to be < local HW. So, it would be useful to log a warning when this happens. Not sure what's the easiest way since now we can have intermediate truncation point.

junrao

@apovzner : Thanks for the update. A minor comment below. Also, (1) have you added the logic to warn if truncation point < local HW? (2) have you run all systems?

junrao · 2018-05-09T20:20:16Z

 import java.util

 import AbstractFetcherThread.ResultWithPartitions
+import kafka.api._


unused import

apovzner · 2018-05-09T21:55:16Z

@junrao I added the warning about truncating below HW to ReplicaFetcherThread.maybeTruncate. I explicitly compare replica.highWatermark to the offset we are truncating. If we truncate several times, and more than once below HW, we will output the warning multiple times, which I think is ok.

I ran system tests yesterday (https://jenkins.confluent.io/job/system-test-kafka-branch-builder/1746/) and there was only one failure in kafkatest.benchmarks.streams.streams_simple_benchmark_test.StreamsSimpleBenchmarkTest.test_simple_benchmark.test=streams-join.scale=1 which was due to stream test process took too long to exit. I don't think it is related to any changes in this PR.

junrao

@apovzner : Thanks for the patch. LGTM.

guozhangwang · 2018-05-10T23:37:27Z


 <script id="upgrade-template" type="text/x-handlebars-template">

+<h4><a id="upgrade_2_0_0" href="#upgrade_2_0_0">Upgrading from 0.8.x, 0.9.x, 0.10.0.x, 0.10.1.x, 0.10.2.x, 0.11.0.x, 1.0.x, 1.1.x, or 1.2.x to 2.0.0</a></h4>


@apovzner @junrao while working on another PR, I realized this one duplicated the part of upgrade_2_0_0 with upgrade_1_2_0 (we renamed 1.2 to 2.0). If there is nothing new content added I'll go ahead and remove the duplicated section in my PR

…-record-version * apache-github/trunk: KAFKA-6894: Improve err msg when connecting processor with global store (apache#5000) KAFKA-6893; Create processors before starting acceptor in SocketServer (apache#4999) MINOR: Fix typo in ConsumerRebalanceListener JavaDoc (apache#4996) MINOR: Remove deprecated valueTransformer.punctuate (apache#4993) MINOR: Update dynamic broker configuration doc for truststore update (apache#4954) KAFKA-6870 Concurrency conflicts in SampledStat (apache#4985) KAFKA-6361: Fix log divergence between leader and follower after fast leader fail over (apache#4882) KAFKA-6813: Remove deprecated APIs in KIP-182, Part II (apache#4976) KAFKA-6878 Switch the order of underlying.init and initInternal (apache#4988) KAFKA-6299; Fix AdminClient error handling when metadata changes (apache#4295) KAFKA-6878: NPE when querying global state store not in READY state (apache#4978) KAFKA 6673: Implemented missing override equals method (apache#4745) KAFKA-6834: Handle compaction with batches bigger than max.message.bytes (apache#4953)

…t leader fail over (apache#4882) Implementation of KIP-279 as described here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-279%3A+Fix+log+divergence+between+leader+and+follower+after+fast+leader+fail+over In summary: - Added leader_epoch to OFFSET_FOR_LEADER_EPOCH_RESPONSE - Leader replies with the pair( largest epoch less than or equal to the requested epoch, the end offset of this epoch) - If Follower does not know about the leader epoch that leader replies with, it truncates to the end offset of largest leader epoch less than leader epoch that leader replied with, and sends another OffsetForLeaderEpoch request. That request contains the largest leader epoch less than leader epoch that leader replied with. Reviewers: Dong Lin <lindong28@gmail.com>, Jun Rao <junrao@gmail.com>

apovzner force-pushed the kafka-6361 branch from 302c46e to 5fb1962 Compare April 25, 2018 23:22

lindong28 self-requested a review April 26, 2018 08:22

lindong28 reviewed Apr 29, 2018

View reviewed changes

apovzner force-pushed the kafka-6361 branch from 5fb1962 to d51e13e Compare April 30, 2018 16:52

junrao reviewed May 1, 2018

View reviewed changes

junrao reviewed May 2, 2018

View reviewed changes

apovzner added 14 commits May 7, 2018 08:58

KAFKA-6361: Fix log divergence between leader and follower after fast…

5ea4672

… leader fail over

set EpochEndOffset.UNDEFINED_EPOCH default to NO_PARTITION_LEADER_EPO…

00bca81

…CH (same -1 as before)

some cleanup and tests

bc51478

Few small fixes to truncation logic

4fce82f

fixed merge errors

2f6cf3b

Fixed build failures and fixed logging

c69e85e

Updated comments based on review comments

ab86ba3

Updated the upgrade doc and minor fixes in comments

b2be14d

Addressed review comments in tests

4dc679c

Truncate based on protocol version

818d29e

Moved common truncation logic to method in base class

a781aba

Added test for truncation logic for protocol version < 2.0

ba620c2

Added test for truncating to largest common epoch in ReplicaAlterLogD…

d05e50d

…irsThread

removed dup test due to merge, small cleanup to address couple of com…

5a290a9

…ments

apovzner force-pushed the kafka-6361 branch from 5dabeb7 to 5a290a9 Compare May 7, 2018 16:16

apovzner added 3 commits May 7, 2018 10:06

Bounding future replica HW to its LOE, and some more cleanup to addre…

a71f6cd

…ss comments

fixes to comments and log

bcced69

updated unit test to properly simulate both brokers on older protocol…

6ccaa51

… version

lindong28 reviewed May 8, 2018

View reviewed changes

apovzner added 2 commits May 8, 2018 11:19

Various small fixes addressing review comments

ca63ae1

Creating OffsetsForLeaderEpochRequests with version

b800cb5

junrao reviewed May 9, 2018

View reviewed changes

apovzner added 2 commits May 8, 2018 20:54

Minor changes to address review comments

64031fc

Future replica truncation is consistent with protocol version

4503ab4

junrao reviewed May 9, 2018

View reviewed changes

apovzner added 2 commits May 9, 2018 12:50

future replica always uses leader epoch for truncation if available

e5ad61f

reverted incorrect warn about truncating below HW

84dc51e

junrao reviewed May 9, 2018

View reviewed changes

Output warning if truncating below HW

544f4fd

junrao approved these changes May 10, 2018

View reviewed changes

junrao merged commit 9679c44 into apache:trunk May 10, 2018

guozhangwang reviewed May 10, 2018

View reviewed changes


		case class OffsetTruncationState(offset: Long, truncationCompleted: Boolean) {

		def this (offset: Long) = this(offset, true)



		//We should have truncated to the offsets in the first response
		assertTrue("Expected offset 155 in captured truncation offsets " + truncateToCapture.getValues,


		<script id="upgrade-template" type="text/x-handlebars-template">

		<h4><a id="upgrade_2_0_0" href="#upgrade_2_0_0">Upgrading from 0.8.x, 0.9.x, 0.10.0.x, 0.10.1.x, 0.10.2.x, 0.11.0.x, 1.0.x, 1.1.x, or 1.2.x to 2.0.0</a></h4>

Conversation

apovzner commented Apr 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

apovzner commented Apr 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lindong28 commented Apr 26, 2018

Uh oh!

lindong28 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apovzner commented May 1, 2018

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

apovzner commented Apr 16, 2018 •

edited

Loading

apovzner commented Apr 26, 2018 •

edited

Loading