KAFKA-13488: Producer fails to recover if topic gets deleted midway#11552
KAFKA-13488: Producer fails to recover if topic gets deleted midway#11552dajac merged 6 commits intoapache:trunkfrom
Conversation
hachikuji
left a comment
There was a problem hiding this comment.
Thanks for the patch! I left a few comments for consideration.
| Integer currentEpoch = lastSeenLeaderEpochs.get(tp); | ||
| if (topicId != null && oldTopicId != null && !topicId.equals(oldTopicId)) { | ||
| // oldTopicId can be null (when metadata is fetched during topic recreation), update the metadata in that case as well. | ||
| if (topicId != null && !topicId.equals(oldTopicId)) { |
There was a problem hiding this comment.
I think this approach is reasonable. The root of the problem is the loose binding between the leader epoch and the topicId. I'm still a bit tempted to take this a little further and only permit the epoch check when the topicIds are matching. This is a little difficult to do at the moment because the epoch may be learned in contexts where we have not yet exposed the topicId. Hopefully this can be tightened up in https://issues.apache.org/jira/browse/KAFKA-13447. For now, this workaround seems ok.
| int newEpoch = partitionMetadata.leaderEpoch.get(); | ||
| Integer currentEpoch = lastSeenLeaderEpochs.get(tp); | ||
| if (topicId != null && oldTopicId != null && !topicId.equals(oldTopicId)) { | ||
| // oldTopicId can be null (when metadata is fetched during topic recreation), update the metadata in that case as well. |
There was a problem hiding this comment.
How about this?
// Between the time that a topic is deleted and re-created, the client may lose
// track of the corresponding topicId (i.e. `oldTopicid` will be null). In this case,
// when we discover the new topicId, we allow the corresponding leader epoch
// to override the last seen value.Also, would it make sense to move this into the corresponding branch that it applies to?
There was a problem hiding this comment.
Ack. Changed the comment as per the suggestion.
would it make sense to move this into the corresponding branch that it applies to?
Sorry, couldn't get it. Can you elaborate on this please. (Do you mean a separate if branch? The current If branch deals with separate topicId, so that should be the one we should modify as part of this patch.)
| // We should treat an added topic ID as though it is the same topic. Handle only when epoch increases. | ||
| // Don't update to an older one | ||
| metadataResponse = RequestTestUtils.metadataUpdateWithIds("dummy", 1, Collections.emptyMap(), Collections.singletonMap("topic-1", 1), _tp -> 1, topicIds); | ||
| // If the Older topic Id is null, we should go with the new TopicId as the leader Epoch |
There was a problem hiding this comment.
It might be worthwhile having a separate case which goes through the sequence described in the jira. Basically this:
- Receive metadata response with topicID A.
- Receive metadata response with UNKNOWN_TOPIC error.
- Receive metadata response with topicID B.
There was a problem hiding this comment.
added a new testcase with the suggested flow. Old test-case still needs to be changed as that case fails now, so modified the case as per the new changes.
There was a problem hiding this comment.
nit: Older -> older; topic Id -> topic ID; TopicId -> topic ID; Epoch -> epoch.
|
@hachikuji Made the suggested changes. Please review. |
|
@hachikuji Gentle bump on the review. PTAL. |
|
@hachikuji @jolshan Bump on the review. |
dajac
left a comment
There was a problem hiding this comment.
@prat0318 Thanks for the PR! I do agree with @hachikuji that the workaround looks reasonable. I have left some comments below (mainly nits).
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
There was a problem hiding this comment.
nit: Empty line could be removed.
| MetadataResponse metadataResponse = emptyMetadataResponse(); | ||
| metadata.updateWithCurrentRequestVersion(metadataResponse, false, 0L); | ||
|
|
||
| Map<String, Uuid> topicIds = Collections.singletonMap("topic-1", Uuid.randomUuid()); |
There was a problem hiding this comment.
nit: We could have kept the comment before the block to be consistent with the other two blocks. I just wanted to rework it a bit.
| metadata.updateWithCurrentRequestVersion(metadataResponse, false, 1L); | ||
| assertEquals(Optional.of(10), metadata.lastSeenLeaderEpoch(tp)); | ||
|
|
||
| // Create topic-1 again but this time with a topic ID bar. LeaderEpoch should be updated to new even if lower. |
There was a problem hiding this comment.
nit: Create topic-1 again but this time with a topic ID.?
| new TopicPartition(topic, 1) -> Seq(1, 0) | ||
| ) | ||
|
|
||
| // Change leader to 1 for both the partitions to increase leader Epoch from 0 -> 1 |
|
@prat0318 Thanks for the update. I left a few more nits to fix typos. Would you have time to quickly address them? |
…11552) Allow the leader epoch to be re-assigned to the new value from the Metadata response if `oldTopicId` is not present in the cache. This is needed because `oldTopicId` is removed from the cache if the topic gets deleted but the leader epoch is not removed. Hence, metadata for the newly recreated topic won't be accepted unless we allow `oldTopicId` to be null. Reviewers: Jason Gustafson <jason@confluent.io>, David Jacot <djacot@confluent.io>
…11552) Allow the leader epoch to be re-assigned to the new value from the Metadata response if `oldTopicId` is not present in the cache. This is needed because `oldTopicId` is removed from the cache if the topic gets deleted but the leader epoch is not removed. Hence, metadata for the newly recreated topic won't be accepted unless we allow `oldTopicId` to be null. Reviewers: Jason Gustafson <jason@confluent.io>, David Jacot <djacot@confluent.io>
…11552) Allow the leader epoch to be re-assigned to the new value from the Metadata response if `oldTopicId` is not present in the cache. This is needed because `oldTopicId` is removed from the cache if the topic gets deleted but the leader epoch is not removed. Hence, metadata for the newly recreated topic won't be accepted unless we allow `oldTopicId` to be null. Reviewers: Jason Gustafson <jason@confluent.io>, David Jacot <djacot@confluent.io>
|
Merged to trunk, 3.1, 3.0 and 2.8. |
…pache#11552) Allow the leader epoch to be re-assigned to the new value from the Metadata response if `oldTopicId` is not present in the cache. This is needed because `oldTopicId` is removed from the cache if the topic gets deleted but the leader epoch is not removed. Hence, metadata for the newly recreated topic won't be accepted unless we allow `oldTopicId` to be null. Reviewers: Jason Gustafson <jason@confluent.io>, David Jacot <djacot@confluent.io>
…pache#11552) Allow the leader epoch to be re-assigned to the new value from the Metadata response if `oldTopicId` is not present in the cache. This is needed because `oldTopicId` is removed from the cache if the topic gets deleted but the leader epoch is not removed. Hence, metadata for the newly recreated topic won't be accepted unless we allow `oldTopicId` to be null. Reviewers: Jason Gustafson <jason@confluent.io>, David Jacot <djacot@confluent.io>
Allow LeaderEpoch to be re-assigned to the new value from the Metadata Response if oldTopicId is not present in the cache. This is needed because oldTopicId is removed from the cache if the topic gets deleted but the LeaderEpoch is not removed. Hence, metadata for the newly recreated topic won't be accepted unless we allow oldTopicId to be null.
This is a fix on top of earlier made #10952 and #11004 PRs but still don't solve the bug mentioned in KAFKA-13488. This is now fixed in this PR.
Committer Checklist (excluded from commit message)