KAFKA-15468: Prevent transaction coordinator reloads on already loaded leaders#14489
KAFKA-15468: Prevent transaction coordinator reloads on already loaded leaders#14489jolshan wants to merge 5 commits intoapache:trunkfrom
Conversation
clolov
left a comment
There was a problem hiding this comment.
Besides some of the suggested cosmetic changes what you are proposing makes sense to me!
| private final Map<TopicPartition, PartitionInfo> electedLeaders; | ||
| private final Map<TopicPartition, PartitionInfo> updatedLeaders; |
There was a problem hiding this comment.
Does it make sense to change these names to tpToPartitionEpochs and tpToLeaderEpochs? I can anticipate this naming being confusing for a person reading the code for the first time given that what classifies as each is defined in https://github.com/apache/kafka/pull/14489/files#diff-be8b1b8ad296c48bbdc3df55fdb859881f150ceadd0959ebf02fb3caac13ee5aR146-R151
There was a problem hiding this comment.
I struggled with naming this quite a bit 😅 . I was also wondering if I should make it clearer abou leader epoch changes vs partition epoch changes. The thing that is tricky is that the map only contains the leaders that experienced the changes (not followers) so I also wanted to make that clear. I will also think on that some more.
There was a problem hiding this comment.
I pushed changes for the simpler fixes and will continue to think on this one.
clolov
left a comment
There was a problem hiding this comment.
With or without a change to the leader naming I am happy with the implementation!
|
Heya @jolshan! Is there something else I can help with in order for this pull request to make it in trunk? |
|
@jolshan I started a new build. |
|
@jolshan there are failing tests for this PR. Can you take a look when you have time? |
|
I'm also waiting for some confirmation from @hachikuji about the transaction changes. I will look at the build issues in the meantime. |
| // left off during the unloading phase. Ensure we remove all associated state for this partition before we continue | ||
| // loading it. | ||
| // loading it. In the case where the state partition is already loaded, we want to remove inflight markers with the | ||
| // old epoch. |
There was a problem hiding this comment.
nit: remove inflight markers with the old epoch and replace them with the new epoch?
| * metadata cache and for all the pending markers. | ||
| */ | ||
| def loadTransactionsForTxnTopicPartition(partitionId: Int, coordinatorEpoch: Int, sendTxnMarkers: SendTxnMarkersCallback): Unit = { | ||
| def maybeLoadTransactionsAndBumpEpochForTxnTopicPartition(partitionId: Int, |
There was a problem hiding this comment.
nit: The rename seems borderline overkill. I would consider the epoch bump part of transaction loading.
There was a problem hiding this comment.
Ok. 😅 I think I was trying to distinguish the difference between logical and physical loading. But maybe that is too specific. Do you think it should just keep the original name?
There was a problem hiding this comment.
Yeah, the original name seems fine to me. We are still loading the transactions. We just have an optimization when we already had state from a previous epoch.
| txnManager.loadTransactionsForTxnTopicPartition(txnTopicPartitionId, coordinatorEpoch, | ||
| txnMarkerChannelManager.addTxnMarkersToSend) | ||
| txnManager.maybeLoadTransactionsAndBumpEpochForTxnTopicPartition(txnTopicPartitionId, coordinatorEpoch, | ||
| txnMarkerChannelManager.addTxnMarkersToSend, txnManager.txnStateLoaded(txnTopicPartitionId)) |
There was a problem hiding this comment.
It's curious that we need to pass the result of txnStateLoaded. Couldn't txnManager figure it out on its own?
| s"$totalLoadingTimeMs milliseconds, of which $schedulerTimeMs milliseconds was spent in the scheduler.") | ||
| Some(loadedTransactions) | ||
| } else { | ||
| None |
There was a problem hiding this comment.
Perhaps we should have a log message in this path?
| def maybeLoadTransactionsAndBumpEpochForTxnTopicPartition(partitionId: Int, | ||
| coordinatorEpoch: Int, | ||
| sendTxnMarkers: SendTxnMarkersCallback, | ||
| transactionStateLoaded: Boolean): Unit = { |
There was a problem hiding this comment.
As mentioned above, I don't think we should pass this as an argument.
On a higher level, I'm trying to figure out the safety of this loading process. Suppose we have two epoch bumps in quick succession. Do we get a strong ordering guarantee given that it is done asynchronously? I think I would expect that we would check for the existence of the partition in loadingPartitions when we first acquire the write lock below. If it exists, then we need to ensure the monotonicity of the epoch. If the entry has a higher epoch, then we ignore the call.
|
I've been asked to split these changes into the two parts I mentioned. Will follow up with that. |
|
Part 1 PR here: #15139 |
… loaded leaders (#15139) This originally was #14489 which covered 2 aspects -- reloading on partition epoch changes where leader epoch did not change and reloading when leader epoch changed but we were already the leader. I've cut out the second part of the change since the first part is much simpler. Redefining the TopicDelta fields to better distinguish when a leader is elected (leader epoch bump) vs when a leader has isr/replica changes (partition epoch bump). There are some cases where we bump the partition epoch but not the leader epoch. We do not need to do operations that only care about the leader epoch bump. (ie -- onElect callbacks) Reviewers: Artem Livshits <alivshits@confluent.io>, José Armando García Sancio <jsancio@apache.org>
… loaded leaders (apache#15139) This originally was apache#14489 which covered 2 aspects -- reloading on partition epoch changes where leader epoch did not change and reloading when leader epoch changed but we were already the leader. I've cut out the second part of the change since the first part is much simpler. Redefining the TopicDelta fields to better distinguish when a leader is elected (leader epoch bump) vs when a leader has isr/replica changes (partition epoch bump). There are some cases where we bump the partition epoch but not the leader epoch. We do not need to do operations that only care about the leader epoch bump. (ie -- onElect callbacks) Reviewers: Artem Livshits <alivshits@confluent.io>, José Armando García Sancio <jsancio@apache.org>
… loaded leaders (apache#15139) This originally was apache#14489 which covered 2 aspects -- reloading on partition epoch changes where leader epoch did not change and reloading when leader epoch changed but we were already the leader. I've cut out the second part of the change since the first part is much simpler. Redefining the TopicDelta fields to better distinguish when a leader is elected (leader epoch bump) vs when a leader has isr/replica changes (partition epoch bump). There are some cases where we bump the partition epoch but not the leader epoch. We do not need to do operations that only care about the leader epoch bump. (ie -- onElect callbacks) Reviewers: Artem Livshits <alivshits@confluent.io>, José Armando García Sancio <jsancio@apache.org>
… loaded leaders (apache#15139) This originally was apache#14489 which covered 2 aspects -- reloading on partition epoch changes where leader epoch did not change and reloading when leader epoch changed but we were already the leader. I've cut out the second part of the change since the first part is much simpler. Redefining the TopicDelta fields to better distinguish when a leader is elected (leader epoch bump) vs when a leader has isr/replica changes (partition epoch bump). There are some cases where we bump the partition epoch but not the leader epoch. We do not need to do operations that only care about the leader epoch bump. (ie -- onElect callbacks) Reviewers: Artem Livshits <alivshits@confluent.io>, José Armando García Sancio <jsancio@apache.org>
|
This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch) If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed. |
This PR has two parts:
Committer Checklist (excluded from commit message)