KAFKA-15468: Prevent transaction coordinator reloads on already loaded leaders by jolshan · Pull Request #14489 · apache/kafka

jolshan · 2023-10-04T23:59:32Z

This PR has two parts:

Redefining the TopicDelta fields to better distinguish when a leader is elected (leader epoch bump) vs when a leader has isr/replica changes (partition epoch bump). There are some cases where we bump the partition epoch but not the leader epoch. We do not need to do operations that only care about the leader epoch bump. (ie -- onElect callbacks)
Even in the case of leader epoch bumps, we may be electing the same leader. In that case, we don't need to do some operations (load state from disk that is already loaded). The GroupCoordinator already handles this case, but the transaction coordinator does not. I've updated this code to not read from disk, but to take the existing metadata and update the leader epoch, as well as send markers with the new epoch.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

clolov

Besides some of the suggested cosmetic changes what you are proposing makes sense to me!

clolov · 2023-10-09T13:44:22Z

+    private final Map<TopicPartition, PartitionInfo> electedLeaders;
+    private final Map<TopicPartition, PartitionInfo> updatedLeaders;


Does it make sense to change these names to tpToPartitionEpochs and tpToLeaderEpochs? I can anticipate this naming being confusing for a person reading the code for the first time given that what classifies as each is defined in https://github.com/apache/kafka/pull/14489/files#diff-be8b1b8ad296c48bbdc3df55fdb859881f150ceadd0959ebf02fb3caac13ee5aR146-R151

I struggled with naming this quite a bit 😅 . I was also wondering if I should make it clearer abou leader epoch changes vs partition epoch changes. The thing that is tricky is that the map only contains the leaders that experienced the changes (not followers) so I also wanted to make that clear. I will also think on that some more.

I pushed changes for the simpler fixes and will continue to think on this one.

clolov

With or without a change to the leader naming I am happy with the implementation!

clolov · 2023-10-19T09:10:02Z

Heya @jolshan! Is there something else I can help with in order for this pull request to make it in trunk?

jsancio

Thanks @jolshan . LGTM outside some formatting issues

jsancio · 2023-10-25T23:48:26Z

@jolshan I started a new build.

jsancio · 2023-10-27T18:11:02Z

@jolshan there are failing tests for this PR. Can you take a look when you have time?

jolshan · 2023-11-01T17:46:10Z

I'm also waiting for some confirmation from @hachikuji about the transaction changes. I will look at the build issues in the meantime.

jolshan · 2023-11-01T23:35:57Z

Newest test failures look unrelated:
Build / JDK 17 and Scala 2.13 / kafka.api.DelegationTokenEndToEndAuthorizationWithOwnerTest.testDescribeTokenForOtherUserFails(String).quorum=kraft
Build / JDK 11 and Scala 2.13 / integration.kafka.server.FetchFromFollowerIntegrationTest.testRackAwareRangeAssignor(String).quorum=zk
Build / JDK 11 and Scala 2.13 / org.apache.kafka.trogdor.coordinator.CoordinatorTest.testTaskRequestWithOldStartMsGetsUpdated()
Build / JDK 21 and Scala 2.13 / integration.kafka.server.FetchFromFollowerIntegrationTest.testRackAwareRangeAssignor(String).quorum=zk
Build / JDK 21 and Scala 2.13 / kafka.api.DelegationTokenEndToEndAuthorizationWithOwnerTest.testNoConsumeWithDescribeAclViaSubscribe(String).quorum=kraft
Build / JDK 21 and Scala 2.13 / kafka.api.DelegationTokenEndToEndAuthorizationWithOwnerTest.testDescribeTokenForOtherUserPasses(String).quorum=kraft
Build / JDK 21 and Scala 2.13 / kafka.server.DescribeClusterRequestTest.testDescribeClusterRequestExcludingClusterAuthorizedOperations(String).quorum=kraft

hachikuji · 2023-10-12T22:45:16Z

    // left off during the unloading phase. Ensure we remove all associated state for this partition before we continue
-    // loading it.
+    // loading it. In the case where the state partition is already loaded, we want to remove inflight markers with the
+    // old epoch.


nit: remove inflight markers with the old epoch and replace them with the new epoch?

hachikuji · 2023-11-27T18:59:11Z

+   * metadata cache and for all the pending markers.
   */
-  def loadTransactionsForTxnTopicPartition(partitionId: Int, coordinatorEpoch: Int, sendTxnMarkers: SendTxnMarkersCallback): Unit = {
+  def maybeLoadTransactionsAndBumpEpochForTxnTopicPartition(partitionId: Int,


nit: The rename seems borderline overkill. I would consider the epoch bump part of transaction loading.

Ok. 😅 I think I was trying to distinguish the difference between logical and physical loading. But maybe that is too specific. Do you think it should just keep the original name?

Yeah, the original name seems fine to me. We are still loading the transactions. We just have an optimization when we already had state from a previous epoch.

hachikuji · 2023-11-27T19:06:07Z

-    txnManager.loadTransactionsForTxnTopicPartition(txnTopicPartitionId, coordinatorEpoch,
-      txnMarkerChannelManager.addTxnMarkersToSend)
+    txnManager.maybeLoadTransactionsAndBumpEpochForTxnTopicPartition(txnTopicPartitionId, coordinatorEpoch,
+      txnMarkerChannelManager.addTxnMarkersToSend, txnManager.txnStateLoaded(txnTopicPartitionId))


It's curious that we need to pass the result of txnStateLoaded. Couldn't txnManager figure it out on its own?

hachikuji · 2023-11-27T19:12:41Z

+            s"$totalLoadingTimeMs milliseconds, of which $schedulerTimeMs milliseconds was spent in the scheduler.")
+          Some(loadedTransactions)
+        } else {
+          None


Perhaps we should have a log message in this path?

hachikuji · 2023-11-27T19:43:08Z

+  def maybeLoadTransactionsAndBumpEpochForTxnTopicPartition(partitionId: Int,
+                                                            coordinatorEpoch: Int,
+                                                            sendTxnMarkers: SendTxnMarkersCallback,
+                                                            transactionStateLoaded: Boolean): Unit = {


As mentioned above, I don't think we should pass this as an argument.

On a higher level, I'm trying to figure out the safety of this loading process. Suppose we have two epoch bumps in quick succession. Do we get a strong ordering guarantee given that it is done asynchronously? I think I would expect that we would check for the existence of the partition in loadingPartitions when we first acquire the write lock below. If it exists, then we need to ensure the monotonicity of the epoch. If the entry has a higher epoch, then we ignore the call.

jolshan · 2024-01-06T00:29:43Z

I've been asked to split these changes into the two parts I mentioned. Will follow up with that.

jolshan · 2024-01-06T01:10:15Z

Part 1 PR here: #15139

… loaded leaders (#15139) This originally was #14489 which covered 2 aspects -- reloading on partition epoch changes where leader epoch did not change and reloading when leader epoch changed but we were already the leader. I've cut out the second part of the change since the first part is much simpler. Redefining the TopicDelta fields to better distinguish when a leader is elected (leader epoch bump) vs when a leader has isr/replica changes (partition epoch bump). There are some cases where we bump the partition epoch but not the leader epoch. We do not need to do operations that only care about the leader epoch bump. (ie -- onElect callbacks) Reviewers: Artem Livshits <alivshits@confluent.io>, José Armando García Sancio <jsancio@apache.org>

… loaded leaders (apache#15139) This originally was apache#14489 which covered 2 aspects -- reloading on partition epoch changes where leader epoch did not change and reloading when leader epoch changed but we were already the leader. I've cut out the second part of the change since the first part is much simpler. Redefining the TopicDelta fields to better distinguish when a leader is elected (leader epoch bump) vs when a leader has isr/replica changes (partition epoch bump). There are some cases where we bump the partition epoch but not the leader epoch. We do not need to do operations that only care about the leader epoch bump. (ie -- onElect callbacks) Reviewers: Artem Livshits <alivshits@confluent.io>, José Armando García Sancio <jsancio@apache.org>

github-actions · 2024-08-05T03:34:08Z

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

jolshan added 3 commits October 3, 2023 15:39

Don't load multiple times

e7ea42c

Add test for transactional reloads

fc9c9e9

Fixes

e4eeb9f

jolshan commented Oct 5, 2023

View reviewed changes

Comment thread core/src/main/scala/kafka/coordinator/transaction/TransactionMetadata.scala Outdated

cmccabe added the kraft label Oct 5, 2023

clolov reviewed Oct 9, 2023

View reviewed changes

Style fixes and renames

7c687f0

clolov approved these changes Oct 10, 2023

View reviewed changes

jsancio self-assigned this Oct 12, 2023

jsancio approved these changes Oct 19, 2023

View reviewed changes

Comment thread metadata/src/test/java/org/apache/kafka/image/TopicsImageTest.java Outdated

Comment thread metadata/src/test/java/org/apache/kafka/image/TopicsImageTest.java Outdated

Comment thread metadata/src/test/java/org/apache/kafka/image/TopicsImageTest.java Outdated

fix spacing

53486ac

hachikuji reviewed Nov 27, 2023

View reviewed changes

jolshan mentioned this pull request Jan 6, 2024

KAFKA-15468 [1/2]: Prevent transaction coordinator reloads on already loaded leaders #15139

Merged

jolshan marked this pull request as draft January 25, 2024 19:27

jsancio removed the kraft label Apr 7, 2024

jsancio removed their assignment Apr 7, 2024

github-actions Bot added the stale Stale PRs label Aug 5, 2024

jolshan closed this Aug 9, 2024

		private final Map<TopicPartition, PartitionInfo> electedLeaders;
		private final Map<TopicPartition, PartitionInfo> updatedLeaders;

Conversation

jolshan commented Oct 4, 2023

Committer Checklist (excluded from commit message)

Uh oh!

Uh oh!

clolov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clolov left a comment

Choose a reason for hiding this comment

Uh oh!

clolov commented Oct 19, 2023

Uh oh!

jsancio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsancio commented Oct 25, 2023

Uh oh!

jsancio commented Oct 27, 2023

Uh oh!

jolshan commented Nov 1, 2023

Uh oh!

jolshan commented Nov 1, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jolshan commented Jan 6, 2024

Uh oh!

jolshan commented Jan 6, 2024

Uh oh!

github-actions Bot commented Aug 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants