KAFKA-15784: Ensure atomicity of in memory update and write when transactionally committing offsets by jolshan · Pull Request #14774 · apache/kafka

jolshan · 2023-11-16T01:50:48Z

Rewrote the verification flow to pass a callback to execute after verification completes.
For the TxnOffsetCommit, we will call doTxnCommitOffsets. This allows us to do offset validations post verification.

I've reorganized the verification code and group coordinator code to make these code paths clearer. The followup refactor (https://issues.apache.org/jira/browse/KAFKA-15987) will further clean up the produce verification code.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…r txnOffsetCommits

junrao · 2023-12-06T21:37:05Z

@jolshan : Thanks for the PR. Should we fix the append call in CoordinatorRuntime (#14705) too? There, a partition level lock in CoordinatorRuntime is held while checking/updating the coordinator state and calling append.

jolshan · 2023-12-06T23:16:09Z

@junrao Thanks for taking a look. I was just rewriting the code to make this clearer. I will take a look at @artemlivshits and your questions about locking now.

I wasn't sure if this PR was trying to remove locks -- I think we want to address that as a follow-up?

jolshan · 2023-12-06T23:18:48Z

-            (topicIdPartition, Errors.NOT_COORDINATOR)
-          }
-          responseCallback(commitStatus)
+    group.inLock {


For the group.inLocks in this method -- we should always be holding the group lock when we call storeOffsets from doTxnCommitOffsets and doCommitOffsets. What do we think about removing these locks and stating that we should be holding the group lock on this method? (and/or just wrapping the method in a lock)

I don't think we can just wrap this method in the lock to show the proper intent -- the intent is that the lock must be already held by the caller because the caller does some validation under the lock as well, and the atomicity of that validation needs to be preserved across local write. Note this is not the correctness issue (the lock is already taken outside, so any random locking pattern works), it's a comment about how to make the code maintenance better. The atomicity requirement is absolutely non-obvious and required a lot of effort from multiple people to figure out and I think this effort is not reflected in this change in any way -- the code got re-arranged into some form that makes it work, but the underlying issue (unclear and confusing atomicity invariants) is not addressed.

So I'd do 3 things:

Remove explicit locking.

Add a comment in the Java doc stating a requirement that this function must be called under the lock.

Add a comment near the appendForGroup call that we rely on it not returning until the local append is done to preserve atomicity protected by the lock.

If it wasn't soon-to-be-dead code I'd probably do more with naming conventions and asserts, but in this situation adding proper comments should be good and easy.

Ok makes sense.

The only thing that made me wonder about keeping the locking is that this method is used in unit tests without the locking on the outside. I wasn't sure if the best solution is to put locks around those calls or not. I would hope that the test doesn't rely on locking given there really should only be one thread running the tests and the ReplicaManager is mocked (so no async appends) but I would need to double check.

dajac · 2023-12-07T07:24:32Z

@junrao @jolshan I work on implementing the transactional offset commits in the new world. I will take care of adding the verification steps there when this PR is done. We can replicate the same pattern there.

hachikuji · 2023-12-07T17:53:36Z

+   * @param verificationGuards            the mapping from topic partition to verification guards if transaction verification is used
+   * @param preAppendErrors               the mapping from topic partition to LogAppendResult for errors that occurred before appending
+   */
+  def appendForGroup(timeout: Long,


Can we generalize the name? I don't see any logic specific to offsets or groups.

I planned to take this out after the refactor. It is only used for appendForGroup to minimize the diff.

It will be unified with appendRecords in the refactor. I can leave a comment referencing https://issues.apache.org/jira/browse/KAFKA-15987 if that helps.

Makes sense. But any harm having a more general name for now? It might be a week or two before the refactor gets checked in.

I guess I just didn't see anything else using it before I refactored and wanted to make the usage clear.

I didn't want to name it appendRecords as to not cause conflicts with that flow. What name were you thinking?

We should also add a comment that would reflect 2 points:

(For the maintainer of this code) -- this code must not return until the local write is done, it is an important invariant that the callers rely upon. Otherwise it looks like a generic async call that can return and continue asynchronously at any point. This way, if an additional async stage is required in this function before the the local write is complete, the maintainer would know to hunt down all usages of this function and figure out the correct action.

(For the caller of this code) -- a quick example of the full workflow of how the caller should use this method: call maybeStartTransactionVerificationForPartition with a callback that would call this method.

call maybeStartTransactionVerificationForPartition with a callback that would call this method

This will change when I do the refactor since this will become appendRecords where we only call maybeStartTransactionVerificationForPartition(s) if the append requires transaction verification. I can add the comment now, but it will be changed in the refactor (https://issues.apache.org/jira/browse/KAFKA-15987)

Adding comment in the refactor should be fine.

If we keep the name, how about we add a check to ensure that the write is for __consumer_offsets? Also, we can drop the internalTopicsAllowed and appendOrigin arguments since they will be implicit.

hachikuji · 2023-12-07T22:41:34Z

+                                recordValidationStatsCallback: Map[TopicPartition, RecordValidationStats] => Unit = _ => (),
+                                requestLocal: RequestLocal = RequestLocal.NoCaching,
+                                actionQueue: ActionQueue = null,
+                                verificationGuards: Map[TopicPartition, VerificationGuard] = Map.empty,


What is the difference between passing no verification guard and passing VerificationGuard.SENTINEL?

I believe checking if the map is empty is a shortcut for skipping verification. That doesn't really matter for the offset change but does for the produce flow.

when we get to the log layer if we don't have an entry in the map we do a getOrElse and return the sentinel

hachikuji · 2023-12-07T22:51:39Z

      }
    }
+
+    appendForGroup(group, records, requestLocal, putCacheCallback, verificationGuards)


What error do we expect if the guard check fails during write?

We expect INVALID_TXN_STATE which is fatal. I think we previously discussed this and decided it was ok for old clients.

We have logic in the createPutCacheCallback to convert the error code returned in TxnOffsetCommit. Do we want to add a case for INVALID_TXN_STATE?

We considered this on the first PR but decided that the abortable/retriable errors were not specific enough.

From KIP-890

Return Abortable Error for TxnOffsetCommitRequests
Instead of INVALID_TXN_STATE and INVALID_PID_MAPPING we considered using UNKNOWN_MEMBER_ID which is abortable. However, this is not a clear message and is not guaranteed to be abortable on non-Java clients. Since we can't specify a message in the response, we thought it would be better to just send the actual (but fatal) errors.

Ok. Mainly I was considering whether we should have an explicit case for this so that it is clearly intentional. What do you think?

I can do that. I just thought that all the ones there were ones that were changed. There are also errors that are returned but not mapped. (ie, coordinator_not_available)

We can also include InvalidPidMapping if we do want to map errors.

We could just include a comment under the default case to emphasize that no mapping is expected for these error codes?

hachikuji · 2023-12-12T02:04:10Z

+          }
+        }
+
+        groupManager.replicaManager.maybeStartTransactionVerificationForPartition(


The access to replicaManager here probably suggests that we probably should be going through GroupMetadataManager. We could expose a wrapped maybeStartTransactionVerificationForPartition from GroupMetadataManager instead. That might also help us encapsulate the error conversion a little better.

i was just about to push my change before I saw this comment. I will address this comment tomorrow.

hachikuji

LGTM

jolshan · 2023-12-14T01:39:35Z

I took a look at the tests and the only one that looked suspicious were the mirrorIntegration tests related to delete.retention.ms not being configured correctly. I saw some flakes in trunk and other PRs and that seems unrelated to this change.

Note, although the last build had a failure for a version, the previous build succeeded and only included a minor code change. Given the nature of the change and the failure, I believe this is non-blocking.

I ran system tests and noticed an issue with GroupModeTransactionsTest when bouncing brokers. After investigating with @hachikuji we believe it is unrelated. He will file a JIRA as a followup. -- edit followup here: https://issues.apache.org/jira/browse/KAFKA-16012

…sactionally committing offsets (#14774) Rewrote the verification flow to pass a callback to execute after verification completes. For the TxnOffsetCommit, we will call doTxnCommitOffsets. This allows us to do offset validations post verification. I've reorganized the verification code and group coordinator code to make these code paths clearer. The followup refactor (https://issues.apache.org/jira/browse/KAFKA-15987) will further clean up the produce verification code. Reviewers: Artem Livshits <alivshits@confluent.io>, Jason Gustafson <jason@confluent.io>, David Jacot <djacot@confluent.io>, Jun Rao <junrao@gmail.com>

junrao

@jolshan : Thanks for the updated PR. Sorry for the late review. Just a few minor comments.

junrao · 2023-12-07T18:42:38Z

+    )
+  }
+
+  def maybeStartTransactionVerificationForPartitions(


Could this be private?

I am doing a refactor PR so I can address your comments there. https://issues.apache.org/jira/browse/KAFKA-15987

junrao · 2023-12-13T23:30:34Z

+   * This method should not return until the write to the local log is completed because updating offsets requires updating
+   * the in-memory and persisted state under a lock together.
+   *
+   * Noted that all pending delayed check operations are stored in a queue. All callers to ReplicaManager.appendRecords()


It's weird to refer to ReplicaManager.appendRecords() here since this method is appendForGroup.

This will be fixed in the refactor. I plan to get rid of appendForGroup and unify in a single appendRecords method.

junrao · 2023-12-13T23:41:26Z

-        // Produce requests (only requests that require verification) should only have one batch per partition in "batches" but check all just to be safe.
-        val transactionalBatches = records.batches.asScala.filter(batch => batch.hasProducerId && batch.isTransactional)
-        transactionalBatches.foreach(batch => transactionalProducerIds.add(batch.producerId))
+  private def sendInvalidRequiredAcksResponse(entries: Map[TopicPartition, MemoryRecords],


Should we reuse this method in appendRecords?

Another change for the refactor. Jason asked me to keep the diff here to a minimum, but I plan to unify the code in the refactor. (He asked me to revert these changes for this PR)

junrao · 2023-12-13T23:55:45Z

-    if (delayedProduceRequestRequired(requiredAcks, allEntries, allResults)) {
+    debug("Produce to local log in %d ms".format(time.milliseconds - sTime))
+
+    val allResults = localProduceResults


Could we just get rid of allResults and just use localProduceResults?

junrao · 2023-12-13T23:57:27Z

+    }
+  }
+
+  private def buildProducePartitionStatus(


Could we reuse buildProducePartitionStatus in appendEntries?

All of these will be covered in the refactor. I didn't touch the produce flow to minimize the diff and cause minimal confusion when reviewing.

junrao · 2023-12-13T23:58:46Z

+    )
+  }
+
+  private def maybeAddDelayedProduce(


Could we reuse maybeAddDelayedProduce in appendEntries?

junrao · 2023-12-14T00:06:16Z

+   *
+   * When the verification returns, the callback will be supplied the error if it exists or Errors.NONE.
+   * If the verification guard exists, it will also be supplied. Otherwise the SENTINEL verification guard will be returned.
+   * This guard can not be used for verification and any appends that attenpt to use it will fail.


typo attenpt

junrao · 2023-12-14T00:13:30Z

+      requestLocal
+    )
+
+    addPartitionsToTxnManager.get.verifyTransaction(


Hmm, addPartitionsToTxnManager could be empty in tests. Should we change addPartitionsToTxnManager.get to addPartitionsToTxnManager.foreach like in the original code?

Thanks for pointing this out. I will fix this in the followup.

…sactionally committing offsets (apache#14774) Rewrote the verification flow to pass a callback to execute after verification completes. For the TxnOffsetCommit, we will call doTxnCommitOffsets. This allows us to do offset validations post verification. I've reorganized the verification code and group coordinator code to make these code paths clearer. The followup refactor (https://issues.apache.org/jira/browse/KAFKA-15987) will further clean up the produce verification code. Reviewers: Artem Livshits <alivshits@confluent.io>, Jason Gustafson <jason@confluent.io>, David Jacot <djacot@confluent.io>, Jun Rao <junrao@gmail.com>

#15087) I originally did some refactors in #14774, but we decided to keep the changes minimal since the ticket was a blocker. Here are those refactors: * Removed separate append paths so that produce, group coordinator, and other append paths all call appendRecords * AppendRecords has been simplified * Removed unneeded error conversions in verification code since group coordinator and produce path convert errors differently, removed test for that * Fixed incorrect capital param name in KafkaRequestHandler * Updated ReplicaManager test to handle produce appends separately when transactions are used. Reviewers: David Jacot <djacot@confluent.io>, Jason Gustafson <jason@confluent.io>

…sactionally committing offsets (apache#14774) Rewrote the verification flow to pass a callback to execute after verification completes. For the TxnOffsetCommit, we will call doTxnCommitOffsets. This allows us to do offset validations post verification. I've reorganized the verification code and group coordinator code to make these code paths clearer. The followup refactor (https://issues.apache.org/jira/browse/KAFKA-15987) will further clean up the produce verification code. Reviewers: Artem Livshits <alivshits@confluent.io>, Jason Gustafson <jason@confluent.io>, David Jacot <djacot@confluent.io>, Jun Rao <junrao@gmail.com>

apache#15087) I originally did some refactors in apache#14774, but we decided to keep the changes minimal since the ticket was a blocker. Here are those refactors: * Removed separate append paths so that produce, group coordinator, and other append paths all call appendRecords * AppendRecords has been simplified * Removed unneeded error conversions in verification code since group coordinator and produce path convert errors differently, removed test for that * Fixed incorrect capital param name in KafkaRequestHandler * Updated ReplicaManager test to handle produce appends separately when transactions are used. Reviewers: David Jacot <djacot@confluent.io>, Jason Gustafson <jason@confluent.io>

…sactionally committing offsets (apache#14774) Rewrote the verification flow to pass a callback to execute after verification completes. For the TxnOffsetCommit, we will call doTxnCommitOffsets. This allows us to do offset validations post verification. I've reorganized the verification code and group coordinator code to make these code paths clearer. The followup refactor (https://issues.apache.org/jira/browse/KAFKA-15987) will further clean up the produce verification code. Reviewers: Artem Livshits <alivshits@confluent.io>, Jason Gustafson <jason@confluent.io>, David Jacot <djacot@confluent.io>, Jun Rao <junrao@gmail.com>

apache#15087) I originally did some refactors in apache#14774, but we decided to keep the changes minimal since the ticket was a blocker. Here are those refactors: * Removed separate append paths so that produce, group coordinator, and other append paths all call appendRecords * AppendRecords has been simplified * Removed unneeded error conversions in verification code since group coordinator and produce path convert errors differently, removed test for that * Fixed incorrect capital param name in KafkaRequestHandler * Updated ReplicaManager test to handle produce appends separately when transactions are used. Reviewers: David Jacot <djacot@confluent.io>, Jason Gustafson <jason@confluent.io>

Redo verification path

e894d84

jolshan commented Nov 16, 2023

View reviewed changes

Comment thread core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala Outdated

jolshan added 7 commits November 16, 2023 16:59

Fix build issues

2a52318

Fix tests

865078b

Merge branch 'trunk' of github.com:apache/kafka into kafka-15784

a20f238

Fix test failures

9bac9a9

Rewrite GroupCoordinator and GroupMetadataManager to handle checks fo…

b4a920b

…r txnOffsetCommits

Merge branch 'trunk' of github.com:apache/kafka into kafka-15784

05c0d28

Update comments and method names

7bc6d06

jolshan marked this pull request as ready for review December 4, 2023 18:02

jolshan changed the title ~~WIP KAFKA-15784: Ensure atomicity of in memory update and write when transactionally committing offsets~~ KAFKA-15784: Ensure atomicity of in memory update and write when transactionally committing offsets Dec 4, 2023

jolshan commented Dec 4, 2023

View reviewed changes

Comment thread core/src/test/scala/unit/kafka/coordinator/AbstractCoordinatorConcurrencyTest.scala Outdated

jolshan commented Dec 4, 2023

View reviewed changes

Comment thread core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala Outdated

Fix style issues and passing verification guards

a471f04

jolshan requested a review from dajac December 4, 2023 19:43

jolshan commented Dec 4, 2023

View reviewed changes

Comment thread core/src/main/scala/kafka/coordinator/group/GroupCoordinator.scala Outdated

artemlivshits reviewed Dec 5, 2023

View reviewed changes

Comment thread core/src/main/scala/kafka/server/KafkaApis.scala Outdated

Comment thread core/src/main/scala/kafka/server/ReplicaManager.scala Outdated

Comment thread core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala Outdated

jolshan added 2 commits December 6, 2023 15:00

Clean up GroupCoordinator, GroupMetadataManger, and ReplicaManager code

7c01682

remove package private scoping that is not needed

965ec39

jolshan commented Dec 6, 2023

View reviewed changes

hachikuji reviewed Dec 6, 2023

View reviewed changes

Comment thread core/src/main/scala/kafka/server/ReplicaManager.scala

jolshan added 3 commits December 7, 2023 09:45

Remove produce path refactor

8ecbe4d

spacing cleanups

99e8577

space

a83b136

hachikuji reviewed Dec 7, 2023

View reviewed changes

remove simple fix

56a44ba

hachikuji reviewed Dec 7, 2023

View reviewed changes

jolshan added 2 commits December 11, 2023 13:35

Fix error code

3a3160d

Fix other error

4f757e2