MINOR: Add RaftReplicaManager by rondagostino · Pull Request #10069 · apache/kafka

rondagostino · 2021-02-05T16:36:13Z

This adds the logic to apply partition metadata when consuming from the Raft-based metadata log.

RaftReplicaManager extends ReplicaManager for now to minimize changes to existing code for the 2.8 release. We will likely adjust this hierarchy at a later time (e.g. introducing a trait and adding a helper to refactor common code). For now, we expose the necessary fields and methods in ReplicaManager by changing their scope from private to protected, and we refactor out a couple of pieces of logic that are shared between the two implementation (stopping replicas and adding log dir fetchers).

Existing tests are sufficient to expose regressions in the current ReplicaManager.

We intend to exercise the new RaftReplicaManager code via system tests and unit/integration tests (both to come in later PRs).

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

rondagostino · 2021-02-05T20:37:56Z

+        val partitionsMadeLeader = makeLeaders(partitionsAlreadyExisting, leaderPartitionStates,
+          highWatermarkCheckpoints,-1, mostRecentMetadataOffsets)
+        val partitionsMadeFollower = makeFollowers(partitionsAlreadyExisting,
+          createMetadataBrokersFromCurrentCache, followerPartitionStates,


Do we have any guarantee that the metadata cache is in a state that is consistent with the deferred changes?

Good question. Metadata changes up through the point of applying the deferred partition metadata changes should be applied to the metadata cache at this point.

One thing we need to think about is the fact that we currently don't defer metadata cache changes at all. The metadata cache will contain partition states that are ahead of ReplicaManager during the time when ReplicaManager is deferring its changes. This means, for example, that the following will reflect deferred partition changes that have been applied to the metadata cache but that have not been applied to ReplicaManager. We may have to write test cases for each of these conditions so we can be clear on what the expected behavior should be.

MetadataRequest

FindCoordinatorRequest

ElectLeadersRequest with topicPartitions = null

DelayedCreatePartitions (in topic purgatory)

DelayedElectLeader (in elect leader purgatory)

Anything that calls ReplicaManager.fetchMessages() and DelayedFetch (in fetch purgatory), though these seem okay since they wait until they can get enough data?

TransactionMarkerChannelManager.addTxnMarkersToBrokerQueue

DescribeConfigsRequest

OffsetCommitRequest (whether or not to send UNKNOWN_TOPIC_OR_PARTITION)

ProduceRequest (whether or not to send UNKNOWN_TOPIC_OR_PARTITION)

FetchRequest (whether or not to send UNKNOWN_TOPIC_OR_PARTITION)

DeleteRecordsRequest (whether or not to send UNKNOWN_TOPIC_OR_PARTITION)

AddPartitionsToTxnRequest (whether or not to send UNKNOWN_TOPIC_OR_PARTITION)

TxnOffsetCommitRequest (whether or not to send UNKNOWN_TOPIC_OR_PARTITION)

OffsetDeleteRequest (whether or not to send UNKNOWN_TOPIC_OR_PARTITION)

I added the partition state with respect to deferral to MetadataPartitions. Storing the information in the metadat cache could help here.

rondagostino · 2021-02-05T20:40:42Z

+          stateChangeLogger.info(s"Applied ${partitionsMadeLeader.size + partitionsMadeFollower.size} deferred partitions prior to the error: " +
+            s"${partitionsMadeLeader.size} leader(s) and ${partitionsMadeFollower.size} follower(s)")
+          // Re-throw the exception for it to be caught in BrokerMetadataListener
+          throw e


If we fail to apply changes, I guess we have to see that as a fatal error? The only possible way of recovering would be to replay the changes.

Yeah, I think so. We may need to put effort into minimizing the blast radius of these failures.

…r4Raft

ijuma · 2021-02-07T15:46:38Z

@rondagostino Looks like a file is missing the license:

19:18:36 Execution failed for task ':rat'.
19:18:36 > Found 1 files with unknown licenses.

ijuma · 2021-02-07T18:55:03Z

All builds are green!

cmccabe · 2021-02-08T22:49:35Z


 object MetadataPartition {
-  def apply(name: String, record: PartitionRecord): MetadataPartition = {
+  val OffsetNeverDeferred = 0L // must not be a valid offset we could see (i.e. must not be positive)


Can you add JavaDoc for this?

Also, what about NoDeferredOffset as a name?

One last question... why is this 0 and not -1? 0 is a valid offset in the log, whereas -1 is not.

As discussed offline, we will remove this and not include the last seen offset in log messages when applying deferred changes.

cmccabe · 2021-02-08T22:50:26Z

-      Collections.emptyList())
+      Collections.emptyList(),
+      largestDeferredOffsetEverSeen = deferredAtOffset.getOrElse(OffsetNeverDeferred),
+      isCurrentlyDeferringChanges = deferredAtOffset.isDefined)


Hmm... why do we need this boolean? Can't we just check if largestDeferredOffsetEverSeen is not OffsetNeverDeferred

We basically use largestDeferredOffsetEverSeen only for logging at this point -- we also check it in a few private def sanityCheckState...() RaftReplicaManager methods. We could completely eliminate largestDeferredOffsetEverSeen if we didn't want to log when the partition was last deferred. It just tracks when the partition was last seen and the change at that offset was deferred rather than directly applied. Once the partition is no longer deferred the value remains whatever it was and the boolean flips to false.

It does seem on the surface that we could change the declaration to deferredSinceOffset and get rid of the boolean -- and deferredSinceOffset would change to -1 once those changes are applied. But there is a problem with this if the partition changes to not being deferred in the metadata cache before we ask RaftReplicaManager to process all of its deferred changes: the value will be -1 in the metadata cache under those circumstances, and we wouldn't have the value to log.

So I think we have a few options.

Do the logging, apply the changes to the matadata cache before replica manager, and keep the Long and Boolean as currently defined

Do the logging, apply the changes to the matadata cache after replica manager, and use just a Long (with the semantics being changed as described above)

Just use a Boolean and don't do the logging.

As discussed offline, we will eliminate the information from the messages we log when applying deferred changes, and we won't carry that info around in MetadataPartition. Currently RaftReplicaManager knows if it is deferring changes or not. Maybe later when we get BrokerLifecycleManager and BrokerMetadataListener committed we can think about where a global boolean might live to identify if the broker is fenced or not. It isn't critical to decide right now because we are only going to defer the application of partition metadata at startup in 2.8.

cmccabe · 2021-02-08T23:02:18Z

Thanks for this PR, @rondagostino !

I can see why you wanted to have RaftReplicaChangeDelegateHelper. The ReplicaManager is not very easy to unit test because it has grown so large. I don't think this delegate thing is quite the right abstraction here-- it's pretty confusing-- but I guess let's revisit this after 2.8 is finished.

I suppose one option is, once ReplicaManager is a pure interface, we can split the kip-500 update logic off into a separate set of functions that takes a ReplicaManager as an input. Then we can easily unit-test the update logic with a MockReplicaManager.

For now I left two small comments... LGTM after those are addressed.

rondagostino

@cmccabe I pushed a commit getting rid of the information from MetadataPartition and removing the information from the log messages.

rondagostino · 2021-02-09T20:07:23Z


 object MetadataPartition {
-  def apply(name: String, record: PartitionRecord): MetadataPartition = {
+  val OffsetNeverDeferred = 0L // must not be a valid offset we could see (i.e. must not be positive)


As discussed offline, we will remove this and not include the last seen offset in log messages when applying deferred changes.

rondagostino · 2021-02-09T20:08:10Z

-      Collections.emptyList())
+      Collections.emptyList(),
+      largestDeferredOffsetEverSeen = deferredAtOffset.getOrElse(OffsetNeverDeferred),
+      isCurrentlyDeferringChanges = deferredAtOffset.isDefined)


As discussed offline, we will eliminate the information from the messages we log when applying deferred changes, and we won't carry that info around in MetadataPartition. Currently RaftReplicaManager knows if it is deferring changes or not. Maybe later when we get BrokerLifecycleManager and BrokerMetadataListener committed we can think about where a global boolean might live to identify if the broker is fenced or not. It isn't critical to decide right now because we are only going to defer the application of partition metadata at startup in 2.8.

cmccabe · 2021-02-09T23:35:49Z

Thanks @rondagostino . I left two small comments... LGTM after those are addressed.

This adds the logic to apply partition metadata when consuming from the Raft-based metadata log. RaftReplicaManager extends ReplicaManager for now to minimize changes to existing code for the 2.8 release. We will likely adjust this hierarchy at a later time (e.g. introducing a trait and adding a helper to refactor common code). For now, we expose the necessary fields and methods in ReplicaManager by changing their scope from private to protected, and we refactor out a couple of pieces of logic that are shared between the two implementation (stopping replicas and adding log dir fetchers). Reviewers: Colin P. McCabe <cmccabe@apache.org>, Ismael Juma <ismael@juma.me.uk>

…e-allocations-lz4 * apache-github/trunk: (118 commits) KAFKA-12327: Remove MethodHandle usage in CompressionType (apache#10123) KAFKA-12297: Make MockProducer return RecordMetadata with values as per contract MINOR: Update zstd and use classes with no finalizers (apache#10120) KAFKA-12326: Corrected regresion in MirrorMaker 2 executable introduced with KAFKA-10021 (apache#10122) KAFKA-12321 the comparison function for uuid type should be 'equals' rather than '==' (apache#10098) MINOR: Add FetchSnapshot API doc in KafkaRaftClient (apache#10097) MINOR: KIP-631 KafkaConfig fixes and improvements (apache#10114) KAFKA-12272: Fix commit-interval metrics (apache#10102) MINOR: Improve confusing admin client shutdown logging (apache#10107) MINOR: Add BrokerMetadataListener (apache#10111) MINOR: Support Raft-based metadata quorums in system tests (apache#10093) MINOR: add the MetaLogListener, LocalLogManager, and Controller interface. (apache#10106) MINOR: Introduce the KIP-500 Broker lifecycle manager (apache#10095) MINOR: Remove always-passing validation in TestRecordTest#testProducerRecord (apache#9930) KAFKA-5235: GetOffsetShell: Support for multiple topics and consumer configuration override (KIP-635) (apache#9430) MINOR: Prevent creating partition.metadata until ID can be written (apache#10041) MINOR: Add RaftReplicaManager (apache#10069) MINOR: Add ClientQuotaMetadataManager for processing QuotaRecord (apache#10101) MINOR: Rename DecommissionBrokers to UnregisterBrokers (apache#10084) MINOR: KafkaBroker.brokerState should be volatile instead of AtomicReference (apache#10080) ... clients/src/main/java/org/apache/kafka/common/record/CompressionType.java core/src/test/scala/unit/kafka/coordinator/group/GroupMetadataManagerTest.scala

MINOR: Add RaftReplicaManager

7fd75c9

ijuma reviewed Feb 5, 2021

View reviewed changes

Comment thread core/src/main/scala/kafka/server/DelayedDeleteRecords.scala Outdated

cmccabe added the kraft label Feb 5, 2021

rondagostino added 3 commits February 5, 2021 16:42

Process deferred vs. not states separately

807254e

Eliminate occurences of HostedPartition.Deferred(_, _, _, _, _)

6f4f211

Move deferred status into MetadataPartition

a9ca0bd

rondagostino mentioned this pull request Feb 5, 2021

Add HostedPartition.Fenced to ReplicaManager confluentinc/kafka#494

Closed

3 tasks

rondagostino commented Feb 5, 2021

View reviewed changes

rondagostino marked this pull request as ready for review February 5, 2021 23:42

rondagostino added 2 commits February 5, 2021 20:00

Merge remote-tracking branch 'apache/trunk' into kip500_ReplicaManage…

a2911fb

…r4Raft

Use RaftMetadataCache for deferred partition state

c4bdca5

ijuma reviewed Feb 6, 2021

View reviewed changes

Comment thread core/src/main/scala/kafka/server/ReplicaManager.scala Outdated

rondagostino added 5 commits February 6, 2021 13:03

Lock down scope of some variables a bit tighter

7d2bc2f

Minor cleanups on logging and sanity checks

0fd6fc4

Use val instead of literal 0 for default value

4b2730e

No need for 2 metadata caches

3349798

First test, add RaftReplicaChangeDelegate for future tests

1eb78a7

Add missing license

afab6d4

ijuma reviewed Feb 8, 2021

View reviewed changes

Comment thread core/src/main/scala/kafka/server/ReplicaManager.scala Outdated

ijuma reviewed Feb 8, 2021

View reviewed changes

Comment thread core/src/main/scala/kafka/server/ReplicaManager.scala Outdated

ijuma reviewed Feb 8, 2021

View reviewed changes

Comment thread core/src/test/scala/unit/kafka/server/RaftReplicaManagerTest.scala Outdated

rondagostino added 5 commits February 8, 2021 10:01

adjust comments as per review

f09e009

Use Mockito instead of EasyMock

4118eaf

Add testDefersChangesImmediately()

0bd5f92

testDefersChangesImmediatelyThenAppliesChanges()

f25553f

Verify onLeadershipChange() callbacks

6c04b6a

cmccabe reviewed Feb 8, 2021

View reviewed changes

Add testAppliesChangesWhenNotDeferring(), some test refactoring

684afba

mumrah mentioned this pull request Feb 9, 2021

Add BrokerMetadataListener and related classes #10090

Closed

Eliminate defer info in MetadataPartition

45affcc

rondagostino commented Feb 9, 2021

View reviewed changes

cmccabe reviewed Feb 9, 2021

View reviewed changes

Comment thread core/src/main/scala/kafka/server/RaftReplicaManager.scala Outdated

cmccabe reviewed Feb 9, 2021

View reviewed changes

Comment thread core/src/main/scala/kafka/server/ReplicaManager.scala Outdated

rondagostino added 2 commits February 10, 2021 10:11

Don't track individual onLeadershipChange() callbacks

5845991

Rename nextBrokers() => brokers()

68635d6

cmccabe approved these changes Feb 10, 2021

View reviewed changes

cmccabe merged commit 31a647f into apache:trunk Feb 10, 2021

Conversation

rondagostino commented Feb 5, 2021

Committer Checklist (excluded from commit message)

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ijuma commented Feb 7, 2021

Uh oh!

ijuma commented Feb 7, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cmccabe Feb 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmccabe commented Feb 8, 2021

Uh oh!

rondagostino left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cmccabe commented Feb 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cmccabe Feb 8, 2021 •

edited

Loading