KAFKA-16686: Wait for given offset in TopicBasedRemoteLogMetadataManagerTest by gaurav-narula · Pull Request #15885 · apache/kafka

gaurav-narula · 2024-05-07T16:24:13Z

Some tests in TopicBasedRemoteLogMetadataManagerTest flake because waitUntilConsumerCatchesUp may break early before consumer manager has caught up with all the events.

This PR adds an expected offsets for leader/follower metadataOffset partitions and ensures we wait for the offset to be at least equal to the argument to avoid flakyness.

Refer Gradle Enterprise Report for more information on flakyness.

gaurav-narula · 2024-05-07T16:24:35Z

CC: @clolov @satishd

kamalcph

Thanks @gaurav-narula for the patch! Left minor comments to address.

We can rewrite this test to be concise. I'll file a separate ticket for this.

kamalcph · 2024-05-09T13:56:10Z

            if (leaderMetadataPartition == followerMetadataPartition) {
-                if (topicBasedRlmm().readOffsetForPartition(leaderMetadataPartition).orElse(-1L) >= 1) {
+                Assertions.assertEquals(targetLeaderMetadataPartitionOffset, targetFollowerMetadataPartitionOffset);
+                if (topicBasedRlmm().readOffsetForPartition(leaderMetadataPartition).orElse(-1L) >= targetLeaderMetadataPartitionOffset) {


previously, we were waiting for >=1, after this change, >=0. This will make the test more flaky.

when the leader and follower partitions are mapped to the same partition, then we have to wait for twice the amount of messages.

satishd

Thanks @gaurav-narula for the PR, left a meta comment.

satishd · 2024-05-09T17:11:40Z

                                            TopicIdPartition newFollowerTopicIdPartition,
-                                            long timeoutMs) throws TimeoutException {
+                                            long timeoutMs,
+                                            long targetLeaderMetadataPartitionOffset,


These parameters will not help much here as this method was written for testNewPartitionUpdates but other tests in this class used the functionality with the gaps. It is better to relook at those usecases and refactor this method respectively.

gaurav-narula · 2024-05-12T01:27:54Z

Thanks for the feedback @kamalcph @satishd!

I've modified the tests so that we propagate a Consumer<RemoteLogMetadata> down to ConsumerTask and use it only for tests.

This allows us to replace waitUntilConsumerCatchesUp with TestUtils.waitForCondition to actually wait on the consumption of all expected RemoteLogMetadata objects we set up in the tests instead of relying on offsets which is ambiguous.

Please have a look and let me know your thoughts!

kamalcph · 2024-05-12T06:41:29Z


    private void processConsumerRecord(ConsumerRecord<byte[], byte[]> record) {
        final RemoteLogMetadata remoteLogMetadata = serde.deserialize(record.value());
+        onConsume.accept(remoteLogMetadata);


This is not the correct way to capture the events. Assume that the testcase don't want to process an event (shouldProcess check returns false). We don't want that event to be captured.

Instead, we can have a setter method for RemotePartitionMetadataStore and pass a custom implementation similar to DummyEventHandler where we can capture the event and delegate it to the real implementation.

Thanks for the suggestion! This made me realise there's also a possible race where even after RemotePartitionMetadataStore::handleRemoteLogSegmentMetadata is invoked, the assertions on topicBasedRlmm().listRemoteLogSegments may fail because remoteLogMetadataCache.isInitialized() may return false.

Inspired by your suggestion to hook on RemotePartitionMetadataStore, I've modified TopicBasedRemoteLogMetadataManagerHarness to accept a spy object for it which is passed down to ConsumerTask. The tests are modified to ensure handleRemoteLogSegmentMetadata and markInitialized are invoked appropriate number of times.

Thanks for updating the test, the approach LGTM!

Nit: Why are we using the supplier pattern instead of adding a setter to TopicBasedRemoteLogMetadataManager and marking it as visibleForTesting?

Nit: Why are we using the supplier pattern instead of adding a setter to TopicBasedRemoteLogMetadataManager and marking it as visibleForTesting?

IIUC, you're alluding to something similar we do for remoteLogMetadataTopicPartitioner at

kafka/storage/src/test/java/org/apache/kafka/server/log/remote/metadata/storage/TopicBasedRemoteLogMetadataManagerHarness.java

Line 108 in 8a9dd2b

if (remoteLogMetadataTopicPartitioner != null) {

.

That can work, but I feel it's very easy to introduce a race inadvertently since TopicBasedRemoteLogMetadataManager::configure spawns a thread (

kafka/storage/src/main/java/org/apache/kafka/server/log/remote/metadata/storage/TopicBasedRemoteLogMetadataManager.java

Line 368 in 8a9dd2b

initializationThread = KafkaThread.nonDaemon("RLMMInitializationThread", this::initializeResources);

). In fact, remoteLogMetadataTopicPartitioner is prone to a race, where if the test thread yields before line 109 is executed, the ProducerManager and ConsumerManager instances can get instantiated with incorrect remoteLogMetdataTopicPartitioner instance (

kafka/storage/src/main/java/org/apache/kafka/server/log/remote/metadata/storage/TopicBasedRemoteLogMetadataManager.java

Line 421 in 8a9dd2b

producerManager = new ProducerManager(rlmmConfig, rlmTopicPartitioner);

).

We can avoid it by invoking the setter before calling TopicBasedRemoteLogMetadataManager::configure but I feel it's easier to enforce it by using a Supplier instead. Either way, I feel this race should be fixed as well now :)

I've created https://issues.apache.org/jira/browse/KAFKA-16712

kamalcph

LGTM, thanks for fixing the flaky test!

chia7712

@gaurav-narula nice fix!

chia7712 · 2024-05-13T13:25:18Z

    private volatile boolean initializationFailed;
+    private final Supplier<RemotePartitionMetadataStore> remoteLogMetadataManagerSupplier;

    public TopicBasedRemoteLogMetadataManager() {


Could you add comments to say this default constructor is required as we create RemoteLogMetadataManager dynamically?

Sure, addressed both in a8ba568

chia7712 · 2024-05-13T13:29:18Z


    // Visible for testing.
-    public TopicBasedRemoteLogMetadataManager(boolean startConsumerThread) {
+    public TopicBasedRemoteLogMetadataManager(boolean startConsumerThread, Supplier<RemotePartitionMetadataStore> remoteLogMetadataManagerSupplier) {


It seems package-private is enough in testing, right?

chia7712

LGTM

I have re-trigger QA. will merge it if no objection.

satishd

Thanks @gaurav-narula for addressing the review comments, this approach LGTM.

Some tests in TopicBasedRemoteLogMetadataManagerTest flake because `waitUntilConsumerCatchesUp` may break early before consumer manager has caught up with all the events. This change allows passing a spy object for `RemotePartitionMetadataStore` down to `ConsumerTask` which allows the test code to ensure the methods on it were invoked appropriate number of times before performing assertions. Refer [Gradle Enterprise Report](https://ge.apache.org/scans/tests?search.timeZoneId=Europe%2FLondon&tests.container=org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManagerTest) for more information on flakyness.

gaurav-narula · 2024-05-14T17:05:17Z

@chia7712 looks like we still suffer from thread leaks in CI :( I've rebased from trunk to trigger CI again

chia7712 · 2024-05-14T18:09:18Z

looks like we still suffer from thread leaks in CI :( I've rebased from trunk to trigger CI again

I have noticed that too. so sad :(

…erTest (apache#15885) Some tests in TopicBasedRemoteLogMetadataManagerTest flake because waitUntilConsumerCatchesUp may break early before consumer manager has caught up with all the events. This PR adds an expected offsets for leader/follower metadataOffset partitions and ensures we wait for the offset to be at least equal to the argument to avoid flakyness. Reviewers: Satish Duggana <satishd@apache.org>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>

gaurav-narula mentioned this pull request May 8, 2024

KAFKA-16688: Use helper method to shutdown ExecutorService #15886

Merged

3 tasks

satishd requested a review from kamalcph May 9, 2024 06:51

kamalcph reviewed May 9, 2024

View reviewed changes

satishd reviewed May 9, 2024

View reviewed changes

gaurav-narula force-pushed the KAFKA-16686 branch 3 times, most recently from d2684d3 to 8586c03 Compare May 12, 2024 01:27

kamalcph reviewed May 12, 2024

View reviewed changes

gaurav-narula force-pushed the KAFKA-16686 branch from 8586c03 to eb6fd01 Compare May 12, 2024 16:34

kamalcph approved these changes May 13, 2024

View reviewed changes

chia7712 reviewed May 13, 2024

View reviewed changes

chia7712 approved these changes May 14, 2024

View reviewed changes

satishd approved these changes May 14, 2024

View reviewed changes

gaurav-narula added 2 commits May 14, 2024 18:04

Address review comments

58424dc

gaurav-narula force-pushed the KAFKA-16686 branch from a8ba568 to 58424dc Compare May 14, 2024 17:04

chia7712 merged commit eb5559a into apache:trunk May 15, 2024

gaurav-narula mentioned this pull request May 15, 2024

KAFKA-16712: Fix race in TopicBasedRemoteLogMetadataManagerMultipleSubscriptionsTest #15962

Merged

3 tasks

Conversation

gaurav-narula commented May 7, 2024

Uh oh!

gaurav-narula commented May 7, 2024

Uh oh!

kamalcph left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satishd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaurav-narula commented May 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kamalcph left a comment

Choose a reason for hiding this comment

Uh oh!

chia7712 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chia7712 left a comment

Choose a reason for hiding this comment

Uh oh!

satishd left a comment

Choose a reason for hiding this comment

Uh oh!

gaurav-narula commented May 14, 2024

Uh oh!

chia7712 commented May 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants