KAFKA-16686: Wait for given offset in TopicBasedRemoteLogMetadataManagerTest#15885
KAFKA-16686: Wait for given offset in TopicBasedRemoteLogMetadataManagerTest#15885chia7712 merged 2 commits intoapache:trunkfrom
Conversation
kamalcph
left a comment
There was a problem hiding this comment.
Thanks @gaurav-narula for the patch! Left minor comments to address.
We can rewrite this test to be concise. I'll file a separate ticket for this.
| if (leaderMetadataPartition == followerMetadataPartition) { | ||
| if (topicBasedRlmm().readOffsetForPartition(leaderMetadataPartition).orElse(-1L) >= 1) { | ||
| Assertions.assertEquals(targetLeaderMetadataPartitionOffset, targetFollowerMetadataPartitionOffset); | ||
| if (topicBasedRlmm().readOffsetForPartition(leaderMetadataPartition).orElse(-1L) >= targetLeaderMetadataPartitionOffset) { |
There was a problem hiding this comment.
previously, we were waiting for >=1, after this change, >=0. This will make the test more flaky.
when the leader and follower partitions are mapped to the same partition, then we have to wait for twice the amount of messages.
satishd
left a comment
There was a problem hiding this comment.
Thanks @gaurav-narula for the PR, left a meta comment.
| TopicIdPartition newFollowerTopicIdPartition, | ||
| long timeoutMs) throws TimeoutException { | ||
| long timeoutMs, | ||
| long targetLeaderMetadataPartitionOffset, |
There was a problem hiding this comment.
These parameters will not help much here as this method was written for testNewPartitionUpdates but other tests in this class used the functionality with the gaps. It is better to relook at those usecases and refactor this method respectively.
d2684d3 to
8586c03
Compare
|
Thanks for the feedback @kamalcph @satishd! I've modified the tests so that we propagate a This allows us to replace Please have a look and let me know your thoughts! |
|
|
||
| private void processConsumerRecord(ConsumerRecord<byte[], byte[]> record) { | ||
| final RemoteLogMetadata remoteLogMetadata = serde.deserialize(record.value()); | ||
| onConsume.accept(remoteLogMetadata); |
There was a problem hiding this comment.
This is not the correct way to capture the events. Assume that the testcase don't want to process an event (shouldProcess check returns false). We don't want that event to be captured.
Instead, we can have a setter method for RemotePartitionMetadataStore and pass a custom implementation similar to DummyEventHandler where we can capture the event and delegate it to the real implementation.
There was a problem hiding this comment.
Thanks for the suggestion! This made me realise there's also a possible race where even after RemotePartitionMetadataStore::handleRemoteLogSegmentMetadata is invoked, the assertions on topicBasedRlmm().listRemoteLogSegments may fail because remoteLogMetadataCache.isInitialized() may return false.
Inspired by your suggestion to hook on RemotePartitionMetadataStore, I've modified TopicBasedRemoteLogMetadataManagerHarness to accept a spy object for it which is passed down to ConsumerTask. The tests are modified to ensure handleRemoteLogSegmentMetadata and markInitialized are invoked appropriate number of times.
There was a problem hiding this comment.
Thanks for updating the test, the approach LGTM!
Nit: Why are we using the supplier pattern instead of adding a setter to TopicBasedRemoteLogMetadataManager and marking it as visibleForTesting?
There was a problem hiding this comment.
Nit: Why are we using the supplier pattern instead of adding a setter to TopicBasedRemoteLogMetadataManager and marking it as visibleForTesting?
IIUC, you're alluding to something similar we do for remoteLogMetadataTopicPartitioner at
That can work, but I feel it's very easy to introduce a race inadvertently since TopicBasedRemoteLogMetadataManager::configure spawns a thread (
remoteLogMetadataTopicPartitioner is prone to a race, where if the test thread yields before line 109 is executed, the ProducerManager and ConsumerManager instances can get instantiated with incorrect remoteLogMetdataTopicPartitioner instance (We can avoid it by invoking the setter before calling TopicBasedRemoteLogMetadataManager::configure but I feel it's easier to enforce it by using a Supplier instead. Either way, I feel this race should be fixed as well now :)
There was a problem hiding this comment.
I've created https://issues.apache.org/jira/browse/KAFKA-16712
8586c03 to
eb6fd01
Compare
kamalcph
left a comment
There was a problem hiding this comment.
LGTM, thanks for fixing the flaky test!
| private volatile boolean initializationFailed; | ||
| private final Supplier<RemotePartitionMetadataStore> remoteLogMetadataManagerSupplier; | ||
|
|
||
| public TopicBasedRemoteLogMetadataManager() { |
There was a problem hiding this comment.
Could you add comments to say this default constructor is required as we create RemoteLogMetadataManager dynamically?
|
|
||
| // Visible for testing. | ||
| public TopicBasedRemoteLogMetadataManager(boolean startConsumerThread) { | ||
| public TopicBasedRemoteLogMetadataManager(boolean startConsumerThread, Supplier<RemotePartitionMetadataStore> remoteLogMetadataManagerSupplier) { |
There was a problem hiding this comment.
It seems package-private is enough in testing, right?
chia7712
left a comment
There was a problem hiding this comment.
LGTM
I have re-trigger QA. will merge it if no objection.
satishd
left a comment
There was a problem hiding this comment.
Thanks @gaurav-narula for addressing the review comments, this approach LGTM.
Some tests in TopicBasedRemoteLogMetadataManagerTest flake because `waitUntilConsumerCatchesUp` may break early before consumer manager has caught up with all the events. This change allows passing a spy object for `RemotePartitionMetadataStore` down to `ConsumerTask` which allows the test code to ensure the methods on it were invoked appropriate number of times before performing assertions. Refer [Gradle Enterprise Report](https://ge.apache.org/scans/tests?search.timeZoneId=Europe%2FLondon&tests.container=org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManagerTest) for more information on flakyness.
a8ba568 to
58424dc
Compare
|
@chia7712 looks like we still suffer from thread leaks in CI :( I've rebased from trunk to trigger CI again |
I have noticed that too. so sad :( |
…erTest (apache#15885) Some tests in TopicBasedRemoteLogMetadataManagerTest flake because waitUntilConsumerCatchesUp may break early before consumer manager has caught up with all the events. This PR adds an expected offsets for leader/follower metadataOffset partitions and ensures we wait for the offset to be at least equal to the argument to avoid flakyness. Reviewers: Satish Duggana <satishd@apache.org>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
…erTest (apache#15885) Some tests in TopicBasedRemoteLogMetadataManagerTest flake because waitUntilConsumerCatchesUp may break early before consumer manager has caught up with all the events. This PR adds an expected offsets for leader/follower metadataOffset partitions and ensures we wait for the offset to be at least equal to the argument to avoid flakyness. Reviewers: Satish Duggana <satishd@apache.org>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
Some tests in TopicBasedRemoteLogMetadataManagerTest flake because
waitUntilConsumerCatchesUpmay break early before consumer manager has caught up with all the events.This PR adds an expected offsets for leader/follower metadataOffset partitions and ensures we wait for the offset to be at least equal to the argument to avoid flakyness.
Refer Gradle Enterprise Report for more information on flakyness.