Skip to content

KAFKA-18513: Validate share state topic records produced in tests.#18521

Merged
AndrewJSchofield merged 10 commits intoapache:trunkfrom
smjn:KAFKA-18513
Jan 15, 2025
Merged

KAFKA-18513: Validate share state topic records produced in tests.#18521
AndrewJSchofield merged 10 commits intoapache:trunkfrom
smjn:KAFKA-18513

Conversation

@smjn
Copy link
Copy Markdown
Collaborator

@smjn smjn commented Jan 14, 2025

  • Currently, the tests using DefaultStatePersister do not perform any validations on the __share_group_state topic to verify
    the records.
  • In this PR, we have added a small util method to do the same for the relevant tests.

@github-actions github-actions Bot added triage PRs from the community core Kafka Broker tests Test fixes (including flaky tests) small Small PRs labels Jan 14, 2025
@AndrewJSchofield AndrewJSchofield added KIP-932 Queues for Kafka ci-approved and removed triage PRs from the community labels Jan 14, 2025
Copy link
Copy Markdown
Member

@AndrewJSchofield AndrewJSchofield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Just a few minor comments.

DEFAULT_MAX_WAIT_MS, 100L, () -> "cache not up yet");
Map<String, Object> consumerConfigs = new HashMap<>();
consumerConfigs.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, cluster.bootstrapServers());
consumerConfigs.put(ConsumerConfig.GROUP_ID_CONFIG, UUID.randomUUID().toString());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could use KafkaConsumer.assign() instead so that the test doesn't have to wait for the consumer to join the consumer group.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too is causing flake

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that implies that the test would also be flaky with subscribe instead of assign, just that it runs slightly more slowly so the risk of flakiness is lower.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps but since the pattern is same as existing tests. This addition should not decrease reliability.

Copy link
Copy Markdown
Collaborator Author

@smjn smjn Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndrewJSchofield It was a blunder on my part. The SGS topic is set to have 3 partitions in the cluster config and I was only assigning to 1 partition. Will rectify.

if (msgs.count() > 0) {
msgs.records(Topic.SHARE_GROUP_STATE_TOPIC_NAME).forEach(records::add);
}
return records.size() == messageCount + 2; // +2 because of extra warmup records
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space before ==

consumerConfigs.put(ConsumerConfig.GROUP_ID_CONFIG, UUID.randomUUID().toString());
consumerConfigs.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, ByteArrayDeserializer.class.getName());
consumerConfigs.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, ByteArrayDeserializer.class.getName());
consumerConfigs.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you could also use seekToBeginning()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is causing flake - I will keep the config

@smjn smjn requested a review from AndrewJSchofield January 14, 2025 11:39
Copy link
Copy Markdown
Member

@AndrewJSchofield AndrewJSchofield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that testMultipleConsumersInGroupSequentialConsumption is flaky with this PR.

@smjn smjn requested a review from AndrewJSchofield January 14, 2025 16:48
@smjn
Copy link
Copy Markdown
Collaborator Author

smjn commented Jan 14, 2025

It seems to me that testMultipleConsumersInGroupSequentialConsumption is flaky with this PR.

It appears that occasionally this error is happening

[2025-01-14 22:23:21,456] ERROR [group-coordinator-event-processor-0]: Failed to run event CoordinatorWriteEvent(name=share-group-heartbeat) due to: Histogram recorded value cannot be negative.. (org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread:150)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be negative.
	at org.HdrHistogram.AbstractHistogram.countsArrayIndex(AbstractHistogram.java:2399) ~[HdrHistogram-2.2.2.jar:?]

It would seem this is resulting in off by one errors - sometimes RPCs are 4 and sometimes 5. I will remove this check from this test for now

@AndrewJSchofield
Copy link
Copy Markdown
Member

I also had a failure locally in testConsumerCloseInGroupSequential because the number of records was incorrect. I expect it's unpredictable for the more complex tests.

@smjn
Copy link
Copy Markdown
Collaborator Author

smjn commented Jan 15, 2025

I also had a failure locally in testConsumerCloseInGroupSequential because the number of records was incorrect. I expect it's unpredictable for the more complex tests.

Then maybe exact record count match is too strict. Perhaps checking for record being produced should be the way to go. Another reason for this is the RPC coalescing and batching which might make the number of records actually written different each time and as a result make the tests brittle.
@AndrewJSchofield

Copy link
Copy Markdown
Member

@AndrewJSchofield AndrewJSchofield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. I ran this test repeatedly yesterday and I think it's time to remove the no-op persister from this test now. The new changes add about 6 minutes to the execution time. The no-op persister is not adding any value any longer and it takes about 6 minutes too. Please remove the parameterisation of the persister and just use the default persister in the tests.

}

private void maybeVerifyShareGroupStateTopicRecordCount(String persister, int messageCount) {
private void maybeVerifyShareGroupStateTopicRecordCount(String persister) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method name now seems redundant. I suggest verifyShareGroupStateTopicRecordsProduced.

@github-actions github-actions Bot removed the small Small PRs label Jan 15, 2025
@smjn smjn requested a review from AndrewJSchofield January 15, 2025 09:22
@smjn smjn changed the title KAFKA-18513: Validate share state topic record count in tests. KAFKA-18513: Validate share state topic records produced in tests. Jan 15, 2025
Copy link
Copy Markdown
Member

@AndrewJSchofield AndrewJSchofield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of additional comments. Thanks for the update.

.setConfigProp("group.share.enable", "true")
.setConfigProp("group.share.partition.max.record.locks", "10000")
.setConfigProp("group.share.persister.class.name", persisterClassName)
.setConfigProp("group.share.persister.class.name", DEFAULT_STATE_PERSISTER)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the default. Not required.

DEFAULT_MAX_WAIT_MS, 100L, () -> "cache not up yet");
Map<String, Object> consumerConfigs = new HashMap<>();
consumerConfigs.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, cluster.bootstrapServers());
consumerConfigs.put(ConsumerConfig.GROUP_ID_CONFIG, UUID.randomUUID().toString());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove the group.id config too because of the use of KafkaConsumer.assign.

@smjn smjn requested a review from AndrewJSchofield January 15, 2025 09:40
Copy link
Copy Markdown
Member

@AndrewJSchofield AndrewJSchofield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final comment. When the build next finishes, please can you triage the failed tests and make sure there are open issues for failed tests. ClientUtilsTest#testParseAndValidateAddressesWithReverseLookup is expected to fail until #18549 is merged. Thanks.


private void verifyShareGroupStateTopicRecordsProduced() {
try {
TestUtils.waitForCondition(() ->
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition will already be satisfied because it mirrors the warmup() method which runs before each test. I think you can remove this as redundant code.

@smjn
Copy link
Copy Markdown
Collaborator Author

smjn commented Jan 15, 2025

One final comment. When the build next finishes, please can you triage the failed tests and make sure there are open issues for failed tests. ClientUtilsTest#testParseAndValidateAddressesWithReverseLookup is expected to fail until #18549 is merged. Thanks.

They seem to be completely unrelated to our work.

@AndrewJSchofield
Copy link
Copy Markdown
Member

One final comment. When the build next finishes, please can you triage the failed tests and make sure there are open issues for failed tests. ClientUtilsTest#testParseAndValidateAddressesWithReverseLookup is expected to fail until #18549 is merged. Thanks.

They seem to be completely unrelated to our work.

True, but if some of the failed tests are flaky but not marked as flaky, then we ought to do something about it.

@smjn
Copy link
Copy Markdown
Collaborator Author

smjn commented Jan 15, 2025

One final comment. When the build next finishes, please can you triage the failed tests and make sure there are open issues for failed tests. ClientUtilsTest#testParseAndValidateAddressesWithReverseLookup is expected to fail until #18549 is merged. Thanks.

They seem to be completely unrelated to our work.

True, but if some of the failed tests are flaky but not marked as flaky, then we ought to do something about it.

Did multiple runs for each of the tests:

NO ERRORS

UserQuotaTest > testThrottledProducerConsumer(String, String).quorum=kraft.groupProtocol=classic
MetricsDuringTopicCreationDeletionTest > "testMetricsDuringTopicCreateDelete(String).quorum=kraft"
PlaintextAdminIntegrationTest > "testElectPreferredLeaders(String).quorum=kraft"
AbstractCoordinatorTest > testWakeupAfterSyncGroupReceivedExternalCompletion() [ALREADY MARKED FLAKY]

FLAKE FOUND

https://issues.apache.org/jira/browse/KAFKA-18550

 KafkaAdminClientTest > testAdminClientApisAuthenticationFailure()

org.opentest4j.AssertionFailedError: Expected an authentication error, but got java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: createPartitions
	at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
	at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:155)
	at org.apache.kafka.clients.admin.KafkaAdminClientTest.lambda$callAdminClientApisAndExpectAnAuthenticationError$38(KafkaAdminClientTest.java:1808)
	at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:53)
	at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:35)
	at org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3115)
	at org.apache.kafka.clients.admin.KafkaAdminClientTest.callAdminClientApisAndExpectAnAuthenticationError(KafkaAdminClientTest.java:1808)
	at org.apache.kafka.clients.admin.KafkaAdminClientTest.testAdminClientApisAuthenticationFailure(KafkaAdminClientTest.java:1793)

https://issues.apache.org/jira/browse/KAFKA-18551

EagerConsumerCoordinatorTest  > testOutdatedCoordinatorAssignment()

org.opentest4j.AssertionFailedError: 
Expected :1
Actual   :2
<Click to see difference>


	at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
	at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
	at org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
	at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:531)
	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinatorTest.testOutdatedCoordinatorAssignment(ConsumerCoordinatorTest.java:1066)
	at org.apache.kafka.clients.consumer.internals.EagerConsumerCoordinatorTest.some(EagerConsumerCoordinatorTest.java:29)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)

@AndrewJSchofield

@smjn smjn requested a review from AndrewJSchofield January 15, 2025 15:04
Copy link
Copy Markdown
Member

@AndrewJSchofield AndrewJSchofield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Will merge once the build is complete.

@AndrewJSchofield AndrewJSchofield merged commit e3a56f3 into apache:trunk Jan 15, 2025
pranavt84 pushed a commit to pranavt84/kafka that referenced this pull request Jan 27, 2025
airlock-confluentinc Bot pushed a commit to confluentinc/kafka that referenced this pull request Jan 27, 2025
manoj-mathivanan pushed a commit to manoj-mathivanan/kafka that referenced this pull request Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved core Kafka Broker KIP-932 Queues for Kafka tests Test fixes (including flaky tests)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants