KAFKA-7670: Admin client testUnreachableBootstrapServer() flaky test by stanislavkozlovski · Pull Request #5942 · apache/kafka

stanislavkozlovski · 2018-11-22T19:59:55Z

This test easily fails locally around once every 20-30 runs.

java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment.

at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:262)
at org.apache.kafka.clients.admin.KafkaAdminClientTest.testUnreachableBootstrapServer(KafkaAdminClientTest.java:277)
at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)

After spending considerate time debugging, I found out that the weakly-consistent iterator of futureResponses does not iterate through the two responses at all in the failing runs.

No elements are ever removed from futureResponses except in the MockClient#reset() method which I verified is never called. I think the cause is the weakly-consistent iterator of Java which states:

they are guaranteed to traverse elements as they existed upon construction exactly once, and may (but are not guaranteed to) reflect any modifications subsequent to construction.

I'm still unsure how to solve this in the cleanest way. I'm opening this PR as a chance to discuss it with other people

ijuma · 2018-11-22T20:12:42Z

You mean that we add a method to the collection while it's being iterated and clear the whole collection even though we never processed some of the elements?

rajinisivaram · 2018-11-23T09:48:12Z

@stanislavkozlovski There is a timing window in the way metadata response is prepared because send of the metadata request can occur on the admin client network thread before the response is added on the main test thread. But metadata requests should be retried and we would expect this to work on retry. So perhaps the issue in the failing case is that retry is not occurring, so may be worth checking if it is retrying and if not, why it is not retrying.

rajinisivaram · 2018-11-23T13:15:13Z

MockClient is not thread-safe. So we can't safely call MockClient#setUnreachable after the admin client is started and it starts sending metadata requests. I think in the failing scenario, setUnreachable is disconnecting after a metadata request is sent by the admin client, but we are not retrying because the disconnect response is lost (due to concurrent disconnect/poll on a class that is not thread-safe). I think we need to either rewrite MockClient#setUnreachable to make it thread-safe or rewrite the test so that it creates the admin client only setUnreachable is invoked on MockClient. Rewriting the test is probably the simpler option.

stanislavkozlovski · 2018-11-23T13:17:10Z

@ijuma No, we don't ever clear the collection. I only found out that while iterating through it in some runs we never find any elements inside.

@rajinisivaram I didn't know that, thanks. I can't reproduce this failure when I add more code for debugging purposes - it just doesn't fail. It's unclear to me why that is. I wonder if it is worth it to change the code such that we instantiate the AdminClient on demand

Update: @rajinisivaram just responded as I was writing this. I'll go with the approach of instantiating AdminClient

lbradstreet · 2018-11-28T00:50:52Z

This looks good to me. Would it be less error prone to make the AdminClientUnitTestEnv take a MockClient in the constructor (pre-prepared), instead of relying on the user to create the env, then prepare the MockClient, then initiate?

stanislavkozlovski · 2018-11-28T10:20:27Z

@lbradstreet I guess you meant take in a AdminClient? That would result in way more boilerplate I'm afraid.

You're right that this is a bit more error prone. My thinking is that it should be caught in the first test run and as such maybe not have as big as an impact. I don't have strong feelings about this though

lbradstreet · 2018-11-29T00:11:51Z

@stanislavkozlovski, you're probably right that it would be significantly more boilerplate. Do all prepare statements have to take place prior to the initiate call? Perhaps those could go through the admin client env, to ensure proper ordering. I don't feel too strongly about it though - what you've done seems fine overall.

ijuma · 2018-12-08T17:44:51Z

@rajinisivaram can you please review these changes?

rajinisivaram · 2019-01-09T14:35:20Z

@stanislavkozlovski Couldn't we add a new constructor with a boolean parameter to decide whether adminClient should be created here? So all tests except testUnreachableBootstrapServer don't need to remember to invoke initiate(). And testUnreachableBootstrapServer will be explicitly not creating the admin client, so it will be more obvious that it needs to create it separately. Perhaps that would be less error-prone for the future?

Sounds good. I updated the javadoc to describe that as well

rajinisivaram

@stanislavkozlovski Thanks for the updates, looks good. Left just a couple of minor comments.

rajinisivaram · 2019-01-10T13:38:59Z

nit: This could just call initiate()?

rajinisivaram · 2019-01-10T13:39:31Z

nit: Perhaps call this method createAdminClient() to be consistent with the flag in the constructor?

hachikuji · 2019-02-07T00:01:10Z

@stanislavkozlovski I found a simpler explanation for the transient failures. Basically MockClient.poll() is not safe for concurrent sends. Previously we did this inside poll():

        List<ClientResponse> copy = new ArrayList<>(responses);
        ClientResponse response;
        while ((response = this.responses.poll()) != null) {
            response.onComplete();
        }
        return copy;

The responses queue can be modified between the copy and the subsequent poll calls. I changed it to this:

        List<ClientResponse> copy = new ArrayList<>();
        ClientResponse response;
        while ((response = this.responses.poll()) != null) {
            response.onComplete();
            copy.add(response);
        }
        return copy;

After this change, I can no longer reproduce the failure. Of course it's debatable whether we should allow concurrent sends, but these tests currently depend on it.

It used to preallocate an array of responses and then complete each response from the original collection sequentially. The problem was that the original collection could have been modified (another thread completing the response) while this was hapenning

stanislavkozlovski · 2019-02-14T02:00:33Z

@hachikuji agreed this seems like a cleaner solution. I also tried and failed to reproduce after using your suggested change. Thanks a lot!

hachikuji

LGTM. Thanks for the patch!

It used to preallocate an array of responses and then complete each response from the original collection sequentially. The problem was that the original collection could have been modified (another thread completing the response) while this was hapenning

* ak/trunk: (45 commits) KAFKA-7487: DumpLogSegments misreports offset mismatches (apache#5756) MINOR: improve JavaDocs about auto-repartitioning in Streams DSL (apache#6269) KAFKA-7935: UNSUPPORTED_COMPRESSION_TYPE if ReplicaManager.getLogConfig returns None (apache#6274) KAFKA-7895: Fix stream-time reckoning for suppress (apache#6278) KAFKA-6569: Move OffsetIndex/TimeIndex logger to companion object (apache#4586) MINOR: add log indicating the suppression time (apache#6260) MINOR: Make info logs for KafkaConsumer a bit more verbose (apache#6279) KAFKA-7758: Reuse KGroupedStream/KGroupedTable with named repartition topics (apache#6265) KAFKA-7884; Docs for message.format.version should display valid values (apache#6209) MINOR: Save failed test output to build output directory MINOR: add test for StreamsSmokeTestDriver (apache#6231) MINOR: Fix bugs identified by compiler warnings (apache#6258) KAFKA-6474: Rewrite tests to use new public TopologyTestDriver [part 4] (apache#5433) MINOR: fix bypasses in ChangeLogging stores (apache#6266) MINOR: Make MockClient#poll() more thread-safe (apache#5942) MINOR: drop dbAccessor reference on close (apache#6254) KAFKA-7811: Avoid unnecessary lock acquire when KafkaConsumer commits offsets (apache#6119) KAFKA-7916: Unify store wrapping code for clarity (apache#6255) MINOR: Add missing Alter Operation to Topic supported operations list in AclCommand KAFKA-7921: log at error level for missing source topic (apache#6262) ...

It used to preallocate an array of responses and then complete each response from the original collection sequentially. The problem was that the original collection could have been modified (another thread completing the response) while this was hapenning

rajinisivaram reviewed Jan 9, 2019

View reviewed changes

rajinisivaram reviewed Jan 10, 2019

View reviewed changes

hachikuji approved these changes Feb 14, 2019

View reviewed changes

hachikuji merged commit d071871 into apache:trunk Feb 14, 2019

stanislavkozlovski deleted the KAFKA-7670-admin-client-flaky-test branch February 18, 2019 11:30

stanislavkozlovski restored the KAFKA-7670-admin-client-flaky-test branch February 18, 2019 11:30

stanislavkozlovski deleted the KAFKA-7670-admin-client-flaky-test branch February 18, 2019 11:30

stanislavkozlovski restored the KAFKA-7670-admin-client-flaky-test branch February 18, 2019 11:30

Conversation

stanislavkozlovski commented Nov 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ijuma commented Nov 22, 2018

Uh oh!

rajinisivaram commented Nov 23, 2018

Uh oh!

rajinisivaram commented Nov 23, 2018

Uh oh!

stanislavkozlovski commented Nov 23, 2018

Uh oh!

lbradstreet commented Nov 28, 2018

Uh oh!

stanislavkozlovski commented Nov 28, 2018

Uh oh!

lbradstreet commented Nov 29, 2018

Uh oh!

ijuma commented Dec 8, 2018

Uh oh!

rajinisivaram Jan 9, 2019

Choose a reason for hiding this comment

Uh oh!

stanislavkozlovski Jan 10, 2019

Choose a reason for hiding this comment

Uh oh!

rajinisivaram left a comment

Choose a reason for hiding this comment

Uh oh!

rajinisivaram Jan 10, 2019

Choose a reason for hiding this comment

Uh oh!

rajinisivaram Jan 10, 2019

Choose a reason for hiding this comment

Uh oh!

hachikuji commented Feb 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stanislavkozlovski commented Feb 14, 2019

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

stanislavkozlovski commented Nov 22, 2018 •

edited

Loading

hachikuji commented Feb 7, 2019 •

edited

Loading