Skip to content

KAFKA-7670: Admin client testUnreachableBootstrapServer() flaky test#5942

Merged
hachikuji merged 1 commit intoapache:trunkfrom
stanislavkozlovski:KAFKA-7670-admin-client-flaky-test
Feb 14, 2019
Merged

KAFKA-7670: Admin client testUnreachableBootstrapServer() flaky test#5942
hachikuji merged 1 commit intoapache:trunkfrom
stanislavkozlovski:KAFKA-7670-admin-client-flaky-test

Conversation

@stanislavkozlovski
Copy link
Copy Markdown
Contributor

@stanislavkozlovski stanislavkozlovski commented Nov 22, 2018

This test easily fails locally around once every 20-30 runs.

java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment.

at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:262)
at org.apache.kafka.clients.admin.KafkaAdminClientTest.testUnreachableBootstrapServer(KafkaAdminClientTest.java:277)
at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)

After spending considerate time debugging, I found out that the weakly-consistent iterator of futureResponses does not iterate through the two responses at all in the failing runs.

No elements are ever removed from futureResponses except in the MockClient#reset() method which I verified is never called. I think the cause is the weakly-consistent iterator of Java which states:

they are guaranteed to traverse elements as they existed upon construction exactly once, and may (but are not guaranteed to) reflect any modifications subsequent to construction.

I'm still unsure how to solve this in the cleanest way. I'm opening this PR as a chance to discuss it with other people

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Nov 22, 2018

You mean that we add a method to the collection while it's being iterated and clear the whole collection even though we never processed some of the elements?

@rajinisivaram
Copy link
Copy Markdown
Contributor

@stanislavkozlovski There is a timing window in the way metadata response is prepared because send of the metadata request can occur on the admin client network thread before the response is added on the main test thread. But metadata requests should be retried and we would expect this to work on retry. So perhaps the issue in the failing case is that retry is not occurring, so may be worth checking if it is retrying and if not, why it is not retrying.

@rajinisivaram
Copy link
Copy Markdown
Contributor

MockClient is not thread-safe. So we can't safely call MockClient#setUnreachable after the admin client is started and it starts sending metadata requests. I think in the failing scenario, setUnreachable is disconnecting after a metadata request is sent by the admin client, but we are not retrying because the disconnect response is lost (due to concurrent disconnect/poll on a class that is not thread-safe). I think we need to either rewrite MockClient#setUnreachable to make it thread-safe or rewrite the test so that it creates the admin client only setUnreachable is invoked on MockClient. Rewriting the test is probably the simpler option.

@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

@ijuma No, we don't ever clear the collection. I only found out that while iterating through it in some runs we never find any elements inside.

@rajinisivaram I didn't know that, thanks. I can't reproduce this failure when I add more code for debugging purposes - it just doesn't fail. It's unclear to me why that is. I wonder if it is worth it to change the code such that we instantiate the AdminClient on demand

Update: @rajinisivaram just responded as I was writing this. I'll go with the approach of instantiating AdminClient

@lbradstreet
Copy link
Copy Markdown
Contributor

This looks good to me. Would it be less error prone to make the AdminClientUnitTestEnv take a MockClient in the constructor (pre-prepared), instead of relying on the user to create the env, then prepare the MockClient, then initiate?

@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

@lbradstreet I guess you meant take in a AdminClient? That would result in way more boilerplate I'm afraid.

You're right that this is a bit more error prone. My thinking is that it should be caught in the first test run and as such maybe not have as big as an impact. I don't have strong feelings about this though

@lbradstreet
Copy link
Copy Markdown
Contributor

@stanislavkozlovski, you're probably right that it would be significantly more boilerplate. Do all prepare statements have to take place prior to the initiate call? Perhaps those could go through the admin client env, to ensure proper ordering. I don't feel too strongly about it though - what you've done seems fine overall.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Dec 8, 2018

@rajinisivaram can you please review these changes?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stanislavkozlovski Couldn't we add a new constructor with a boolean parameter to decide whether adminClient should be created here? So all tests except testUnreachableBootstrapServer don't need to remember to invoke initiate(). And testUnreachableBootstrapServer will be explicitly not creating the admin client, so it will be more obvious that it needs to create it separately. Perhaps that would be less error-prone for the future?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I updated the javadoc to describe that as well

Copy link
Copy Markdown
Contributor

@rajinisivaram rajinisivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stanislavkozlovski Thanks for the updates, looks good. Left just a couple of minor comments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This could just call initiate()?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Perhaps call this method createAdminClient() to be consistent with the flag in the constructor?

@hachikuji
Copy link
Copy Markdown
Contributor

hachikuji commented Feb 7, 2019

@stanislavkozlovski I found a simpler explanation for the transient failures. Basically MockClient.poll() is not safe for concurrent sends. Previously we did this inside poll():

        List<ClientResponse> copy = new ArrayList<>(responses);
        ClientResponse response;
        while ((response = this.responses.poll()) != null) {
            response.onComplete();
        }
        return copy;

The responses queue can be modified between the copy and the subsequent poll calls. I changed it to this:

        List<ClientResponse> copy = new ArrayList<>();
        ClientResponse response;
        while ((response = this.responses.poll()) != null) {
            response.onComplete();
            copy.add(response);
        }
        return copy;

After this change, I can no longer reproduce the failure. Of course it's debatable whether we should allow concurrent sends, but these tests currently depend on it.

It used to preallocate an array of responses and then complete each response from the original collection sequentially. The problem was that the original collection could have been modified (another thread completing the response) while this was hapenning
@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

@hachikuji agreed this seems like a cleaner solution. I also tried and failed to reproduce after using your suggested change. Thanks a lot!

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the patch!

@hachikuji hachikuji merged commit d071871 into apache:trunk Feb 14, 2019
hachikuji pushed a commit that referenced this pull request Feb 14, 2019
It used to preallocate an array of responses and then complete each response from the original collection sequentially. The problem was that the original collection could have been modified (another thread completing the response) while this was hapenning
hachikuji pushed a commit that referenced this pull request Feb 14, 2019
It used to preallocate an array of responses and then complete each response from the original collection sequentially. The problem was that the original collection could have been modified (another thread completing the response) while this was hapenning
hachikuji pushed a commit that referenced this pull request Feb 14, 2019
It used to preallocate an array of responses and then complete each response from the original collection sequentially. The problem was that the original collection could have been modified (another thread completing the response) while this was hapenning
hachikuji pushed a commit that referenced this pull request Feb 14, 2019
It used to preallocate an array of responses and then complete each response from the original collection sequentially. The problem was that the original collection could have been modified (another thread completing the response) while this was hapenning
@stanislavkozlovski stanislavkozlovski deleted the KAFKA-7670-admin-client-flaky-test branch February 18, 2019 11:30
@stanislavkozlovski stanislavkozlovski restored the KAFKA-7670-admin-client-flaky-test branch February 18, 2019 11:30
@stanislavkozlovski stanislavkozlovski deleted the KAFKA-7670-admin-client-flaky-test branch February 18, 2019 11:30
@stanislavkozlovski stanislavkozlovski restored the KAFKA-7670-admin-client-flaky-test branch February 18, 2019 11:30
jarekr pushed a commit to confluentinc/kafka that referenced this pull request Apr 18, 2019
* ak/trunk: (45 commits)
  KAFKA-7487: DumpLogSegments misreports offset mismatches (apache#5756)
  MINOR: improve JavaDocs about auto-repartitioning in Streams DSL (apache#6269)
  KAFKA-7935: UNSUPPORTED_COMPRESSION_TYPE if ReplicaManager.getLogConfig returns None (apache#6274)
  KAFKA-7895: Fix stream-time reckoning for suppress (apache#6278)
  KAFKA-6569: Move OffsetIndex/TimeIndex logger to companion object  (apache#4586)
  MINOR: add log indicating the suppression time (apache#6260)
  MINOR: Make info logs for KafkaConsumer a bit more verbose (apache#6279)
  KAFKA-7758: Reuse KGroupedStream/KGroupedTable with named repartition topics (apache#6265)
  KAFKA-7884; Docs for message.format.version should display valid values (apache#6209)
  MINOR: Save failed test output to build output directory
  MINOR: add test for StreamsSmokeTestDriver (apache#6231)
  MINOR: Fix bugs identified by compiler warnings (apache#6258)
  KAFKA-6474: Rewrite tests to use new public TopologyTestDriver [part 4] (apache#5433)
  MINOR: fix bypasses in ChangeLogging stores (apache#6266)
  MINOR: Make MockClient#poll() more thread-safe (apache#5942)
  MINOR: drop dbAccessor reference on close (apache#6254)
  KAFKA-7811: Avoid unnecessary lock acquire when KafkaConsumer commits offsets (apache#6119)
  KAFKA-7916: Unify store wrapping code for clarity (apache#6255)
  MINOR: Add missing Alter Operation to Topic supported operations list in AclCommand
  KAFKA-7921: log at error level for missing source topic (apache#6262)
  ...
pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019
It used to preallocate an array of responses and then complete each response from the original collection sequentially. The problem was that the original collection could have been modified (another thread completing the response) while this was hapenning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants