KAFKA-18441: Fix flaky KafkaAdminClientTest#testAdminClientApisAuthenticationFailure#18735
Conversation
…ticationFailure Signed-off-by: PoAn Yang <payang@apache.org>
AndrewJSchofield
left a comment
There was a problem hiding this comment.
I think the change makes the test reliable, but really the handling of pending authentication errors in the MockClient is questionable.
Please revert the removal of @Flaky from this PR. I suggest you open an issue for the problem of pending authentication errors in the MockClient.
| } | ||
| } | ||
|
|
||
| @Flaky("KAFKA-18441") |
There was a problem hiding this comment.
I don't think this @Flaky should be removed until we have evidence that this has properly fixed the flakiness in the CI. I would wait for 7 days of clean builds before removing it.
There was a problem hiding this comment.
Thanks for the review. Added the tag back.
| AdminClientConfig.RETRY_BACKOFF_MS_CONFIG, "5000"))) { | ||
| env.kafkaClient().setNodeApiVersions(NodeApiVersions.create()); | ||
| env.kafkaClient().createPendingAuthenticationError(cluster.nodes().get(0), | ||
| TimeUnit.DAYS.toMillis(1)); |
There was a problem hiding this comment.
This is the suspicious part I think. I think the intent of the code was to have an authentication error which lasted for 1 day, giving plenty of time for the other tests to run. In fact, the code in MockClient doesn't appear to do that, and the value of the TimeUnit.DAYS.toMillis(1) is questionable.
There was a problem hiding this comment.
Agree that there is something about the MockClient we're not getting here (or is off and we might be hiding it).
This TimeUnit.DAYS.toMillis(1) was supposed to achieve exactly what we're having to do by setting a higher RETRY_BACKOFF_MS_CONFIG. The createPendingAuthenticationError takes the backoff param and sets it as the backoff for the node
So the MockClient shouldn't generate a new fetchMetadata for 1 day I expect.
That being said, agree with @AndrewJSchofield 's suggestion, we could add the config to stabilize the very noisy test, and have a follow-up jira to check why the MockClient is requiring an explicit backoff config in this case.
There was a problem hiding this comment.
Hi @AndrewJSchofield and @lianetm, thanks for the review and information. After checking the code again, my PR description is unclear.
With default "retry.backoff.ms", if all assertion can't finish in 100L, the KafkaAdminClient make a metadata call again.
kafka/clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java
Lines 1541 to 1550 in 048dfef
In the fetchMetadata request, it uses MetadataUpdateNodeIdProvider.
kafka/clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java
Lines 1686 to 1690 in 048dfef
In MetadataUpdateNodeIdProvider, it triggers AdminMetadataManager#rebootstrap.
Finally, the rebootstrap function makes fatalException be null again in AdminMetadataManager#update.
This case relies on AdminMetadataManager#isReady to throw error. If fatalException is null, the assertion can't be fulfilled.
There was a problem hiding this comment.
Thanks for the description. I've approved the PR and will merge once the build is complete. However, my question really was whether this is desirable behaviour. I would say that the original author intended to "pin" the authentication failure to last for a day so that all operations failed and a multi-part test could be performed reliably. Something has changed over the past couple of months making this unreliable. So, I expect there's a follow-on piece of work in here.
There was a problem hiding this comment.
interesting, seems to me the changes to rebootstrap for metadata shared above are maybe the ones behind this then (from nov 2024 +- btw).
Even though the test is setting a 1 day backoff, that only makes the MockClient return null node as leastLoaded (and this will now trigger a new metadata request that I guess didn't trigger before the reboostrap logic)
There was a problem hiding this comment.
Yes, I expect that's it.
There was a problem hiding this comment.
We should probably set metadata.recovery.strategy to none to replicate the original behavior, where fatal errors are not automatically reset. Also, we should add root cause ( the AuthenticationException is reset due to rebootstrap) to the comment.
There was a problem hiding this comment.
Yes, I think that's a better solution and it seems to work in my local testing. @FrankYang0529
wdyt? How about changing the extra config to AdminClientConfig.METADATA_RECOVERY_STRATEGY_CONFIG, "none"?
There was a problem hiding this comment.
@chia7712 @AndrewJSchofield, thanks for the great suggestion. Updated it.
There was a problem hiding this comment.
Looks good to me now. I'll merge and cherry-pick to 4.0.
Signed-off-by: PoAn Yang <payang@apache.org>
Signed-off-by: PoAn Yang <payang@apache.org>
Signed-off-by: PoAn Yang <payang@apache.org>
…ticationFailure (#18735) Reviewers: Lianet Magrans <lmagrans@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Andrew Schofield <aschofield@confluent.io>
…ticationFailure (apache#18735) Reviewers: Lianet Magrans <lmagrans@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Andrew Schofield <aschofield@confluent.io>
…ticationFailure (apache#18735) Reviewers: Lianet Magrans <lmagrans@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Andrew Schofield <aschofield@confluent.io>
The default
retry.backoff.msis100Landmetadata.recovery.strategyisrebootstrap. If the case can't finish all assertion in100L,AdminMetadataManager#rebootstrapwill be triggered and authentication error will be cleanup. Since we don't set nextpendingAuthenticationErrors, there will not have next authentication error Finally, the case will fail. Setmetadata.recovery.strategyasnoneto avoid the error.Committer Checklist (excluded from commit message)