Skip to content

KAFKA-18441: Fix flaky KafkaAdminClientTest#testAdminClientApisAuthenticationFailure#18735

Merged
AndrewJSchofield merged 6 commits intoapache:trunkfrom
FrankYang0529:KAFKA-18441
Jan 30, 2025
Merged

KAFKA-18441: Fix flaky KafkaAdminClientTest#testAdminClientApisAuthenticationFailure#18735
AndrewJSchofield merged 6 commits intoapache:trunkfrom
FrankYang0529:KAFKA-18441

Conversation

@FrankYang0529
Copy link
Copy Markdown
Member

@FrankYang0529 FrankYang0529 commented Jan 29, 2025

The default retry.backoff.ms is 100L and metadata.recovery.strategy is rebootstrap. If the case can't finish all assertion in 100L, AdminMetadataManager#rebootstrap will be triggered and authentication error will be cleanup. Since we don't set next pendingAuthenticationErrors, there will not have next authentication error Finally, the case will fail. Set metadata.recovery.strategy as none to avoid the error.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

…ticationFailure

Signed-off-by: PoAn Yang <payang@apache.org>
Copy link
Copy Markdown
Member

@AndrewJSchofield AndrewJSchofield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change makes the test reliable, but really the handling of pending authentication errors in the MockClient is questionable.

Please revert the removal of @Flaky from this PR. I suggest you open an issue for the problem of pending authentication errors in the MockClient.

}
}

@Flaky("KAFKA-18441")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this @Flaky should be removed until we have evidence that this has properly fixed the flakiness in the CI. I would wait for 7 days of clean builds before removing it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Added the tag back.

AdminClientConfig.RETRY_BACKOFF_MS_CONFIG, "5000"))) {
env.kafkaClient().setNodeApiVersions(NodeApiVersions.create());
env.kafkaClient().createPendingAuthenticationError(cluster.nodes().get(0),
TimeUnit.DAYS.toMillis(1));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the suspicious part I think. I think the intent of the code was to have an authentication error which lasted for 1 day, giving plenty of time for the other tests to run. In fact, the code in MockClient doesn't appear to do that, and the value of the TimeUnit.DAYS.toMillis(1) is questionable.

Copy link
Copy Markdown
Member

@lianetm lianetm Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that there is something about the MockClient we're not getting here (or is off and we might be hiding it).

This TimeUnit.DAYS.toMillis(1) was supposed to achieve exactly what we're having to do by setting a higher RETRY_BACKOFF_MS_CONFIG. The createPendingAuthenticationError takes the backoff param and sets it as the backoff for the node

So the MockClient shouldn't generate a new fetchMetadata for 1 day I expect.

if (!connectionState(node.idString()).isBackingOff(now))

That being said, agree with @AndrewJSchofield 's suggestion, we could add the config to stabilize the very noisy test, and have a follow-up jira to check why the MockClient is requiring an explicit backoff config in this case.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @AndrewJSchofield and @lianetm, thanks for the review and information. After checking the code again, my PR description is unclear.

With default "retry.backoff.ms", if all assertion can't finish in 100L, the KafkaAdminClient make a metadata call again.

long metadataFetchDelayMs = metadataManager.metadataFetchDelayMs(now);
if (metadataFetchDelayMs == 0) {
metadataManager.transitionToUpdatePending(now);
Call metadataCall = makeMetadataCall(now);
// Create a new metadata fetch call and add it to the end of pendingCalls.
// Assign a node for just the new call (we handled the other pending nodes above).
if (!maybeDrainPendingCall(metadataCall, now))
pendingCalls.add(metadataCall);
}

In the fetchMetadata request, it uses MetadataUpdateNodeIdProvider.

private Call makeBrokerMetadataCall(long now) {
// We use MetadataRequest here so that we can continue to support brokers that are too
// old to handle DescribeCluster.
return new Call(true, "fetchMetadata", calcDeadlineMs(now, requestTimeoutMs),
new MetadataUpdateNodeIdProvider()) {

In MetadataUpdateNodeIdProvider, it triggers AdminMetadataManager#rebootstrap.

private class MetadataUpdateNodeIdProvider implements NodeProvider {
@Override
public Node provide() {
long now = time.milliseconds();
LeastLoadedNode leastLoadedNode = client.leastLoadedNode(now);
if (metadataRecoveryStrategy == MetadataRecoveryStrategy.REBOOTSTRAP
&& !leastLoadedNode.hasNodeAvailableOrConnectionReady()) {
metadataManager.rebootstrap(now);
}

Finally, the rebootstrap function makes fatalException be null again in AdminMetadataManager#update.

public void update(Cluster cluster, long now) {
if (cluster.isBootstrapConfigured()) {
log.debug("Setting bootstrap cluster metadata {}.", cluster);
bootstrapCluster = cluster;
} else {
log.debug("Updating cluster metadata to {}", cluster);
this.lastMetadataUpdateMs = now;
}
this.state = State.QUIESCENT;
this.fatalException = null;

This case relies on AdminMetadataManager#isReady to throw error. If fatalException is null, the assertion can't be fulfilled.

public boolean isReady() {
if (fatalException != null) {
log.debug("Metadata is not usable: failed to get metadata.", fatalException);
throw fatalException;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the description. I've approved the PR and will merge once the build is complete. However, my question really was whether this is desirable behaviour. I would say that the original author intended to "pin" the authentication failure to last for a day so that all operations failed and a multi-part test could be performed reliably. Something has changed over the past couple of months making this unreliable. So, I expect there's a follow-on piece of work in here.

Copy link
Copy Markdown
Member

@lianetm lianetm Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, seems to me the changes to rebootstrap for metadata shared above are maybe the ones behind this then (from nov 2024 +- btw).

private class MetadataUpdateNodeIdProvider implements NodeProvider {
@Override
public Node provide() {
long now = time.milliseconds();
LeastLoadedNode leastLoadedNode = client.leastLoadedNode(now);
if (metadataRecoveryStrategy == MetadataRecoveryStrategy.REBOOTSTRAP
&& !leastLoadedNode.hasNodeAvailableOrConnectionReady()) {
metadataManager.rebootstrap(now);
}

Even though the test is setting a 1 day backoff, that only makes the MockClient return null node as leastLoaded (and this will now trigger a new metadata request that I guess didn't trigger before the reboostrap logic)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I expect that's it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably set metadata.recovery.strategy to none to replicate the original behavior, where fatal errors are not automatically reset. Also, we should add root cause ( the AuthenticationException is reset due to rebootstrap) to the comment.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's a better solution and it seems to work in my local testing. @FrankYang0529
wdyt? How about changing the extra config to AdminClientConfig.METADATA_RECOVERY_STRATEGY_CONFIG, "none"?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chia7712 @AndrewJSchofield, thanks for the great suggestion. Updated it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now. I'll merge and cherry-pick to 4.0.

Signed-off-by: PoAn Yang <payang@apache.org>
Signed-off-by: PoAn Yang <payang@apache.org>
Signed-off-by: PoAn Yang <payang@apache.org>
@AndrewJSchofield AndrewJSchofield merged commit 0dfc401 into apache:trunk Jan 30, 2025
@FrankYang0529 FrankYang0529 deleted the KAFKA-18441 branch January 30, 2025 10:28
AndrewJSchofield pushed a commit that referenced this pull request Jan 30, 2025
…ticationFailure (#18735)

Reviewers: Lianet Magrans <lmagrans@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Andrew Schofield <aschofield@confluent.io>
pdruley pushed a commit to pdruley/kafka that referenced this pull request Feb 12, 2025
…ticationFailure (apache#18735)

Reviewers: Lianet Magrans <lmagrans@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Andrew Schofield <aschofield@confluent.io>
manoj-mathivanan pushed a commit to manoj-mathivanan/kafka that referenced this pull request Feb 19, 2025
…ticationFailure (apache#18735)

Reviewers: Lianet Magrans <lmagrans@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Andrew Schofield <aschofield@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clients small Small PRs tests Test fixes (including flaky tests)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants