KAFKA-18441: Fix flaky KafkaAdminClientTest#testAdminClientApisAuthenticationFailure by FrankYang0529 · Pull Request #18735 · apache/kafka

FrankYang0529 · 2025-01-29T03:21:31Z

The default retry.backoff.ms is 100L and metadata.recovery.strategy is rebootstrap. If the case can't finish all assertion in 100L, AdminMetadataManager#rebootstrap will be triggered and authentication error will be cleanup. Since we don't set next pendingAuthenticationErrors, there will not have next authentication error Finally, the case will fail. Set metadata.recovery.strategy as none to avoid the error.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…ticationFailure Signed-off-by: PoAn Yang <payang@apache.org>

AndrewJSchofield

I think the change makes the test reliable, but really the handling of pending authentication errors in the MockClient is questionable.

Please revert the removal of @Flaky from this PR. I suggest you open an issue for the problem of pending authentication errors in the MockClient.

AndrewJSchofield · 2025-01-29T10:56:33Z

        }
    }

-    @Flaky("KAFKA-18441")


I don't think this @Flaky should be removed until we have evidence that this has properly fixed the flakiness in the CI. I would wait for 7 days of clean builds before removing it.

Thanks for the review. Added the tag back.

AndrewJSchofield · 2025-01-29T10:58:39Z

+                    AdminClientConfig.RETRY_BACKOFF_MS_CONFIG, "5000"))) {
            env.kafkaClient().setNodeApiVersions(NodeApiVersions.create());
            env.kafkaClient().createPendingAuthenticationError(cluster.nodes().get(0),
                    TimeUnit.DAYS.toMillis(1));


This is the suspicious part I think. I think the intent of the code was to have an authentication error which lasted for 1 day, giving plenty of time for the other tests to run. In fact, the code in MockClient doesn't appear to do that, and the value of the TimeUnit.DAYS.toMillis(1) is questionable.

Agree that there is something about the MockClient we're not getting here (or is off and we might be hiding it).

This TimeUnit.DAYS.toMillis(1) was supposed to achieve exactly what we're having to do by setting a higher RETRY_BACKOFF_MS_CONFIG. The createPendingAuthenticationError takes the backoff param and sets it as the backoff for the node

kafka/clients/src/test/java/org/apache/kafka/clients/MockClient.java

Line 170 in 97a2280

backoff(node, backoffMs);

So the MockClient shouldn't generate a new fetchMetadata for 1 day I expect.

kafka/clients/src/test/java/org/apache/kafka/clients/MockClient.java

Line 606 in 97a2280

if (!connectionState(node.idString()).isBackingOff(now))

That being said, agree with @AndrewJSchofield 's suggestion, we could add the config to stabilize the very noisy test, and have a follow-up jira to check why the MockClient is requiring an explicit backoff config in this case.

Hi @AndrewJSchofield and @lianetm, thanks for the review and information. After checking the code again, my PR description is unclear.

With default "retry.backoff.ms", if all assertion can't finish in 100L, the KafkaAdminClient make a metadata call again.

kafka/clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java

Lines 1541 to 1550 in 048dfef

long metadataFetchDelayMs = metadataManager.metadataFetchDelayMs(now);

if (metadataFetchDelayMs == 0) {

metadataManager.transitionToUpdatePending(now);

Call metadataCall = makeMetadataCall(now);

// Create a new metadata fetch call and add it to the end of pendingCalls.

// Assign a node for just the new call (we handled the other pending nodes above).

if (!maybeDrainPendingCall(metadataCall, now))

pendingCalls.add(metadataCall);

}

In the fetchMetadata request, it uses MetadataUpdateNodeIdProvider.

kafka/clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java

Lines 1686 to 1690 in 048dfef

private Call makeBrokerMetadataCall(long now) {

// We use MetadataRequest here so that we can continue to support brokers that are too

// old to handle DescribeCluster.

return new Call(true, "fetchMetadata", calcDeadlineMs(now, requestTimeoutMs),

new MetadataUpdateNodeIdProvider()) {

In MetadataUpdateNodeIdProvider, it triggers AdminMetadataManager#rebootstrap.

kafka/clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java

Lines 722 to 730 in 048dfef

private class MetadataUpdateNodeIdProvider implements NodeProvider {

@Override

public Node provide() {

long now = time.milliseconds();

LeastLoadedNode leastLoadedNode = client.leastLoadedNode(now);

if (metadataRecoveryStrategy == MetadataRecoveryStrategy.REBOOTSTRAP

&& !leastLoadedNode.hasNodeAvailableOrConnectionReady()) {

metadataManager.rebootstrap(now);

}

Finally, the rebootstrap function makes fatalException be null again in AdminMetadataManager#update.

kafka/clients/src/main/java/org/apache/kafka/clients/admin/internals/AdminMetadataManager.java

Lines 303 to 313 in 632aedc

public void update(Cluster cluster, long now) {

if (cluster.isBootstrapConfigured()) {

log.debug("Setting bootstrap cluster metadata {}.", cluster);

bootstrapCluster = cluster;

} else {

log.debug("Updating cluster metadata to {}", cluster);

this.lastMetadataUpdateMs = now;

}

this.state = State.QUIESCENT;

this.fatalException = null;

This case relies on AdminMetadataManager#isReady to throw error. If fatalException is null, the assertion can't be fulfilled.

kafka/clients/src/main/java/org/apache/kafka/clients/admin/internals/AdminMetadataManager.java

Lines 186 to 190 in 632aedc

public boolean isReady() {

if (fatalException != null) {

log.debug("Metadata is not usable: failed to get metadata.", fatalException);

throw fatalException;

}

Thanks for the description. I've approved the PR and will merge once the build is complete. However, my question really was whether this is desirable behaviour. I would say that the original author intended to "pin" the authentication failure to last for a day so that all operations failed and a multi-part test could be performed reliably. Something has changed over the past couple of months making this unreliable. So, I expect there's a follow-on piece of work in here.

interesting, seems to me the changes to rebootstrap for metadata shared above are maybe the ones behind this then (from nov 2024 +- btw).

kafka/clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java

Lines 722 to 730 in 048dfef

private class MetadataUpdateNodeIdProvider implements NodeProvider {

@Override

public Node provide() {

long now = time.milliseconds();

LeastLoadedNode leastLoadedNode = client.leastLoadedNode(now);

if (metadataRecoveryStrategy == MetadataRecoveryStrategy.REBOOTSTRAP

&& !leastLoadedNode.hasNodeAvailableOrConnectionReady()) {

metadataManager.rebootstrap(now);

}

Even though the test is setting a 1 day backoff, that only makes the MockClient return null node as leastLoaded (and this will now trigger a new metadata request that I guess didn't trigger before the reboostrap logic)

Yes, I expect that's it.

We should probably set metadata.recovery.strategy to none to replicate the original behavior, where fatal errors are not automatically reset. Also, we should add root cause ( the AuthenticationException is reset due to rebootstrap) to the comment.

Yes, I think that's a better solution and it seems to work in my local testing. @FrankYang0529
wdyt? How about changing the extra config to AdminClientConfig.METADATA_RECOVERY_STRATEGY_CONFIG, "none"?

@chia7712 @AndrewJSchofield, thanks for the great suggestion. Updated it.

Looks good to me now. I'll merge and cherry-pick to 4.0.

Signed-off-by: PoAn Yang <payang@apache.org>

…ticationFailure (#18735) Reviewers: Lianet Magrans <lmagrans@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Andrew Schofield <aschofield@confluent.io>

…ticationFailure (apache#18735) Reviewers: Lianet Magrans <lmagrans@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Andrew Schofield <aschofield@confluent.io>

KAFKA-18441: Fix flaky KafkaAdminClientTest#testAdminClientApisAuthen…

07fb1ce

…ticationFailure Signed-off-by: PoAn Yang <payang@apache.org>

github-actions Bot added tests Test fixes (including flaky tests) clients small Small PRs labels Jan 29, 2025

This was referenced Jan 29, 2025

KAFKA-18619: New consumer topic metadata events should set requireMetadata flag #18668

Merged

KAFKA-18569: New consumer close may wait on unneeded FindCoordinator #18590

Merged

ijuma requested review from AndrewJSchofield and lianetm January 29, 2025 05:58

AndrewJSchofield requested changes Jan 29, 2025

View reviewed changes

chia7712 mentioned this pull request Jan 29, 2025

KAFKA-18646: Null records in fetch response breaks librdkafka #18726

Merged

3 tasks

Merge branch 'trunk' into KAFKA-18441

ea060ef

chia7712 mentioned this pull request Jan 29, 2025

MINOR: prevent exception from HdrHistogram #18674

Merged

3 tasks

address comment

d09bb66

Signed-off-by: PoAn Yang <payang@apache.org>

AndrewJSchofield approved these changes Jan 29, 2025

View reviewed changes

lianetm approved these changes Jan 29, 2025

View reviewed changes

lianetm mentioned this pull request Jan 29, 2025

KAFKA-18034: CommitRequestManager should fail pending requests on fatal coordinator errors #18548

Merged

3 tasks

This was referenced Jan 30, 2025

MINOR: Session windows should accept zero as session gap #18734

Merged

MINOR: cleanup KStream JavaDocs (4/N) - stream-table-inner-join #18721

Merged

FrankYang0529 added 3 commits January 30, 2025 08:46

address comment

5ad7bc0

Signed-off-by: PoAn Yang <payang@apache.org>

Merge branch 'trunk' into KAFKA-18441

9fee5de

remove test code

88aa5c4

Signed-off-by: PoAn Yang <payang@apache.org>

AndrewJSchofield merged commit 0dfc401 into apache:trunk Jan 30, 2025

FrankYang0529 deleted the KAFKA-18441 branch January 30, 2025 10:28

FrankYang0529 mentioned this pull request Feb 10, 2025

KAFKA-18441: Remove flaky tag on KafkaAdminClientTest#testAdminClientApisAuthenticationFailure #18847

Merged

3 tasks

	long metadataFetchDelayMs = metadataManager.metadataFetchDelayMs(now);
	if (metadataFetchDelayMs == 0) {
	metadataManager.transitionToUpdatePending(now);
	Call metadataCall = makeMetadataCall(now);
	// Create a new metadata fetch call and add it to the end of pendingCalls.
	// Assign a node for just the new call (we handled the other pending nodes above).

	if (!maybeDrainPendingCall(metadataCall, now))
	pendingCalls.add(metadataCall);
	}

	private Call makeBrokerMetadataCall(long now) {
	// We use MetadataRequest here so that we can continue to support brokers that are too
	// old to handle DescribeCluster.
	return new Call(true, "fetchMetadata", calcDeadlineMs(now, requestTimeoutMs),
	new MetadataUpdateNodeIdProvider()) {

	private class MetadataUpdateNodeIdProvider implements NodeProvider {
	@Override
	public Node provide() {
	long now = time.milliseconds();
	LeastLoadedNode leastLoadedNode = client.leastLoadedNode(now);
	if (metadataRecoveryStrategy == MetadataRecoveryStrategy.REBOOTSTRAP
	&& !leastLoadedNode.hasNodeAvailableOrConnectionReady()) {
	metadataManager.rebootstrap(now);
	}

	public void update(Cluster cluster, long now) {
	if (cluster.isBootstrapConfigured()) {
	log.debug("Setting bootstrap cluster metadata {}.", cluster);
	bootstrapCluster = cluster;
	} else {
	log.debug("Updating cluster metadata to {}", cluster);
	this.lastMetadataUpdateMs = now;
	}

	this.state = State.QUIESCENT;
	this.fatalException = null;

	public boolean isReady() {
	if (fatalException != null) {
	log.debug("Metadata is not usable: failed to get metadata.", fatalException);
	throw fatalException;
	}

Conversation

FrankYang0529 commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianetm Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lianetm Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

FrankYang0529 commented Jan 29, 2025 •

edited

Loading

lianetm Jan 29, 2025 •

edited

Loading

lianetm Jan 29, 2025 •

edited

Loading