KAFKA-7940: Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback by stanislavkozlovski · Pull Request #6330 · apache/kafka

stanislavkozlovski · 2019-02-26T15:40:27Z

This test has been seen to fail with:

java.lang.AssertionError: Too many quotaLimit calls Map(PRODUCE -> 1, FETCH -> 1, REQUEST -> 4)
java.lang.AssertionError: Too many quotaLimit calls Map(PRODUCE -> 1, FETCH -> 1, REQUEST -> 3)
java.lang.AssertionError: Too many quotaLimit calls Map(PRODUCE -> 1, FETCH -> 1, REQUEST -> 3)

All of which are in three test jobs where the following tests are seen to fail:
https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/28/

kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup
kafka.api.CustomQuotaCallbackTest.testCustomQuotaCallback

https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/15

kafka.api.CustomQuotaCallbackTest.testCustomQuotaCallback

https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/14/

kafka.api.CustomQuotaCallbackTest.testCustomQuotaCallback
kafka.server.DynamicBrokerReconfigurationTest.testAddRemoveSaslListeners

Reproduce Attempts

I tried to reproduce this flaky test locally by running the testCustomQuotaCallback over and over again.
First off, I had the issue where GroupedUserQuotaCallback would accumulate mutations and always fail. I thought it would be better if we reset the variables on every tearDown() call in CustomQuotaCallbackTest.scala

After changing that, I ran the test over and over again. I commented out the lines after line 107 ( assertTrue(s"Too many quotaLimit calls $quotaLimitCalls", quotaLimitCalls(ClientQuotaType.REQUEST).get <= serverCount)) since any changes afterwards are ruled out to have an impact.

A lot of red herrings encountered while debugging. I think I managed to pinpoint the cause of this to a race condition in ZooKeeper dynamic config change notifications.

Processing override for entityPath: users/group0_ with config: Map(request_percentage -> 1000.0, consumer_byte_rate -> 2000000, producer_byte_rate -> 1000000) 1551189456810
Processing override for entityPath: users/group0_ with config: Map(request_percentage -> 1000.0, consumer_byte_rate -> 2000000, producer_byte_rate -> 1000000) 1551189456810
in UserConfigHandler#processConfigChanges 1551189456811
in UserConfigHandler#processConfigChanges 1551189456811
In updateQuotaMetricConfigs 1551189456811}
In updateQuotaMetricConfigs 1551189456811}
In updateQuotaMetricConfigs 1551189456811}
In updateQuotaMetricConfigs 1551189456811}
In updateQuotaMetricConfigs 1551189456811}
In updateQuotaMetricConfigs 1551189456811}
Updating metric config allMetrics.asScala.filterKeys 1551189456811
In this 1551189456811}
RESET THEM (quotaLimitCalls.values.foreach(_.set(0)))
Updating metric config allMetrics.asScala.filterKeys 1551189456811
OUT UserConfigHandler#processConfigChanges 1551189456811
OUT UserConfigHandler#processConfigChanges 1551189456811
Asserting 1551189457588

This is basically the trailing result of the user.configureAndWaitForQuota call. That test code verifies that the quotas are set on the metrics (

kafka/core/src/test/scala/integration/kafka/api/BaseQuotaTest.scala

Line 297 in 2627a1b

    
           assertEquals(s"ClientId $consumerClientId of user $userPrincipal must have request quota", Quota.upperBound(requestQuota), overrideConsumerRequestQuota)

).
The problem is that there is a race condition in between the code that updates said metrics - it subsequently calls ClientQuotaManager#updateQuotaMetricConfigs() (

kafka/core/src/main/scala/kafka/server/ClientQuotaManager.scala

Line 449 in 2627a1b

updateQuotaMetricConfigs(updatedEntity)

) which also [goes on to call]((

kafka/core/src/main/scala/kafka/server/ClientQuotaManager.scala

Line 496 in 2627a1b

    
           Option(quotaCallback.quotaLimit(clientQuotaType, metricTags)).foreach { quota =>

) quotaCallback.quotaLimit, incrementing the quotaLimitCalls variable we check in the test.
I found UserConfigHandler#processConfigChanges() to be the culprit.

Solution

We basically need a way to guarantee that UserConfigHandler#processConfigChanges() has exited. If the test code's user.configureAndWaitForQuota() method had a way to do that it would be perfect. I, unfortunately, could not come up with a good way to address this.

I was thinking that we could correlate the expected quotaLimitCalls. I see they had three different kinds of values across tests (with Thread.sleep enabled):

resetting Map(PRODUCE -> 5, FETCH -> 5, REQUEST -> 18)
resetting Map(PRODUCE -> 6, FETCH -> 6, REQUEST -> 20)
resetting Map(PRODUCE -> 7, FETCH -> 7, REQUEST -> 22)

Part of those are populated from the test code's user.configureAndWaitForQuota (2 request quota() calls for each 1 produce, 1 fetch quota() call.
I can't exactly tell the correlation though.

I brainstormed for a bit but could not figure out a smarter way to guarantee that this test passes unless we add a Thread.sleep(). We could potentially add a waitUntilTrue(quotaLimitCalls(ClientQuotaType.REQUEST) >= 18) and then reset but that seems even more hacky and error-prone.

Otherwise, we could simply not check quotaLimitCalls(ClientQuotaType.REQUEST).get <= serverCount or allow for some leeway there.

rajinisivaram · 2019-02-26T20:01:43Z

@stanislavkozlovski Thanks for the PR. I think it would be better to just remove that failing assertion or just assert that it is less than 100 (or some large enough value) just as a sanity check.

stanislavkozlovski · 2019-02-27T09:40:26Z

@rajinisivaram I added a value that is still somewhat reasonable - 10.
In all my runs, I couldn't get it to go over 4, so I think it is alright.

rajinisivaram

@stanislavkozlovski Thanks for the PR, LGTM. Will merge after builds complete.

mjsax · 2019-02-27T21:24:57Z

This PR seems to address https://issues.apache.org/jira/browse/KAFKA-7940 -- PR title should not have been MINOR, but use the ticket number.

If it fixed the issues, please link the PR to the ticket and update (ie, close the ticket). This should be cherry-picked to 2.2, too. Thanks.

stanislavkozlovski · 2019-02-28T09:55:59Z

Sorry about that, I had it in my fork's branch name but forgot to update the PR title

…otaCallback (#6330)

rajinisivaram · 2019-02-28T10:11:51Z

Merged to 2.2 as well.

mjsax · 2019-02-28T17:14:33Z

Thanks @stanislavkozlovski and @rajinisivaram!

* AK/trunk: (36 commits) KAFKA-7962: Avoid NPE for StickyAssignor (apache#6308) Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback (apache#6330) KAFKA-7918: Inline generic parameters Pt. II: RocksDB Bytes Store and Memory LRU Caches (apache#6327) MINOR: fix parameter naming (apache#6316) KAFKA-7956 In ShutdownableThread, immediately complete the shutdown if the thread has not been started (apache#6218) MINOR: Refactor replica log dir fetching for improved logging (apache#6313) [TRIVIAL] Remove unused StreamsGraphNode#repartitionRequired (apache#6227) MINOR: Increase produce timeout to 120 seconds (apache#6326) KAFKA-7918: Inline generic parameters Pt. I: in-memory key-value store (apache#6293) MINOR: Fix line break issue in upgrade notes (apache#6320) KAFKA-7972: Use automatic RPC generation in SaslHandshake MINOR: Enable capture of full stack trace in StreamTask#process (apache#6310) KAFKA-7938: Fix test flakiness in DeleteConsumerGroupsTest (apache#6312) KAFKA-7937: Fix Flaky Test ResetConsumerGroupOffsetTest.testResetOffsetsNotExistingGroup (apache#6311) MINOR: Update docs to say 2.2 (apache#6315) KAFKA-7672 : force write checkpoint during StreamTask #suspend (apache#6115) KAFKA-7961; Ignore assignment for un-subscribed partitions (apache#6304) KAFKA-7672: Restoring tasks need to be closed upon task suspension (apache#6113) KAFKA-7864; validate partitions are 0-based (apache#6246) KAFKA-7492 : Updated javadocs for aggregate and reduce methods returning null behavior. (apache#6285) ...

…pache#6330)

stanislavkozlovski added 2 commits February 26, 2019 11:16

Add teardown helper for static variables in GroupedUserQuotaCallback

7363334

Add sleep after initial user quota configuration

23613f2

Replace Thread.sleep with quick sanity check

9e21360

rajinisivaram approved these changes Feb 27, 2019

View reviewed changes

rajinisivaram merged commit f667f57 into apache:trunk Feb 27, 2019

stanislavkozlovski changed the title ~~Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback~~ KAFKA-7940: Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback Feb 28, 2019

rajinisivaram pushed a commit that referenced this pull request Feb 28, 2019

KAFKA-7940: Address flakiness of CustomQuotaCallbackTest#testCustomQu…

d691cb6

…otaCallback (#6330)

pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019

Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback (a…

1a1523f

…pache#6330)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-7940: Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback#6330

KAFKA-7940: Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback#6330
rajinisivaram merged 3 commits intoapache:trunkfrom
stanislavkozlovski:KAFKA-7940-test-custom-quota-callback

stanislavkozlovski commented Feb 26, 2019 •

edited

Loading

Uh oh!

rajinisivaram commented Feb 26, 2019

Uh oh!

stanislavkozlovski commented Feb 27, 2019

Uh oh!

rajinisivaram left a comment

Uh oh!

mjsax commented Feb 27, 2019

Uh oh!

stanislavkozlovski commented Feb 28, 2019

Uh oh!

rajinisivaram commented Feb 28, 2019

Uh oh!

mjsax commented Feb 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stanislavkozlovski commented Feb 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproduce Attempts

Solution

Uh oh!

rajinisivaram commented Feb 26, 2019

Uh oh!

stanislavkozlovski commented Feb 27, 2019

Uh oh!

rajinisivaram left a comment

Choose a reason for hiding this comment

Uh oh!

mjsax commented Feb 27, 2019

Uh oh!

stanislavkozlovski commented Feb 28, 2019

Uh oh!

rajinisivaram commented Feb 28, 2019

Uh oh!

mjsax commented Feb 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stanislavkozlovski commented Feb 26, 2019 •

edited

Loading