Skip to content

KAFKA-7940: Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback#6330

Merged
rajinisivaram merged 3 commits intoapache:trunkfrom
stanislavkozlovski:KAFKA-7940-test-custom-quota-callback
Feb 27, 2019
Merged

KAFKA-7940: Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback#6330
rajinisivaram merged 3 commits intoapache:trunkfrom
stanislavkozlovski:KAFKA-7940-test-custom-quota-callback

Conversation

@stanislavkozlovski
Copy link
Copy Markdown
Contributor

@stanislavkozlovski stanislavkozlovski commented Feb 26, 2019

This test has been seen to fail with:

java.lang.AssertionError: Too many quotaLimit calls Map(PRODUCE -> 1, FETCH -> 1, REQUEST -> 4)
java.lang.AssertionError: Too many quotaLimit calls Map(PRODUCE -> 1, FETCH -> 1, REQUEST -> 3)
java.lang.AssertionError: Too many quotaLimit calls Map(PRODUCE -> 1, FETCH -> 1, REQUEST -> 3)

All of which are in three test jobs where the following tests are seen to fail:
https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/28/

kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup
kafka.api.CustomQuotaCallbackTest.testCustomQuotaCallback

https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/15

kafka.api.CustomQuotaCallbackTest.testCustomQuotaCallback

https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/14/

kafka.api.CustomQuotaCallbackTest.testCustomQuotaCallback
kafka.server.DynamicBrokerReconfigurationTest.testAddRemoveSaslListeners

Reproduce Attempts

I tried to reproduce this flaky test locally by running the testCustomQuotaCallback over and over again.
First off, I had the issue where GroupedUserQuotaCallback would accumulate mutations and always fail. I thought it would be better if we reset the variables on every tearDown() call in CustomQuotaCallbackTest.scala

After changing that, I ran the test over and over again. I commented out the lines after line 107 ( assertTrue(s"Too many quotaLimit calls $quotaLimitCalls", quotaLimitCalls(ClientQuotaType.REQUEST).get <= serverCount)) since any changes afterwards are ruled out to have an impact.

A lot of red herrings encountered while debugging. I think I managed to pinpoint the cause of this to a race condition in ZooKeeper dynamic config change notifications.

Processing override for entityPath: users/group0_ with config: Map(request_percentage -> 1000.0, consumer_byte_rate -> 2000000, producer_byte_rate -> 1000000) 1551189456810
Processing override for entityPath: users/group0_ with config: Map(request_percentage -> 1000.0, consumer_byte_rate -> 2000000, producer_byte_rate -> 1000000) 1551189456810
in UserConfigHandler#processConfigChanges 1551189456811
in UserConfigHandler#processConfigChanges 1551189456811
In updateQuotaMetricConfigs 1551189456811}
In updateQuotaMetricConfigs 1551189456811}
In updateQuotaMetricConfigs 1551189456811}
In updateQuotaMetricConfigs 1551189456811}
In updateQuotaMetricConfigs 1551189456811}
In updateQuotaMetricConfigs 1551189456811}
Updating metric config allMetrics.asScala.filterKeys 1551189456811
In this 1551189456811}
RESET THEM (quotaLimitCalls.values.foreach(_.set(0)))
Updating metric config allMetrics.asScala.filterKeys 1551189456811
OUT UserConfigHandler#processConfigChanges 1551189456811
OUT UserConfigHandler#processConfigChanges 1551189456811
Asserting 1551189457588

This is basically the trailing result of the user.configureAndWaitForQuota call. That test code verifies that the quotas are set on the metrics (

assertEquals(s"ClientId $consumerClientId of user $userPrincipal must have request quota", Quota.upperBound(requestQuota), overrideConsumerRequestQuota)
).
The problem is that there is a race condition in between the code that updates said metrics - it subsequently calls ClientQuotaManager#updateQuotaMetricConfigs() (
updateQuotaMetricConfigs(updatedEntity)
) which also [goes on to call]((
Option(quotaCallback.quotaLimit(clientQuotaType, metricTags)).foreach { quota =>
) quotaCallback.quotaLimit, incrementing the quotaLimitCalls variable we check in the test.
I found UserConfigHandler#processConfigChanges() to be the culprit.

Solution

We basically need a way to guarantee that UserConfigHandler#processConfigChanges() has exited. If the test code's user.configureAndWaitForQuota() method had a way to do that it would be perfect. I, unfortunately, could not come up with a good way to address this.

I was thinking that we could correlate the expected quotaLimitCalls. I see they had three different kinds of values across tests (with Thread.sleep enabled):

resetting Map(PRODUCE -> 5, FETCH -> 5, REQUEST -> 18)
resetting Map(PRODUCE -> 6, FETCH -> 6, REQUEST -> 20)
resetting Map(PRODUCE -> 7, FETCH -> 7, REQUEST -> 22)

Part of those are populated from the test code's user.configureAndWaitForQuota (2 request quota() calls for each 1 produce, 1 fetch quota() call.
I can't exactly tell the correlation though.

I brainstormed for a bit but could not figure out a smarter way to guarantee that this test passes unless we add a Thread.sleep(). We could potentially add a waitUntilTrue(quotaLimitCalls(ClientQuotaType.REQUEST) >= 18) and then reset but that seems even more hacky and error-prone.

Otherwise, we could simply not check quotaLimitCalls(ClientQuotaType.REQUEST).get <= serverCount or allow for some leeway there.

@rajinisivaram
Copy link
Copy Markdown
Contributor

@stanislavkozlovski Thanks for the PR. I think it would be better to just remove that failing assertion or just assert that it is less than 100 (or some large enough value) just as a sanity check.

@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

@rajinisivaram I added a value that is still somewhat reasonable - 10.
In all my runs, I couldn't get it to go over 4, so I think it is alright.

Copy link
Copy Markdown
Contributor

@rajinisivaram rajinisivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stanislavkozlovski Thanks for the PR, LGTM. Will merge after builds complete.

@rajinisivaram rajinisivaram merged commit f667f57 into apache:trunk Feb 27, 2019
@mjsax
Copy link
Copy Markdown
Member

mjsax commented Feb 27, 2019

This PR seems to address https://issues.apache.org/jira/browse/KAFKA-7940 -- PR title should not have been MINOR, but use the ticket number.

If it fixed the issues, please link the PR to the ticket and update (ie, close the ticket). This should be cherry-picked to 2.2, too. Thanks.

@stanislavkozlovski
Copy link
Copy Markdown
Contributor Author

Sorry about that, I had it in my fork's branch name but forgot to update the PR title

@stanislavkozlovski stanislavkozlovski changed the title Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback KAFKA-7940: Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback Feb 28, 2019
@rajinisivaram
Copy link
Copy Markdown
Contributor

Merged to 2.2 as well.

@mjsax
Copy link
Copy Markdown
Member

mjsax commented Feb 28, 2019

Thanks @stanislavkozlovski and @rajinisivaram!

jarekr pushed a commit to confluentinc/kafka that referenced this pull request Apr 18, 2019
* AK/trunk: (36 commits)
  KAFKA-7962: Avoid NPE for StickyAssignor (apache#6308)
  Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback (apache#6330)
  KAFKA-7918: Inline generic parameters Pt. II: RocksDB Bytes Store and Memory LRU Caches (apache#6327)
  MINOR: fix parameter naming (apache#6316)
  KAFKA-7956 In ShutdownableThread, immediately complete the shutdown if the thread has not been started (apache#6218)
  MINOR: Refactor replica log dir fetching for improved logging (apache#6313)
  [TRIVIAL] Remove unused StreamsGraphNode#repartitionRequired (apache#6227)
  MINOR: Increase produce timeout to 120 seconds (apache#6326)
  KAFKA-7918: Inline generic parameters Pt. I: in-memory key-value store (apache#6293)
  MINOR: Fix line break issue in upgrade notes (apache#6320)
  KAFKA-7972: Use automatic RPC generation in SaslHandshake
  MINOR: Enable capture of full stack trace in StreamTask#process (apache#6310)
  KAFKA-7938: Fix test flakiness in DeleteConsumerGroupsTest (apache#6312)
  KAFKA-7937: Fix Flaky Test ResetConsumerGroupOffsetTest.testResetOffsetsNotExistingGroup (apache#6311)
  MINOR: Update docs to say 2.2 (apache#6315)
  KAFKA-7672 : force write checkpoint during StreamTask #suspend (apache#6115)
  KAFKA-7961; Ignore assignment for un-subscribed partitions (apache#6304)
  KAFKA-7672: Restoring tasks need to be closed upon task suspension (apache#6113)
  KAFKA-7864; validate partitions are 0-based (apache#6246)
  KAFKA-7492 : Updated javadocs for aggregate and reduce methods returning null behavior. (apache#6285)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants