Skip to content

KAFKA-14345: Fix flakiness with more accurate bound in (Dynamic)ConnectionQuotaTest#12806

Closed
gharris1727 wants to merge 12 commits intoapache:trunkfrom
gharris1727:kafka-14345-flakey-quotas
Closed

KAFKA-14345: Fix flakiness with more accurate bound in (Dynamic)ConnectionQuotaTest#12806
gharris1727 wants to merge 12 commits intoapache:trunkfrom
gharris1727:kafka-14345-flakey-quotas

Conversation

@gharris1727
Copy link
Copy Markdown
Contributor

@gharris1727 gharris1727 commented Nov 1, 2022

Signed-off-by: Greg Harris greg.harris@aiven.io

The existing hard-coded error bound in the test is inaccurate, such that certain patterns
of request arrival cause this test to exceed the bound slightly. This has appeared in CI as
a flakey test run which just barely exceeds the listed bound.

Instead of a hard-coded error bound which can become inaccurate if the test's parameters change,
compute the worst-case error bound based on the knowledge of the underlying windowing algorithm.
Full derivation and explanation of the worst-case bound is provided in a comment in the util function.
The bound is also unit tested in a mocked-time environment which simulates the behavior of the full
SocketServer's throttle servo loop, simulating a connection flood with various delays.

Also fix some IDE warnings, incorrect/misleading comments, and an AdminClient leak.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

…ionQuotaTest

Signed-off-by: Greg Harris <greg.harris@aiven.io>
@gharris1727
Copy link
Copy Markdown
Contributor Author

cc @apovzner @splett2 for awareness of this bound change.

Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
Signed-off-by: Greg Harris <greg.harris@aiven.io>
@divijvaidya
Copy link
Copy Markdown
Member

Hey @gharris1727
Since you are familiar with the windowing algorithm for throttling, please look at #12045 too. You will find similarities over there with this PR. I will also review this PR tomorrow.

Signed-off-by: Greg Harris <greg.harris@aiven.io>
@gharris1727
Copy link
Copy Markdown
Contributor Author

gharris1727 commented Nov 1, 2022

@divijvaidya Thanks for pointing me to that PR with a similar flakey test. I tried applying this bound to the situation in that test and it didn't hold, and managed to deterministically replicate the behavior in my unit test with a parameterized test. Here's the results, just for discussion's sake:

[1]  numSamples=1,  minInterval=0
[2]  numSamples=1,  minInterval=1
[3]  numSamples=1,  minInterval=2
[4]  numSamples=1,  minInterval=10
[5]  numSamples=1,  minInterval=20
[6]  numSamples=1,  minInterval=30
[7]  numSamples=1,  minInterval=39
[8]  numSamples=1,  minInterval=40
[9]  numSamples=2,  minInterval=0  PASS
[10] numSamples=2,  minInterval=1
[11] numSamples=2,  minInterval=2
[12] numSamples=2,  minInterval=10
[13] numSamples=2,  minInterval=20
[14] numSamples=2,  minInterval=30
[15] numSamples=2,  minInterval=39
[16] numSamples=2,  minInterval=40 PASS
[17] numSamples=10, minInterval=0  PASS
[18] numSamples=10, minInterval=1  PASS
[19] numSamples=10, minInterval=2
[20] numSamples=10, minInterval=10
[21] numSamples=10, minInterval=20
[22] numSamples=10, minInterval=30
[23] numSamples=10, minInterval=39
[24] numSamples=10, minInterval=40 PASS

I was testing the minInterval=0 case when I was developing this bound, but it appears that the theoretical bound doesn't apply in some cases. At this time I'm not sure yet whether the bound is wrong or the rate limiting is wrong.

Signed-off-by: Greg Harris <greg.harris@aiven.io>
@gharris1727
Copy link
Copy Markdown
Contributor Author

My worst-case bounds test had a typo that made the results invalid. I've fixed that and this is the current behavior:

PASS numSamples = 1
PASS numSamples = 10
PASS numSamples = 0,1
PASS numSamples = 20,30,40
FAIL numSamples = 2 && 2 <= minInterval <= 19

I also experimented with the fix from #12045 and it caused all of the test cases to pass.

Signed-off-by: Greg Harris <greg.harris@aiven.io>
@gharris1727
Copy link
Copy Markdown
Contributor Author

Okay, I've just loosened the bound to compensate for the variable-length windows. This means that the flakey tests should now be resolved, while not changing the behavior of the window algorithm.
I believe that this is ready for review @divijvaidya @apovzner @splett2. Thanks!

Signed-off-by: Greg Harris <greg.harris@aiven.io>
@gharris1727 gharris1727 changed the title KAFKA-14345: Fix flakiness with more accurate bound in DynamicConnectionQuotaTest KAFKA-14345: Fix flakiness with more accurate bound in (Dynamic)ConnectionQuotaTest Nov 4, 2022
Signed-off-by: Greg Harris <greg.harris@aiven.io>
@jlprat jlprat added the tests Test fixes (including flaky tests) label Mar 29, 2023
@github-actions
Copy link
Copy Markdown

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurrs in the next 30 days, it will be automatically closed.

@github-actions github-actions Bot added the stale Stale PRs label Jul 15, 2023
@gharris1727
Copy link
Copy Markdown
Contributor Author

@divijvaidya Are you interested in reviewing this? This is a KIP-less alternative to solve the flakiness in this test.

@github-actions github-actions Bot removed the stale Stale PRs label Aug 2, 2023
@divijvaidya
Copy link
Copy Markdown
Member

Hi @gharris1727

I am afraid that fixing only the tests hides the real problem/bug with the quotas implementation. If we merge this PR, yes the tests will not be flaky but aren't we then accepting the current behaviour of quotas implementation as expected behaviour i.e. the deviation could be +- epsilon and epsilon could exceed the thresholds (max.connection.creation.rate) set for a windowing period.

Having said that, the answer might be that the current behaviour is accepted behaviour. In which case, I would be comfortable with this change if it is accompanied by a change in docs explaining the current expectations from the windowing algorithm so that the users at least know what their expectation from the quota implementation should be.

Alternatively, the fix i.e. #12045 doesn't require a KIP (it doesn't change any public interfaces). Please feel free to pick up that PR (or duplicate it). I won't have time any time soon to work on it. I will be happy to provide a review.

What do you think?

@gharris1727
Copy link
Copy Markdown
Contributor Author

Hey @divijvaidya .

aren't we then accepting the current behaviour of quotas implementation as expected behaviour i.e. the deviation could be +- epsilon and epsilon could exceed the thresholds (max.connection.creation.rate) set for a windowing period.

It is not possible to eliminate the deviation visible to the outside observer, and these tests will always need to include an error term. Here's what the error bounds will be for the default window configuration and various observation times with the variable-width and fixed-width windowing algorithms:

millis  variable-width          fixed-width
11001	2.4444444444444446	2.2
20001	1.736842105263158	2.2
22001	1.736842105263158	1.5714285714285714
30001	1.5172413793103448	1.5714285714285714
33001	1.5172413793103448	1.375
40001	1.4102564102564104	1.375
44001	1.4102564102564104	1.2790697674418605
50001	1.346938775510204	1.2790697674418605
55001	1.346938775510204	1.2222222222222223
60001	1.305084745762712	1.2222222222222223
66001	1.305084745762712	1.1846153846153846
70001	1.2753623188405796	1.1846153846153846
77001	1.2753623188405796	1.1578947368421053
80001	1.2531645569620253	1.1578947368421053
88001	1.2531645569620253	1.1379310344827587
90001	1.2359550561797752	1.1379310344827587
99001	1.2359550561797752	1.1224489795918366

This PR implements the variable-width limits because the variable-width algorithm is currently on trunk, not because it is more accepted or correct. I'm personally ambivalent about which algorithm should be used. This PR replaces the un-motivated and incorrect hardcoded constants with computed bounds, and that is beneficial with or without the fixed-width algorithm.

Alternatively, the fix i.e. #12045 doesn't require a KIP (it doesn't change any public interfaces). Please feel free to pick up that PR (or duplicate it). I won't have time any time soon to work on it. I will be happy to provide a review.

If you don't have time or interest to work that PR and/or KIP, then I think a test-only fix is appropriate until someone is interested in picking up the PR. The flaky tests have already made us aware of the odd behavior of the variable-width algorithm, so more ongoing failures aren't helpful.

Having said that, the answer might be that the current behaviour is accepted behaviour. In which case, I would be comfortable with this change if it is accompanied by a change in docs explaining the current expectations from the windowing algorithm so that the users at least know what their expectation from the quota implementation should be.

I don't think that what this PR addresses needs to be explained in the documentation.

  • The existing quotas documentation is very general, and focuses on the strategic usage of quotas. This extremely specific error bound would be out of place among the other documentation.
  • How users choose their rate limit depends on so many more important factors than this error term (such as network environment, hardware, etc) that I think to mention the error factor in specific would mislead users to think it is more important than it is.
  • Users will already be compensating for this error term when setting their limits. If someone observes that the effective rate limit (either for bursts or steady-state flows) is too high and affects their cluster health, they will lower their configured rate limit to compensate. They may be blaming the wrong thing (their hardware vs the rate limit algorithm) but the effect is still the same.

@gharris1727
Copy link
Copy Markdown
Contributor Author

Hi @divijvaidya As this is still causing flaky failures ~2% of the time, I'm still interested in getting this fix merged. Thanks!

@github-actions
Copy link
Copy Markdown

This PR is being marked as stale since it has not had any activity in 90 days. If you
would like to keep this PR alive, please leave a comment asking for a review. If the PR has
merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

@github-actions github-actions Bot added the stale Stale PRs label Dec 27, 2024
@github-actions
Copy link
Copy Markdown

This PR has been closed since it has not had any activity in 120 days. If you feel like this
was a mistake, or you would like to continue working on it, please feel free to re-open the
PR and ask for a review.

@github-actions github-actions Bot added the closed-stale PRs that were closed due to inactivity label Jan 27, 2025
@github-actions github-actions Bot closed this Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

closed-stale PRs that were closed due to inactivity stale Stale PRs tests Test fixes (including flaky tests)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants