KAFKA-10079: improve thread-level stickiness by ableegoldman · Pull Request #8775 · apache/kafka

ableegoldman · 2020-06-02T05:23:55Z

Uses a similar (but slightly different) algorithm as in KAFKA-9987 to produce a maximally sticky -- and perfectly balanced -- assignment of tasks to threads within a single client. This is important for in-memory stores which get wiped out when transferred between threads.

Must be cherrypicked to 2.6

vvcephei · 2020-06-02T14:27:10Z

Ok to test

vvcephei

Thanks @ableegoldman ! It looks good to me overall.

vvcephei · 2020-06-02T16:30:31Z

We have several new methods, and also this new book-kept collection (consumerToPreviousTaskIds), but no new tests for them in ClientStateTest. Can you add the missing coverage?

The new methods are more a matter of principle; I'm really concerned that we should have good coverage on the bookkeeping aspect of consumerToPreviousTaskIds because I fear future regressions when we have to maintain two data structures in a consistent fashion

Definitely. I meant to write tests but then I took Luna for a walk and forgot 😄

vvcephei · 2020-06-02T22:55:25Z

Ok to test

vvcephei · 2020-06-02T22:55:39Z

Test this please

vvcephei · 2020-06-02T22:55:57Z

Test this please

vvcephei · 2020-06-02T22:56:16Z

Test this please

vvcephei · 2020-06-02T22:57:20Z

Test this please

ableegoldman · 2020-06-03T01:55:11Z

Tests failed due to the broken consumer StickyAssignor test that will be fixed via #8786

vvcephei

Thanks for the update, @ableegoldman ; one more question...

vvcephei · 2020-06-03T22:29:06Z

+    }
+
+    @Test
+    public void shouldReturnPreviousStatefulTasksForConsumerInIncreasingLagOrder() {


I missed the extra sort on my last review. It really seems like too much fanciness for the ClientState to sort the tasks in lag order. Would it be too messy to move the sort aspect out to the balancing code that needs it?

You didn't miss it, I just snuck it in there after your review :P

Sorry, should have called out that I made some more changes. I think that was the only significant logical change though. I'll try pulling the sort out into the assignment code

vvcephei · 2020-06-03T22:29:37Z

Ok to test

vvcephei · 2020-06-03T22:29:43Z

Test this please

vvcephei

Thanks for the update, @ableegoldman , just one question...

vvcephei · 2020-06-04T17:39:18Z

+            // If we couldn't compute the task lags due to failure to fetch offsets, just return a flat constant
+            totalLag = 0L;


Is this the right constant to represent "we don't know the lag"? Or did I mistake how this is going to be used?

The value itself doesn't matter, just that it's constant across all tasks.

But I'm guessing you meant, why not use the existing UNKNOWN_OFFSET_SUM sentinel, in which case the answer is probably just that I forgot about it. Anyway I did a slight additional refactoring beyond this, just fyi: instead of skipping the lag computation when we fail to fetch offsets, we now always initialize the lags and just pass in the UNKNOWN_OFFSET_SUM for all stateful tasks when the offset fetch fails.

vvcephei · 2020-06-04T19:26:18Z

Test this please

vvcephei · 2020-06-04T21:27:53Z

Test this please

vvcephei · 2020-06-04T21:28:01Z

Test this please

ableegoldman · 2020-06-05T00:12:37Z

Java8 failed with
KTableSourceTopicRestartIntegrationTest.shouldRestoreAndProgressWhenTopicNotWrittenToDuringRestoration

Java14 failed with KTableSourceTopicRestartIntegrationTest.shouldRestoreAndProgressWhenTopicWrittenToDuringRestorationWithEosAlphaEnabled

I've seen both of these be flaky already (and frankly am a bit concerned about them...) but I'll see if I can reproduce this locally in case this PR is somehow making them worse

ableegoldman · 2020-06-05T18:03:06Z

200 runs and I can't reproduce either. But it looks like both were previously flaky, and seem unrelated to this PR. Can we kick off tests again?

vvcephei · 2020-06-09T23:49:18Z

Test this please

vvcephei · 2020-06-09T23:49:52Z

Test this please

vvcephei · 2020-06-09T23:50:14Z

Test this please

vvcephei · 2020-06-09T23:50:40Z

Test this please

vvcephei · 2020-06-09T23:51:57Z

Test this please

vvcephei · 2020-06-10T04:22:46Z

Test this please

vvcephei

Thanks @ableegoldman !

Uses a similar (but slightly different) algorithm as in KAFKA-9987 to produce a maximally sticky -- and perfectly balanced -- assignment of tasks to threads within a single client. This is important for in-memory stores which get wiped out when transferred between threads. Reviewers: John Roesler <vvcephei@apache.org>

vvcephei · 2020-06-10T15:12:48Z

Cherry picked to 2.6

…t-for-generated-requests * apache-github/trunk: (248 commits) KAFKA-10049: Fixed FKJ bug where wrapped serdes are set incorrectly when using default StreamsConfig serdes (apache#8764) KAFKA-10027: Implement read path for feature versioning system (KIP-584) (apache#8680) KAFKA-10085: correctly compute lag for optimized source changelogs (apache#8787) KAFKA-10086: Integration test for ensuring warmups are effective (apache#8818) KAFKA-9374: Make connector interactions asynchronous (apache#8069) MINOR: reduce sizeInBytes for percentiles metrics (apache#8835) KAFKA-10115: Incorporate errors.tolerance with the Errant Record Reporter (apache#8829) KAFKA-9216: Enforce that Connect’s internal topics use `compact` cleanup policy (apache#8828) KAFKA-9845: Warn users about using config providers with plugin.path property (apache#8455) KAFKA-7833: Add missing test (apache#8847) KAFKA-9066: Retain metrics for failed tasks (apache#8502) KAFKA-9841: Revoke duplicate connectors and tasks when zombie workers return with an outdated assignment (apache#8453) KAFKA-9985: Sink connector may exhaust broker when writing in DLQ (apache#8663) KAFKA-9441: remove prepareClose() to simplify task management (apache#8833) KAFKA-7833: Add Global/StateStore name conflict check (apache#8825) KAFKA-9969: Exclude ConnectorClientConfigRequest from class loading isolation (apache#8630) KAFKA-9991: Fix flaky unit tests (apache#8843) KAFKA-10014; Always try to close all channels in Selector#close (apache#8685) KAFKA-10079: improve thread-level stickiness (apache#8775) MINOR: Print all removed dynamic members during join complete (apache#8816) ...

vvcephei added the streams label Jun 2, 2020

vvcephei reviewed Jun 2, 2020

View reviewed changes

ableegoldman added 11 commits June 3, 2020 12:12

WIP

9140097

implemented stickiness

5083889

compiling tests

be11863

improve stickiness to the max

dc296fc

debugging last test

4b8e96c

use sorted set for test

8300d49

checkstyle

d40dd5a

add unit tests

5cb6f40

filter for statefulness in SPA

72e64ef

fixing up tests

acb7a4c

bump log to INFO

f580e63

ableegoldman force-pushed the 10079-HA-for-in-memory-stores branch from 619cafe to f580e63 Compare June 3, 2020 19:13

vvcephei reviewed Jun 3, 2020

View reviewed changes

ableegoldman added 2 commits June 4, 2020 09:54

pull sorting into SPA

7c18e59

checkstyle

96a6c23

vvcephei reviewed Jun 4, 2020

View reviewed changes

ableegoldman added 3 commits June 4, 2020 11:39

use UNKNOWN_OFFSET_SUM, always initialize lag

d26d462

remove no longer relevant new test

6df10ca

fix/rename counter

fc98357

ableegoldman added 2 commits June 4, 2020 11:59

some cleanup

b5cfd40

fixindentation

f83baf9

ableegoldman added 2 commits June 4, 2020 13:30

simplify

f0c5a19

fix bug caught in test

ee384a9

ableegoldman added 3 commits June 8, 2020 10:02

Adding unit tests

168051c

checkstyle

6b0f6fb

remove unused topic partitions

e921175

vvcephei approved these changes Jun 10, 2020

View reviewed changes

vvcephei merged commit 0f68dc7 into apache:trunk Jun 10, 2020

ableegoldman deleted the 10079-HA-for-in-memory-stores branch June 26, 2020 22:39

		// If we couldn't compute the task lags due to failure to fetch offsets, just return a flat constant
		totalLag = 0L;

Conversation

ableegoldman commented Jun 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vvcephei commented Jun 2, 2020

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vvcephei Jun 2, 2020

Choose a reason for hiding this comment

Uh oh!

ableegoldman Jun 2, 2020

Choose a reason for hiding this comment

Uh oh!

vvcephei commented Jun 2, 2020

Uh oh!

vvcephei commented Jun 2, 2020

Uh oh!

vvcephei commented Jun 2, 2020

Uh oh!

vvcephei commented Jun 2, 2020

Uh oh!

vvcephei commented Jun 2, 2020

Uh oh!

ableegoldman commented Jun 3, 2020

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

vvcephei Jun 3, 2020

Choose a reason for hiding this comment

Uh oh!

ableegoldman Jun 4, 2020

Choose a reason for hiding this comment

Uh oh!

vvcephei commented Jun 3, 2020

Uh oh!

vvcephei commented Jun 3, 2020

Uh oh!

vvcephei left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvcephei Jun 4, 2020

Choose a reason for hiding this comment

Uh oh!

ableegoldman Jun 4, 2020

Choose a reason for hiding this comment

Uh oh!

vvcephei commented Jun 4, 2020

Uh oh!

vvcephei commented Jun 4, 2020

Uh oh!

vvcephei commented Jun 4, 2020

Uh oh!

ableegoldman commented Jun 5, 2020

Uh oh!

ableegoldman commented Jun 5, 2020

Uh oh!

vvcephei commented Jun 9, 2020

Uh oh!

vvcephei commented Jun 9, 2020

Uh oh!

vvcephei commented Jun 9, 2020

Uh oh!

vvcephei commented Jun 9, 2020

Uh oh!

vvcephei commented Jun 9, 2020

Uh oh!

vvcephei commented Jun 10, 2020

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

vvcephei commented Jun 10, 2020

Uh oh!

Reviewers

Assignees

ableegoldman commented Jun 2, 2020 •

edited

Loading

vvcephei left a comment •

edited

Loading