KAFKA-10079: improve thread-level stickiness#8775
Conversation
|
Ok to test |
vvcephei
left a comment
There was a problem hiding this comment.
Thanks @ableegoldman ! It looks good to me overall.
There was a problem hiding this comment.
We have several new methods, and also this new book-kept collection (consumerToPreviousTaskIds), but no new tests for them in ClientStateTest. Can you add the missing coverage?
The new methods are more a matter of principle; I'm really concerned that we should have good coverage on the bookkeeping aspect of consumerToPreviousTaskIds because I fear future regressions when we have to maintain two data structures in a consistent fashion
There was a problem hiding this comment.
Definitely. I meant to write tests but then I took Luna for a walk and forgot 😄
|
Ok to test |
|
Test this please |
3 similar comments
|
Test this please |
|
Test this please |
|
Test this please |
|
Tests failed due to the broken consumer StickyAssignor test that will be fixed via #8786 |
619cafe to
f580e63
Compare
vvcephei
left a comment
There was a problem hiding this comment.
Thanks for the update, @ableegoldman ; one more question...
| } | ||
|
|
||
| @Test | ||
| public void shouldReturnPreviousStatefulTasksForConsumerInIncreasingLagOrder() { |
There was a problem hiding this comment.
I missed the extra sort on my last review. It really seems like too much fanciness for the ClientState to sort the tasks in lag order. Would it be too messy to move the sort aspect out to the balancing code that needs it?
There was a problem hiding this comment.
You didn't miss it, I just snuck it in there after your review :P
Sorry, should have called out that I made some more changes. I think that was the only significant logical change though. I'll try pulling the sort out into the assignment code
|
Ok to test |
|
Test this please |
There was a problem hiding this comment.
Thanks for the update, @ableegoldman , just one question...
| // If we couldn't compute the task lags due to failure to fetch offsets, just return a flat constant | ||
| totalLag = 0L; |
There was a problem hiding this comment.
Is this the right constant to represent "we don't know the lag"? Or did I mistake how this is going to be used?
There was a problem hiding this comment.
The value itself doesn't matter, just that it's constant across all tasks.
But I'm guessing you meant, why not use the existing UNKNOWN_OFFSET_SUM sentinel, in which case the answer is probably just that I forgot about it. Anyway I did a slight additional refactoring beyond this, just fyi: instead of skipping the lag computation when we fail to fetch offsets, we now always initialize the lags and just pass in the UNKNOWN_OFFSET_SUM for all stateful tasks when the offset fetch fails.
|
Test this please |
|
Test this please |
1 similar comment
|
Test this please |
|
Java8 failed with Java14 failed with I've seen both of these be flaky already (and frankly am a bit concerned about them...) but I'll see if I can reproduce this locally in case this PR is somehow making them worse |
|
200 runs and I can't reproduce either. But it looks like both were previously flaky, and seem unrelated to this PR. Can we kick off tests again? |
|
Test this please |
5 similar comments
|
Test this please |
|
Test this please |
|
Test this please |
|
Test this please |
|
Test this please |
Uses a similar (but slightly different) algorithm as in KAFKA-9987 to produce a maximally sticky -- and perfectly balanced -- assignment of tasks to threads within a single client. This is important for in-memory stores which get wiped out when transferred between threads. Reviewers: John Roesler <vvcephei@apache.org>
|
Cherry picked to 2.6 |
…t-for-generated-requests * apache-github/trunk: (248 commits) KAFKA-10049: Fixed FKJ bug where wrapped serdes are set incorrectly when using default StreamsConfig serdes (apache#8764) KAFKA-10027: Implement read path for feature versioning system (KIP-584) (apache#8680) KAFKA-10085: correctly compute lag for optimized source changelogs (apache#8787) KAFKA-10086: Integration test for ensuring warmups are effective (apache#8818) KAFKA-9374: Make connector interactions asynchronous (apache#8069) MINOR: reduce sizeInBytes for percentiles metrics (apache#8835) KAFKA-10115: Incorporate errors.tolerance with the Errant Record Reporter (apache#8829) KAFKA-9216: Enforce that Connect’s internal topics use `compact` cleanup policy (apache#8828) KAFKA-9845: Warn users about using config providers with plugin.path property (apache#8455) KAFKA-7833: Add missing test (apache#8847) KAFKA-9066: Retain metrics for failed tasks (apache#8502) KAFKA-9841: Revoke duplicate connectors and tasks when zombie workers return with an outdated assignment (apache#8453) KAFKA-9985: Sink connector may exhaust broker when writing in DLQ (apache#8663) KAFKA-9441: remove prepareClose() to simplify task management (apache#8833) KAFKA-7833: Add Global/StateStore name conflict check (apache#8825) KAFKA-9969: Exclude ConnectorClientConfigRequest from class loading isolation (apache#8630) KAFKA-9991: Fix flaky unit tests (apache#8843) KAFKA-10014; Always try to close all channels in Selector#close (apache#8685) KAFKA-10079: improve thread-level stickiness (apache#8775) MINOR: Print all removed dynamic members during join complete (apache#8816) ...
Uses a similar (but slightly different) algorithm as in KAFKA-9987 to produce a maximally sticky -- and perfectly balanced -- assignment of tasks to threads within a single client. This is important for in-memory stores which get wiped out when transferred between threads.
Must be cherrypicked to 2.6