KAFKA-6145: Pt 2. Include offset sums in subscription by ableegoldman · Pull Request #8246 · apache/kafka

ableegoldman · 2020-03-06T23:55:44Z

KIP-441 Pt. 2: Compute sum of offsets across all stores/changelogs in a task and include them in the subscription.

Previously each thread would just encode every task on disk, but we now need to read the changelog file which is unsafe to do without a lock on the task directory. So, each thread now encodes only its assigned active and standby tasks, and ignores any already-locked tasks.

In some cases there may be unowned and unlocked tasks on disk that were reassigned to another instance and haven't been cleaned up yet by the background thread. Each StreamThread makes a weak effort to lock any such task directories it finds, and if successful is then responsible for computing and reporting that task's offset sum (based on reading the checkpoint file)

This PR therefore also addresses two orthogonal issues:

Prevent background cleaner thread from deleting unowned stores during a rebalance
Deduplicate standby tasks in subscription: each thread used to include every (non-active) task found on disk in its "standby task" set, which meant every active, standby, and unowned task was encoded by every thread.

ableegoldman · 2020-03-07T01:46:43Z

Call for review @cadonna @vvcephei

vvcephei · 2020-03-10T17:41:37Z

test this please

vvcephei

Thanks, @ableegoldman !

vvcephei · 2020-03-10T17:53:21Z

Let's add this before we try to read the checkpoint file. In case we do get an IOException, we shouldn't forget that we got the lock.

vvcephei · 2020-03-10T18:17:23Z

what should we do if the offset is negative?

Oh, also, should we detect overflow and pin to MAX_VALUE in that case?

Q: Would it be too strict to specify offset >= 0 as an invariant and throw an IllegalStateException?

Regarding the overflow, when computing the lag, the sum of the last committed offsets will always be larger or equal to the sum of the offset of the states of a task except for the case where the sum of the last committed offsets has already overflown but the sum of the offset of the states has not. So, if both have overflown then the difference should not be affected by the overflows. If only the sum of the last committed offsets has overflown we need to compute the difference differently, but we are able to recognize this case. All of this assumes that overflows are well defined in Java as MIN_VALUE comes after MAX_VALUE.

Seems like the only negative offset we can get is -1, which indicates the offset is unknown in which case we should skip it. Will add a check for overflow too

Hmm. Skipping would count the "current position" on that store as 0. Should we assume "unknown offset" equates to "fully caught up", or is it safer to set it to MAX_VALUE, or something else?

-1 means that recordMetadata.offset == ProduceResponse.INVALID_OFFSET, which sounds to me like either the topic or producer isn't initialized yet or is corrupted in some way. Both of those make sense (to me) to interpret as an offset of 0. But, I'm no expert in the Producer client and various offset meanings.

w.r.t the matter of overflow, I think we should aim to keep things simple and just pin to MAX_VALUE in the event of overflow, if we think that's likely to be a rare event. Obviously we should at the least make sure we don't crash or seriously harm the operation or results of the app -- pinning to MAX_VALUE will at worst make us potentially switch over to an active task before it's completely caught up.

vvcephei · 2020-03-10T18:18:29Z

Should we try to unlock the rest? Also, we should always include the causing exception.

vvcephei · 2020-03-10T18:22:10Z

This now looks a little suspicious... In the absence of a value, should we assume standbys are caught up (0L), or that they are not (MAX_VALUE)?

Hm, I guess if we do that then the assignment algorithm should hopefully reduce to the original one during an upgrade while we're pinned to the lower subscription version.

The assignment will be a little screwy on the first VP rebalance while there are mixed subscription versions, but maybe that's not worth worrying about. Probably not worth adding another sentinel value either..

Actually I reverse, think we should add a (negative) OFFSET_SUM_UNKNOWN sentinel for this case. Either way we have to do some special handling, this way we can still enforce the endOffsetSum >= taskOffsetSum invariant and also don't share this pseudo-sentinel with the overflow case

ableegoldman · 2020-03-10T19:21:08Z

Not taking a hard stance on this spelling, just aiming for consistency across the code base

cadonna

@ableegoldman Thank you for the PR.

Here my feedback.

cadonna · 2020-03-10T11:04:33Z

Q: Wouldn't a log message on DEBUG-level make sense here?

Actually, what about warn?

cadonna · 2020-03-10T11:23:11Z

prop:

Suggested change

long offsetSum = 0;

for (final long offset : changelogOffsets.values()) {

offsetSum += offset;

}

return offsetSum;

return changelogOffsets.values().stream().reduce(0L, Long::sum);

cadonna · 2020-03-10T11:26:34Z

prop: rename to sumUpChangelogOffsets

How about sumOfChangelogOffsets?

cadonna · 2020-03-10T11:36:55Z

Q: This question is unrelated to your change. Why is firstException an atomic variable? It is a local variable and shutdown() is only called from StreamThread which should be single-threaded. \cc @guozhangwang

I think it's just more convenient than the conditional block to check if it's null.

Fair enough if the performance is similar.

cadonna · 2020-03-10T12:25:52Z

prop:

Suggested change

for (final TaskId task : prevTasks()) {

taskOffsetSumsCache.put(task, Task.LATEST_OFFSET);

}

for (final TaskId task : standbyTasks()) {

taskOffsetSumsCache.put(task, 0L);

}

prevTasks().forEach((taskId) -> taskOffsetSumsCache.put(taskId, Task.LATEST_OFFSET));

prevTasks().forEach((taskId) -> taskOffsetSumsCache.put(taskId, 0L));

Not a big fan of this suggestion. Is there something wrong with loops?

I'll let you two fight this one out 😄

Nothing wrong with loops as forEach() is also a loop. If I can write a loop more concisely and I still easily get what they do, I would go for it. This is a proposal, so if @ableegoldman wants to follow it fine, if not I am also fine with it (after a bit of crying).

cadonna · 2020-03-10T20:42:25Z

req: Why do you verify the setup code here?

I bet she copied the idiom from all of my tests. I did it because it makes the tests easier to read... I.e., you can visually see what state everything is in. Otherwise you'd have to reason about what state it would be in, given all the mocks above.

To verify that we're actually testing what we think we're testing, ie if due to some bug in TaskManager 0_0 did not actually reach the RUNNING state, we should fail fast. Otherwise when the test fails, it's not clear that it's due to an unrelated bug rather than a bug in getTaskOffsetSums

Also what @vvcephei said 🙂 . It's kind of hard to reason about the states based only on calls to TaskManager in this test class

Fair enough given the complexity of the setup. I guess what disturbs me most is the fact that the setup is so complex.

cadonna · 2020-03-10T21:10:39Z

req: In general I think this unit test is really large. For the sake of readability and modularization, you should split it into multiple tests. Maybe for each case two unit tests: one with a single case. Then one unit test for the composite scenario with all cases and different occurrences of the different cases. If you extract and parametrize the setup, it should not be too much code duplication. Additionally, a test where stateDirectory.listTaskDirectories() returns an empty array and a test with a stateless task are missing.

To clarify, you're suggesting to add smaller tests for each case (and edge cases) but also leave the composite test in as well?

Basically yes. I would extend the composite test with multiple occurrences of each case so that we also cover the scenario where we have -- for example -- n active running task, m active non-running tasks, k standby tasks etc.
If this makes the test too clumsy, you could cover each of n active running task, m active non-running tasks, k standby tasks etc in its own test and make one composite test with one occurrence of each case. Choose what is better readable.
My point is, that the current test does not cover multiple occurrences of the same case.

ableegoldman · 2020-03-11T00:26:41Z

@vvcephei @cadonna Made some changes to TaskManager to split out the lock acquisition from the checkpoint reading/offset computation, please give it another pass and lmk if you agree it's an improvement

Also did some deep cleaning of the tests, hopefully they're much easier to follow now.

cadonna

@ableegoldman I like a lot how you cleaned-up the tests. They are much better readable now.

Here my feedback.

cadonna · 2020-03-11T12:24:45Z

req: Please add a similar unit test as the one above but where stateDirectory.listTaskDirectories() returns null instead of an empty array.

I think we should actually just make listTaskDirectories() always return an empty File[] instead of null in some cases and new File[0] in others, as we treat both cases the same. WDYT?

Yeah, I agree with you. I checked the code and it should be OK to always return an empty array. However, could you open a second PR for it that we merge before this one to keep this PR focussed.

I'm not sure it's really worth doing as a separate PR, it's about 10 lines of code and is only really motivated by the work in this KIP?

Fair enough

cadonna · 2020-03-11T12:53:00Z

req: If the release of the lock throws, we log an error message in releaseTaskDirLock(id) but swallow the exception here. This seems to me a false alarm. Imagine you analyse the log files and find an error that actually isn't one. I think, we should suppress the log message in this case.

How about just a warning in releaseTaskDirLock, and then log as an error if it's actually fatal?

I would prefer to not log anything in releaseTaskDirLock() and check the return value of releaseTaskDirLock() at caller side. That would avoid double log messages due to the same event.
On a different note, do we need to return the exception from releaseTaskDirLock(). We could just throw it and catch it where we call releaseTaskDirLock(). Am I missing something?

new question: why bother with this complexity? The background cleaner can do its thing when we're not rebalancing, right?

We have/are making some changes to the directory cleanup (unrelated to 441) that seem like we could end up with a lot of empty directories that have to wait for the cleanup thread to be removed. Since every thread has to go through every task directory and try to lock it, I was thinking we should try to avoid blocking the cleanup thread as much as possible.

Note that currently, the cleanup thread runs every 10 min (by default). If we also choose 10 min as the default for the probing rebalance interval, and leave empty directories locked during a rebalance, we might never delete them (until the probing reblances end of course).

Maybe a better approach is to let the cleanup thread run slightly more frequently to remove empty directories only, and skip anything with remaining state & valid checkpoint. WDYT? If that sounds preferable I can make a ticket to follow up later

Nevermind the above -- assuming we get this PR into 2.6 as well we can just leverage the new listNonEmptyTaskDirectories and the problem becomes moot. Removed the unlocking attempt from this method for now

cadonna · 2020-03-11T15:09:30Z

prop: Rename to shouldPinOffsetToLongMaxValueInCaseOfOverflow

Wow yeah that original test name made no sense, thanks for the prop

vvcephei · 2020-03-12T15:09:15Z

test this please

vvcephei

Thanks @ableegoldman , I think this is just about ready. A few final remarks...

vvcephei · 2020-03-12T15:52:16Z

new question: why bother with this complexity? The background cleaner can do its thing when we're not rebalancing, right?

vvcephei · 2020-03-12T18:59:38Z

Still unsure if this is the right logic. What if we just return an "unknown sum" sentinel here? Then, if any store's offset is unknown, then the task's offset sum would also be reported as "unknown", which would let the assignor treat it as "not caught up".

In what cases might one store's offsets being invalid mean that every other store with valid offsets should not be taken into account?

That's not rhetorical, I really am asking. It's not clear to me exactly when you'd get this invalid offset response -- but if only one partition was having issues (whatever those may be) and the others al had valid, positive offsets, would we have to wipe out the entire state? Do we even check for negative offsets elsewhere in Streams? It's not clear to me that/if we do (in fact several places assume they are always positive and I believe would actually crash if not)

As far as I can see from the code, an offset of -1 is returned in error case. Additionally, offset -1 is used during initialization of the producer response. I guess in the error case we do not write any offset into the checkpoint file and the offset map. So I suppose that it cannot happen that the offset becomes -1 in this code. So, my proposal would be to double-check my observation. In case it is correct we can specify an invariant that the offset must be >= 0 and throw an IllegalStateException if the invariant is not satisfied.

We skip adding offsets to the map in the error case, but I asked Jason and apparently there may be a non-error case where -1 is returned with an idempotent producer. Just cc'ed you on the thread

vvcephei · 2020-03-12T19:00:29Z

you could avoid computing the addition twice by checking after this line if offsetSum < 0

That's what I did originally, Bruno suggested changing it to use else -- but, maybe I misunderstood his actual proposal...I'll set it back

My original proposal was

if (offset < 0L) { if (offset == -1L) { log.debug("Skipping unknown offset for changelog {}", changelog); } else { log.warn("Unexpected negative offset {} for changelog {}", offset, changelog); } } else { offsetSum += offset; if (offsetSum < 0) { log.warn("Sum of changelog offsets for task {} overflowed, pinning to Long.MAX_VALUE", id); return Long.MAX_VALUE; } }

I find this easier to read.

I see, I think I find it easier to read without the else as it makes it clear that we are just adding the offset, except in these two potential edge cases (overflow and negative). But we still need to come to a consensus about how to handle the negative case anyway

vvcephei · 2020-03-12T20:37:28Z

The test failure seems related:

org.apache.kafka.streams.processor.internals.ActiveTaskCreatorTest.shouldConstructProducerMetricsWithEOS

java.lang.IllegalStateException: missing behavior definition for the preceding method call:
StateDirectory.directoryForTask(0_0)

vvcephei · 2020-03-12T20:38:11Z

Also, can you run the system tests before we merge this?

ableegoldman · 2020-03-12T22:51:11Z

Streams system tests: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3830/

ableegoldman · 2020-03-13T19:35:50Z

System test failures are unrelated to this PR (StreamsEOSTest.test_failure_and_recovery and StreamsEOSTest.test_failure_and_recovery_complex) and have been failing on trunk already -- double checked the logs to make sure the root cause is the same as the trunk failures (it is)

vvcephei · 2020-03-13T19:42:23Z

Linking the test results here, since the jenkins build will be cleaned up: http://testing.confluent.io/confluent-kafka-branch-builder-system-test-results/?prefix=2020-03-12--001.1584073791--ableegoldman--KIP-441-send-task-offset-sums--298f1e2/

vvcephei · 2020-03-13T19:42:46Z

test this please

vvcephei · 2020-03-13T19:43:21Z

test this please

vvcephei · 2020-03-13T22:26:44Z

test this please

vvcephei · 2020-03-13T22:27:09Z

test this please

vvcephei · 2020-03-14T03:47:42Z

Unrelated test failure:
kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup

vvcephei

Did a final pass, and it looks good to me.

cadonna · 2020-03-15T17:46:04Z

@@ -347,8 +349,14 @@ private synchronized void cleanRemovedTasks(final long cleanupDelayMs,
     * @return The list of all the existing local directories for stream tasks
     */
    File[] listTaskDirectories() {


One last thing: Could you open another PR to add unit tests that check that the array is empty for the two edge cases?

Ack, added the tests to the Pt 2.5 PR

Actually, ended up doing some additional cleanup on the side so I split it out into a small PR; please give this a quick review.
#8304

* apache-github/trunk: (39 commits) MINOR: cleanup and add tests to StateDirectoryTest (apache#8304) HOTFIX: StateDirectoryTest should use Set instead of List (apache#8305) MINOR: Fix build and JavaDoc warnings (apache#8291) MINOR: Fix kafka.server.RequestQuotaTest missing new ApiKeys. (apache#8302) KAFKA-9712: Catch and handle exception thrown by reflections scanner (apache#8289) KAFKA-9670; Reduce allocations in Metadata Response preparation (apache#8236) MINOR: fix Scala 2.13 build error introduced in apache#8083 (apache#8301) MINOR: enforce non-negative invariant for checkpointed offsets (apache#8297) MINOR: comment apikey types in generated switch (apache#8201) MINOR: Fix typo in CreateTopicsResponse.json (apache#8300) KIP-546: Implement describeClientQuotas and alterClientQuotas. (apache#8083) KAFKA-6647: Do note delete the lock file while holding the lock (apache#8267) KAFKA-9677: Fix consumer fetch with small consume bandwidth quotas (apache#8290) KAFKA-9533: Fix JavaDocs of KStream.transformValues (apache#8298) MINOR: reuse pseudo-topic in FKJoin (apache#8296) KAFKA-6145: Pt 2. Include offset sums in subscription (apache#8246) KAFKA-9714; Eliminate unused reference to IBP in `TransactionStateManager` (apache#8293) KAFKA-9718; Don't log passwords for AlterConfigs in request logs (apache#8294) KAFKA-8768: DeleteRecords request/response automated protocol (apache#7957) KAFKA-9685: Solve Set concatenation perf issue in AclAuthorizer ...

vvcephei added the streams label Mar 10, 2020

vvcephei self-requested a review March 10, 2020 17:41

vvcephei reviewed Mar 10, 2020

View reviewed changes

ableegoldman commented Mar 10, 2020

View reviewed changes

cadonna reviewed Mar 10, 2020

View reviewed changes

ableegoldman force-pushed the KIP-441-send-task-offset-sums branch from 4e3ebfe to fe8e2e5 Compare March 11, 2020 00:43

cadonna reviewed Mar 11, 2020

View reviewed changes

vvcephei reviewed Mar 12, 2020

View reviewed changes

ableegoldman added 16 commits March 12, 2020 14:16

report offset sums

dd0366a

add test case

9f995fd

write checkpoint for unowned task

9ff3dfc

checkstyle

fc9b943

fix spelling

b10999a

decode older subscriptions to dummy offset sums

d96d2b9

checkstyle

e96442d

github comments

146cff7

more changes from github review

801dfa7

split out lock acquisition andcheckpoint reading/offset sum computation

f3854db

add helper to StateDirectory

0cf37a4

annoying but better tests

a6cad43

remove original monolith test

d1adbee

fix NPE

1eb7c0c

fix no-valid-checkpoint test

2fa42cb

githubreview

042b12b

ableegoldman added 3 commits March 12, 2020 14:16

checkstyle

d4b3092

log warning on overflow

cf64c33

fix active task creator test

35675a7

ableegoldman force-pushed the KIP-441-send-task-offset-sums branch from eaa31be to 35675a7 Compare March 12, 2020 21:22

don't attempt unlock for empty or checkpoint-less task dirs

298f1e2

remove negative offset handling to enforce in separate PR

3135d65

fix tests

a757918

vvcephei approved these changes Mar 14, 2020

View reviewed changes

vvcephei merged commit 542853d into apache:trunk Mar 14, 2020

cadonna reviewed Mar 15, 2020

View reviewed changes

mjsax added the kip Requires or implements a KIP label Jun 12, 2020

ableegoldman deleted the KIP-441-send-task-offset-sums branch June 26, 2020 22:37

-        long offsetSum = 0;
-        for (final long offset : changelogOffsets.values()) {
-            offsetSum += offset;
-        }
-        return offsetSum;
+        return changelogOffsets.values().stream().reduce(0L, Long::sum);

-                for (final TaskId task : prevTasks()) {
-                    taskOffsetSumsCache.put(task, Task.LATEST_OFFSET);
-                }
-                for (final TaskId task : standbyTasks()) {
-                    taskOffsetSumsCache.put(task, 0L);
-                }
+                prevTasks().forEach((taskId) -> taskOffsetSumsCache.put(taskId, Task.LATEST_OFFSET));
+                prevTasks().forEach((taskId) -> taskOffsetSumsCache.put(taskId, 0L));

Conversation

ableegoldman commented Mar 6, 2020

Uh oh!

ableegoldman commented Mar 7, 2020

Uh oh!

vvcephei commented Mar 10, 2020

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadonna left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadonna Mar 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadonna Mar 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadonna Mar 10, 2020 •

edited

Loading

cadonna Mar 11, 2020 •

edited

Loading

ableegoldman commented Mar 11, 2020 •

edited

Loading