KAFKA-7672 : force write checkpoint during StreamTask #suspend by abbccdda · Pull Request #6115 · apache/kafka

abbccdda · 2019-01-10T04:08:00Z

This fix is aiming for #2 issue pointed out within https://issues.apache.org/jira/browse/KAFKA-7672
In the current setup, we do offset checkpoint file write when EOS is turned on during #suspend, which introduces the potential race condition during StateManager #closeSuspend call. To mitigate the problem, we attempt to always write checkpoint file in #suspend call.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

abbccdda · 2019-01-18T06:40:20Z

@mjsax @guozhangwang Could you take another look to see if this makes sense?

abbccdda · 2019-01-23T05:36:47Z

@guozhangwang @mjsax Maybe another look?

abbccdda · 2019-02-02T01:40:08Z

@guozhangwang @mjsax Take a look when you got time?

abbccdda · 2019-02-07T22:49:48Z

@mjsax Mind take another look?

mjsax

The change itself LGTM.

I would recommend to update the comment a little bit.

Call for second review @guozhangwang @bbejeck @vvcephei @ableegoldman

Can we add a test that exposes the issue, ie, force the race condition? (Could maybe also be done as a follow up) -- this is a critical bug fix and we should get it into 2.2 (ie, need to be merged by 2/15).

mjsax · 2019-02-12T17:15:15Z

Which check? I don't see any if clause.

I was thinking the same. Did you mean that the other change on this PR eliminates the chance for a double checkpoint file writes?

Oh, the check is just pointing to eosEnabled. Let me update the comment.

mjsax · 2019-02-12T17:22:44Z

@abbccdda One more thing: can we also add

completedRestorers.clear();

to StoreChangelogReader#reset() as pointed out on the ticket (and add a test if possible).

mjsax · 2019-02-12T18:12:11Z

Another thought that just crossed my mind while digging into this further: Atm, there is a race condition between writing and reading the checkpoint file. Wouldn't it be simpler to avoid this race condition by moving

        // load the checkpoint information
        checkpointableOffsets.putAll(checkpoint.read());
        if (eosEnabled) {
            // delete the checkpoint file after finish loading its stored offsets
            checkpoint.delete();
            checkpoint = null;
        }

from ProcessorStateManager constructor to ProcessorStateManager#register() ?

This would simplify the logic IMHO.

vvcephei · 2019-02-12T19:14:05Z

Actually, @mjsax , after reviewing the ticket, I think 6113 (#6113) is the actual bugfix that must go in. So I don't think that @abbccdda needs to add the clear in this PR.

If we can merge 6115 as well, which contains a significant performance improvement for restore in some cases, then it would be ideal.

bbejeck · 2019-02-12T19:57:29Z

If we can merge 6115 as well, which contains a significant performance improvement for restore in some cases, then it would be ideal.

#6115 includes the completedRestorers.clear() call so I think we just need to make sure both PRs get in.

vvcephei

Hi @abbccdda ,

Thanks for the PR.

I had one comment: it seems like we should update more of the state manager APIs to indicate that commit is no longer a guaranteed part of close. Otherwise, we could easily introduce a bug later, calling the method with acked offsets, not realizing that they will actually just be ignored.

Thanks,
-John

vvcephei · 2019-02-12T20:02:37Z

If we're going to ignore the ackedOffsets argument here, it seems like we should revisit the interface.

There are multiple paths that lead to this method, and it's not analytically obvious why it's ok to just ignore the argument and skip checkpointing here.

If we want to move the checkpoint up in the lifecycle to suspend, then it seems like we should do so holistically and remove the checkpointable offsets from the paramters of close. What do you think?

@vvcephei @bbejeck I checked the callers of close(ackedOffsets) function,

Abstract task is ok since our fix is to make sure state manager could checkpoint when suspending

GlobalStateUpdateTask is also fine since the global state manager (using GlobalStateManagerImpl) is overriding this function who will do offset checkpoint.

So I think we should be safe to ignore this in the base class, but keep the acked offset parameter in order for subclass to checkpoint for now, which minimizing the risk of this PR. Thank you!

What "subclasses" are you referring to? Both classes you mention are internal and not extended. I don't see any risk that we need to minimize?

I mean GlobalStateManagerImpl implements the close(offsets) @mjsax

Could wel update the code in GlobalStateUpdateTask to:

public void close() throws IOException { stateMgr.checkpoint(offsets); stateMgr.close(); }

Oh, my point is that stateMgr in GlobalStateUpdateTask is taking in GlobalStateManagerImpl whose close(offsets) will do the checkpoint operation. So I guess we don't need to call checkpoint explicitly here right? @mjsax

@mjsax Thoughts?

@mjsax @abbccdda @bbejeck I left a comment below for this issue. Please take a look.

bbejeck

LGTM, but I've left some comments regarding un-used method parameters as a result of this change.

bbejeck · 2019-02-12T20:07:03Z

I was thinking the same. Did you mean that the other change on this PR eliminates the chance for a double checkpoint file writes?

bbejeck · 2019-02-12T20:24:06Z

One note, with the removal of this line, the ackedOffsets is no longer used at all in this method. However, it's part of the StateManager#close interface, but it's not safe to remove as GlobalStateManager also implements the StateManager interface. Which to me, brings up two questions

Do we need to apply the same approach for global state stores?

Should we refactor the StateManager interface to have a no-args close() method?

Given the time constraints, I don't think should hold up this PR but it's something that IMHO we shouldn't let slip.

guozhangwang

Just laying out the context here (admittedly, this piece of logic is a bit hard to understand due to messy code structures, and we have a TODO task to clean up this tech debt soon):

StreamTask:

With EOS turned off, today we are actually double-checkpointing necessarily, once in suspend, and once in close. In fact, only one checkpoint is needed. This is known issue but did not incur any correctness issue, just unnecessary overhead.
With EOS turned off, we do not checkpoint on suspend, and only on close.

StandbyTask:

We always write the checkpoint in flushAndCheckpointState, which is called in both commit and suspend. Note that although we pass empty map into the checkpoint call, it is okay since it will be updated with the committed offsets internally. However, in fact in commit it is okay to just pass in null to NOT checkpoint at all, since there should be no offset change between the latest flushAndCheckpointState and the close call during normal processing (i.e. we are also duplicate-checkpointing here).

GlobalUpdateTask:

We pass in offsets in both stateConsumer.pollAndUpdate() and stateConsumer.close() but again this is the same duplicate-checkpointing issue as in StreamsTask case 1), because the offsets would never changed between the previous checkpoint call during normal run, and if there is an exception in between -- e.g. you checkpointed offset 100 in the latest pollAndUpdate, and in the next pollAndUpdate call you get an exception at offset 105, and hence jump into the finally blocked to call close -- we should not checkpoint 105 in this case since the data may not be flushed to the store at all, but rather just keep the checkpoint file with 100 to maintain at least once semantics.

With all this, let's do the following:

in StreamTask, the principle is that we always checkpoint on flush, and never checkpoint on close any more, regardless of EOS. And because of that:
Remove final Map<TopicPartition, Long> ackedOffsets from the close call, as @bbejeck suggested. In the StreamTask caller, we can safely remove the parameter since we are now never checkpointing on close.
Remove the parameter in AbstractTask#void closeStateManager(final boolean writeCheckpoint) as well and also remove the condition that writes checkpoint or not also. Since, again, no matter if we are closing cleanly or committing successfully, we would not write checkpoint files any more. And we can also remove the corresponding parameters like clean and commitSuccesfully in its ancestor call trace as well.
remove the parameter in stateMgr.close(offsets); in GlobalUpdateTask#close().

WDYT?

guozhangwang · 2019-02-22T00:51:33Z

Actually, about the action 0) above: since we now write checkpoint file at suspend as well when EOS is turned on, we should delete the file upon resumption we should also delete the checkpoint file as we did at the construction time:

if (eosEnabled) {
            // delete the checkpoint file after finish loading its stored offsets
            checkpoint.delete();
            checkpoint = null;
        }

so that the semantics is guaranteed: after resumption, if we get a crash, we should enforce bootstrapping from the beginning.

guozhangwang · 2019-02-22T19:54:12Z

EosIntegrationTest. shouldNotViolateEosIfOneTaskFailsWithState
GlobalStateManagerImplTest. shouldCheckpointRestoredOffsetsToFile
GlobalStateManagerImplTest. shouldWriteCheckpointsOnClose
GlobalStateTaskTest. shouldCloseStateManagerWithOffsets
StreamTaskTest. shouldNotCheckpointOffsetsOnCommitIfEosIsEnabled

Those tests failed locally.

abbccdda · 2019-02-22T21:00:02Z

Among the tests, I think
shouldCheckpointRestoredOffsetsToFile, shouldWriteCheckpointsOnClose,shouldCloseStateManagerWithOffsets,shouldNotCheckpointOffsetsOnCommitIfEosIsEnabled could be removed, because our change basically breaks the fundamental assumption behind them.
I will keep looking at EosIntegrationTest to figure out whether we should do sth to make it work.

abbccdda · 2019-02-22T23:05:01Z

Thanks @guozhangwang for the bug fix! I also rebased my test removal changes on top

mjsax · 2019-02-22T23:55:57Z

chechstyle error -- can you update the pr to fix it? @abbccdda

mjsax · 2019-02-23T00:08:59Z

nit: remove this

To make the semantics clearer, I am wondering if we should use two nested if:

if(eosEnabled && !clean) { try { if (checkpoint != null) { ... } } catch(...) {...} }

This make it clear, that it's a EOS condition (first var to be checked) and for the EOS, do something if !clean.

The checkpoint != null if just a guard against NPE and has nothing to do with the actual logic

mjsax · 2019-02-23T00:14:15Z

Why do we need to include clean here? (Was this another bug?) I thought for standbys this does not matter?

Good point, since standby tasks do not have eos anyways today.

mjsax · 2019-02-23T00:15:20Z

nit: "Did not find checkpoint..."

mjsax · 2019-02-23T00:26:38Z

Thanks for the follow ups @guozhangwang LGTM.

…7672

This fix is aiming for #2 issue pointed out within https://issues.apache.org/jira/browse/KAFKA-7672 In the current setup, we do offset checkpoint file write when EOS is turned on during #suspend, which introduces the potential race condition during StateManager #closeSuspend call. To mitigate the problem, we attempt to always write checkpoint file in #suspend call. Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>, John Roesler <john@confluent.io>, Bill Bejeck <bbejeck@gmail.com>

guozhangwang · 2019-02-23T05:50:16Z

Cherry-picked to 2.2 as well.

abbccdda · 2019-02-23T06:50:47Z

@guozhangwang Thanks a lot for making the fix work!

* AK/trunk: (36 commits) KAFKA-7962: Avoid NPE for StickyAssignor (apache#6308) Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback (apache#6330) KAFKA-7918: Inline generic parameters Pt. II: RocksDB Bytes Store and Memory LRU Caches (apache#6327) MINOR: fix parameter naming (apache#6316) KAFKA-7956 In ShutdownableThread, immediately complete the shutdown if the thread has not been started (apache#6218) MINOR: Refactor replica log dir fetching for improved logging (apache#6313) [TRIVIAL] Remove unused StreamsGraphNode#repartitionRequired (apache#6227) MINOR: Increase produce timeout to 120 seconds (apache#6326) KAFKA-7918: Inline generic parameters Pt. I: in-memory key-value store (apache#6293) MINOR: Fix line break issue in upgrade notes (apache#6320) KAFKA-7972: Use automatic RPC generation in SaslHandshake MINOR: Enable capture of full stack trace in StreamTask#process (apache#6310) KAFKA-7938: Fix test flakiness in DeleteConsumerGroupsTest (apache#6312) KAFKA-7937: Fix Flaky Test ResetConsumerGroupOffsetTest.testResetOffsetsNotExistingGroup (apache#6311) MINOR: Update docs to say 2.2 (apache#6315) KAFKA-7672 : force write checkpoint during StreamTask #suspend (apache#6115) KAFKA-7961; Ignore assignment for un-subscribed partitions (apache#6304) KAFKA-7672: Restoring tasks need to be closed upon task suspension (apache#6113) KAFKA-7864; validate partitions are 0-based (apache#6246) KAFKA-7492 : Updated javadocs for aggregate and reduce methods returning null behavior. (apache#6285) ...

…he#6115) This fix is aiming for #2 issue pointed out within https://issues.apache.org/jira/browse/KAFKA-7672 In the current setup, we do offset checkpoint file write when EOS is turned on during #suspend, which introduces the potential race condition during StateManager #closeSuspend call. To mitigate the problem, we attempt to always write checkpoint file in #suspend call. Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>, John Roesler <john@confluent.io>, Bill Bejeck <bbejeck@gmail.com>

abbccdda force-pushed the bug_fix branch 6 times, most recently from df7ddbe to 61824db Compare January 14, 2019 04:10

mjsax added the streams label Jan 14, 2019

mjsax reviewed Jan 14, 2019

View reviewed changes

Comment thread streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamTask.java Outdated

abbccdda force-pushed the bug_fix branch 2 times, most recently from 85648ae to d76d00d Compare January 18, 2019 02:01

abbccdda force-pushed the bug_fix branch 6 times, most recently from 342d6e0 to 8dbd301 Compare January 22, 2019 17:57

mjsax approved these changes Feb 12, 2019

View reviewed changes

vvcephei reviewed Feb 12, 2019

View reviewed changes

bbejeck approved these changes Feb 12, 2019

View reviewed changes

abbccdda force-pushed the bug_fix branch from f7dcfb8 to 5396cfd Compare February 12, 2019 23:26

guozhangwang reviewed Feb 22, 2019

View reviewed changes

Boyang Chen added 2 commits February 22, 2019 11:13

address checkstyle and comments

f8b5757

second try

aa2d307

abbccdda force-pushed the bug_fix branch from 218cf7f to aa2d307 Compare February 22, 2019 19:13

rebase and format

4ce1d71

abbccdda force-pushed the bug_fix branch from 4ce1d71 to 816ea03 Compare February 22, 2019 19:44

abbccdda force-pushed the bug_fix branch from 816ea03 to 433922a Compare February 22, 2019 20:31

fix an edge case

42a24cc

guozhangwang force-pushed the bug_fix branch from 433922a to 42a24cc Compare February 22, 2019 22:44

abbccdda force-pushed the bug_fix branch 2 times, most recently from 433922a to d92b49c Compare February 22, 2019 23:04

mjsax reviewed Feb 23, 2019

View reviewed changes

fix a regression introduced in previous commit

161eeb3

guozhangwang force-pushed the bug_fix branch from d92b49c to 161eeb3 Compare February 23, 2019 00:20

github comments

a5c735d

guozhangwang added 2 commits February 22, 2019 21:37

Merge branch 'trunk' of https://github.com/apache/kafka into KReview-…

b6faa86

…7672

checkstyle fixes

aefb000

guozhangwang merged commit 1f9aa01 into apache:trunk Feb 23, 2019

ijuma mentioned this pull request Jul 4, 2019

MINOR: Use Scala's -release flag if possible #7035

Closed

3 tasks

Conversation

abbccdda commented Jan 10, 2019

Committer Checklist (excluded from commit message)

Uh oh!

Uh oh!

abbccdda commented Jan 18, 2019

Uh oh!

abbccdda commented Jan 23, 2019

Uh oh!

abbccdda commented Feb 2, 2019

Uh oh!

abbccdda commented Feb 7, 2019

Uh oh!

mjsax left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax commented Feb 12, 2019

Uh oh!

mjsax commented Feb 12, 2019

Uh oh!

vvcephei commented Feb 12, 2019

Uh oh!

bbejeck commented Feb 12, 2019

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bbejeck left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Feb 22, 2019

Uh oh!

guozhangwang commented Feb 22, 2019

Uh oh!

abbccdda commented Feb 22, 2019

Uh oh!

abbccdda commented Feb 22, 2019

Uh oh!

mjsax commented Feb 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax Feb 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mjsax Feb 23, 2019 •

edited

Loading