KAFKA-7672 : force write checkpoint during StreamTask #suspend#6115
KAFKA-7672 : force write checkpoint during StreamTask #suspend#6115guozhangwang merged 12 commits intoapache:trunkfrom
Conversation
df7ddbe to
61824db
Compare
85648ae to
d76d00d
Compare
|
@mjsax @guozhangwang Could you take another look to see if this makes sense? |
342d6e0 to
8dbd301
Compare
|
@guozhangwang @mjsax Maybe another look? |
|
@guozhangwang @mjsax Take a look when you got time? |
|
@mjsax Mind take another look? |
mjsax
left a comment
There was a problem hiding this comment.
The change itself LGTM.
I would recommend to update the comment a little bit.
Call for second review @guozhangwang @bbejeck @vvcephei @ableegoldman
Can we add a test that exposes the issue, ie, force the race condition? (Could maybe also be done as a follow up) -- this is a critical bug fix and we should get it into 2.2 (ie, need to be merged by 2/15).
There was a problem hiding this comment.
Which check? I don't see any if clause.
There was a problem hiding this comment.
I was thinking the same. Did you mean that the other change on this PR eliminates the chance for a double checkpoint file writes?
There was a problem hiding this comment.
Oh, the check is just pointing to eosEnabled. Let me update the comment.
|
@abbccdda One more thing: can we also add to |
|
Another thought that just crossed my mind while digging into this further: Atm, there is a race condition between writing and reading the checkpoint file. Wouldn't it be simpler to avoid this race condition by moving from This would simplify the logic IMHO. |
|
Actually, @mjsax , after reviewing the ticket, I think 6113 (#6113) is the actual bugfix that must go in. So I don't think that @abbccdda needs to add the If we can merge 6115 as well, which contains a significant performance improvement for restore in some cases, then it would be ideal. |
#6115 includes the |
vvcephei
left a comment
There was a problem hiding this comment.
Hi @abbccdda ,
Thanks for the PR.
I had one comment: it seems like we should update more of the state manager APIs to indicate that commit is no longer a guaranteed part of close. Otherwise, we could easily introduce a bug later, calling the method with acked offsets, not realizing that they will actually just be ignored.
Thanks,
-John
There was a problem hiding this comment.
If we're going to ignore the ackedOffsets argument here, it seems like we should revisit the interface.
There are multiple paths that lead to this method, and it's not analytically obvious why it's ok to just ignore the argument and skip checkpointing here.
If we want to move the checkpoint up in the lifecycle to suspend, then it seems like we should do so holistically and remove the checkpointable offsets from the paramters of close. What do you think?
There was a problem hiding this comment.
@vvcephei @bbejeck I checked the callers of close(ackedOffsets) function,
- Abstract task is ok since our fix is to make sure state manager could checkpoint when suspending
- GlobalStateUpdateTask is also fine since the global state manager (using GlobalStateManagerImpl) is overriding this function who will do offset checkpoint.
So I think we should be safe to ignore this in the base class, but keep the acked offset parameter in order for subclass to checkpoint for now, which minimizing the risk of this PR. Thank you!
There was a problem hiding this comment.
What "subclasses" are you referring to? Both classes you mention are internal and not extended. I don't see any risk that we need to minimize?
There was a problem hiding this comment.
I mean GlobalStateManagerImpl implements the close(offsets) @mjsax
There was a problem hiding this comment.
Could wel update the code in GlobalStateUpdateTask to:
public void close() throws IOException {
stateMgr.checkpoint(offsets);
stateMgr.close();
}
There was a problem hiding this comment.
Oh, my point is that stateMgr in GlobalStateUpdateTask is taking in GlobalStateManagerImpl whose close(offsets) will do the checkpoint operation. So I guess we don't need to call checkpoint explicitly here right? @mjsax
bbejeck
left a comment
There was a problem hiding this comment.
LGTM, but I've left some comments regarding un-used method parameters as a result of this change.
There was a problem hiding this comment.
I was thinking the same. Did you mean that the other change on this PR eliminates the chance for a double checkpoint file writes?
There was a problem hiding this comment.
One note, with the removal of this line, the ackedOffsets is no longer used at all in this method. However, it's part of the StateManager#close interface, but it's not safe to remove as GlobalStateManager also implements the StateManager interface. Which to me, brings up two questions
- Do we need to apply the same approach for global state stores?
- Should we refactor the
StateManagerinterface to have a no-argsclose()method?
Given the time constraints, I don't think should hold up this PR but it's something that IMHO we shouldn't let slip.
guozhangwang
left a comment
There was a problem hiding this comment.
Just laying out the context here (admittedly, this piece of logic is a bit hard to understand due to messy code structures, and we have a TODO task to clean up this tech debt soon):
StreamTask:
-
With EOS turned off, today we are actually double-checkpointing necessarily, once in suspend, and once in close. In fact, only one checkpoint is needed. This is known issue but did not incur any correctness issue, just unnecessary overhead.
-
With EOS turned off, we do not checkpoint on suspend, and only on close.
StandbyTask:
We always write the checkpoint in flushAndCheckpointState, which is called in both commit and suspend. Note that although we pass empty map into the checkpoint call, it is okay since it will be updated with the committed offsets internally. However, in fact in commit it is okay to just pass in null to NOT checkpoint at all, since there should be no offset change between the latest flushAndCheckpointState and the close call during normal processing (i.e. we are also duplicate-checkpointing here).
GlobalUpdateTask:
We pass in offsets in both stateConsumer.pollAndUpdate() and stateConsumer.close() but again this is the same duplicate-checkpointing issue as in StreamsTask case 1), because the offsets would never changed between the previous checkpoint call during normal run, and if there is an exception in between -- e.g. you checkpointed offset 100 in the latest pollAndUpdate, and in the next pollAndUpdate call you get an exception at offset 105, and hence jump into the finally blocked to call close -- we should not checkpoint 105 in this case since the data may not be flushed to the store at all, but rather just keep the checkpoint file with 100 to maintain at least once semantics.
With all this, let's do the following:
- in StreamTask, the principle is that we always checkpoint on flush, and never checkpoint on close any more, regardless of EOS. And because of that:
- Remove
final Map<TopicPartition, Long> ackedOffsetsfrom the close call, as @bbejeck suggested. In the StreamTask caller, we can safely remove the parameter since we are now never checkpointing on close. - Remove the parameter in
AbstractTask#void closeStateManager(final boolean writeCheckpoint)as well and also remove the condition that writes checkpoint or not also. Since, again, no matter if we are closing cleanly or committing successfully, we would not write checkpoint files any more. And we can also remove the corresponding parameters likecleanandcommitSuccesfullyin its ancestor call trace as well. - remove the parameter in
stateMgr.close(offsets);inGlobalUpdateTask#close().
WDYT?
|
Actually, about the action 0) above: since we now write checkpoint file at suspend as well when EOS is turned on, we should delete the file upon resumption we should also delete the checkpoint file as we did at the construction time: so that the semantics is guaranteed: after resumption, if we get a crash, we should enforce bootstrapping from the beginning. |
Those tests failed locally. |
|
Among the tests, I think |
433922a to
42a24cc
Compare
433922a to
d92b49c
Compare
|
Thanks @guozhangwang for the bug fix! I also rebased my test removal changes on top |
|
chechstyle error -- can you update the pr to fix it? @abbccdda |
There was a problem hiding this comment.
To make the semantics clearer, I am wondering if we should use two nested if:
if(eosEnabled && !clean) {
try {
if (checkpoint != null) { ... }
} catch(...) {...}
}
This make it clear, that it's a EOS condition (first var to be checked) and for the EOS, do something if !clean.
The checkpoint != null if just a guard against NPE and has nothing to do with the actual logic
There was a problem hiding this comment.
Why do we need to include clean here? (Was this another bug?) I thought for standbys this does not matter?
There was a problem hiding this comment.
Good point, since standby tasks do not have eos anyways today.
There was a problem hiding this comment.
nit: "Did not find checkpoint..."
d92b49c to
161eeb3
Compare
|
Thanks for the follow ups @guozhangwang LGTM. |
This fix is aiming for #2 issue pointed out within https://issues.apache.org/jira/browse/KAFKA-7672 In the current setup, we do offset checkpoint file write when EOS is turned on during #suspend, which introduces the potential race condition during StateManager #closeSuspend call. To mitigate the problem, we attempt to always write checkpoint file in #suspend call. Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>, John Roesler <john@confluent.io>, Bill Bejeck <bbejeck@gmail.com>
|
Cherry-picked to 2.2 as well. |
|
@guozhangwang Thanks a lot for making the fix work! |
* AK/trunk: (36 commits) KAFKA-7962: Avoid NPE for StickyAssignor (apache#6308) Address flakiness of CustomQuotaCallbackTest#testCustomQuotaCallback (apache#6330) KAFKA-7918: Inline generic parameters Pt. II: RocksDB Bytes Store and Memory LRU Caches (apache#6327) MINOR: fix parameter naming (apache#6316) KAFKA-7956 In ShutdownableThread, immediately complete the shutdown if the thread has not been started (apache#6218) MINOR: Refactor replica log dir fetching for improved logging (apache#6313) [TRIVIAL] Remove unused StreamsGraphNode#repartitionRequired (apache#6227) MINOR: Increase produce timeout to 120 seconds (apache#6326) KAFKA-7918: Inline generic parameters Pt. I: in-memory key-value store (apache#6293) MINOR: Fix line break issue in upgrade notes (apache#6320) KAFKA-7972: Use automatic RPC generation in SaslHandshake MINOR: Enable capture of full stack trace in StreamTask#process (apache#6310) KAFKA-7938: Fix test flakiness in DeleteConsumerGroupsTest (apache#6312) KAFKA-7937: Fix Flaky Test ResetConsumerGroupOffsetTest.testResetOffsetsNotExistingGroup (apache#6311) MINOR: Update docs to say 2.2 (apache#6315) KAFKA-7672 : force write checkpoint during StreamTask #suspend (apache#6115) KAFKA-7961; Ignore assignment for un-subscribed partitions (apache#6304) KAFKA-7672: Restoring tasks need to be closed upon task suspension (apache#6113) KAFKA-7864; validate partitions are 0-based (apache#6246) KAFKA-7492 : Updated javadocs for aggregate and reduce methods returning null behavior. (apache#6285) ...
…he#6115) This fix is aiming for #2 issue pointed out within https://issues.apache.org/jira/browse/KAFKA-7672 In the current setup, we do offset checkpoint file write when EOS is turned on during #suspend, which introduces the potential race condition during StateManager #closeSuspend call. To mitigate the problem, we attempt to always write checkpoint file in #suspend call. Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>, John Roesler <john@confluent.io>, Bill Bejeck <bbejeck@gmail.com>
This fix is aiming for #2 issue pointed out within https://issues.apache.org/jira/browse/KAFKA-7672
In the current setup, we do offset checkpoint file write when EOS is turned on during #suspend, which introduces the potential race condition during StateManager #closeSuspend call. To mitigate the problem, we attempt to always write checkpoint file in #suspend call.
Committer Checklist (excluded from commit message)