HOTFIX: don't close or wipe out someone else's state#8478
HOTFIX: don't close or wipe out someone else's state#8478guozhangwang merged 3 commits intoapache:trunkfrom
Conversation
|
Retest this please. |
guozhangwang
left a comment
There was a problem hiding this comment.
While reviewing the PR I had a meta question: the issue would happen if the previous owner thread has not released the lock while the new owner thread is going to close the task. The new thread will try to lock the dir while it is transiting to RESTORING, but the stateDirectory.lock(id) will return false immediately if the lock is at hands of another thread within this JVM, which is totally possible if one thread is closing the task while the other is grabbing the task.
So that means in StateManagerUtil#registerStateStores it is very possible that the lock(id) function would return false, which would be interpreted as fatal error -- more specifically, with incremental protocol since we always close the task before it is reassigned, it may not happen, but with the eager protocol it is very possible this will happen. Right? Is that an overlooked issue?
There was a problem hiding this comment.
Is this change intentional?
There was a problem hiding this comment.
Yep, sorry I meant to leave a comment. I noticed this is redundant since cleanupTask (invoked by closeTaskDirty) does the same thing
@guozhangwang Can you clarify? If the task fails to lock the directory in |
61e1b8f to
dad39b7
Compare
| public void testCloseStateManagerDirtyShallSwallowException() throws IOException { | ||
| final LogCaptureAppender appender = LogCaptureAppender.createAndRegister(); | ||
|
|
||
| public void testCloseStateManagerThrowsExceptionWhenDirty() throws IOException { |
There was a problem hiding this comment.
We actually were already throwing some ProcessorStateExceptions up through close even when unclean, which I think was the cause of a bug we resolved a few weeks ago. Now we just make no assumptions about whether this will throw or not, and catch any exceptions in StreamTask / StandbyTask if it's a dirty close.
|
test this |
1 similar comment
|
test this |
|
test this! |
|
test this |
1 similar comment
|
test this |
Ah yes, that's right. |
|
Failures unrelated: |
|
test this please |
guozhangwang
left a comment
There was a problem hiding this comment.
Made another pass, after addressing this should be good to merge.
| } else { | ||
| log.warn("Closing {} task {} uncleanly and swallows an exception", taskType, id, exception); | ||
| } | ||
| throw exception; |
There was a problem hiding this comment.
Nice clean, now we catch the exception in the caller.
| replayAll(); | ||
|
|
||
| StateManagerUtil.closeStateManager( | ||
| logger, "logPrefix:", false, true, stateManager, stateDirectory, TaskType.ACTIVE); |
There was a problem hiding this comment.
nit: we can set eosEnabled to false since we set it to true in the next test?
|
test this please |
1 similar comment
|
test this please |
* 'trunk' of github.com:apache/kafka: (28 commits) MINOR: cleanup RocksDBStore tests (apache#8510) KAFKA-9818: Fix flaky test in RecordCollectorTest (apache#8507) MINOR: reduce impact of trace logging in replica hot path (apache#8468) KAFKA-6145: KIP-441: Add test scenarios to ensure rebalance convergence (apache#8475) KAFKA-9881: Convert integration test to verify measurements from RocksDB to unit test (apache#8501) MINOR: improve test coverage for dynamic LogConfig(s) (apache#7616) MINOR: Switch order of sections on tumbling and hopping windows in streams doc. Tumbling windows are defined as "special case of hopping time windows" - but hopping windows currently only explained later in the docs. (apache#8505) KAFKA-9819: Fix flaky test in StoreChangelogReaderTest (apache#8488) HOTFIX: fix active task process ratio metric recording KAFKA-9796; Ensure broker shutdown is not stuck when Acceptor is waiting on connection queue (apache#8448) MINOR: Use streaming iterator with decompression buffer when building offset map (apache#8494) Add log message in release.py (apache#8461) KAFKA-9854 Re-authenticating causes mismatched parse of response (apache#8471) KAFKA-9838; Add log concurrency test and fix minor race condition (apache#8476) KAFKA-9703; Free up compression buffer after splitting a large batch KAFKA-9779: Add Stream system test for 2.5 release (apache#8378) KAFKA-7885: TopologyDescription violates equals-hashCode contract. (apache#6210) MINOR: KafkaApis#handleOffsetDeleteRequest does not group result correctly (apache#8485) HOTFIX: don't close or wipe out someone else's state (apache#8478) MINOR: add process(Test)Messages to the README (apache#8480) ...
When it comes to actually closing a task we now treat all states exactly the same, and call
StateManagerUtil#closeStateManagerregardless of whether it's inCREATEDorRESTORINGorRUNNINGUnfortunately
StateManagerUtildoesn't actually check to make sure that we actually own the lock for this task's state. During a dirty close with eos enabled, we wipe the state -- but in some cases, this means deleting the state out from under another StreamThread who is still in the process of revoking this task