HOTFIX: don't close or wipe out someone else's state by ableegoldman · Pull Request #8478 · apache/kafka

ableegoldman · 2020-04-14T02:06:20Z

When it comes to actually closing a task we now treat all states exactly the same, and call StateManagerUtil#closeStateManager regardless of whether it's in CREATED or RESTORING or RUNNING

Unfortunately StateManagerUtil doesn't actually check to make sure that we actually own the lock for this task's state. During a dirty close with eos enabled, we wipe the state -- but in some cases, this means deleting the state out from under another StreamThread who is still in the process of revoking this task

mjsax · 2020-04-14T04:59:23Z

Retest this please.

guozhangwang

While reviewing the PR I had a meta question: the issue would happen if the previous owner thread has not released the lock while the new owner thread is going to close the task. The new thread will try to lock the dir while it is transiting to RESTORING, but the stateDirectory.lock(id) will return false immediately if the lock is at hands of another thread within this JVM, which is totally possible if one thread is closing the task while the other is grabbing the task.

So that means in StateManagerUtil#registerStateStores it is very possible that the lock(id) function would return false, which would be interpreted as fatal error -- more specifically, with incremental protocol since we always close the task before it is reassigned, it may not happen, but with the eager protocol it is very possible this will happen. Right? Is that an overlooked issue?

guozhangwang · 2020-04-14T17:35:24Z

Is this change intentional?

Yep, sorry I meant to leave a comment. I noticed this is redundant since cleanupTask (invoked by closeTaskDirty) does the same thing

Thanks! SG.

ableegoldman · 2020-04-14T22:31:29Z

in StateManagerUtil#registerStateStores it is very possible that the lock(id) function would return false, which would be interpreted as fatal error

@guozhangwang Can you clarify? If the task fails to lock the directory in registerStateStores it throws a LockException, which is ultimately caught and handled by the TaskManager in tryToCompleteRestoration. It'll just stay in CREATED and try again on the next loop right?

ableegoldman · 2020-04-14T23:14:13Z

-    public void testCloseStateManagerDirtyShallSwallowException() throws IOException {
-        final LogCaptureAppender appender = LogCaptureAppender.createAndRegister();
-
+    public void testCloseStateManagerThrowsExceptionWhenDirty() throws IOException {


We actually were already throwing some ProcessorStateExceptions up through close even when unclean, which I think was the cause of a bug we resolved a few weeks ago. Now we just make no assumptions about whether this will throw or not, and catch any exceptions in StreamTask / StandbyTask if it's a dirty close.

guozhangwang · 2020-04-15T04:15:11Z

test this

guozhangwang · 2020-04-15T04:15:41Z

test this

guozhangwang · 2020-04-15T04:16:15Z

test this!

guozhangwang · 2020-04-15T04:54:12Z

test this

guozhangwang · 2020-04-15T05:21:30Z

test this

guozhangwang · 2020-04-15T05:21:47Z

@guozhangwang Can you clarify? If the task fails to lock the directory in registerStateStores it throws a LockException, which is ultimately caught and handled by the TaskManager in tryToCompleteRestoration. It'll just stay in CREATED and try again on the next loop right?

Ah yes, that's right.

ableegoldman · 2020-04-15T17:25:08Z

Failures unrelated:
ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup
SuppressionDurabilityIntegrationTest.shouldRecoverBufferAfterShutdown[exactly_once]

guozhangwang · 2020-04-15T17:40:13Z

test this please

guozhangwang

Made another pass, after addressing this should be good to merge.

guozhangwang · 2020-04-15T17:45:45Z

-            } else {
-                log.warn("Closing {} task {} uncleanly and swallows an exception", taskType, id, exception);
-            }
+            throw exception;


Nice clean, now we catch the exception in the caller.

guozhangwang · 2020-04-15T17:49:56Z

Thanks! SG.

guozhangwang · 2020-04-15T17:59:52Z

+        replayAll();
+
+        StateManagerUtil.closeStateManager(
+            logger, "logPrefix:", false, true, stateManager, stateDirectory, TaskType.ACTIVE);


nit: we can set eosEnabled to false since we set it to true in the next test?

guozhangwang · 2020-04-15T18:40:50Z

test this please

guozhangwang · 2020-04-15T18:41:08Z

test this please

* 'trunk' of github.com:apache/kafka: (28 commits) MINOR: cleanup RocksDBStore tests (apache#8510) KAFKA-9818: Fix flaky test in RecordCollectorTest (apache#8507) MINOR: reduce impact of trace logging in replica hot path (apache#8468) KAFKA-6145: KIP-441: Add test scenarios to ensure rebalance convergence (apache#8475) KAFKA-9881: Convert integration test to verify measurements from RocksDB to unit test (apache#8501) MINOR: improve test coverage for dynamic LogConfig(s) (apache#7616) MINOR: Switch order of sections on tumbling and hopping windows in streams doc. Tumbling windows are defined as "special case of hopping time windows" - but hopping windows currently only explained later in the docs. (apache#8505) KAFKA-9819: Fix flaky test in StoreChangelogReaderTest (apache#8488) HOTFIX: fix active task process ratio metric recording KAFKA-9796; Ensure broker shutdown is not stuck when Acceptor is waiting on connection queue (apache#8448) MINOR: Use streaming iterator with decompression buffer when building offset map (apache#8494) Add log message in release.py (apache#8461) KAFKA-9854 Re-authenticating causes mismatched parse of response (apache#8471) KAFKA-9838; Add log concurrency test and fix minor race condition (apache#8476) KAFKA-9703; Free up compression buffer after splitting a large batch KAFKA-9779: Add Stream system test for 2.5 release (apache#8378) KAFKA-7885: TopologyDescription violates equals-hashCode contract. (apache#6210) MINOR: KafkaApis#handleOffsetDeleteRequest does not group result correctly (apache#8485) HOTFIX: don't close or wipe out someone else's state (apache#8478) MINOR: add process(Test)Messages to the README (apache#8480) ...

mjsax added the streams label Apr 14, 2020

guozhangwang reviewed Apr 14, 2020

View reviewed changes

ableegoldman added 2 commits April 14, 2020 15:35

need to ensure lock

682c27f

fix up tests, add new ones

dad39b7

ableegoldman force-pushed the HOTFIX-dont-close-someone-elses-state branch from 61e1b8f to dad39b7 Compare April 14, 2020 23:07

ableegoldman commented Apr 14, 2020

View reviewed changes

guozhangwang reviewed Apr 15, 2020

View reviewed changes

github review

1ad8282

guozhangwang merged commit 640be46 into apache:trunk Apr 15, 2020

ableegoldman deleted the HOTFIX-dont-close-someone-elses-state branch April 20, 2020 22:25

Conversation

ableegoldman commented Apr 14, 2020

Uh oh!

mjsax commented Apr 14, 2020

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ableegoldman commented Apr 14, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Apr 15, 2020

Uh oh!

guozhangwang commented Apr 15, 2020

Uh oh!

guozhangwang commented Apr 15, 2020

Uh oh!

guozhangwang commented Apr 15, 2020

Uh oh!

guozhangwang commented Apr 15, 2020

Uh oh!

guozhangwang commented Apr 15, 2020

Uh oh!

ableegoldman commented Apr 15, 2020

Uh oh!

guozhangwang commented Apr 15, 2020

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

guozhangwang commented Apr 15, 2020

Uh oh!

guozhangwang commented Apr 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants