KAFKA-5998: fix checkpointableOffsets handling by vvcephei · Pull Request #7030 · apache/kafka

vvcephei · 2019-07-03T17:01:09Z

fix checkpoint file warning by filtering checkpointable offsets per task
clean up state manager hierarchy to prevent similar bugs

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

vvcephei · 2019-07-03T17:03:47Z

 * of Global State Stores. There is only ever 1 instance of this class per Application Instance.
 */
-public class GlobalStateManagerImpl extends AbstractStateManager implements GlobalStateManager {
+public class GlobalStateManagerImpl implements GlobalStateManager {


Stop sharing mutable state between a superclass and subclass. The only reason to do it was to support the re-initialization logic, but the checkpoint map can just as easily be passed in as a parameter.

vvcephei · 2019-07-03T17:06:01Z

+        eosEnabled = StreamsConfig.EXACTLY_ONCE.equals(config.getString(StreamsConfig.PROCESSING_GUARANTEE_CONFIG));
+        baseDir = stateDirectory.globalStateDir();
+        checkpointFile = new OffsetCheckpoint(new File(baseDir, CHECKPOINT_FILE_NAME));
+        checkpointFileCache = new HashMap<>();


It took me a really long time to decipher the actual purposes of "checkpoint" and "checkpointableOffsets". I've renamed them to "checkpointFile" and "checkpointFileCache" to be more self-documenting.

vvcephei · 2019-07-03T17:08:36Z

                    try {
                        entry.getValue().get().close();
-                    } catch (final Exception e) {
+                    } catch (final RuntimeException e) {


Since this PR is to clean up difficult-to-maintain code, I also included other cleanups, like dropping unnecessary this modifiers, restricting too-broad catch blocks, etc.

vvcephei · 2019-07-03T17:08:47Z

                                   .append(entry.getKey())
                                   .append(". Reason: ")
-                                   .append(e.toString())
+                                   .append(e)


unnecessary toString

vvcephei · 2019-07-03T17:09:08Z


    // TODO: this map does not work with customized grouper where multiple partitions
-    // of the same topic can be assigned to the same topic.
+    // of the same topic can be assigned to the same task.


pretty sure this was a typo

vvcephei · 2019-07-03T17:19:38Z

 import static org.apache.kafka.streams.state.internals.WrappedStateStore.isTimestamped;

-abstract class AbstractStateManager implements StateManager {
+final class StateManagerUtil {


This has changed from an abstract class to a static utility class.

vvcephei · 2019-07-03T17:20:06Z

-                    throw new ProcessorStateException(String.format("%sError while deleting the checkpoint file", logPrefix), e);
-                }
+            try {
+                stateMgr.clearCheckpoints();


checking for null is now encapsulated.

vvcephei · 2019-07-03T17:22:13Z

            false,
            stateDirectory,
-            emptyMap(),
+            singletonMap(persistentStoreName, persistentStorePartition.topic()),


This test erroneously didn't include a changelog topic for the store in question. Now that we are actually verifying the checkpoints before we write them, we have to get this right.

Edit: some refactoring I did actually removed this enforcement. I'm working out how to keep it, but it's pretty complicated...

vvcephei · 2019-07-03T17:22:24Z

                false,
                stateDirectory,
-                emptyMap(),
+                singletonMap(persistentStoreName, persistentStoreTopicName),


ditto here.

vvcephei · 2019-07-03T17:23:29Z

    }

    private StreamTask createStatefulTask(final StreamsConfig config, final boolean logged) {
+        final StateStore stateStore = new MockKeyValueStore(storeName, logged);


We're no longer overlooking the fact that this store wasn't "logged" when we write the checkpoints.

vvcephei

@mjsax @cadonna @ableegoldman @abbccdda @guozhangwang @pkleindl ... did I miss anyone?

This PR should fix the long-running KAFKA-5998 bug.

The change I'm proposing is bigger in scope than the actual fix, though, because I wanted to take steps to prevent a similar bug from cropping up in the same code in the future. My theory is that the bug had an easy time hiding in this code because the handling of checkpointable offsets was so complex. I'm hoping that by reducing the mutable scope and also tightening up the invariants around the checkpointable offsets, we will have an easier time maintaining this module.

Let me know what you think!

Thanks,
-John

vvcephei · 2019-07-03T20:59:43Z



-public class ProcessorStateManager extends AbstractStateManager {
+public class ProcessorStateManager implements StateManager {


Also here, no longer sharing mutable state between super and sub classes.

vvcephei · 2019-07-03T22:04:29Z

+    private final File baseDir;
+    private OffsetCheckpoint checkpointFile;
+    private final Map<TopicPartition, Long> checkpointFileCache = new HashMap<>();
+    private final Map<TopicPartition, Long> initialLoadedCheckpoints;


Adding this collection breaks a circular dependency in this class:

the checkpoints we load from disk are potentially not valid for the current topology

we have to load the checkpoints immediately because we have to delete the checkpoint file before processing in the case of EOS

we also need to have read the checkpoint file before registering stores, since it might be needed to create a restorer

we can't know if a checkpoint from the file is valid until after registering stores

In other words, if the prior code wanted to validate the loaded checkpoints, it would have to register the stores before loading checkpoints, but it also needs to load the checkpoints before registering the stores.

We're breaking the cycle here by keeping the loaded checkpoints separate. Now we read the checkpoint file into initialLoadedCheckpoints, which is used to register the stores, and then we are able to make sure that we only ever write valid checkpoints into the checkpointFileCache, which is used to update the checkpoint file later on.

vvcephei · 2019-07-03T22:05:52Z

                recordConverters.put(topic, recordConverter);
            } else {
-                log.trace("Restoring state store {} from changelog topic {} at checkpoint {}", storeName, topic, checkpointableOffsets.get(storePartition));
+                final Long restoreCheckpoint = store.persistent() ? initialLoadedCheckpoints.get(storePartition) : null;


This is where we're using the loaded checkpoint for store registration. Note the missing condition which is now handled... if the store is not persistent, it should not use the loaded checkpoint.

vvcephei · 2019-07-03T22:06:39Z

+        );
+    }
+
+    void clearCheckpoints() throws IOException {


encapsulating this operation so that outside classes don't have to directly mutate our checkpointFile field.

vvcephei · 2019-07-03T22:07:33Z

+            checkpointFile.delete();
+            checkpointFile = null;
+
+            checkpointFileCache.clear();


We didn't previously clear the cache on the blocks that this method replaces, but after reading the code, I'm pretty sure this is the right thing to do.

vvcephei · 2019-07-03T22:09:31Z

                fileOutputStream.getFD().sync();
            }

+            LOG.trace("Swapping tmp checkpoint file {} {}", temp.toPath(), file.toPath());


Having these logs would have demystified a large part of the prior (misdirected) investigation, since we were never sure whether the tmp file existed or not, or what was going on.

vvcephei · 2019-07-03T22:09:53Z

+        return restoredOffsets;
+    }
+
+    void setRestoredOffsets(final Map<TopicPartition, Long> restoredOffsets) {


Added to support some of the needed test changes.

vvcephei · 2019-07-03T22:10:07Z

    public void testChangeLogOffsets() throws IOException {
        final TaskId taskId = new TaskId(0, 0);
-        final long lastCheckpointedOffset = 10L;
+        final long storeTopic1LoadedCheckpoint = 10L;


renamed for clarity.

vvcephei · 2019-07-03T22:11:18Z

    }

+    @Test
+    public void shouldIgnoreIrrlevantLoadedCheckpoints() throws IOException {


Added a bunch of new tests to cover both the bug itself and also previously untested code paths in the ProcessorStateManager.

Note: this particular test shows that we will actually repair all the corrupted checkpoint files that buggy Streams versions wrote.

vvcephei · 2019-07-03T22:12:34Z

+    }
+
+    @Test
+    public void shouldIgnoreIrrelevantRestoredCheckpoints() throws IOException {


Note: this particular test verifies that the bug is fixed.

cadonna

Thanks for the PR @vvcephei and congratulations to you and @pkleindl for finding and fixing this bug. I left a couple of comments. I had hard time to find the code that actually fixes the bug (and I am still not sure if I found it). Could you please add some specific comments about the fix, next time, since this fix is not that trivial? I am also wondering if you could have divided this PR in two: one for the fix itself and one for the repair of old corrupted checkpoints. IMO, it would have made reviewing the PRs easier.

cadonna · 2019-07-05T12:52:28Z



-public class ProcessorStateManager extends AbstractStateManager {
+public class ProcessorStateManager implements StateManager {


Wouldn't it be more meaningful to rename this class to TaskStateManager?

Maybe, I'm not sure of the historical reason to name it this way.

cadonna · 2019-07-05T14:42:00Z

+    }
+
+    private void updateCheckpointFileCache(final Map<TopicPartition, Long> checkpointableOffsetsFromProcessing) {
+        final Map<TopicPartition, Long> restoredOffsets = validCheckpointableOffsets(changelogReader.restoredOffsets());


This is the most important line to fix the bug, right?

Yes. I'd marked it in an earlier version of this PR. I guess that comment became "outdated" at some point. Sorry about that.

cadonna · 2019-07-05T14:45:55Z

                    storePartition,
                    new CompositeRestoreListener(stateRestoreCallback),
-                    checkpointableOffsets.get(storePartition),
+                    restoreCheckpoint,


If the store is not peristent or the read checkpoint file does not contain the partition, this will throw a NPE, right? If yes, you should add unit tests for these cases.

No, I guess you were thinking the Long would become unboxed at this point? It's actually a Long parameter, and the StateRestorer constructor checks for null... Not the cleanest code, I guess, but it looks like it's been this way since 2017.

My fault! I missed the parameter. I looked at the next parameter in the StateRestorer constructor which is a long.

vvcephei · 2019-07-09T23:23:50Z

Thanks for the review, @cadonna !

I'm sorry that my earlier call-out of the actual bugfix got marked "outdated" at some point, so I guess it doesn't show up in the diff anymore. I didn't notice when that happened, or I would have re-marked it.

Regarding splitting up the PRs, I do agree with you. It would have been nice to get a smaller fix in, and then tackled the refactoring separately.

If I can make one excuse for myself, it would be that in this case, it wasn't clear to me that the fix was good enough because the scope of checkpointableOffsets was too broad. If we just did a bugfix first, it would have been equally hard for reviewers to have confidence in the fix.

In retrospect, though, I could have submitted the refactor first, and then followed up with the bugfix. It just didn't occur to me, for whatever reason.

In any case, thanks for wading through the code review! I think I addressed all your comments.

WDYT?
-John

bbejeck · 2019-07-11T15:52:06Z

retest this please

cadonna

LGTM

bbejeck · 2019-07-12T13:42:32Z

Merged #7030 into trunk

fix checkpoint file warning by filtering checkpointable offsets per task clean up state manager hierarchy to prevent similar bugs Reviewers: Bruno Cadonna <bruno@confluent.io>, Bill Bejeck <bbejeck@gmail.com>

bbejeck · 2019-07-12T14:25:30Z

cherry-picked to 2.3 and 2.2

Tin-Nguyen · 2019-07-16T10:26:53Z

@bbejeck it means that the fix is available in 2.2.x version?

bbejeck · 2019-07-16T14:31:55Z

@Tin-Nguyen yes if check out the 2.2 branch and build it. For right now I don't know if there will be another 2.2 release.

Tin-Nguyen · 2019-07-17T14:13:21Z

thanks @bbejeck

Tin-Nguyen · 2019-07-17T18:03:53Z

@bbejeck I'm wondering if we have an updated binary download includes the fix?

guozhangwang · 2019-07-17T23:17:56Z

-        this.eosEnabled = eosEnabled;
-        this.checkpoint = new OffsetCheckpoint(new File(baseDir, CHECKPOINT_FILE_NAME));
-    }
+    private StateManagerUtil() {}


Do we need this constructor explicitly? Would this just be default in java?

guozhangwang · 2019-07-17T23:18:30Z

-            storeToChangelogTopic,
-            partitions,
-            processorContext);
+        StateManagerUtil.reinitializeStateStoresForPartitions(log,


I liked this refactoring a lot, thanks @vvcephei !

* apache-github/2.3: MINOR: Update documentation for enabling optimizations (apache#7099) MINOR: Remove stale streams producer retry default docs. (apache#6844) KAFKA-8635; Skip client poll in Sender loop when no request is sent (apache#7085) KAFKA-8615: Change to track partition time breaks TimestampExtractor (apache#7054) KAFKA-8670; Fix exception for kafka-topics.sh --describe without --topic mentioned (apache#7094) KAFKA-8602: Separate PR for 2.3 branch (apache#7092) KAFKA-8530; Check for topic authorization errors in OffsetFetch response (apache#6928) KAFKA-8662; Fix producer metadata error handling and consumer manual assignment (apache#7086) KAFKA-8637: WriteBatch objects leak off-heap memory (apache#7050) KAFKA-8620: fix NPE due to race condition during shutdown while rebalancing (apache#7021) HOT FIX: close RocksDB objects in correct order (apache#7076) KAFKA-7157: Fix handling of nulls in TimestampConverter (apache#7070) KAFKA-6605: Fix NPE in Flatten when optional Struct is null (apache#5705) Fixes apache#8198 KStreams testing docs use non-existent method pipe (apache#6678) KAFKA-5998: fix checkpointableOffsets handling (apache#7030) KAFKA-8653; Default rebalance timeout to session timeout for JoinGroup v0 (apache#7072) KAFKA-8591; WorkerConfigTransformer NPE on connector configuration reloading (apache#6991) MINOR: add upgrade text (apache#7013) Bump version to 2.3.1-SNAPSHOT

…ling (apache#7030) TICKET = KAFKA-5998 LI_DESCRIPTION = EXIT_CRITERIA = HASH [1052d87] ORIGINAL_DESCRIPTION = fix checkpoint file warning by filtering checkpointable offsets per task clean up state manager hierarchy to prevent similar bugs Reviewers: Bruno Cadonna <bruno@confluent.io>, Bill Bejeck <bbejeck@gmail.com> (cherry picked from commit 1052d87)

KAFKA-5998: fix checkpointableOffsets handling

bf92d92

vvcephei mentioned this pull request Jul 3, 2019

KAFKA-5998: fix checkpointableOffsets handling #7027

Closed

3 tasks

vvcephei commented Jul 3, 2019

View reviewed changes

avoid using potentially invalid checkpoints

ea9bf41

vvcephei commented Jul 3, 2019

View reviewed changes

cadonna reviewed Jul 5, 2019

View reviewed changes

cr comments

7b33b85

cadonna approved these changes Jul 11, 2019

View reviewed changes

bbejeck approved these changes Jul 12, 2019

View reviewed changes

bbejeck added the streams label Jul 12, 2019

bbejeck merged commit 53b4ce5 into apache:trunk Jul 12, 2019

guozhangwang reviewed Jul 17, 2019

View reviewed changes



		public class ProcessorStateManager extends AbstractStateManager {
		public class ProcessorStateManager implements StateManager {

Conversation

vvcephei commented Jul 3, 2019

Committer Checklist (excluded from commit message)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvcephei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadonna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvcephei commented Jul 9, 2019

Uh oh!

bbejeck commented Jul 11, 2019

Uh oh!

cadonna left a comment

Choose a reason for hiding this comment

Uh oh!