KAFKA-13603: Allow the empty active segment to have missing offset index during recovery by ccding · Pull Request #11345 · apache/kafka

ccding · 2021-09-20T19:43:17Z

Within a LogSegment, the TimeIndex and OffsetIndex are lazy indices that don't get created on disk until they are accessed for the first time. However, Log recovery logic expects the presence of an offset index file on disk for each segment, otherwise, the segment is considered corrupted.

This PR introduces a forceFlushActiveSegment boolean for the log.flush function to allow the shutdown process to flush the empty active segment, which makes sure the offset index file exists.

Co-Author: Kowshik Prakasam kowshik@gmail.com

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

ccding · 2021-09-20T22:03:29Z

cc @kowshik

ijuma · 2021-09-22T12:50:52Z

@ccding the PR does have unit tests though, so we should update that part of the PR message.

ccding · 2021-09-22T14:24:11Z

Updated the Test section

kowshik

@ccding Thanks for the PR. LGTM. Just a minor comment: could we please update the PR description to use kowshik@gmail.com as my email address?

Within a LogSegment, the TimeIndex and OffsetIndex are lazy indices that don't get created on disk until they are accessed for the first time. However, Log recovery logic expects the presence of offset index file on disk for each segment, otherwise the segment is considered corrupted. Author: Kowshik Prakasam <kowshik@gmail.com>

ccding · 2021-09-23T02:46:22Z

@kowshik Thanks for the code review. Updated your email in the PR description as well as in the commit message.

ccding · 2021-09-24T00:08:51Z

Failed tests are irrelevant and passed on my local run.

 Build / JDK 8 and Scala 2.12 / org.apache.kafka.connect.mirror.integration.IdentityReplicationIntegrationTest.testOneWayReplicationWithAutoOffsetSync()	4.7 sec	1
 Build / JDK 8 and Scala 2.12 / org.apache.kafka.connect.integration.InternalTopicsIntegrationTest.testCreateInternalTopicsWithFewerReplicasThanBrokers	56 sec	1

ccding · 2021-09-27T21:40:17Z

ping @junrao @ijuma for code review

junrao

@ccding : Thanks for the PR. Left one comment below.

junrao · 2021-09-29T18:42:12Z

-    if (lazyOffsetIndex.file.exists) {
+  def sanityCheck(timeIndexFileNewlyCreated: Boolean, isActiveSegment: Boolean): Unit = {
+    // We allow for absence of offset index file only for an empty active segment.
+    if ((isActiveSegment && size == 0) || lazyOffsetIndex.file.exists) {


I am wondering why the active segment will be missing the offset index file during a clean shutdown. When we load the segments during broker restart, we call resizeIndexes() on the last segment. This should trigger the creation of the offset index file, which will be flushed on broker shutdown.

When we load the segments during broker restart, we call resizeIndexes() on the last segment. This should trigger the creation of the offset index file, which will be flushed on broker shutdown.

The sanityCheck is called before resizeIndexes.

It appears you are talking about we start the broker then immediately shut it down. In between the active segment may have been changed, and if the new one is empty, no index file is created.

I am still trying to understand if the missing index is the result of a clean shutdown or a hard shutdown. When will roll a segment, the index on the new active segment is created lazily. However, during a clean shutdown, we force flush the active segment, which should trigger the creation of an empty index file because the following method is used in segment flush.

def offsetIndex: OffsetIndex = lazyOffsetIndex.get
On a hard shutdown, it's possible for the offset index to be missing. However, in that case, the offset index can be missing even when the log is not empty. So, I am wondering how common of an issue that we are fixing.

@junrao: When the UnifiedLog is flushed during clean shutdown, we flush the LocaLog until the logEndOffset. Here an empty active segment is not included in the list of candidate segments to be flushed. The reason is that during LocalLog.flush(), the LogSegments.values(recoveryPoint, logEndOffset) call here does not select the empty active segment (doc), because, the logEndOffset would match the base offset of the empty active segment and thus get ommitted. So, prior to clean shutdown if the empty active segment's offset index was never created before, then, the offset index will not be created during clean shutdown because the empty active segment is never flushed.

The above is shown in the following passing unit test:

@Test def testFlushEmptyActiveSegmentDoesNotCreateOffsetIndex(): Unit = { // Create an empty log. val logConfig = LogTestUtils.createLogConfig(segmentBytes = 1024 * 1024) val log = createLog(logDir, logConfig) val oneRecord = TestUtils.records(List( new SimpleRecord(mockTime.milliseconds, "a".getBytes, "value".getBytes) )) // Append a record and flush. Verify that there exists only 1 segment. log.appendAsLeader(oneRecord, leaderEpoch = 0) assertEquals(1, log.logEndOffset) log.flush() assertEquals(1, log.logSegments.size) assertTrue(UnifiedLog.logFile(logDir, 0).exists()) assertTrue(UnifiedLog.offsetIndexFile(logDir, 0).exists()) assertFalse(UnifiedLog.logFile(logDir, 1).exists()) assertFalse(UnifiedLog.offsetIndexFile(logDir, 1).exists()) // Roll the log and verify that the new active segment's offset index is missing. log.roll() assertEquals(2, log.logSegments.size) assertTrue(UnifiedLog.logFile(logDir, 0).exists()) assertTrue(UnifiedLog.offsetIndexFile(logDir, 0).exists()) assertTrue(UnifiedLog.logFile(logDir, 1).exists()) assertFalse(UnifiedLog.offsetIndexFile(logDir, 1).exists()) // Flush the log and once again verify that the active segment's offset index is still missing. log.flush() assertTrue(UnifiedLog.logFile(logDir, 0).exists()) assertTrue(UnifiedLog.offsetIndexFile(logDir, 0).exists()) assertTrue(UnifiedLog.logFile(logDir, 1).exists()) assertFalse(UnifiedLog.offsetIndexFile(logDir, 1).exists()) // Close the log and verify that the active segment's offset index is still missing. log.close() assertTrue(UnifiedLog.logFile(logDir, 0).exists()) assertTrue(UnifiedLog.offsetIndexFile(logDir, 0).exists()) assertTrue(UnifiedLog.logFile(logDir, 1).exists()) assertFalse(UnifiedLog.offsetIndexFile(logDir, 1).exists()) }

This PR mainly fixes a logging issue in the code. For example, one situation where the issue happens more frequently is the following: Imagine there exists a topic with very low ingress traffic in some/all partitions. Imagine that for this topic the retention setting causes all existing segments to expire and get removed. In such a case, we roll the log to create an active segment. This ensures there is at least one segment remaining in the LocalLog when the retention loop completes. However we don't create the offset index for the active segment until the first append operation. Now before the first append, if the Kafka cluster is rolled then we will see this false negative corruption error message during recovery.

This PR fixes the logging problem by ignoring the absence of offset index for an empty active segment during recovery.

@kowshik : Thanks for the great explanation. It makes sense to me now.

If we don't flush the only empty log segment during clean shutdown, we could lose the log segment file as well, which causes the replica to lose track of logEndOffset. I am wondering if we should force flush the empty log segment during clean shutdown too.

Will def flush(): Unit = flush(logEndOffset + 1) trigger flushing empty active segments every time we roll a segment, not only during shutdown?

@ccding When we roll a segment, we explicitly ensure to flush only the old segment. See this LOC.

@kowshik : #8346 tries to avoid opening the index during close() when the index is not opened yet. This applies to existing segments on broker restart. For active segment, we typically need to open the index anyway. So, we probably don't need to optimize the rare case when it's empty. Plus, the danger of losing logEndOffset is a bigger concern than avoiding the cost of opening one index file.

@junrao Sounds good to me. We can flush the active segment during clean shutdown. That's a very elegant way to handle this problem.

@ccding Would you like to update the PR with the approach proposed by @junrao?

Updated. Please let me know if I misunderstood anything.

…recovery" This reverts commit ecb8692.

junrao

@ccding : Thanks for the updated PR. One more comment below.

junrao · 2021-10-01T16:44:37Z

   * Flush all local log segments
   */
-  def flush(): Unit = flush(logEndOffset)
+  def flush(): Unit = flush(logEndOffset + 1)


Could we add a comment why we need to flush to logEndOffset + 1? Also, could we have a test that verifies the index file is present after an empty segment is rolled and the broker is shut down?

Thanks for the comment. Will write it later. Was too busy yesterday and didn't have time to do so.

kowshik

@ccding: Thanks for the updated PR. Just one comment below.

kowshik · 2021-10-01T16:26:48Z

   * Flush all local log segments
   */
-  def flush(): Unit = flush(logEndOffset)
+  def flush(): Unit = flush(logEndOffset + 1)


Can we add unit test coverage for this change in UnifiedLogTest.scala?

ccding · 2021-10-04T15:55:12Z

While I am working on the test, I have a question:

The log.flush() function is also called at

kafka/core/src/main/scala/kafka/log/LogManager.scala

Line 1245 in db42afd

log.flush()

and

kafka/core/src/main/scala/kafka/log/UnifiedLog.scala

Line 933 in db42afd

if (localLog.unflushedMessages >= config.flushInterval) flush()

They are not during a shutdown. Will these also flush empty active segments and hurt performance?

cc @junrao @kowshik

kowshik · 2021-10-05T07:16:20Z

@ccding During a call to log.flush(), we remember here the offset upto which the log was flushed. So, a subsequent flush() does not cause additional disk I/O due to this check unless the logEndOffset has advanced. This means that empty active segments shouldn't add additional burden to the log.flush() operation during each call, unless new empty active segments are generated in between 2 calls but that's quite uncommon (see relevant comment).

Furthermore, we typically don't configure to flush the log periodically or during appends. Even then, when we call log.flush() without passing an offset parameter, the intent was always to flush all data written to the log. It is just a corner case that "all data written to the log" needs to also include an empty active segment for correctness reasons since the logEndOffset is derived from it during recovery. Some related details on documentation below:

(1) Periodic flush:

kafka/core/src/main/scala/kafka/log/LogManager.scala

Lines 1244 to 1246 in 0fe4e24

    
             if(timeSinceLastFlush >= log.config.flushMs) 
        
               log.flush() 
        
           } catch {

Here, log.flush() executes only when the timeSinceLastFlush >= log.config.flushMs. The flush.ms configuration is documented here and it has a default value of Long.MaxValue (very high!) defined here. The doc recommends us not to override the config unless needed.

(2) Flush during appends:

kafka/core/src/main/scala/kafka/log/UnifiedLog.scala

Line 933 in db42afd

if (localLog.unflushedMessages >= config.flushInterval) flush()

Here, flush() executes only when unflushedMessages >= config.flushInterval. Similar explanation to the above. The flush.messages configuration is documented here and it has a default value of Long.MaxValue (very high!) defined here. The doc recommends us not to override the config unless needed.

kowshik · 2021-10-21T07:11:24Z

@ccding Are you planning to add a unit test for this PR, and address the recent review comments?

ccding · 2021-10-21T15:02:35Z

@kowshik I will find some time to work on it

ccding · 2021-10-27T16:31:51Z

@junrao @kowshik I addressed the above comments. PTAL

ccding · 2021-10-27T18:07:25Z

The change failed recovery point check at

kafka/core/src/test/scala/unit/kafka/log/LogLoaderTest.scala

Line 790 in 22aa9d2

verifyRecoveredLog(log, lastOffset)

I am not sure what to do. Please advise.

junrao · 2021-11-01T20:43:38Z

@ccding : Thanks for reporting the test failure. This brings up a good point. It's kind of weird to ever have recovery point > log end offset. I am thinking that another potential way to fix this is for the flush() call in close() from recovery point to log end offset to be inclusive on both ends. This way, a flush of (log end offset, log end offset) will force a flush since it's needed for flushing the metadata of log end offset. All other flushes will still be exclusive on the right end since they don't need to preserve the metadata on the right end.

junrao

@ccding : Thanks for the updated PR. A couple of more comments.

junrao · 2021-11-08T17:47:17Z


  /**
   * Flush all local log segments
+   * We have to pass logEngOffset + 1 to the `def flush(offset: Long): Unit` function to flush empty


This comment is in the wrong place.

junrao · 2021-11-08T17:48:06Z

  def close(): Unit = {
    debug("Closing log")
    lock synchronized {
+      flush(logEndOffset + 1)


This seems to have the same problem that the log recovery point could be moved to logEndOffset + 1, which is a bit weird?

Actually, this doesn't work. I am trying to figure out a proper solution.

ccding · 2021-12-22T06:06:06Z

@kowshik @junrao @hachikuji I have addressed the review comments and please take a look. thanks

ccding · 2022-01-06T20:59:45Z

ping @kowshik @junrao @hachikuji

junrao

@ccding : Thanks for the updated PR. A few more comments.

junrao · 2022-01-10T19:22:34Z


-  override def flush(): Unit = {
-    log.flush()
+  override def flush(inclusive: Boolean): Unit = {


Should inclusive be renamed to forceFlushActiveSegment?

junrao · 2022-01-10T19:25:51Z

+
+  /**
+   * Flush local log segments for all offsets up to offset-1 if includingOffset=false; up to offset
+   * if includingOffset=true. The recovery point is set to offset-1.


The comment is inaccurate. The recovery point is always offset.

junrao · 2022-01-10T19:26:40Z

+   * Flush local log segments for all offsets up to offset-1 if includingOffset=false; up to offset
+   * if includingOffset=true. The recovery point is set to offset-1.
+   *
+   * @param offset The offset to flush up to (non-inclusive); the new recovery point


Should we get rid of "(non-inclusive)"?

junrao · 2022-01-10T19:29:46Z

+  private def flush(offset: Long, includingOffset: Boolean): Unit = {
+    val flushOffset = if (includingOffset) offset + 1  else offset
+    val newRecoveryPoint = offset
+    maybeHandleIOException(s"Error while flushing log for $topicPartition in dir ${dir.getParent} with offset $flushOffset and recovery point $newRecoveryPoint") {


Instead of $flushOffset, perhaps it's clearer to use "$offset(ex/inclusive)"? Ditto for the debug logging below.

junrao · 2022-01-10T19:38:53Z

          case _: NoSuchFileException =>
-            error(s"${params.logIdentifier}Could not find offset index file corresponding to log file" +
-              s" ${segment.log.file.getAbsolutePath}, recovering segment and rebuilding index files...")
+            if (segment.baseOffset < params.recoveryPointCheckpoint)


This condition is correct if hadCleanShutdown is false.

If hadCleanShutdown is true, it seems the condition should be segment.baseOffset <=params.recoveryPointCheckpoint. Or maybe we should just always log the error if hadCleanShutdown is true.

I think if hadCleanShutdown is true, it should never throw NoSuchFileException unless there is a bug in the code.

Added the hadCleanShutdown check anyways to catch potential issues: if (params.hadCleanShutdown || segment.baseOffset < params.recoveryPointCheckpoint)

junrao · 2022-01-10T19:45:32Z

+     * @param inclusive Whether the flush includes the log end offset. Should be `true` during close; otherwise false.
     */
-    void flush();
+    void flush(boolean inclusive);


We use forceFlushActiveSegment in UnifiedLog.flush(). Should we be consistent with the name?

junrao · 2022-01-10T19:52:40Z

+    assertEquals(lastOffset, log.recoveryPoint, s"Unexpected recovery point")
+    assertEquals(numMessages, log.logEndOffset, s"Should have $numMessages messages when log is reopened w/o recovery")
+    assertEquals(0, log.activeSegment.timeIndex.entries, "Should have same number of time index entries as before.")
+    log.activeSegment.sanityCheck(true) // this should not throw


Could we add a comment why this check won't throw after re-instantiating the log?

junrao · 2022-01-10T19:54:34Z

+    assertThrows(classOf[NoSuchFileException], () => log.activeSegment.sanityCheck(true))
+    var lastOffset = log.logEndOffset
+
+    log = createLog(logDir, logConfig, recoveryPoint = lastOffset, lastShutdownClean = false)


Should we call log.closeHandlers() before assigning a new value to log? Otherwise, it seems that we are leaking file handles.

junrao

@ccding : Thanks for the updated PR. Just a couple of more minor comments.

junrao · 2022-01-10T23:33:16Z

+      + { if (includingOffset) "inclusive" else "exclusive" }
+      + s") and recovery point $newRecoveryPoint") {
+      if (flushOffset > localLog.recoveryPoint) {
+        debug(s"Flushing log up to offset $flushOffset with recovery point $newRecoveryPoint, last flushed: $lastFlushTime,  current time: ${time.milliseconds()}, " +


Instead of $flushOffset, could we change to "$offset(ex/inclusive)"?

junrao · 2022-01-10T23:36:07Z

    /**
     * Flush the current log to disk.
+     *
+     * @param forceFlushActiveSegment Whether the flush includes the log end offset. Should be `true` during close; otherwise false.


Whether the flush includes the log end offset => Whether to force flush the active segment?

junrao

@ccding : Thanks for the updated PR. LGTM

@hachikuji : Any other comments from you?

junrao · 2022-01-19T21:58:25Z

@ccding : The PR is kind of large now. Could you associate the PR with a jira? Thanks.

kowshik

@ccding Thanks for the PR. LGTM. Just a small comment below.

kowshik · 2022-01-24T21:47:05Z

+      s") and recovery point $newRecoveryPoint") {
+      if (flushOffset > localLog.recoveryPoint) {
+        debug(s"Flushing log up to offset (" +
+          { if (includingOffset) "inclusive" else "exclusive" } +


The clause if (includingOffset) "inclusive" else "exclusive" } is redundant. This can be extracted into a separate variable, or just eliminate it and instead you could just print the value of includingOffset.

junrao · 2022-01-26T22:17:17Z

@ccding : There is a conflict now. Could you rebase?

junrao

@ccding : Thanks for rebasing. One more comment below. Also, do you know why all tests are failing?

junrao · 2022-01-27T05:08:06Z

+    val flushOffset = if (includingOffset) offset + 1  else offset
+    val newRecoveryPoint = offset
+    maybeHandleIOException(s"Error while flushing log for $topicPartition in dir ${dir.getParent} with offset=$offset, " +
+      s"includingOffset=$includingOffset, newRecoveryPoint=$newRecoveryPoint") {


I think the logging that you had before the last commit is more intuitive.

kowshik

Thanks for the updated PR. Have a small comment below.

kowshik · 2022-01-27T20:16:13Z

      if (flushOffset > localLog.recoveryPoint) {
-        debug(s"Flushing log up to offset=$offset, includingOffset=$includingOffset, " +
-          s"newRecoveryPoint=$newRecoveryPoint, last flushed: $lastFlushTime,  current time: ${time.milliseconds()}, " +
+        debug(s"Flushing log up to offset ($includingOffsetStr)" +


I think perhaps you meant to use offset=$offset?

junrao · 2022-01-27T23:00:49Z

@ccding : Thanks for the latest PR. LGTM. Merged to trunk.

ccding force-pushed the last branch from 8c2daac to 7d3b711 Compare September 21, 2021 14:15

kowshik approved these changes Sep 23, 2021

View reviewed changes

ccding force-pushed the last branch from 7d3b711 to ecb8692 Compare September 23, 2021 02:45

junrao reviewed Sep 29, 2021

View reviewed changes

ccding added 3 commits September 30, 2021 17:45

Revert "Allow empty last segment to have missing offset index during …

9904830

…recovery" This reverts commit ecb8692.

Merge branch 'trunk' into last

ac6e9cd

flush empty active segments

a2c51d9

junrao reviewed Oct 1, 2021

View reviewed changes

kowshik reviewed Oct 1, 2021

View reviewed changes

ccding added 2 commits October 26, 2021 09:14

add comment

82913f0

unit test

b915fe2

ccding added 2 commits November 5, 2021 07:52

address comments

4df64d8

Merge branch 'trunk' into last

0f2f0a8

junrao reviewed Nov 8, 2021

View reviewed changes

ccding force-pushed the last branch from 15cd6cc to b035937 Compare November 17, 2021 04:42

flush at the right place

c50bff2

ccding added 2 commits December 20, 2021 14:25

fix time stamp check

d775f14

fix comment

4d88092

ccding force-pushed the last branch from 1110884 to 4d88092 Compare December 20, 2021 22:27

fix log loader test

099cb42

ccding force-pushed the last branch from 81adbe5 to 099cb42 Compare December 21, 2021 23:15

ccding added 4 commits December 22, 2021 00:09

Merge branch 'trunk' into last

229a537

trigger test

14239c0

Merge branch 'trunk' into last

8b7e0c9

Merge branch 'trunk' into last

0c8e48b

junrao reviewed Jan 10, 2022

View reviewed changes

address comments

25821db

junrao reviewed Jan 10, 2022

View reviewed changes

update comments and polish log messages

47ba1d5

junrao approved these changes Jan 13, 2022

View reviewed changes

ccding changed the title ~~Allow empty last segment to have missing offset index during recovery~~ KAFKA-13603: Allow empty active segment to have missing offset index during recovery Jan 20, 2022

ccding changed the title ~~KAFKA-13603: Allow empty active segment to have missing offset index during recovery~~ KAFKA-13603: Allow the empty active segment to have missing offset index during recovery Jan 20, 2022

kowshik approved these changes Jan 24, 2022

View reviewed changes

improve log message

885db15

Merge branch 'trunk' into last

5bb1a47

ccding force-pushed the last branch from 27bd6ba to 5bb1a47 Compare January 27, 2022 01:51

junrao reviewed Jan 27, 2022

View reviewed changes

fix compile and improve log message

e00906a

kowshik reviewed Jan 27, 2022

View reviewed changes

fix debug output

bd21bb2

junrao merged commit a21aec8 into apache:trunk Jan 27, 2022

Conversation

ccding commented Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

ccding commented Sep 20, 2021

Uh oh!

ijuma commented Sep 22, 2021

Uh oh!

ccding commented Sep 22, 2021

Uh oh!

kowshik left a comment

Choose a reason for hiding this comment

Uh oh!

ccding commented Sep 23, 2021

Uh oh!

ccding commented Sep 24, 2021

Uh oh!

ccding commented Sep 27, 2021

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kowshik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccding commented Oct 4, 2021

Uh oh!

kowshik commented Oct 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kowshik commented Oct 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ccding commented Oct 21, 2021

Uh oh!

ccding commented Oct 27, 2021

Uh oh!

ccding commented Oct 27, 2021

Uh oh!

junrao commented Nov 1, 2021

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccding commented Sep 20, 2021 •

edited

Loading

kowshik commented Oct 5, 2021 •

edited

Loading

kowshik commented Oct 21, 2021 •

edited

Loading