KAFKA-12520: Ensure log loading does not truncate producer state unless required by dhruvilshah3 · Pull Request #10388 · apache/kafka

dhruvilshah3 · 2021-03-23T16:52:01Z

When we find a .swap file on startup, we typically want to rename and replace it as .log, .index, .timeindex, etc. as a way to complete any ongoing replace operations. These swap files are usually known to have been flushed to disk before the replace operation begins.

One flaw in the current logic is that we recover these swap files on startup and as part of that, end up truncating the producer state and rebuild it from scratch. This is unneeded as the replace operation does not mutate the producer state by itself. It is only meant to replace the .log file along with corresponding indices. Because of this unneeded producer state rebuild operation, we have seen multi-hour startup times for clusters that have large compacted topics.

This patch fixes the issue by doing a sanity check of all records in the segment to swap and rebuilds corresponding indices without mutating the producer state. Similarly, we also rebuild indices without truncating the producer state when we find a missing or corrupted index in the middle of the log.

The patch also adds an extra sanity check to detect invalid bytes at the end of swap segments. Before this patch, we would truncate invalid bytes from the swap segment which could leave us with holes in the log. Because this is an unexpected scenario, we now raise an exception in such cases which will fail the broker on startup.

…cer state

junrao

@dhruvilshah3 : Thanks for the PR. A couple of comments below.

junrao · 2021-03-24T00:26:44Z

              "recovering segment and rebuilding index files...")
-            recoverSegment(segment)
+            if (segment.validateSegmentAndRebuildIndices() > 0)
+              throw new KafkaStorageException("Found invalid or corrupted messages in segment " + segment.log.file);


Perhaps we could report the number of invalid bytes in the exception? Ditto below and in completeSwapOperations().

junrao · 2021-03-24T00:43:58Z

   */
  @nonthreadsafe
-  def recover(producerStateManager: ProducerStateManager, leaderEpochCache: Option[LeaderEpochFileCache] = None): Int = {
+  def validateSegmentAndRebuildIndices(batchCallbackOpt: Option[FileChannelRecordBatch => Unit] = None) : Int = {


It seems this method needs to the logic to trim the indexes at the end?

Good catch, I added that.

junrao · 2021-03-24T05:00:32Z

            error(s"Could not find offset index file corresponding to log file ${segment.log.file.getAbsolutePath}, " +
              "recovering segment and rebuilding index files...")
-            recoverSegment(segment)
+            if (segment.validateSegmentAndRebuildIndices() > 0)


Another thing is that it's possible for a segment after recovery point to have no index file and also be corrupted. In that case, we want to truncate the data instead of failing with an error.

Makes sense. I reworked the logic to handle unflushed files the right way.

dhruvilshah3 · 2021-03-25T21:57:46Z

+  def validateSegmentAndRebuildIndices(batchCallbackOpt: Option[FileChannelRecordBatch => Unit] = None) : Int = {
    offsetIndex.reset()
    timeIndex.reset()
    txnIndex.reset()


There is another problem here in that we are not rebuilding the transaction index. The current logic seems pretty tied up with producer state maintenance. I will try to see if there's a way to separate it out.

dhruvilshah3 · 2021-06-09T23:41:41Z

Closing this PR as it's being taken forward in #10763.

@dhruvilshah3

…ss required (#10763) When we find a .swap file on startup, we typically want to rename and replace it as .log, .index, .timeindex, etc. as a way to complete any ongoing replace operations. These swap files are usually known to have been flushed to disk before the replace operation begins. One flaw in the current logic is that we recover these swap files on startup and as part of that, end up truncating the producer state and rebuild it from scratch. This is unneeded as the replace operation does not mutate the producer state by itself. It is only meant to replace the .log file along with corresponding indices. Because of this unneeded producer state rebuild operation, we have seen multi-hour startup times for clusters that have large compacted topics. This patch fixes the issue. With ext4 ordered mode, the metadata are ordered and no matter it is a clean/unclean shutdown. As a result, we rework the recovery workflow as follows. If there are any .cleaned files, we delete all .swap files with higher/equal offsets due to KAFKA-6264. We also delete the .cleaned files. If no .cleaned file, do nothing for this step. If there are any .log.swap files left after step 1, they, together with their index files, must be renamed from .cleaned and are complete (renaming from .cleaned to .swap is in reverse offset order). We rename these .log.swap files and their corresponding index files to regular files, while deleting the original files from compaction or segment split if they haven't been deleted. Do log splitting for legacy log segments with offset overflow (KAFKA-6264) If there are any other index swap files left, they must come from partial renaming from .swap files to regular files. We can simply rename them to regular files. credit: some code is copied from @dhruvilshah3 's PR: #10388 Reviewers: Dhruvil Shah <dhruvil@confluent.io>, Jun Rao <junrao@gmail.com>

@dhruvilshah3

…ss required (apache#10763) When we find a .swap file on startup, we typically want to rename and replace it as .log, .index, .timeindex, etc. as a way to complete any ongoing replace operations. These swap files are usually known to have been flushed to disk before the replace operation begins. One flaw in the current logic is that we recover these swap files on startup and as part of that, end up truncating the producer state and rebuild it from scratch. This is unneeded as the replace operation does not mutate the producer state by itself. It is only meant to replace the .log file along with corresponding indices. Because of this unneeded producer state rebuild operation, we have seen multi-hour startup times for clusters that have large compacted topics. This patch fixes the issue. With ext4 ordered mode, the metadata are ordered and no matter it is a clean/unclean shutdown. As a result, we rework the recovery workflow as follows. If there are any .cleaned files, we delete all .swap files with higher/equal offsets due to KAFKA-6264. We also delete the .cleaned files. If no .cleaned file, do nothing for this step. If there are any .log.swap files left after step 1, they, together with their index files, must be renamed from .cleaned and are complete (renaming from .cleaned to .swap is in reverse offset order). We rename these .log.swap files and their corresponding index files to regular files, while deleting the original files from compaction or segment split if they haven't been deleted. Do log splitting for legacy log segments with offset overflow (KAFKA-6264) If there are any other index swap files left, they must come from partial renaming from .swap files to regular files. We can simply rename them to regular files. credit: some code is copied from @dhruvilshah3 's PR: apache#10388 Reviewers: Dhruvil Shah <dhruvil@confluent.io>, Jun Rao <junrao@gmail.com>

dhruvilshah3 added 3 commits March 23, 2021 09:46

KAFKA-12520: Ensure recovery of swap segments does not truncate produ…

3ec973f

…cer state

Rebuild indices only when we find a missing or corrupted index file

69a900d

fix compilation

e741e47

junrao reviewed Mar 24, 2021

View reviewed changes

dhruvilshah3 commented Mar 25, 2021

View reviewed changes

Ensure flushed / unflushed segments are recovered appropriately

5e3e0c6

ccding mentioned this pull request Jun 7, 2021

KAFKA-12520: Ensure log loading does not truncate producer state unless required #10763

Merged

3 tasks

dhruvilshah3 closed this Jun 9, 2021

dhruvilshah3 deleted the producer-state branch June 9, 2021 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-12520: Ensure log loading does not truncate producer state unless required#10388

KAFKA-12520: Ensure log loading does not truncate producer state unless required#10388
dhruvilshah3 wants to merge 4 commits intoapache:trunkfrom
dhruvilshah3:producer-state

dhruvilshah3 commented Mar 23, 2021

Uh oh!

junrao left a comment

Uh oh!

junrao Mar 24, 2021

Uh oh!

dhruvilshah3 Mar 25, 2021

Uh oh!

junrao Mar 24, 2021

Uh oh!

dhruvilshah3 Mar 25, 2021

Uh oh!

junrao Mar 24, 2021

Uh oh!

dhruvilshah3 Mar 25, 2021

Uh oh!

dhruvilshah3 Mar 25, 2021

Uh oh!

dhruvilshah3 commented Jun 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dhruvilshah3 commented Mar 23, 2021

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

junrao Mar 24, 2021

Choose a reason for hiding this comment

Uh oh!

dhruvilshah3 Mar 25, 2021

Choose a reason for hiding this comment

Uh oh!

junrao Mar 24, 2021

Choose a reason for hiding this comment

Uh oh!

dhruvilshah3 Mar 25, 2021

Choose a reason for hiding this comment

Uh oh!

junrao Mar 24, 2021

Choose a reason for hiding this comment

Uh oh!

dhruvilshah3 Mar 25, 2021

Choose a reason for hiding this comment

Uh oh!

dhruvilshah3 Mar 25, 2021

Choose a reason for hiding this comment

Uh oh!

dhruvilshah3 commented Jun 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants