[KAFKA-8522] Streamline tombstone and transaction marker removal by ConcurrencyPractitioner · Pull Request #7884 · apache/kafka

ConcurrencyPractitioner · 2020-01-02T04:36:01Z

The objective of this PR is to prevent tombstones from persisting in logs under low throughput conditions.

ConcurrencyPractitioner · 2020-01-09T01:14:53Z

@junrao This PR is ready for review. :)

ConcurrencyPractitioner · 2020-01-10T03:19:18Z

Retest this please.

junrao

@ConcurrencyPractitioner : Thanks for the new patch. Overall, the logic still seems a bit over-complicated to me. Left a few more comments below.

junrao · 2020-01-11T01:08:13Z

    private def cleanFilthiestLog(): Boolean = {
      val preCleanStats = new PreCleanStats()
-      val cleaned = cleanerManager.grabFilthiestCompactedLog(time, preCleanStats) match {
+      val ltc =cleanerManager.grabFilthiestCompactedLog(time, preCleanStats)


ltc => logToClean ? Also, do we need to use another local val since ltc is only used once?

junrao · 2020-01-13T21:55:01Z

   * @param retainDeletesAndTxnMarkers Should tombstones and markers be retained while cleaning this segment
   * @param maxLogMessageSize The maximum message size of the corresponding topic
   * @param stats Collector for cleaning statistics
+   * @param tombstoneRetentionMs Defines how long a tombstone should be kept as defined by log configuration


We should make it clear the difference btw retainDeletesAndTxnMarkers and tombstoneRetentionMs. Also, it's probably better to put they as adjacent params.

junrao · 2020-01-13T22:23:48Z

        // note that we will never delete a marker until all the records from that transaction are removed.
-        discardBatchRecords = shouldDiscardBatch(batch, transactionMetadata, retainTxnMarkers = retainDeletesAndTxnMarkers)
+        val canDiscardBatch = shouldDiscardBatch(batch, transactionMetadata, retainTxnMarkers = retainDeletesAndTxnMarkers)
+        isControlBatchEmpty = canDiscardBatch


Hmm, isControlBatchEmpty is a bit misleading since batch is not always a control batch.

junrao · 2020-01-13T23:51:19Z

-      override def shouldRetainRecord(batch: RecordBatch, record: Record): Boolean = {
+      override def checkBatchRetention(batch: RecordBatch): BatchRetention = checkBatchRetention(batch, batch.deleteHorizonMs())
+
+      override def shouldRetainRecord(batch: RecordBatch, record: Record, newDeleteHorizonMs: Long): Boolean = {


It's probably better to have the logic to determine if deleteHorizonMs should be set here instead of MemoryRecords since it's log cleaner specific logic. I was thinking that we could extend checkBatchRetention() to return (Boolean, shouldSetHorizon).

junrao · 2020-01-14T00:05:55Z

+      var shouldRetainDeletes = true
+      if (isLatestVersion)
+        shouldRetainDeletes = (batch.deleteHorizonSet() && currentTime < batch.deleteHorizonMs()) || 
+                              (!batch.deleteHorizonSet() && currentTime < newBatchDeleteHorizonMs)


Hmm, if deleteHorizonSet is not set, we shouldn't be deleting the tombstone. So, not sure what newBatchDeleteHorizonMs is intended for.

Oh, this is used as a means to help the tests in LogCleanerTest.scala pass. LogCleanerTest usually wants the tombstones removed in a single pass (but that pass is usually used for setting the delete horizon ms, which means without doing the above, we would be unable to remove tombstones). Therefore, by adding the newBatchDeleteHorizonMs argument (which is passed in by MemoryRecords), whenever LogCleaner calls clean log with the current time marked as Log.MAX_VALUE, we will be able to remove the tombstones / control records in one pass.

junrao · 2020-01-14T00:06:30Z

-            }
+            final BatchIterationResult iterationResult = iterateOverBatch(batch, decompressionBufferSupplier, filterResult, filter,
+                                                                          batchMagic, writeOriginalBatch, maxOffset, retainedRecords,
+                                                                          containsTombstonesOrMarker, deleteHorizonMs);


Hmm, why are we passing in containsTombstonesOrMarker, which is always false?

Oh, I can remove that.

junrao · 2020-01-14T00:10:34Z

+            long deleteHorizonMs = filter.retrieveDeleteHorizon(batch);
+            final BatchRetention batchRetention;
+            if (!batch.deleteHorizonSet())
+                batchRetention = filter.checkBatchRetention(batch, deleteHorizonMs);


Since deleteHorizonMs can be obtained from batch, it's not clear why we need to pass that in as a param.

Oh, look in comment above. This delete horizon is used for the case where we want to remove the tombstones in a single pass. On the first iteration of Log Cleaner, we are unable to remove the tombstone because no delete horizon has not been set yet. Therefore, when we compute the delete horizon, we need to pass the delete horizon back into checkBatchRetention so that tombstones can be removed in one iteration.

On second thought, I think we don't need to add an extra parameter to the checkBatchRetention method. Such logic would only need to be restricted to LogCleaner. i.e. we store the delete horizon in another variable in the Record Filter we implemented in LogCleaner.

Hmm, I am still not sure why we need to remove a tombstone in one pass. If a tombstone's delete horizon is not set, it can't be removed in this round of cleaning.

Alright, acknowledged. I think thats a good point.

junrao · 2020-01-14T00:25:10Z

      if(cleanableLogs.isEmpty) {
-        None
+        // in this case, we are probably in a low throughput situation
+        // therefore, we should take advantage of this fact and remove tombstones if we can


I am not sure about this. A round of cleaning can be expensive since we need to read in all existing cleaned segments. That's why by default, we only trigger a round of cleaning if the dirty portion of the log is as large as the cleaned portion. Not sure if it's worth doing cleaning more aggressively just to remove the tombstone. So, perhaps we can leave it outside of this PR for now.

@junrao I did some thinking about this. The integration test I added does not pass without this part. Because what happens is that in logs with tombstones, there is the possibility that without further throughput, the cleanable logs will always be empty. Therefore, as I mentioned in the comment, since we are in a low throughput situation, LogCleaner's workload is relatively light anyways. In that case, we can clean tombstones since we don't have much else to do.

There is a way to figure out whether if log cleaner has a heavy workload or not. If cleanable logs has remained empty for a long period of time (for a set threshold), then we can safely say that the log cleaner thread isn't busy since there is no logs to clean. After that threshold has passed, we can start processing logs with tombstones and removing them.

This should help us know exactly when we can go back and remove tombstones.

Perhaps, we can keep track of the largest deleteHorizonMs in the cleaned portion. We can then trigger a round of cleaning when the current time has passed the largest deleteHorizonMs.

Yeah, I found that this approach probably is a lot better.

ConcurrencyPractitioner · 2020-01-15T23:41:50Z

Hi @junrao Thanks for the comments you left!

Overall, I managed to simplify the code somewhat and removed a couple of methods that was probably not necessary. Notably, there is not as many calls involving batch as there was previously. Hope this was what you wanted. :)

junrao

@ConcurrencyPractitioner : Thanks for addressing the comments. A few more comments below.

junrao · 2020-01-16T22:08:25Z

+   * @param tombstoneRetentionMs How long we should retend the tombstones whose version is greater than equal to 2
   * @param maxLogMessageSize The maximum message size of the corresponding topic
   * @param stats Collector for cleaning statistics
+   * @param tombstoneRetentionMs Defines how long a tombstone should be kept as defined by log configuration


tombstoneRetentionMs is duplicated in the javadoc.

junrao · 2020-01-16T22:08:39Z

                             lastRecordsOfActiveProducers: Map[Long, LastRecord],
-                             stats: CleanerStats): Unit = {
+                             stats: CleanerStats,
+                             currentTime: Long = RecordBatch.NO_TIMESTAMP): Boolean = {


Could we add currentTime to the javadoc?

junrao · 2020-01-16T22:17:04Z

+        /**
+         * Checks if the control batch (if it is one) can be removed (making sure that it is empty)
+         */
+        protected boolean isControlBatchEmpty(RecordBatch recordBatch) {


This seems never used?

Yeah, will get rid of that.

junrao · 2020-01-17T00:02:22Z

      if(cleanableLogs.isEmpty) {
-        None
+        // in this case, we are probably in a low throughput situation
+        // therefore, we should take advantage of this fact and remove tombstones if we can


Perhaps, we can keep track of the largest deleteHorizonMs in the cleaned portion. We can then trigger a round of cleaning when the current time has passed the largest deleteHorizonMs.

junrao · 2020-01-17T01:35:23Z

+            long deleteHorizonMs = filter.retrieveDeleteHorizon(batch);
+            final BatchRetention batchRetention;
+            if (!batch.deleteHorizonSet())
+                batchRetention = filter.checkBatchRetention(batch, deleteHorizonMs);


Hmm, I am still not sure why we need to remove a tombstone in one pass. If a tombstone's delete horizon is not set, it can't be removed in this round of cleaning.

ConcurrencyPractitioner · 2020-01-28T00:24:28Z

@junrao I've mostly resolved your comments. I'm working on how we could trigger a call for a clean when the latest delete horizon had been passed. Other than that, feel free to add anything else. :)

ConcurrencyPractitioner · 2020-02-01T16:47:52Z

@junrao Do you want to take another look?

ConcurrencyPractitioner · 2020-02-05T03:11:56Z

@junrao pinging.

junrao · 2020-02-06T02:40:14Z

@ConcurrencyPractitioner : Thanks for the updated PR. Will take another look.

junrao

@ConcurrencyPractitioner : Thanks for the updated PR. Made another pass of the non-testing files. A few comments below.

junrao · 2020-02-04T02:11:12Z

+   * @param retainDeletesAndTxnMarkers Should tombstones (lower than version 2) and markers be retained while cleaning this segment
   * @param maxLogMessageSize The maximum message size of the corresponding topic
   * @param stats Collector for cleaning statistics
+   * @param tombstoneRetentionMs Defines how long a tombstone should be kept as defined by log configuration


Could we move this up to below retainDeletesAndTxnMarkers?

junrao · 2020-02-05T23:18:14Z

+                    writeOriginalBatch = false;
+                }
+            }
+            return new BatchIterationResult(writeOriginalBatch, containsTombstonesOrMarker, maxOffset);


It's probably better to name writeOriginalBatch here to sth like recordsFiltered since we combine other information to determine writeOriginalBatch later on.

junrao · 2020-02-05T23:24:38Z

+                // if the batch does not contain tombstones, then we don't need to overwrite batch
+                boolean canControlBatchBeRemoved = batch.isControlBatch() && deleteHorizonMs > RecordBatch.NO_TIMESTAMP;
+                if (writeOriginalBatch && (deleteHorizonMs == RecordBatch.NO_TIMESTAMP || deleteHorizonMs == batch.deleteHorizonMs()
+                     || (!containsTombstonesOrMarker && !canControlBatchBeRemoved))) {


It seems that the logic can be simplified a bit. It seems that we can do this branch if writeOriginalBatch is true and needToSetDeleteHorizon is false (needToSetDeleteHorizon = (batch magic >= V2 && containsTombstonesOrMarker && batch's deleteHorizon not set)).

Oh, sure, that's fine. But we also still need to account for the control batch and check whether or not it is empty yet.

For a control batch, it's only removed at the batch level. So, if the batch can be deleted at the batch level, we won't get in here. If the batch can't be deleted at the batch level, the record within the batch will always be retained.

@junrao Is this always the case? If I remember correctly in the KIP, control batches, if it contains only tombstones, will be persisted in the logs for a set period of time i.e. we need to at some point remove the tombstones first before the control batches can be deleted. Therefore, I think it would be very much possible that we need to check for isControlBatchEmpty here.

@ConcurrencyPractitioner : A control batch has only a single marker record (either a commit or abort). When all records before the control batch are removed, we set the deleteHorizon for the control batch. When the time passes the deleteHorizon, the control batch is removed. A control batch never contains a tombstone.

junrao · 2020-02-06T21:58:24Z

+
+        if (batch.deleteHorizonSet()) {
+          if (batch.deleteHorizonMs() > latestDeleteHorizon) {
+            latestDeleteHorizon = batch.deleteHorizonMs()


This may not be the best place to track latestDeleteHorizon. Perhaps we can return the largest deleteHorizon in MemoryRecords.filterTo() and keep track of latestDeleteHorizon in the while loop in line 713. If we do that, I am not sure if we need retrieveDeleteHorizon() since MemoryRecords.filterTo() can obtain whether deleteHorizon is set from the batch and calculate the new deleteHorizon if needed.

Well, I think there is multiple problems we might need to think about:

We don't know what the current time is since MemoryRecords doesn't have access to a Time instance.

For control batches, retrieveDeleteHorizon serves a critical function: We call controlBatch.onTransactionRead there to determine if we can set a delete horizon for our batch.

In summation, I think that there are multiple dependencies (located in LogCleaner) which must be called from MemoryRecords#filterTo. It would be more of a hassle I think if we need to figure out a way how to call all these methods from filterTo as well.

Good point on #2. My concern is that the batch could be filtered after retrieveDeleteHorizon() is called. Then, the latestDeleteHorizon maintained here won't be very accurate.

junrao · 2020-02-06T22:12:46Z

       */
      val latestOffsetForKey = record.offset() >= foundOffset
-      val isRetainedValue = record.hasValue || retainDeletes
+      val isLatestVersion = batch.magic() >= RecordBatch.MAGIC_VALUE_V2


isLatestVersion => supportDeleteHorizon?

junrao · 2020-02-06T22:34:58Z

+        // therefore, we should take advantage of this fact and remove tombstones if we can
+        // under the condition that the log's latest delete horizon is less than the current time
+        // tracked
+        val logsContainingTombstones = logs.filter {


Could we put the common logic into a shared method to avoid duplicating most of the code below?

junrao · 2020-02-06T22:41:04Z

+                    producerEpoch, baseSequence, isTransactional, isControlRecord, false, partitionLeaderEpoch, 0);
+    }
+
+    public static void writeEmptyHeader(ByteBuffer buffer,


This method seems unused?

junrao · 2020-02-06T22:41:37Z

+                producerEpoch, baseSequence, isTransactional, isControlRecord, isDeleteHorizonSet, partitionLeaderEpoch, 0);
+    }
+
+    static void writeHeader(ByteBuffer buffer,


This method seems unused?

ConcurrencyPractitioner · 2020-03-19T21:51:04Z

@hachikuji All comments addressed. See if there is anything else that we might need to account for.

ConcurrencyPractitioner · 2020-03-29T21:38:08Z

Pinging @hachikuji.

ConcurrencyPractitioner · 2020-04-13T19:42:34Z

@hachikuji Pinging for review

junrao · 2020-04-20T19:08:59Z

@ConcurrencyPractitioner : We now have https://github.com/apache/kafka/blob/trunk/.asf.yaml. You can add yourself to Jenkins's whitelist by following https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories#id-.asf.yamlfeaturesforgitrepositories-JenkinsPRWhitelisting .

ConcurrencyPractitioner · 2020-04-20T19:52:32Z

ok to test

…er/kafka into KAFKA-8522

ConcurrencyPractitioner · 2020-04-20T19:58:32Z

ok to test

ConcurrencyPractitioner · 2020-04-20T19:59:20Z

@junrao Cool. It's just that should I edit the .asf.yml as part of this PR? Or will I need to do it some other way?

junrao · 2020-04-20T20:21:52Z

@ConcurrencyPractitioner : You can just submit a separate PR to add yourself in .asf.yml.

ConcurrencyPractitioner · 2020-04-20T21:28:01Z

@junrao Alright, got it done.

ConcurrencyPractitioner · 2020-04-22T04:09:02Z

ok to test

ConcurrencyPractitioner · 2020-04-22T04:09:16Z

test this please

ConcurrencyPractitioner · 2020-04-23T20:46:08Z

@junrao I don't think the .asf.yaml worked. Tried to trigger a few test rounds, but Jenkins didn't respond.

junrao · 2020-04-23T21:03:35Z

@ConcurrencyPractitioner Could you try "retest this please"? If it still doesn't work, you can file an Apache infra jira for help.

ConcurrencyPractitioner · 2020-04-24T02:25:40Z

@junrao Did try on another PR. Looks like it didn't work. I will fire a JIRA.

ConcurrencyPractitioner · 2020-04-24T02:32:37Z

Reported in JIRA here: https://issues.apache.org/jira/browse/INFRA-20182

ConcurrencyPractitioner · 2020-05-01T17:48:29Z

@hachikuji Do you have time to review? Just give me a heads-up if there are some comments left unaddressed.

wushujames · 2020-12-03T05:26:37Z

Hi @ConcurrencyPractitioner . What is the status of this PR? We are also experiencing https://issues.apache.org/jira/browse/KAFKA-8522 . Thanks!

junrao · 2020-12-03T16:28:11Z

@wushujames : This PR is mostly ready. It's just waiting for another committer more familiar with the transactional logic to take another look.

@ConcurrencyPractitioner : Would you be able to rebase this PR? Thanks.

akamensky · 2021-01-15T10:33:44Z

@ConcurrencyPractitioner @junrao this PR has been stale since April 2020. When would it be ready to merge? We are hitting this issue and it causes insanely long startup times in our applications as they need to read all the tombstones that are not being removed.

ConcurrencyPractitioner · 2021-01-15T22:06:10Z

@akamensky @wushujames @junrao Migrating to a new PR. You could find it here #9915.

[KAFKA-8522] Streamline tombstone and transaction marker removal

fdf7095

ConcurrencyPractitioner mentioned this pull request Jan 2, 2020

[KAFKA-8522] Implementing proposal as outlined in KIP-534 #7600

Closed

Fixing stuff

f7dcaac

ConcurrencyPractitioner closed this Jan 2, 2020

ConcurrencyPractitioner reopened this Jan 2, 2020

Fixing stuff

2ab16af

junrao reviewed Jan 14, 2020

View reviewed changes

ConcurrencyPractitioner added 2 commits January 15, 2020 08:42

Resolving some comments

d36f776

Resolving remaining comments

3b2193b

junrao reviewed Jan 17, 2020

View reviewed changes

Adding two pass modification

dcc2f65

Adding some last changes

a9d8c4d

junrao reviewed Feb 6, 2020

View reviewed changes

ConcurrencyPractitioner added 10 commits February 11, 2020 19:29

Adding stuff

dd9ca28

Getting modified

1587462

Delete total.out

a25187b

Delete .gitignore

f469515

Delete .gitignore

15d9d8a

Delete LogCleaner.scala.orig

a3bc996

Delete diff.out

e00f37c

Delete LogCleanerManager.scala.orig

331ebad

Delete .gitignore

4b12ebb

Delete .gitignore

2fc90d3

Addressing last comments

bed40ab

Create .asf.yml

54cb56b

ConcurrencyPractitioner added 4 commits April 20, 2020 12:55

Merge remote-tracking branch 'upstream/trunk' into KAFKA-8522

81ba24c

Adding github whitelist

cadee72

Merge branch 'KAFKA-8522' of https://github.com/ConcurrencyPractition…

90877fc

…er/kafka into KAFKA-8522

Removing .yml file

e694f13

Reverting unnecessary change

a9316ad

ConcurrencyPractitioner mentioned this pull request Jan 15, 2021

[KAKFA-8522] Streamline tombstone and transaction marker removal #9915

Closed

3 tasks

ConcurrencyPractitioner closed this Jan 15, 2021

mattwong949 mentioned this pull request Jun 22, 2021

[KAFKA-8522] Streamline tombstone and transaction marker removal #10914

Merged

3 tasks

Conversation

ConcurrencyPractitioner commented Jan 2, 2020

Uh oh!

ConcurrencyPractitioner commented Jan 9, 2020

Uh oh!

ConcurrencyPractitioner commented Jan 10, 2020

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ConcurrencyPractitioner commented Jan 15, 2020

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ConcurrencyPractitioner commented Jan 28, 2020

Uh oh!

ConcurrencyPractitioner commented Feb 1, 2020

Uh oh!

ConcurrencyPractitioner commented Feb 5, 2020

Uh oh!

junrao commented Feb 6, 2020

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!