KAFKA-9579 Fetch implementation for records in the remote storage through a specific purgatory. by satishd · Pull Request #13535 · apache/kafka

satishd · 2023-04-11T06:50:44Z

This PR includes

Recognize the fetch requests with out of range local log offsets
Add fetch implementation for the data lying in the range of [logStartOffset, localLogStartOffset]
Add a new purgatory for async remote read requests which are served through a specific thread pool

We have an extended version of remote fetch that can fetch from multiple remote partitions in parallel, which we will raise as a followup PR.

A few tests for the newly introduced changes are added in this PR. There are some tests available for these scenarios in 2.8.x, refactoring with the trunk changes, will add them in followup PRs.

Other contributors:
kamal.chandraprakash@gmail.com - Further improvements and adding a few tests
showuon@gmail.com - Added a few test cases for these changes.

PS: This functionality is pulled out from internal branches with other functionalities related to the feature in 2.8.x. The reason for not pulling all the changes as it makes the PR huge, and burdensome to review and it also needs other metrics, minor enhancements(including perf), and minor changes done for tests. So, we will try to have followup PRs to cover all those.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

showuon

Thanks for the PR. Left some comments.

showuon · 2023-04-13T10:38:46Z

+            RecordBatch firstBatch = findFirstBatch(remoteLogInputStream, offset);
+
+            if (firstBatch == null)
+                return new FetchDataInfo(new LogOffsetMetadata(offset), MemoryRecords.EMPTY, false,


I think we need to log something in this case.

jeqo · 2023-04-13T12:23:18Z

+    // The 1st topic-partition that has to be read from remote storage
+    var remoteFetchInfo: Optional[RemoteStorageFetchInfo] = Optional.empty()


I understand a new PR will come to overcome this, but could we provide further context (on the source code or PR) about the implications of using the first topic-partition only?

Agreed - there are consumption patterns which diverge from the local case with this approach (that is, uneven progress across the partitions consumed from a topic [with said partitions of the same nature w.r.t. record batch size and overall size]).

It may be preferable not to diverge from the local approach and read from all the remote partitions found in the fetchInfos. Then, a different read pattern which provides greater performance for a specific operational environment and workload could be enforced via a configuration property.

As I already called out in this PR description, that it is followed up with a PR. We will describe the config on different options with respective scenarios. The default value will be to fetch from multiple partitions as it does with local log segments.

Got it, thanks.

Sure, thanks.

junrao

@satishd : Thanks for the PR. A few comments below.

junrao · 2023-04-12T21:07:04Z

-      // may arrive and hence make this operation completable.
-      delayedFetchPurgatory.tryCompleteElseWatch(delayedFetch, delayedFetchKeys)
+
+      if (remoteFetchInfo.isPresent) {


In line 1082, we should further test !remoteFetchInfo.isPresent, right?

I am not sure line num:1082 is sane as you meant it to be as the file could have been updated. Please clarify.

In the following code, we should go into that branch only if remoteFetchInfo is empty, right? Otherwise, if we could get into a situation that a remote partition is never served because the fetch request is always satisfied with new local data on other partitions.

if (params.maxWaitMs <= 0 || fetchInfos.isEmpty || bytesReadable >= params.minBytes || errorReadingData || hasDivergingEpoch || hasPreferredReadReplica) {

Do you mean to say that we should not return immediately if remoteFetchInfo exists because that should be served otherwise remote fetches may starve as long as there is enough data immediately available to be sent? So, the condition becomes

if (!remoteFetchInfo.isPresent && (params.maxWaitMs <= 0 || fetchInfos.isEmpty || bytesReadable >= params.minBytes || errorReadingData || hasDivergingEpoch || hasPreferredReadReplica))

Sure, that check was missed while pulling the changes. Good catch. Updated it with the latest commit.

jeqo · 2023-04-14T06:10:15Z

+        InputStream remoteSegInputStream = null;
+        try {
+            // Search forward for the position of the last offset that is greater than or equal to the target offset
+            remoteSegInputStream = remoteLogStorageManager.fetchLogSegment(rlsMetadata.get(), startPos);


Would be possible sending the endOffset as well? Without it, input stream will potentially contain the whole log and not be consumed til the end.
In the case of S3, when inputstream is not consumed til the end HTTP connection is aborted.

We will look into it in a followup PR.

Sure, thanks.

Hangleton · 2023-04-14T14:19:28Z

+    // The 1st topic-partition that has to be read from remote storage
+    var remoteFetchInfo: Optional[RemoteStorageFetchInfo] = Optional.empty()


Agreed - there are consumption patterns which diverge from the local case with this approach (that is, uneven progress across the partitions consumed from a topic [with said partitions of the same nature w.r.t. record batch size and overall size]).

It may be preferable not to diverge from the local approach and read from all the remote partitions found in the fetchInfos. Then, a different read pattern which provides greater performance for a specific operational environment and workload could be enforced via a configuration property.

satishd

Thanks @junrao for your review. Addressed your comments inline and/or with the latest commits.

satishd · 2023-04-18T16:04:28Z

+                }
+
+                if (searchInLocalLog) {
+                    txnIndexOpt = (localLogSegments.hasNext()) ? Optional.of(localLogSegments.next().txnIndex()) : Optional.empty();


Right, it can have duplicates. But consumer already handles the duplicate aborted transactions. Updated the code to remove duplicates incase any consumer implementation can not handle duplicate aborted transactions.

satishd · 2023-04-18T16:08:01Z

-      // may arrive and hence make this operation completable.
-      delayedFetchPurgatory.tryCompleteElseWatch(delayedFetch, delayedFetchKeys)
+
+      if (remoteFetchInfo.isPresent) {


I am not sure line num:1082 is sane as you meant it to be as the file could have been updated. Please clarify.

showuon

LGTM! @satishd , will there be other tests added as in the PR description said, or there will be follow-up PRs to add them?

showuon · 2023-04-28T08:45:54Z

@Hangleton @junrao @jeqo , any other comments to this PR? We hope we can merge it in the early stage of a release, so that we can have enough time to test the stability and have more improvement. Thanks.

divijvaidya

Overall looks good to me. One major comment about correctly shutting down the delayed fetch thread pool, otherwise looks good to me.

divijvaidya · 2023-04-28T11:53:48Z

Is this RejectedExecutionException propagated to the Consumer fetch? If yes, is this a change in the existing interface with the consumer? (please correct me if I am wrong but I am not aware of consumer handling or expecting RejectedExecutionException today.

This error is propagated as unexpected error (UnknownServerException) to the consumer client and it is already handled.

Thank you. That answers my question.

Should we add a log if the nature of the error is not propagated?

satishd · 2023-05-01T11:19:27Z

LGTM! @satishd , will there be other tests added as in the PR description said, or there will be follow-up PRs to add them?

@showuon Those will be added as followups.

Hangleton · 2023-04-28T13:00:15Z

+    // The 1st topic-partition that has to be read from remote storage
+    var remoteFetchInfo: Optional[RemoteStorageFetchInfo] = Optional.empty()


Got it, thanks.

Hangleton · 2023-04-28T13:59:32Z

nit: instead of NOT_AVAILABLE, maybe the message could report that the log start offset is strictly greater than the fetch offset?

divijvaidya

Thank you for addressing the previous comments Satish. I have some additional ones about how we are handling shutdown.

divijvaidya · 2023-05-02T11:00:18Z

How did we decide on 2min. here? I don't think we should block shutdown of the broker on this over here because there are other limits associated with clean vs unclean shutdown. If we do plan to block, we should tie it to overall shutdown timeout. As an example, clean shutdown is expected to be completed in 5 min. see lifecycleManager.controlledShutdownFuture.get(5L, TimeUnit.MINUTES) in BrokerServer.scala.

It does not require that to be completed in 5 mins. lifecycleManager.controlledShutdownFuture is more about processing the controlled shutdown event to the controller for that broker. It will wait for 5 mins before proceeding with other sequence of actions. But that will not get affected because of the code introduced here.
Logging subsystem handles unclean shutdown for log segments and it would have been already finished before RemoteLogManager is closed. So, they will not get affected because of this timeout. But we can have a short duration here like 10 secs, we can revisit introducing a config if it is really needed for closing the remote log subsystem.

divijvaidya

This change looks good to me! (assuming tests will be merged in separate PR)

junrao

@satishd : Thanks for the updated PR. Haven't looked at the testing code. A few more comments.

junrao · 2023-05-03T20:58:30Z

-      // may arrive and hence make this operation completable.
-      delayedFetchPurgatory.tryCompleteElseWatch(delayedFetch, delayedFetchKeys)
+
+      if (remoteFetchInfo.isPresent) {


In the following code, we should go into that branch only if remoteFetchInfo is empty, right? Otherwise, if we could get into a situation that a remote partition is never served because the fetch request is always satisfied with new local data on other partitions.

if (params.maxWaitMs <= 0 || fetchInfos.isEmpty || bytesReadable >= params.minBytes || errorReadingData || hasDivergingEpoch || hasPreferredReadReplica) {

Sign in to view

+                }
+
+                if (searchInLocalLog) {
+                    txnIndexOpt = (localLogSegments.hasNext()) ? Optional.of(localLogSegments.next().txnIndex()) : Optional.empty();


satishd · 2023-05-15T00:18:50Z

@junrao We are not sure whether those failures are related to this change. They do not fail on the laptop or other hosts. We are looking into those failures.

…ate the test failure in Jenkins.

kamalcph · 2023-05-16T16:26:32Z

@satishd : Thanks for the updated PR. Are the test failures related to this PR especially the following ones related to remote store?

[Build / JDK 11 and Scala 2.13 / kafka.server.ListOffsetsRequestWithRemoteStoreTest.testResponseIncludesLeaderEpoch()](https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-13535/30/testReport/junit/kafka.server/ListOffsetsRequestWithRemoteStoreTest/Build___JDK_11_and_Scala_2_13___testResponseIncludesLeaderEpoch___2/)
[Build / JDK 8 and Scala 2.12 / kafka.server.ListOffsetsRequestWithRemoteStoreTest.testResponseIncludesLeaderEpoch()](https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-13535/30/testReport/junit/kafka.server/ListOffsetsRequestWithRemoteStoreTest/Build___JDK_8_and_Scala_2_12___testResponseIncludesLeaderEpoch__/)
[Build / JDK 8 and Scala 2.12 / kafka.server.ListOffsetsRequestWithRemoteStoreTest.testResponseIncludesLeaderEpoch()](https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-13535/30/testReport/junit/kafka.server/ListOffsetsRequestWithRemoteStoreTest/Build___JDK_8_and_Scala_2_12___testResponseIncludesLeaderEpoch___2/)

It is not because of the changes in the PR. #10389 attempted to stabilize this test but it can still fail if the machine is slow.

junrao

@kamalcph : Thanks for the investigation. The PR LGTM. Just a couple of minor comments.

Also, should we reopen https://issues.apache.org/jira/browse/KAFKA-12384 since it's still flaky?

satishd · 2023-05-17T09:57:32Z

Thanks @junrao for the updated review. Addressed your latest minor review comments.

junrao

@satishd : Thanks for the updated PR. A couple of minor comments.

satishd · 2023-05-17T18:01:59Z

Thanks @junrao for the latest comments, addressed them with the latest commit.

junrao

@satishd : Thanks for the updated PR. LGTM

Hangleton

Thanks for the PR Satish!

dajac · 2023-05-22T16:32:25Z

@satishd I see many failed tests here. Are they related to changes made in this PR? This commit seems to be the only recent changes in this area.

satishd · 2023-05-23T03:11:13Z

@dajac They do not seem to be related to this PR. Please take a look at the comment.

dajac · 2023-05-23T06:23:45Z

@satishd testResponseIncludesLeaderEpoch fails locally. Does it pass for you? It does not seem to be related to slow CI.

satishd · 2023-05-23T12:36:24Z

@dajac It is passed locally on my laptop.

Gradle Test Run :core:test > Gradle Test Executor 65 > ListOffsetsRequestTest > testResponseDefaultOffsetAndLeaderEpochForAllVersions() PASSED

Gradle Test Run :core:test > Gradle Test Executor 65 > ListOffsetsRequestTest > testListOffsetsMaxTimeStampOldestVersion() PASSED

Gradle Test Run :core:test > Gradle Test Executor 65 > ListOffsetsRequestTest > testListOffsetsErrorCodes() PASSED

Gradle Test Run :core:test > Gradle Test Executor 65 > ListOffsetsRequestTest > testCurrentEpochValidation() PASSED

Gradle Test Run :core:test > Gradle Test Executor 65 > ListOffsetsRequestTest > testResponseIncludesLeaderEpoch() PASSED

BUILD SUCCESSFUL in 1m 8s
55 actionable tasks: 6 executed, 49 up-to-date
➜  kafka git:(apache-trunk) date
Tue May 23 18:00:01 IST 2023
➜  kafka git:(apache-trunk) git rev-parse --verify HEAD
15f8705246e094f7825b76a38d9f12f95d626ee5
➜  kafka git:(apache-trunk)

> Task :core:test

Gradle Test Run :core:test > Gradle Test Executor 71 > ListOffsetsRequestWithRemoteStoreTest > testResponseDefaultOffsetAndLeaderEpochForAllVersions() PASSED

Gradle Test Run :core:test > Gradle Test Executor 71 > ListOffsetsRequestWithRemoteStoreTest > testListOffsetsMaxTimeStampOldestVersion() PASSED

Gradle Test Run :core:test > Gradle Test Executor 71 > ListOffsetsRequestWithRemoteStoreTest > testListOffsetsErrorCodes() PASSED

Gradle Test Run :core:test > Gradle Test Executor 71 > ListOffsetsRequestWithRemoteStoreTest > testCurrentEpochValidation() PASSED

Gradle Test Run :core:test > Gradle Test Executor 71 > ListOffsetsRequestWithRemoteStoreTest > testResponseIncludesLeaderEpoch() PASSED

BUILD SUCCESSFUL in 1m 9s
55 actionable tasks: 6 executed, 49 up-to-date
➜  kafka git:(apache-trunk) date
Tue May 23 18:05:20 IST 2023
➜  kafka git:(apache-trunk) git rev-parse --verify HEAD
15f8705246e094f7825b76a38d9f12f95d626ee5
➜  kafka git:(apache-trunk)

dajac · 2023-05-23T12:53:51Z

@satishd Weird... It fails all the time on my laptop.

Gradle Test Run :core:test > Gradle Test Executor 9 > ListOffsetsRequestTest > testResponseIncludesLeaderEpoch() FAILED
    org.opentest4j.AssertionFailedError: expected: <(10,1,0)> but was: <(-1,-1,78)>
        at app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
        at app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
        at app//org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
        at app//org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:182)
        at app//org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:177)
        at app//org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1142)
        at app//kafka.server.ListOffsetsRequestTest.testResponseIncludesLeaderEpoch(ListOffsetsRequestTest.scala:210)

% git rev-parse --verify HEAD
15f8705246e094f7825b76a38d9f12f95d626ee5

dajac · 2023-05-23T13:26:45Z

@satishd I raised a PR to fix this: #13747. Could you take a look?

dajac · 2023-05-24T08:13:02Z

-    quota: ReplicaQuota,
-    responseCallback: Seq[(TopicIdPartition, FetchPartitionData)] => Unit
-  ): Unit = {
+  def fetchMessages(params: FetchParams,


A small comment. We should avoid completely changing the code style without reasons. The format of the method was not a mistake. It is the format that we mainly used in this class nowadays.

dajac · 2023-05-24T08:14:02Z

+  private def handleOffsetOutOfRangeError(tp: TopicIdPartition, params: FetchParams, fetchInfo: PartitionData,
+                                          adjustedMaxBytes: Int, minOneMessage:
+                                          Boolean, log: UnifiedLog, fetchTimeMs: Long,
+                                          exception: OffsetOutOfRangeException): LogReadResult = {


We usually don't format method like this. Could we put one argument per line?

dajac · 2023-05-24T08:18:25Z

+        val fetchDataInfo =
+        new FetchDataInfo(new LogOffsetMetadata(offset), MemoryRecords.EMPTY, false, Optional.empty(),


nit: new FetchDataInfo should be on previous line or indented.

satishd force-pushed the rlm-consumer-fetch branch 4 times, most recently from ca16e79 to 9ba10cf Compare April 12, 2023 09:43

showuon reviewed Apr 12, 2023

View reviewed changes

satishd marked this pull request as ready for review April 12, 2023 15:49

satishd requested a review from junrao April 12, 2023 15:49

satishd force-pushed the rlm-consumer-fetch branch from 9acff1f to fb613c8 Compare April 12, 2023 15:56

satishd changed the title ~~[DRAFT] KAFKA-9579 Fetch implementation for records in the remote storage through a specific purgatory.~~ KAFKA-9579 Fetch implementation for records in the remote storage through a specific purgatory. Apr 12, 2023

showuon reviewed Apr 13, 2023

View reviewed changes

Comment thread core/src/main/scala/kafka/server/ReplicaManager.scala Outdated

showuon reviewed Apr 13, 2023

View reviewed changes

satishd force-pushed the rlm-consumer-fetch branch from fb613c8 to c2873f5 Compare April 13, 2023 11:41

jeqo reviewed Apr 13, 2023

View reviewed changes

junrao reviewed Apr 13, 2023

View reviewed changes

jeqo reviewed Apr 14, 2023

View reviewed changes

Hangleton reviewed Apr 14, 2023

View reviewed changes

satishd force-pushed the rlm-consumer-fetch branch 2 times, most recently from b9c6ef8 to 666fd8d Compare April 17, 2023 11:13

satishd commented Apr 18, 2023

View reviewed changes

satishd force-pushed the rlm-consumer-fetch branch from 4a8f67f to b8a3c83 Compare April 19, 2023 05:05

satishd requested a review from junrao April 25, 2023 06:11

showuon reviewed Apr 28, 2023

View reviewed changes

divijvaidya reviewed Apr 28, 2023

View reviewed changes

Hangleton reviewed May 2, 2023

View reviewed changes

divijvaidya reviewed May 2, 2023

View reviewed changes

Hangleton reviewed May 3, 2023

View reviewed changes

Comment thread core/src/main/java/kafka/log/remote/RemoteLogReader.java Outdated

Comment thread core/src/main/java/kafka/log/remote/RemoteLogReader.java

Comment thread core/src/test/java/kafka/log/remote/RemoteLogReaderTest.java Outdated

divijvaidya approved these changes May 4, 2023

View reviewed changes

junrao reviewed May 4, 2023

View reviewed changes

Capture the error code in testResponseIncludesLeaderEpoch to investig…

35f4dec

…ate the test failure in Jenkins.

junrao approved these changes May 16, 2023

View reviewed changes

Comment thread core/src/main/scala/kafka/server/ReplicaManager.scala Outdated

Comment thread core/src/main/scala/kafka/server/ReplicaManager.scala Outdated

Addressed review comments

59866b1

junrao reviewed May 17, 2023

View reviewed changes

Comment thread core/src/main/scala/kafka/server/ReplicaManager.scala Outdated

Comment thread core/src/main/scala/kafka/server/ReplicaManager.scala Outdated

Addressed review comments

51a570a

junrao approved these changes May 17, 2023

View reviewed changes

satishd merged commit 6f19730 into apache:trunk May 18, 2023

Hangleton reviewed May 18, 2023

View reviewed changes

dajac mentioned this pull request May 24, 2023

MINOR: Fix ListOffsetsRequestTest.testResponseIncludesLeaderEpoch #13747

Merged

3 tasks

dajac reviewed May 24, 2023

View reviewed changes

divijvaidya added the tiered-storage Related to the Tiered Storage feature label Jun 15, 2023

jeqo mentioned this pull request Jun 27, 2023

S3 download speed is too slow when consumer needs to read data from S3 storage Aiven-Open/tiered-storage-for-apache-kafka#297

Open

chia7712 reviewed Jul 19, 2024

View reviewed changes

Comment thread storage/src/main/java/org/apache/kafka/storage/internals/log/RemoteStorageThreadPool.java

satishd mentioned this pull request Jul 22, 2024

KAFKA-17168: Remove the logPrefix to print the thread name #16657

Merged

3 tasks

		// The 1st topic-partition that has to be read from remote storage
		var remoteFetchInfo: Optional[RemoteStorageFetchInfo] = Optional.empty()

		val fetchDataInfo =
		new FetchDataInfo(new LogOffsetMetadata(offset), MemoryRecords.EMPTY, false, Optional.empty(),

Conversation

satishd commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

showuon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satishd Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satishd May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

satishd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

satishd Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

showuon left a comment

satishd commented Apr 11, 2023 •

edited

Loading

satishd Apr 18, 2023 •

edited

Loading

satishd May 9, 2023 •

edited

Loading

satishd Apr 18, 2023 •

edited

Loading

satishd May 4, 2023 •

edited

Loading