KAFKA-9039: Optimize ReplicaFetcher fetch path#7443
KAFKA-9039: Optimize ReplicaFetcher fetch path#7443guozhangwang merged 12 commits intoapache:trunkfrom
Conversation
We already have the partition fetch data in the fetch session, so a copy is not required.
isReplicaInSync check. Reduce cost of updating map by not unnecessarily wrapping in ClientIdTopicPartition in FetcherStats
|
According to profiling, a majority of the rest of the time goes to looking up the logStartOffset for each partition on each pass (https://github.com/apache/kafka/pull/7443/files#diff-a8437f241a5ae585d5805ee769313080R252), and updating the incremental fetch session (https://github.com/apache/kafka/pull/7443/files#diff-a8437f241a5ae585d5805ee769313080R253). If we can drop those further, we may be able to save another 50-75%. |
| fetcherLagMetrics.lag <= 0 | ||
| else | ||
| false | ||
| stats.getAndMaybePut(topicPartition) |
There was a problem hiding this comment.
Wrapping each put in a ClientIdTopicPartition served no purpose here, as metricId is fixed for every instantiation of FetcherLagStats. This adds additional object creation/GC overhead, as well as additional hashing overhead.
| val fetchState = fetchStates(topicPartition) | ||
| if (fetchState.fetchOffset == currentFetchState.fetchOffset && currentFetchState.isReadyForFetch) { | ||
| val fetchPartitionData = sessionPartitions.get(topicPartition) | ||
| if (fetchPartitionData != null && fetchPartitionData.fetchOffset == currentFetchState.fetchOffset && currentFetchState.isReadyForFetch) { |
There was a problem hiding this comment.
This could be cleaned up a little.
| val partitionsWithError = mutable.Set[TopicPartition]() | ||
|
|
||
| val builder = fetchSessionHandler.newBuilder() | ||
| val builder = fetchSessionHandler.newBuilder(partitionMap.size, false) |
There was a problem hiding this comment.
Use the new builder which presizes the PartitionData hashmap, and which does not make another copy of sessionPartitions. We will not be modifying sessionPartitions, or using it again after we have processed the fetch request, so we do not need to make an unnecessary copy.
| .toForget(fetchData.toForget) | ||
| .metadata(fetchData.metadata) | ||
| Some(requestBuilder) | ||
| Some(ReplicaFetch(fetchData.sessionPartitions(), requestBuilder)) |
There was a problem hiding this comment.
By passing through sessionPartitions we can avoid making another copy of the PartitionMap in https://github.com/apache/kafka/pull/7443/files#diff-2d03a5d4349a23e56c09b4e773055249L119. sessionPartitions contains all of the PartitionData that we need to process the fetch response.
| val isReplicaInSync = fetcherLagStats.isReplicaInSync(topicPartition) | ||
| !isReplicaInSync && quota.isThrottled(topicPartition) && quota.isQuotaExceeded | ||
| private def shouldFollowerThrottle(quota: ReplicaQuota, fetchState: PartitionFetchState, topicPartition: TopicPartition): Boolean = { | ||
| !fetchState.isReplicaInSync && quota.isThrottled(topicPartition) && quota.isQuotaExceeded |
There was a problem hiding this comment.
We avoid a HashMap lookup here by including the lag in the PartitionFetchState directly.
| def isTruncating: Boolean = state == Truncating && !isDelayed | ||
|
|
||
| def isDelayed: Boolean = delay.getDelay(TimeUnit.MILLISECONDS) > 0 | ||
| def isDelayed: Boolean = delay.exists(_.getDelay(TimeUnit.MILLISECONDS) > 0) |
There was a problem hiding this comment.
Avoid an unnecessary gettimeofday when partitions are in Fetching state (the common case) by using an Option[DelayedItem].
|
Thanks for the combo-patch @lbradstreet ! I think this is very helpful. A couple of thoughts on top of my head:
partitionMap.size in Scala is O(n), although the passed-in map is a LinkedHashMap whose size call is O(1). Honestly I do not know which one would be triggered, but worth reducing it to be safe.
|
|
For 1) I've just updated the description to include the relative ratio of each optimization, it seems that 29fdd60 makes it less scalable with the num.partitions since it's only useful with small num.partitions. |
|
Thanks! I'm assuming that you're OK with it still, because it still helps across the board? It just helps less for the high partition count case? |
new partitions are left in next map.
It's a good question, and I'm not sure. The idea was that by making next smaller we could make the lookups progressively quicker as we process sessionsPartitions. It did seem to improve performance for our benchmark. I forgot to include the second optimization that it allows us to make f0bf23a. This helps reduce containsKey checks, and is worth keeping if we leave the change to keep the second map in. If not we will need to remove it again. |
Yeah that's okay with me still. Just trying to point it out that with large number of partitions there seems other dominating factors. |
We will get better performance improvements by redesigning how incremental fetch sessions work in the Replica Fetcher. Minimizing differences in the fetch session will minimize risks of the changeset.
|
@guozhangwang I removed (32e30ac) most of the risky FetchSessionHandler changes that only showed up in the FetchSessionHandlerBenchmark, as I think we should redesign how the Replica Fetcher incremental fetch session is maintained, rather than try and optimize it further. I re-ran the benchmark on the r5x.large and it shows very little differences vs the previous commit. |
guozhangwang
left a comment
There was a problem hiding this comment.
LGTM. I do not have further comments except the fix for scala 2.13 above. I will try to run the benchmark included in your PR on my local MP5 as well to confirm its effects before merging.
| * incremental fetch requests (see below). | ||
| */ | ||
| private LinkedHashMap<TopicPartition, PartitionData> next = new LinkedHashMap<>(); | ||
| private LinkedHashMap<TopicPartition, PartitionData> next; |
There was a problem hiding this comment.
I'd suggest we make one step further, to just make the sessionPartitions a local variable only: this is because in later handling logic after the response is sent back, we only need sessionPartitions.keySet / size only, which we could maintain separately as a member field which is populated after the build() call. WDYT?
There was a problem hiding this comment.
NVM I realized now you also need the sessionPartitions for saving one fetch state copying as well.
|
I've run the benchmark from this PR on my laptop and here's the results: Baseline (trunk) For 100/500/1000/5000 partitions, the cost scale factor (i.e. score per partition) increased from 1 to 1/1.32/1.46/2.42 This PR: The latency reduction ranges between 65% to 50%. Interesting, for 100/500/1000/5000 partitions, the cost scale factor (i.e. score per partition) increased from 1 to 1/1.42/1.64/2.72, which is higher than the baseline. I means the optimizations we've done in this PR benefits the scenario with less partitions more than larger num.partitions. Overall, I still think we can merge the PR as-is and continue the next optimizations on top of it. |
Resolved by 32e30ac |
|
Build failures are due to broken trunk (#7521) |
|
retest this please |
| fetchRequest: FetchRequest.Builder): Unit = { | ||
| val partitionsWithError = mutable.Set[TopicPartition]() | ||
| var responseData: Seq[(TopicPartition, FetchData)] = Seq.empty | ||
| var responseData: Map[TopicPartition, FetchData] = Map.empty |
junrao
left a comment
There was a problem hiding this comment.
@lbradstreet : Thanks for the PR. Looks good overall. A few comments below.
| val partitionsWithError = mutable.Set[TopicPartition]() | ||
|
|
||
| val builder = fetchSessionHandler.newBuilder() | ||
| val builder = fetchSessionHandler.newBuilder(partitionMap.size, false) |
There was a problem hiding this comment.
Hmm, here, we choose not to copy sessionPartitions in the builder. However, in line 265 and line 273, we still read sessionPartitions. Is that correct?
There was a problem hiding this comment.
I believe this is still correct. We do not copy sessionPartitions, but sessionPartitions will also not be modified until the next fetch/doWork iteration, and all use of it is from the same thread. I believe it is thus safe to use it without making a copy.
There was a problem hiding this comment.
Thanks for the explanation. That makes sense.
junrao
left a comment
There was a problem hiding this comment.
@lbradstreet : Thanks for the latest PR. LGTM
|
LGTM. I will merge after the jenkins job completes. |
|
Failures in: 2.13: |
Improves the performance of the replica fetcher for high partition count fetch requests, where a majority of the partitions did not update between fetch requests. All benchmarks were run on an r5x.large.
Vanilla
Benchmark (partitionCount) Mode Cnt Score Error Units
ReplicaFetcherThreadBenchmark.testFetcher 100 avgt 15 26491.825 ± 438.463 ns/op
ReplicaFetcherThreadBenchmark.testFetcher 500 avgt 15 153941.952 ± 4337.073 ns/op
ReplicaFetcherThreadBenchmark.testFetcher 1000 avgt 15 339868.602 ± 4201.462 ns/op
ReplicaFetcherThreadBenchmark.testFetcher 5000 avgt 15 2588878.448 ± 22172.482 ns/op
From 100 to 5000 partitions the latency increase is 2588878.448 / 26491.825 = 97.
Avoid gettimeofdaycalls in steady state fetch states
8545888
Benchmark (partitionCount) Mode Cnt Score Error Units
ReplicaFetcherThreadBenchmark.testFetcher 100 avgt 15 22685.381 ± 267.727 ns/op
ReplicaFetcherThreadBenchmark.testFetcher 500 avgt 15 113622.521 ± 1854.254 ns/op
ReplicaFetcherThreadBenchmark.testFetcher 1000 avgt 15 273698.740 ± 9269.554 ns/op
ReplicaFetcherThreadBenchmark.testFetcher 5000 avgt 15 2189223.207 ± 1706.945 ns/op
From 100 to 5000 partitions the latency increase is 2189223.207 / 22685.381 = 97X
Avoid copying partition states to maintain fetch offsets
29fdd60
Benchmark (partitionCount) Mode Cnt Score Error Units
ReplicaFetcherThreadBenchmark.testFetcher 100 avgt 15 17039.989 ± 609.355 ns/op
ReplicaFetcherThreadBenchmark.testFetcher 500 avgt 15 99371.086 ± 1833.256 ns/op
ReplicaFetcherThreadBenchmark.testFetcher 1000 avgt 15 216071.333 ± 3714.147 ns/op
ReplicaFetcherThreadBenchmark.testFetcher 5000 avgt 15 2035678.223 ± 5195.232 ns/op
From 100 to 5000 partitions the latency increase is 2035678.223 / 17039.989 = 119X
Keep lag alongside PartitionFetchState to avoid expensive isReplicaInSync check
0e57e3e
Benchmark (partitionCount) Mode Cnt Score Error Units
ReplicaFetcherThreadBenchmark.testFetcher 100 avgt 15 15131.684 ± 382.088 ns/op
ReplicaFetcherThreadBenchmark.testFetcher 500 avgt 15 86813.843 ± 3346.385 ns/op
ReplicaFetcherThreadBenchmark.testFetcher 1000 avgt 15 193050.381 ± 3281.833 ns/op
ReplicaFetcherThreadBenchmark.testFetcher 5000 avgt 15 1801488.513 ± 2756.355 ns/op
From 100 to 5000 partitions the latency increase is 1801488.513 / 15131.684 = 119X
Fetch session optimizations (mostly presizing the next hashmap, and avoiding making a copy of sessionPartitions, as a deep copy is not required for the ReplicaFetcher)
2614b24
Benchmark (partitionCount) Mode Cnt Score Error Units
ReplicaFetcherThreadBenchmark.testFetcher 100 avgt 15 11386.203 ± 416.701 ns/op
ReplicaFetcherThreadBenchmark.testFetcher 500 avgt 15 60820.292 ± 3163.001 ns/op
ReplicaFetcherThreadBenchmark.testFetcher 1000 avgt 15 146242.158 ± 1937.254 ns/op
ReplicaFetcherThreadBenchmark.testFetcher 5000 avgt 15 1366768.926 ± 3305.712 ns/op
From 100 to 5000 partitions the latency increase is 1366768.926 / 11386.203 = 120