KAFKA-10271 Performance regression while fetching a key from a single partition#9020
KAFKA-10271 Performance regression while fetching a key from a single partition#9020guozhangwang merged 20 commits intoapache:trunkfrom
Conversation
… return all state stores
abbccdda
left a comment
There was a problem hiding this comment.
Thanks for the PR, could we add some unit test coverage?
|
Hi @dima5rr , thanks for the PR! I was looking into the context of this ticket and noticed that we're essentially just duplicating here the logic in Anecdotally, I think that most people would actually just query immediately and then discard their store reference, for example like In fact, the only value it could provide is if you do plan to save the store reference and re-use it for multiple queries. But in that case, there could be a rebalance at any time, so checking up front probably doesn't help much no matter what the use case is. What do you think about removing the loop Of course, the short-circuit you're providing here is valuable in any case. Thanks! |
|
Alternatively, as you observed, if the parameters contain a single partition, then there should just be one specific store that matches. Instead of returning a |
|
Hi @vvcephei, thank you for the input. After profiling under load it looks like problem in excessive loops over calling internalTopologyBuilder.topicGroups() WDYT? |
|
Indeed StreamThreadStateStoreProvider does not need InternalTopologyBuilder in order to find stream task, it has all required data in StreamThread. |
… before returning the WrappingStoreProvider
… before returning the WrappingStoreProvider
vvcephei
left a comment
There was a problem hiding this comment.
Thanks for the update, and for adding the tests, and especially for profiling it!
I just had a couple of minor comments. Do you mind running the profiler again, just to make sure it lines up with your expectations?
Thanks,
-John
| if (allStores.isEmpty()) { | ||
| if (storeQueryParameters.partition() != null) { | ||
| throw new InvalidStateStoreException( | ||
| String.format("The specified partition %d for store %s does not exist.", |
There was a problem hiding this comment.
Is this really a different condition than the one on L65? It seems like the failure is still probably that the store "migrated" instead of "doesn't exist", right?
There was a problem hiding this comment.
L65 catches on rebalancing, while L60 is parameter validation for incorrect partition case.
There was a problem hiding this comment.
Could you elaborate a bit more about this? If allStores.isEmpty() is empty, it is always possible that the specified store-partition or just store-"null" does not exist in this client. Why they are different failure cases?
There was a problem hiding this comment.
Hey @dima5rr , I think Guozhang's question was hidden because the conversation was already "resolved". Do you mind answering this concern?
There was a problem hiding this comment.
Hey @guozhangwang, you're right, this check is ambiguous, it's more likely parameter sanity validation when user explicitly specify a single partition.
There was a problem hiding this comment.
Got it, in that case how about we just encode the partition in the thrown's message so that upon throwing, people can still check if the partition is null or not when debugging?
Otherwise, this PR all LGTM :)
There was a problem hiding this comment.
Hey @guozhangwang, I am just care that in case of partition is null, the error message is referenced in official FAQ.
There was a problem hiding this comment.
That's a fair point, let's just merge it as is then.
| if (allStores.isEmpty()) { | ||
| if (storeQueryParameters.partition() != null) { | ||
| throw new InvalidStateStoreException( | ||
| String.format("The specified partition %d for store %s does not exist.", |
There was a problem hiding this comment.
Could you elaborate a bit more about this? If allStores.isEmpty() is empty, it is always possible that the specified store-partition or just store-"null" does not exist in this client. Why they are different failure cases?
| } | ||
|
|
||
| @Test | ||
| public void shouldNotAccessJoinStoresWhenGivingName() throws InterruptedException { |
There was a problem hiding this comment.
A good coverage improvement! Thanks.
| } | ||
| throw new InvalidStateStoreException("Cannot get state store " + storeName + " because the requested partition " + | ||
| partition + " is not available on this instance"); | ||
| private Optional<Task> findStreamTask(final Collection<Task> tasks, final String storeName, final int partition) { |
There was a problem hiding this comment.
This is a great find, thanks!
| if (!globalStore.isEmpty()) { | ||
| return queryableStoreType.create(globalStoreProvider, storeName); | ||
| } | ||
| final List<T> allStores = new ArrayList<>(); |
|
test this please |
1 similar comment
|
test this please |
|
test this |
1 similar comment
|
test this |
|
@dima5rr I tried to compile your branch but got a few compilation error like the following: |
apache#9108) The main goal is to remove usage of embedded broker (EmbeddedKafkaCluster) in AbstractJoinIntegrationTest and its subclasses. This is because the tests under this class are no longer using the embedded broker, except for two. testShouldAutoShutdownOnIncompleteMetadata is one of such tests. Furthermore, this test does not actually perfom stream-table join; it is testing an edge case of joining with a non-existent topic, so it should be in a separate test. Testing strategy: run existing unit and integration test Reviewers: Boyang Chen <boyang@confluent.io>, Bill Bejeck <bbejeck@apache.org>
|
Hi @guozhangwang can you trigger new build, looks like flaky tests? |
|
test this please |
1 similar comment
|
test this please |
|
test this |
|
test this |
|
Test passed, merged to trunk. Thanks @dima5rr for your great contribution! |
…e partition (#9020) StreamThreadStateStoreProvider excessive loop over calling internalTopologyBuilder.topicGroups(), which is synchronized, thus causing significant performance degradation to the caller, especially when store has many partitions. Reviewers: John Roesler <vvcephei@apache.org>, Guozhang Wang <wangguoz@gmail.com>
|
Cherry-picked to 2.6 as well. |
…e partition (#9020) StreamThreadStateStoreProvider excessive loop over calling internalTopologyBuilder.topicGroups(), which is synchronized, thus causing significant performance degradation to the caller, especially when store has many partitions. Reviewers: John Roesler <vvcephei@apache.org>, Guozhang Wang <wangguoz@gmail.com>
* Updating trunk versions after cutting branch for 2.7 * KAFKA-9929: Support backward iterator on SessionStore (apache#9139) Implements KIP-617 for `SessionStore` Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, John Roesler <vvcephei@apache.org> * MINOR: remove unused scala files from core module (apache#9296) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Lee Dongjin <dongjin@apache.org> * MINOR: correct package of LinuxIoMetricsCollector (apache#9271) Reviewers: Mickael Maison <mickael.maison@gmail.com>, Lee Dongjin <dongjin@apache.org> * KAFKA-10028: Minor fixes to describeFeatures and updateFeatures apis (apache#9393) In this PR, I have addressed the review comments from @chia7712 in apache#9001 which were provided after apache#9001 was merged. The changes are made mainly to KafkaAdminClient: Improve error message in updateFeatures api when feature name is empty. Propagate top-level error message in updateFeatures api. Add an empty-parameter variety for describeFeatures api. Minor documentation updates to @param and @return to make these resemble other apis. Reviewers: Chia-Ping Tsai chia7712@gmail.com, Jun Rao junrao@gmail.com * KAFKA-10271: Performance regression while fetching a key from a single partition (apache#9020) StreamThreadStateStoreProvider excessive loop over calling internalTopologyBuilder.topicGroups(), which is synchronized, thus causing significant performance degradation to the caller, especially when store has many partitions. Reviewers: John Roesler <vvcephei@apache.org>, Guozhang Wang <wangguoz@gmail.com> Co-authored-by: Jorge Esteban Quilcate Otoya <quilcate.jorge@gmail.com> Co-authored-by: Chia-Ping Tsai <chia7712@gmail.com> Co-authored-by: Kowshik Prakasam <kprakasam@confluent.io> Co-authored-by: Dima Reznik <dima.r@fiverr.com>
…e partition (apache#9020) StreamThreadStateStoreProvider excessive loop over calling internalTopologyBuilder.topicGroups(), which is synchronized, thus causing significant performance degradation to the caller, especially when store has many partitions. Reviewers: John Roesler <vvcephei@apache.org>, Guozhang Wang <wangguoz@gmail.com>

StreamThreadStateStoreProvider excessive loop over calling internalTopologyBuilder.topicGroups(), which is synchronized, thus causing significant performance degradation to the caller, especially when store has many partitions.
https://issues.apache.org/jira/browse/KAFKA-10271