KAFKA-12648: Pt. 3 - addNamedTopology API by ableegoldman · Pull Request #10788 · apache/kafka

ableegoldman · 2021-05-29T01:53:24Z

Pt. 1: #10609
Pt. 2: #10683
Pt. 3: #10788

In Pt. 3 we implement the addNamedTopology API. This can be used to update the processing topology of a running Kafka Streams application without resetting the app, or even pausing/restarting the process. It's up to the user to ensure that this API is called on every instance of an application to ensure all clients are able to run the newly added NamedTopology. This should not be too much of a burden as it only requires that each client eventually be updated by the user -- under the covers, Streams will take care of keeping the internal state consistent while various clients wait to converge on the latest view of the full topology.

Internally, when a new NamedTopology is added a rebalance will be triggered to distribute the tasks that correspond to it. To minimize disruption and wasted work, the assignor just computes the desired eventual assignment of these new tasks to clients regardless of whether the target client has been issued the addNamedTopology request yet. If a client receives tasks for a NamedTopology it does not yet recognize, it stashes them away and continues to process its other topologies. Once it receives this new NamedTopology, those tasks will be created and begin processing without triggering a new rebalance. If the new NamedTopology does not match any unknown tasks it has received, then the client must trigger a fresh rebalance for this new NamedTopology.

…ogyBuilders of named topologies (#10683) Pt. 1: #10609 Pt. 2: #10683 Pt. 3: #10788 The TopologyMetadata is next up after Pt. 1 #10609. This PR sets up the basic architecture for running an app with multiple NamedTopologies, though the APIs to add/remove them dynamically are not implemented until Pt. 3 Reviewers: Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>

ableegoldman · 2021-07-29T22:43:02Z

We changed the behavior to return an empty Optional rather than throw, as users may want to use this API to determine whether the given named topology is known or not

…ith cleanup

…ually needed

guozhangwang

Made a first pass on the PR.

guozhangwang · 2021-08-03T23:38:36Z

 public class NamedTopologyIntegrationTest {
    public static final EmbeddedKafkaCluster CLUSTER = new EmbeddedKafkaCluster(1);

+    // TODO KAFKA-12648:


This is meta question: do we have coverage on scenarios where the leader/member's bookkept named-topologies set are different? I.e. 1) the leader would not try to create any tasks that it's own topology-metadata is not aware of even if other subscriptions contain more topics, 2) vice verse, the other members would not try to create tasks for assignment that its topology metadata does not recognize, while later when they get added the tasks gets created then?

I'm still filling out the integration test suite, especially the multi-node testing, but I'll make sure this scenario has coverage. This will probably have to be in the followup Pt. 4 which expands add/removeNamedTopology to return a Future, since being able to block on this helps a lot with the testing.

…t is needed

guozhangwang

We a pass on the new commits, but I'm a bit confused on some logic around unknown task removal and task freezing.. maybe we can chat again for me to get your thoughts?

…cking

ableegoldman · 2021-08-05T03:00:09Z

@guozhangwang I think I've addressed all your feedback and significantly cleaned up the streamthread event loop + topology locking, let me know if there's anything else

ableegoldman · 2021-08-05T09:40:48Z

Java 8 & 11 tests passed, Java 16 build failure was unrelated: Execution failed for task ':storage:unitTest'.

wcarlson5

@ableegoldman I still this this looks good!

guozhangwang

Made another pass.

guozhangwang · 2021-08-05T22:22:11Z

-    private final SortedMap<String, InternalTopologyBuilder> builders; // Keep sorted by topology name for readability
+    private final TopologyVersion version;
+
+    private final ConcurrentNavigableMap<String, InternalTopologyBuilder> builders; // Keep sorted by topology name for readability


SGTM.

What about the other comment, i.e. moving the Map<String, InternalTopologyBuilder> builders into the TopologyVersion itself? Besides the constructors, the only modifiers to builders seem to be register/deregister, in which we would always try to getAndIncrement version. So what about consolidating the modification of builders along with version bump, and hence we would not need to use a ConcurrentNavigableMap?

guozhangwang · 2021-08-05T22:26:49Z


        final long pollLatency = pollPhase();

+        topologyMetadata.maybeWaitForNonEmptyTopology(() -> state);


How about moving this ahead of pollPhase()? We are likely to be kicked out of the group while blocked waiting here, so it's better to be aware of that and re-join the group immediately, rather than doing the restore/etc still which may be all wasted work.

Ack (although note that there's no wasted work on the restore phase since there's by definition nothing for the thread to do yet as it won't have been assigned any new tasks until it polls again).

I don't think it really matters much where we put this for that reason, except for the case in which we start up with no topology -- then it's a waste to join the group in the first place, so we may as well wait until we receive something to work on. So yes, I'll move it back ahead of poll

…n-empty topology before calling poll

ableegoldman · 2021-08-06T02:25:16Z

    void shutdown(final boolean clean) {
        final AtomicReference<RuntimeException> firstException = new AtomicReference<>(null);

-        final Set<Task> tasksToCloseDirty = new HashSet<>();


No actual changes here, just pulled the cleanup of tasks out into a separate new #closeAndCleanUpTasks method so we can call that on tasks from removed topologies

ableegoldman · 2021-08-06T02:26:52Z


        if (!remainingRevokedPartitions.isEmpty()) {
-            log.warn("The following partitions {} are missing from the task partitions. It could potentially " +
+            log.debug("The following revoked partitions {} are missing from the current task partitions. It could potentially " +


Making this debug since warn seems too intense, and I'm not sure it's even worthy of info -- also, with named topologies you would expect to see this almost every time a topology is removed since the thread will try to close those tasks as soon as it notices the topology's removal

guozhangwang · 2021-08-06T04:31:49Z

LGTM! Please feel free to merge after green builds.

ableegoldman · 2021-08-06T07:16:44Z

Just one unrelated test failure (reopened KAFKA-13128): StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread

ableegoldman · 2021-08-06T07:19:26Z

Merged to trunk -- thanks all for keeping up with the reviews so far 😄

…ogyBuilders of named topologies (apache#10683) Pt. 1: apache#10609 Pt. 2: apache#10683 Pt. 3: apache#10788 The TopologyMetadata is next up after Pt. 1 apache#10609. This PR sets up the basic architecture for running an app with multiple NamedTopologies, though the APIs to add/remove them dynamically are not implemented until Pt. 3 Reviewers: Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>

Pt. 1: apache#10609 Pt. 2: apache#10683 Pt. 3: apache#10788 In Pt. 3 we implement the addNamedTopology API. This can be used to update the processing topology of a running Kafka Streams application without resetting the app, or even pausing/restarting the process. It's up to the user to ensure that this API is called on every instance of an application to ensure all clients are able to run the newly added NamedTopology. Reviewers: Guozhang Wang <guozhang@confluent.io>, Walker Carlson <wcarlson@confluent.io>

This was referenced May 29, 2021

KAFKA-12648: Pt. 2 - Introduce TopologyMetadata to wrap InternalTopologyBuilders of named topologies #10683

Merged

KAFKA-12648: Pt. 1 - Add NamedTopology to protocol and state directory structure #10609

Merged

ableegoldman marked this pull request as draft May 29, 2021 01:55

ableegoldman force-pushed the 12648-Pt3-addNamedTopology-API branch from 678d8f8 to fb44c25 Compare June 10, 2021 02:31

ableegoldman force-pushed the 12648-Pt3-addNamedTopology-API branch 6 times, most recently from 75a9961 to bacd140 Compare July 15, 2021 21:42

ableegoldman mentioned this pull request Jul 24, 2021

MINOR: factor state checks into descriptive methods and clarify javadocs #11123

Merged

ableegoldman marked this pull request as ready for review July 29, 2021 18:54

ableegoldman force-pushed the 12648-Pt3-addNamedTopology-API branch from ddd50d0 to c876cb7 Compare July 29, 2021 22:37

ableegoldman commented Jul 29, 2021

View reviewed changes

ableegoldman mentioned this pull request Jul 30, 2021

KAFKA-12648: minor followup from Pt. 2 and some new tests #11146

Merged

ableegoldman requested a review from guozhangwang July 30, 2021 02:39

Pt. 3 -- addNamedTopology()

33655d2

ableegoldman force-pushed the 12648-Pt3-addNamedTopology-API branch from 32e8f55 to 33655d2 Compare August 3, 2021 02:59

ableegoldman marked this pull request as draft August 3, 2021 03:59

ableegoldman added 2 commits August 2, 2021 21:08

WIP: fixing up and begin adding tests for removal of named topology w…

5ab3be8

…ith cleanup

remove inter-node version

c91c241

ableegoldman marked this pull request as ready for review August 3, 2021 05:42

ableegoldman force-pushed the 12648-Pt3-addNamedTopology-API branch from 2727910 to a1e2374 Compare August 3, 2021 06:00

Remove supportedNamedTopologies field from SubscriptionInfo until act…

e93056c

…ually needed

ableegoldman force-pushed the 12648-Pt3-addNamedTopology-API branch from a1e2374 to e93056c Compare August 3, 2021 06:04

guozhangwang reviewed Aug 3, 2021

View reviewed changes

ableegoldman added 3 commits August 3, 2021 18:57

remove assignmentNamedTopologies from AssignmentInfo until (and if) i…

a1528b4

…t is needed

first set of updates to review feedback

aa9d71d

implement task freezing for removed tasks

a18493f

checkstyle

7636591

guozhangwang reviewed Aug 4, 2021

View reviewed changes

ableegoldman added 3 commits August 4, 2021 12:56

checkstyle

4c46a30

fix stupid late-night bugs and clean up thread event loop/topology lo…

156d74f

…cking

fix up condition given semantics of #subscription

6720585

ableegoldman added 4 commits August 4, 2021 20:05

temporarily ignore tests that need blocking behavior

e23c145

minor cosmetics from Pt. 4

a0778e9

fixing up some tests

e8056ff

final test fix

895e6a3

wcarlson5 approved these changes Aug 5, 2021

View reviewed changes

guozhangwang reviewed Aug 5, 2021

View reviewed changes

more review feedback: close tasks after removed topology, wait for no…

68e314e

…n-empty topology before calling poll

ableegoldman commented Aug 6, 2021

View reviewed changes

ableegoldman added 2 commits August 5, 2021 19:34

fix test due to log change

e478a70

move wait

623e7d3

ableegoldman merged commit 6854eb8 into apache:trunk Aug 6, 2021

ableegoldman mentioned this pull request Oct 21, 2021

KAFKA-12648: Pt. 4 - return Add/RemoveNamedTopologyResult so callers can wait on topology changes #11421

Closed


		final long pollLatency = pollPhase();

		topologyMetadata.maybeWaitForNonEmptyTopology(() -> state);

Conversation

ableegoldman commented May 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ableegoldman commented Aug 5, 2021

Uh oh!

ableegoldman commented Aug 5, 2021

Uh oh!

wcarlson5 left a comment

Choose a reason for hiding this comment

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ableegoldman Aug 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Aug 6, 2021

Uh oh!

ableegoldman commented Aug 6, 2021

Uh oh!

ableegoldman commented Aug 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ableegoldman commented May 29, 2021 •

edited

Loading

ableegoldman Aug 6, 2021 •

edited

Loading