KAFKA-12648: fix #add/removeNamedTopology blocking behavior when app is in CREATED by ableegoldman · Pull Request #11813 · apache/kafka

ableegoldman · 2022-02-26T05:22:37Z

Currently the #add/removeNamedTopology APIs behave a little wonky when the application is still in CREATED. Since adding and removing topologies runs some validation steps there is valid reason to want to add or remove a topology on a dummy app that you don't plan to start, or a real app that you haven't started yet. But to actually check the results of the validation you need to call get() on the future, so we need to make sure that get() won't block forever in the case of no failure -- as is currently the case

ableegoldman · 2022-02-26T05:26:18Z

     * Start up Streams with a collection of initial NamedTopologies (may be empty)
     */
-    public void start(final Collection<NamedTopology> initialTopologies) {
+    public synchronized void start(final Collection<NamedTopology> initialTopologies) {


super.start() is already synchronized but we should just go ahead and synchronize at the first layer

I took some time to understand why we want to synchronize here, as at the first sight it looks a bit unclear to me:

/* means caller -> callee */
inherited.start: synchronized, public ->
addNamedTopology: unsynchronized, public, register topology metadata ->
completedFutureForUnstartedApp: synchronized, private, check state

removeNamedTopology: unsynchronized, public, unregister metadata topology ->
completedFutureForUnstartedApp: synchronized, private, check state

Register/unregister topology metadata is synchronized, and parent.start would modify state.

I think I understand now that it's because addNamedTopology is not synchronized, plus when we have multiple named topology we want to keep the state unchanged while adding them one-by-one. Is that the case? If yes maybe it's better to add such reasoning in the javadoc above.

ack, will do

ableegoldman · 2022-02-26T05:26:21Z

     * @return the NamedTopology for the specific name, or Optional.empty() if the application has no NamedTopology of that name
     */
-    public Optional<NamedTopology> getTopologyByName(final String name) {
+    public synchronized Optional<NamedTopology> getTopologyByName(final String name) {


Should make sure this is thread safe since it's how we check to make sure a name isn't already used when trying to add a new topology

ableegoldman · 2022-02-26T05:26:54Z

                     removeTopologyFuture.isCompletedExceptionally() ? "unsuccessfully" : "successfully",
                     topologyToRemove, partitionsToReset
            );
-            if (!partitionsToReset.isEmpty()) {


The offset reset code is pretty long so I pulled it out into its own method to clean things up a bit

ableegoldman · 2022-02-26T05:27:05Z

+    /**
+     * @return  true iff the application is still in CREATED and the future was completed
+     */
+    private synchronized boolean completedFutureForUnstartedApp(final KafkaFutureImpl<Void> updateTopologyFuture,


This is the main fix

wcarlson5

I have some concerns about this behavior. I understand why we want to complete the future before we have started the streams application. However I am not sure that is the correct decision. If we take the future to be that the topology is processing then it actually makes sense for the future to not return until the streams has been started. And the user should not call get on it before then.

Maybe we can have a third part of Add/Remove topology result. We can have a is done adding, is processing and for Removing topology a future for resetting the offsets. I think this is a reasonable compromise, as it might not be possible to get all the check in as actually running the topology in the registration. However getting the future of a topology was added before the streams client was started and waiting for that to be processed is a reasonable path too IMO.

anyways, I could be convinced but I think its something we should think about

guozhangwang · 2022-02-27T01:01:50Z

+        }
+    }
+
+    private RemoveNamedTopologyResult resetOffsets(final KafkaFutureImpl<Void> removeTopologyFuture,


I'm assuming this is just extracting the inlined function and hence skipped and did not compare line by line :)

Yep -- just direct copy/paste

guozhangwang · 2022-02-28T02:11:54Z

     * Start up Streams with a collection of initial NamedTopologies (may be empty)
     */
-    public void start(final Collection<NamedTopology> initialTopologies) {
+    public synchronized void start(final Collection<NamedTopology> initialTopologies) {


I took some time to understand why we want to synchronize here, as at the first sight it looks a bit unclear to me:

/* means caller -> callee */
inherited.start: synchronized, public ->
addNamedTopology: unsynchronized, public, register topology metadata ->
completedFutureForUnstartedApp: synchronized, private, check state

removeNamedTopology: unsynchronized, public, unregister metadata topology ->
completedFutureForUnstartedApp: synchronized, private, check state

Register/unregister topology metadata is synchronized, and parent.start would modify state.

I think I understand now that it's because addNamedTopology is not synchronized, plus when we have multiple named topology we want to keep the state unchanged while adding them one-by-one. Is that the case? If yes maybe it's better to add such reasoning in the javadoc above.

guozhangwang · 2022-02-28T02:13:22Z

        topologyMetadata.unregisterTopology(removeTopologyFuture, topologyToRemove);

-        if (resetOffsets) {
+        if (!completedFutureForUnstartedApp(removeTopologyFuture, "removing topology") && resetOffsets) {


nit: how about put resetOffsets as the first condition so that if it's false, we would skip the synchronized function (not sure if JIT would really be able to optimize this way)?

We kind of assume that an application will not be doing heavy/frequent #removeNamedTopology calls, if it turns out that users want to be able to add and remove many topologies at a high rate then we can come back and try to optimize this -- it just doesn't seem to make much sense for an application to have such high turnover of topologies, this feature is generally speaking more targeted at providing a relatively stable application the ability to update its topology as needed, not for high volumes of transient topologies

Oh mm actually this change is what's causing the tests to hang as it breaks the fix -- we actually need to ensure that we check the CREATED state and complete the future if so. But while I still stand by my comments above, ie that trying to avoid entering a synchronized block when adding or removing a topology is probably premature optimization, I actually did look over the class and believe we can make this work without synchronizing this particular method.

(However we do still need to synchronize on start, for several reasons)

Thanks! I overlooked on its side effects..

wcarlson5 · 2022-02-28T20:49:48Z


    @Test
-    public void shouldThrowTopologyExceptionWhenAddingNamedTopologyReadingFromSameInputTopic() {
+    public void shouldThrowTopologyExceptionWhenAddingNamedTopologyReadingFromSameInputTopicAfterStart() {


Nit: After start?

Can you elaborate? I did add a "AfterStart" suffix here -- is that what you meant?

guozhangwang · 2022-03-01T17:26:21Z

@ableegoldman I took another look at the latest commit, and LGTM.

The jenkins build timeout seems consistent though, and maybe related; after your investigation on its cases please feel free to merge.

guozhangwang · 2022-03-03T06:14:24Z

Reviewed the latest commit, LGTM.

guozhangwang · 2022-03-03T06:14:33Z

Re-triggering jenkins.

ableegoldman added 2 commits February 25, 2022 21:17

fix blocking behavior, add tests

af2a57c

checkstyle

2a6127f

ableegoldman requested review from guozhangwang and vvcephei February 26, 2022 05:22

ableegoldman commented Feb 26, 2022

View reviewed changes

wcarlson5 reviewed Feb 26, 2022

View reviewed changes

Comment thread streams/src/test/java/org/apache/kafka/streams/processor/internals/NamedTopologyTest.java

Comment thread ...apache/kafka/streams/processor/internals/namedtopology/KafkaStreamsNamedTopologyWrapper.java Outdated

guozhangwang reviewed Feb 28, 2022

View reviewed changes

add unit test

36805cf

wcarlson5 reviewed Feb 28, 2022

View reviewed changes

review feedback

c91ad03

remove synchronization and extract CREATED check

d7d6716

ableegoldman merged commit 6f54fae into apache:trunk Mar 4, 2022

Conversation

ableegoldman commented Feb 26, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wcarlson5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Mar 1, 2022

Uh oh!

guozhangwang commented Mar 3, 2022

Uh oh!

guozhangwang commented Mar 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants