KAFKA-6687: rewrite topology to allow reading the same topic multiple times in the DSL#9582
Conversation
76fe912 to
e9b0362
Compare
There was a problem hiding this comment.
Saw this and at first I thought it was broken because it only considers pattern-subscribed topics that happened to explicitly configure an offset reset policy. Unless I'm missing something here, that makes no sense and we should consider all source patterns and whether they overlap.
But then I started thinking, why does it matter if they overlap? Just because one pattern is a substring of another does not mean that they'll match the same topics. So I think that we should actually just remove this restriction altogether. Am I missing anything here?
There was a problem hiding this comment.
I agree on the first part.
Regarding the second part, I had similiar thoughts when I wrote my comment in mergeDuplicateSourceNodes().
But I might also be missing something here.
There was a problem hiding this comment.
I might also be missing something, but what's the scenario where one pattern is a substring of another and they dont match the same topics? If you take Bruno's example from earlier of topic* and topi*, topi* would be considered a substring of topic* and they would both match topic A, right? I guess the other scenario is if we have a topic topia A, that would match topi* and not topic*. So I guess it seems like it isn't always true that they'll overlap, but we would want to check if they do, right?
There was a problem hiding this comment.
Pattern topic* is contained in pattern topic*A. However, topic*A matches only a subset of topic*. So, they do not match exactly the same topics. But matching exactly the same topics is a pre-requisite for merging the source nodes.
There was a problem hiding this comment.
I think in this case we were matching whether the pattern's string was a literal substring of another pattern's string, not whether the regexes themselves are substrings. So topi* would not be a substring of topic* because topi* is not contained literally within the string topic*. It's not doing a smart regex-matching, just a dumb literal string comparison
cadonna
left a comment
There was a problem hiding this comment.
Thank you for the PR, @ableegoldman!
According to the ticket similar work needs to be done for table() and globalTable(). What do you think of adding subtasks to the ticket to track what has already been done and what not?
Here my feedback.
There was a problem hiding this comment.
We could avoid the instanceof and the casting if we introduce a RootGraphNode with a method sourceNodes(). Since a root can only have source nodes and state stores as children, we could make the topology code in general a bit more type safe. As far as I can see that would need some additional changes outside the scope of this PR. So, feel free to not consider this comment for this PR and we can do another PR for that.
There was a problem hiding this comment.
Yeah I think that's a fair point but I would prefer to keep the scope of this PR as small as possible for now. Maybe @lct45 could pick this up on the side once this is merged?
There was a problem hiding this comment.
Just to be clear. This improves the situation but it is not a complete solution, right? Assume we have a topic topicA. Patterns topic* and topi* both match topicA but they are different when compared with this comparator. In that case a TopologyException would be thrown in the InternalTopologyBuilder, right?
There was a problem hiding this comment.
Yes to all of that: this PR improves some situations, but not all. Specifically you would still get a TopologyException if (a) subscribing to overlapping but not equal collection of topics, (b) subscribing to a topic and to a pattern that matches said topic, and (c) subscribing to two (or more) patterns that match the same topic(s).
Case (c) is what you described, I just wanted to list them all here for completion. Here's my take on what we can/should reasonably try to tackle:
(a) this case is easily detected, easily worked around, and easy for us to fix. It results in a "compile time" exception (meaning when the topology is compiled, not the program) which users can quickly detect and work around if need be by rewriting the topology themselves. Fix is relatively straightforward but very low priority, so I plan to just file a followup ticket for this for now
(b) is easily detected (you get a compile time exception) and possible to work around, but difficult to solve. I think in all cases a user could find a way around this issue by some combination of topology rewriting and Pattern manipulation or topic renaming, depending on what exactly they're trying to achieve. Of course there's no way for us to detect what an arbitrary user is trying to do in this case, so I don't see any path forwarding to making this case possible. No plans to file a followup ticket
(c) is difficult to detect, might be possible to work around, and probably very complicated to actually fix. Unfortunately, in this case you only get a run-time exception, since there's no way of knowing which topics will or will not be created ahead of time. And I'm thinking that determining whether two regexes will both match any possible string may be unsolvable...so, no followup ticket planned for this.
WDYT?
There was a problem hiding this comment.
Thank you for the list of issues. I agree in all points.
There was a problem hiding this comment.
I agree on the first part.
Regarding the second part, I had similiar thoughts when I wrote my comment in mergeDuplicateSourceNodes().
But I might also be missing something here.
Thanks for the review @cadonna . This fix actually does work for I do think there's some possible followup work to further improve the situation (see my comment above) but I would say that's it's different enough to merit creating separate followup tickets rather than subtasks of this one. Lmk what you think |
There was a problem hiding this comment.
Could you please extract this part in its own method since we use it also in a couple of other tests?
Fair enough. Let's do it as you say. |
f8b95c5 to
cf32d69
Compare
| // TODO we only merge source nodes if the subscribed topic(s) are an exact match, so it's still not | ||
| // possible to subscribe to topicA in one KStream and topicA + topicB in another. We could achieve | ||
| // this by splitting these source nodes into one topic per node and routing to the subscribed children |
There was a problem hiding this comment.
|
Two unrelated flaky test failures: |
|
Merged to trunk |
Needed to fix this on the side in order to more easily set up some experiments, so here's the PR.
Allows a user to create multiple KStreams
and/or KTablesfrom the same topic, collection of topics, or pattern. At the moment this isn't possible since we can only consume from a topic once, and each source topic maps to a single source node in the topology. The "fix" is just to rewrite the logical plan and merge any duplicate source nodes into a single node before it gets compiled into the physical topology.The one exception is when the stream/table are subscribed to an overlapping-but-unequal collection of topics, which I left as future work (with a TODO in the comments describing a possible solution). If the offset reset policy doesn't match we just throw a TopologyException.
edit: tables are much more complicated so I opted to restrict things to just multiple KStreams for now on and consider allowing multiple KTables (or KStream+KTable) as followup work