KAFKA-7301: Fix streams Scala join ambiguous overload#5502
KAFKA-7301: Fix streams Scala join ambiguous overload#5502guozhangwang merged 1 commit intoapache:trunkfrom
Conversation
|
@joan38 Can we have some tests for this ? Looks ok to me .. |
|
It works for me, from my experience I never used any KTable
|
|
Good catch! |
There was a problem hiding this comment.
Sorry if I'm being dense, but is it ok for Materialized to be implicit?
It seems to me that this would allow us to elide the parameter as long as there is a Materialized of the correct generic type in context, right?
But the Materialized builders aren't freely interchangeable just based on their generic types. In particular, if they have a name, then it would be incorrect to give the same name to every store in the topology. At first glance this seems like more room for error than I'm comfortable with for implicits.
There was a problem hiding this comment.
You're not being dense.
The default Materialized given as implicit:
Doesn't specify any name or other configuration for the store. Therfore it would behave the same as if we didn't give one.
In fact we are doing this for
aggregate and other transformations.
There was a problem hiding this comment.
Ok, I agree that's safe. I just had another thought reviewing this, though...
It's not immediately obvious that this would be the case, but join results are not materialized by default. Each side of the join must maintain the state of its stream, but since the final result can always be computed from the left and right stores, we don't need to materialize the join result.
So if I happen to have this implicit in scope, then all the joins in my topology will become materialized. Obviously, this will have some impact on resource usage.
I think that I actually have no way to specify that I don't want a join to be materialized in the current API.
Note that most of the other KTable operators are similar in behavior. For example: table.filter.filter.mapValues.toStream with no materializations actually only needs one state store (for the original table), but if we implicitly materialized them all, you'd wind up storing the data 4 times.
There was a problem hiding this comment.
Now, arguably this is a bug in the Java code, since AFAICT it's pointless to materialize the join if there's no queriable name.
Looking in KTableImpl, the join does this:
// only materialize if specified in Materialized
if (materializedInternal != null) {
kTableJoinNodeBuilder.withMaterializedInternal(materializedInternal);
}
But the filter, for example, does this:
// only materialize if the state store is queryable
final boolean shouldMaterialize = materializedInternal != null && materializedInternal.isQueryable();
Maybe we should go ahead and fix this logic, and then the implicit would be fine.
But I'm still not sure that it's great to require Materialized as an argument, effectively requiring the implicit as well, or a bunch of dummy arguments in my topology, just so they can be ignored later.
It's a bummer, because needing to support two variants means we can't do the currying approach at all, which means we can't benefit from the improved type inference.
Maybe the way out is just to have join and materializedJoin (and innerJoin and materializedInnerJoin, etc.). I don't know...
There was a problem hiding this comment.
@guozhangwang WDYT? Is there some benefit to always requiring (possibly anonymous) Materialized on joins?
I could see it being advantageous to make the serdes available for possible optimizations, but I don't know if this is the right mechanism for that anyway. I.e., you have to pass in something called "materialized", but it actually doesn't materialize anything; just registers serdes -- this seems confusing.
There was a problem hiding this comment.
Indeed, if you bring in scope a Materialized that has a state store name configured, then it will be storing the data.
WDYT about a version like this:
def join[VO, VR](other: KTable[K, VO])(joiner: (V, VO) => VR)
def join[VO, VR](other: KTable[K, VO], materialized: Materialized[K, VR, ByteArrayKeyValueStore])(joiner: (V, VO) => VR)There was a problem hiding this comment.
Also note that the aggregators (like in KGroupedTable) take an implicit Materialized, but in those cases, materialization is actually required, even if it's anonymous.
There was a problem hiding this comment.
@vvcephei The only places where we've expected an implicit materialized: Materialized are in count()s and in aggregate()s (where it actually makes sense to require them implicitly), so I don't see why in your example table.filter.filter.mapValues.toStream it would materialize 4 times?
There was a problem hiding this comment.
Sorry, my comment was ambiguous. I was just saying that such would be the result if we make materialized implicit across the board in KTable.
|
And to @mowczare 's point, I think we can add some "api shape" tests. That is, we don't really need to test any of the behavioral aspects of KStreams, but we should write tests that prove the API has the right shape when it's actually used. For example, we can create a set of mock implementations for the Java API that just send you to the correct next type (like Then, we should be able to initialize the scala facades to wrap the mocks and proceed to write any topologies we want. I think there's not even any need to assert any conditions. We'd just be checking that the desired source code does indeed compile. What do you think? |
|
You are right we can mock the Java API but if we go with static tests (complie time only) then there is no need for a mock. |
|
Ok! Either way; I was just throwing the idea out there. |
|
This option looks good to me too. Since |
|
@vvcephei That's always welcome 😄 I just pushed a WIP of test implementation. Tomorrow I will finish the test for |
905ba28 to
b0f8bd6
Compare
|
Please review the added tests. |
|
Jenkins fails due to lacking of license: |
e4920eb to
c289d39
Compare
There was a problem hiding this comment.
Is it good to put my name here?
There was a problem hiding this comment.
I'm pretty sure this is binary compatible.
|
Flaky API tests. Retest this, please. |
|
From @vvcephei's great comment #5502 (comment), it sounds like with this change we may unintentionally start to persist in a state store if it turns out that we have an implicit So I'm thinking about the following change instead: def join[VO, VR](other: KTable[K, VO])(joiner: (V, VO) => VR)
def join[VO, VR](other: KTable[K, VO], materialized: Materialized[K, VR, ByteArrayKeyValueStore])(joiner: (V, VO) => VR)The drawback is that it introduces a small reordering of parameters compare to the Java API, but I think it's not too disturbing and at least it's semantically correct. WTYT? |
|
@joan38 In the original implementation we did not have the implicit .. https://github.com/lightbend/kafka-streams-scala/blob/develop/src/main/scala/com/lightbend/kafka/scala/streams/KTableS.scala#L47-L53 .. The API was something like what u have here .. |
|
@debasishg Indeed I don't think the implicit is a good idea now to @vvcephei's comments. def join[VO, VR](other: KTable[K, VO])(joiner: (V, VO) => VR)
def join[VO, VR](other: KTable[K, VO], materialized: Materialized[K, VR, ByteArrayKeyValueStore])(joiner: (V, VO) => VR)No implicit and overloading, but with the trade off of reordering the parameters a bit. |
|
@joan38 I agree with the type inference thingy. Previously we did it for compatibility with the Java APIs. But I think we already had this discussion of making a curried argument for better type inference. I was just pointing to the |
|
Part of the goal with the multiple arg lists is to make the type system correctly infer the type parameters, right? If I understand this properly,
Is that right? Also, because of the structure of the Java This might be an ergonomic improvement anyway. What are your thoughts? |
|
@vvcephei the other solution of val materialized = Materialized.as("my-name").withKeySerde(someSerde).withValueSerde(someSerde)
join(otherStream, materialized)((a, b) => a + b)I don't see much the point of inferring the result type really. |
There was a problem hiding this comment.
I'm not sure of the implications of adding these copyright lines, or whether it's proper to remove the prior lines. Maybe there's an armchair OS lawyer out there who can comment.
There was a problem hiding this comment.
Humm that one was not intended, I will revert that.
|
Ok so the first commit is a proposed fix for joins. Now I'm thinking that the implicit |
There was a problem hiding this comment.
Do you guys understand why this is 3 instead of 2?
The window should ditch the first event no?
There was a problem hiding this comment.
It turns out the advanceWallClockTime is not enough, I need to change the time of the events.
6e18c3c to
fc3d0f1
Compare
|
Doing more testing I discovered that |
|
Hey @joan38 , I hope this isn't too discouraging, but since your prior PR is already released, we need to do a fresh KIP to update the API again. We're skirting that for the original scope of this PR, since it's actually impossible to use those methods in their current form, so we know that no one is. However, if I understand correctly, Actually, now that I'm thinking about it, it might be nice to also just create a new PR for I do very much appreciate these tests and improvements; I'm just trying to smoothe the way for them to get merged. Thanks! |
|
Makes perfect sense. |
|
Thanks! I'll take another look tomorrow. |
vvcephei
left a comment
There was a problem hiding this comment.
@guozhangwang This LGTM now. I guess we also need to cherry pick it to 2.0.
Join in the Scala streams API is currently unusable in 2.0.0 as reported by @mowczare: This due to an overload of it with the same signature in the first curried parameter. See compiler issue that didn't catch it: https://issues.scala-lang.org/browse/SI-2628 Reviewers: Debasish Ghosh <dghosh@acm.org>, Guozhang Wang <guozhang@confluent.io>, John Roesler <john@confluent.io> minor
|
Cherry-picked to 2.0 as well, with some minor fixes to resolve conflicts. |
|
@vvcephei @guozhangwang Could you guys give me KIP creation access? |
|
@joan38 I've granted you the access. |
|
Thanks @guozhangwang |
Join in the Scala streams API is currently unusable in 2.0.0 as reported by @mowczare: apache#5019 (comment) This due to an overload of it with the same signature in the first curried parameter. See compiler issue that didn't catch it: https://issues.scala-lang.org/browse/SI-2628 Reviewers: Debasish Ghosh <dghosh@acm.org>, Guozhang Wang <guozhang@confluent.io>, John Roesler <john@confluent.io>
joinin the Scala streams API is currently unusable in2.0.0as reported by @mowczare:#5019 (comment)
This due to an overload of it with the same signature in the first curried parameter.
See compiler issue that didn't catch it: https://issues.scala-lang.org/browse/SI-2628
I don't see many options here:
joinwith theMaterializedas implicit like we did withaggregate... (this PR currently implements)jointo something likejoinMat.joinand loosing the type inference.I will add all the needed tests once we agree on an option.
The current workarounds are:
@debasishg
@ijuma
@guozhangwang
@mowczare
Thanks