KAFKA-4217: Add KStream.flatTransform#5273
Conversation
- Adds flatTrasform method in KStream - Adds processor supplier and processor for flatTransform
cadonna
left a comment
There was a problem hiding this comment.
Should I remove all references to context#forward in the API documentation of transform or flatTransform? What about the references to context#forward() in Transformer#init() and Transformer#close()? Are there use cases for those calls to forward()?
| public <K1, V1> KStream<K1, V1> transform(final TransformerSupplier<? super K, ? super V, KeyValue<K1, V1>> transformerSupplier, | ||
| final String... stateStoreNames) { | ||
| Objects.requireNonNull(transformerSupplier, "transformerSupplier can't be null"); | ||
| private <K1, V1> KStream<K1, V1> doTransform(final ProcessorSupplier processorSupplierForGraph, |
There was a problem hiding this comment.
Do we really need two distinct objects for the processor supplier? This is an intermediate solution, isn't it?
There was a problem hiding this comment.
\cc @bbejeck Can you help out here? I am not sure atm. This was changed recently.
There was a problem hiding this comment.
@guozhangwang This code and comment is outdated. The method doTransform() takes one supplier and is not modified by this PR any longer.
mjsax
left a comment
There was a problem hiding this comment.
Thanks for the PR. Couple of comments and question.
| * @see #flatMapValues(ValueMapper) | ||
| * @see #flatMapValues(ValueMapperWithKey) | ||
| * @see #transform(TransformerSupplier, String...) | ||
| * @see #flatTransform(TransformerSupplier, String...) |
There was a problem hiding this comment.
Why do you remove transform ? We only add a new flatTransform but transform is not removed.
There was a problem hiding this comment.
I didn't want to make the list of referenced methods too long, so I thought to let map reference transform and flatMap reference flatTransform. But I do not have hard feelings about it.
There was a problem hiding this comment.
Fair point. Not sure either :) So far, we list a lot of methods. Do you think we should do a general cleanup reduction of listed method? In the past, we only added but never removed?
There was a problem hiding this comment.
I'd be in favor of either just adding the new method to the list (without removing another one) or deleting this whole list.
Personally, I feel the list is a little redundant with this interface itself.
There was a problem hiding this comment.
I am fine with both options, but I am in favor of deleting the list. Out of curiosity, what is the advantage of just adding new methods to the list without removing others?
There was a problem hiding this comment.
Unless @mjsax objects, I vote to just delete these lists. At this point, it almost looks like it's directing you to all the other methods in this interface, which seems redundant.
I'm not sure I follow your last question. The list exists to direct readers to other relevant methods. I'm not sure why adding flatTransform renders transform irrelevant...
There was a problem hiding this comment.
I think, it's overall useful to use @see to guide users -- however, I agree that it's getting a very long list. This seems to be a general issue though. Maybe, we should do a follow up PR, and go over all JavaDocs and clean them up? We should define a "strategy/guideline" how we cross reference. For example, does it make sense to link to both flatMapValues -- I guess not.
WDYT @vvcephei ?
There was a problem hiding this comment.
Yeah, I'm fine with defining and applying a coherent strategy.
In lieu of that, I guess the default thing to do here would be to just add the new method to the list without removing any other items.
There was a problem hiding this comment.
I re-added the links. @vvcephei My questions was about just adding links vs reorganizing links. I think my question was answered by the decision taken.
There was a problem hiding this comment.
| * Furthermore, via {@link org.apache.kafka.streams.processor.Punctuator#punctuate(long)} the processing progress | ||
| * can be observed and additional periodic actions can be performed. | ||
| * Although calling {@link ProcessorContext#forward(K, V) forward()} in {@link Transformer#transform} | ||
| * allows to emit multiple records for each input record, it is discouraged to do so since output records |
There was a problem hiding this comment.
emit multiple records -- this seems to contradict the change from L491. Can we be more consistent?
Reviewing this, I am also realizing that context.forward() is still the only way to emit data with a punctuation callback. Thus, the question is, what should we describe in the JavaDocs and what not?
It might be better, to not mention context.forward() at all, but only describe it in Punctuator instead?
| * {@link #flatTransform(TransformerSupplier, String...) flatTransform()} is safer because it throws a compile error | ||
| * if the types of the output records are not compatible with the output stream. | ||
| * Calling {@link ProcessorContext#forward(K, V)} in {@link Transformer#transform} may be disallowed in future | ||
| * releases. |
There was a problem hiding this comment.
I don't think we can do this. Also, I would only mention it, when we start to deprecate an API.
| * @see #transformValues(ValueTransformerWithKeySupplier, String...) | ||
| * @see #process(ProcessorSupplier, String...) | ||
| */ | ||
| <K1, V1> KStream<K1, V1> flatTransform(final TransformerSupplier<? super K, ? super V, ? extends Iterable<? extends KeyValue<? extends K1, ? extends V1>>> transformerSupplier, |
There was a problem hiding this comment.
Why ? extends Iterable<? extends KeyValue<? extends K1, ? extends V1>> and not just Iterable<KeyValue<K1, V1>> ?
transform also has return type KeyValue<K1, V1> but not ? extends KeyValue<? extends K1, ? extends V1>
There was a problem hiding this comment.
I have to admit, that I copied this from a comment in the corresponding Jira issue. I am not (yet) a Java Generics expert, but I think to -- at least partly -- understand the rationale behind.
With Iterable<KeyValue<K1,V1>> the API forces the Transformer to declare Iterable as the return type of its transform method. With ? extends Iterable<KeyValue<K1,V1>>, the Transfomer may declare any type that implements Iterable (e.g. List) as the return type of its transform method. So it enlarges the set of valid Transformer declarations. Is it better to have a straight Iterable or allow more return types? I do not know. Maybe some of the more expert Kafka Streams developers and users can answer this question.
Regarding <K1, V1> vs. <? extends K1, ? extends V1>, I guess <K1, V1> makes more sense, since K1 and V1 are inferred by the compiler from the actual arguments which would always be the key type and the value type in the return types of the transform method of the given Transformer. Please correct me, if I misunderstood Java Generics here.
For the case KeyValue<...> vs. ? extends KeyValue<...>, I am too much of a newbie, to know what benefits it may have to extend KeyValue. Please help me.
Btw, flatMap is declared as follows:
<KR, VR> KStream<KR, VR> flatMap(final KeyValueMapper<? super K, ? super V, ? extends Iterable<? extends KeyValue<? extends KR, ? extends VR>>> mapper);
There was a problem hiding this comment.
Generics confuse me regularly, too... I was just comparing to transform -- seems our API is not consistent. I guess your argument makes sense.
There was a problem hiding this comment.
Regarding <K1, V1> vs. <? extends K1, ? extends V1>, I changed my mind. I guess also here <? extends K1, ? extends V1> makes sense because a user could also give K1 and V1 explicitly.
There was a problem hiding this comment.
I hope you don't mind if I jump in...
It's probably worth doing some experiments to get the bounds just right. Here are the basic properties we need:
Let's say we have a simple class hierarchy:
interface Animal
interface Cat extends Animal
interface Dog extends Animal
And let's say we subclass KeyValue: public class KV extends KeyValue.
Given a KStream<String, Cat>, we want to be able to call flatTransform with a TransformerSupplier<String, Animal, List<KeyValue<Animal>>> like this one:
(String s, Animal a) -> asList(new KV(s, new Cat()), new KeyValue(s, new Dog())).
The ? super K and ? super V ensure that any transformer that can handle a K and V is permitted. You can handle an instance if you can assign it to you parameter types (so your parameter type is a supertype of the instance type). You need to specify this as a bound because the compiler doesn't generally let me assign a supertype to a subtype (I can't write Cat c = (Animal) a;).
I don't think you actually need the ? extends Whatever parts to get the right outcome. If anything, you might need it on the stuff inside the iterable to make sure that you can just pass in heterogeneous members, and they'll get "rounded" to their lowest common superclass.
I think this works without extends because you generally can assign a subtype to a supertype.
So maybe try an example like this without the extends stuff and see if the compiler chokes on it or not.
There was a problem hiding this comment.
IIRC, Java doesn't fully implement variance, but this is the basic concept (in scala) of what's going on with those extends return types: https://docs.scala-lang.org/tour/variances.html
There was a problem hiding this comment.
Ok, I stand corrected. I did the experiment I was suggesting, and I got the desired results only with the API you have submitted:
<K1, V1> KStreamImpl<K1, V1> flatTransform(final TransformerSupplier<? super K, ? super V, ? extends Iterable<? extends KeyValue<? extends K1, ? extends V1>>> transformerSupplier,
String... storeNames) {
return null;
}
public static class KV<K, V> extends KeyValue<K, V> {
public KV(final K key, final V value) {
super(key, value);
}
}
public static void main(String[] args) {
final KStreamImpl<Integer, Long> stream = new KStreamImpl<>(null, null, null, false, null);
// exact transformer
final KStreamImpl<Integer, Long> stream2 = stream.flatTransform(
new TransformerSupplier<Integer, Long, Iterable<KeyValue<Integer, Long>>>() {
@Override
public Transformer<Integer, Long, Iterable<KeyValue<Integer, Long>>> get() {
return new Transformer<Integer, Long, Iterable<KeyValue<Integer, Long>>>() {
@Override
public void init(final ProcessorContext context) {}
@Override
public Iterable<KeyValue<Integer, Long>> transform(final Integer key, final Long value) {
return Arrays.asList(new KV<>(key, value), new KeyValue<>(key, value));
}
@Override
public void close() {}
};
}
}
);
// transformer that takes superclass k/v and returns exact results
final KStreamImpl<Integer, Long> stream3 = stream.flatTransform(
new TransformerSupplier<Number, Number, Iterable<KeyValue<Integer, Long>>>() {
@Override
public Transformer<Number, Number, Iterable<KeyValue<Integer, Long>>> get() {
return new Transformer<Number, Number, Iterable<KeyValue<Integer, Long>>>() {
@Override
public void init(final ProcessorContext context) {
}
@Override
public Iterable<KeyValue<Integer, Long>> transform(final Number key, final Number value) {
return Arrays.asList(new KV<>(key.intValue(), value.longValue()), new KeyValue<>(1, 3L));
}
@Override
public void close() {
}
};
}
}
);
// transformer that takes exact parameters and returns subclass results
final KStreamImpl<Number, Number> stream4 = stream.flatTransform(
new TransformerSupplier<Integer, Long, Iterable<KeyValue<Integer, Long>>>() {
@Override
public Transformer<Integer, Long, Iterable<KeyValue<Integer, Long>>> get() {
return new Transformer<Integer, Long, Iterable<KeyValue<Integer, Long>>>() {
@Override
public void init(final ProcessorContext context) {
}
@Override
public Iterable<KeyValue<Integer, Long>> transform(final Integer key, final Long value) {
return Arrays.asList(new KV<>(key, value), new KeyValue<>(1, 3L));
}
@Override
public void close() {
}
};
}
}
);
// transformer that takes superclass parameters and returns subclass results
final KStreamImpl<Number, Number> stream5 = stream.flatTransform(
new TransformerSupplier<Number, Number, Iterable<KeyValue<Integer, Long>>>() {
@Override
public Transformer<Number, Number, Iterable<KeyValue<Integer, Long>>> get() {
return new Transformer<Number, Number, Iterable<KeyValue<Integer, Long>>>() {
@Override
public void init(final ProcessorContext context) {
}
@Override
public Iterable<KeyValue<Integer, Long>> transform(final Number key, final Number value) {
return Arrays.asList(new KV<>(key.intValue(), value.longValue()), new KeyValue<>(1, 3L));
}
@Override
public void close() {
}
};
}
}
);
}If you take out any of the ? extends, it won't compile.
|
|
||
| private final TransformerSupplier<? super K, ? super V, ? extends Iterable<? extends KeyValue<? extends K1, ? extends V1>>> transformerSupplier; | ||
|
|
||
| public KStreamFlatTransform(TransformerSupplier<? super K, ? super V, ? extends Iterable<? extends KeyValue<? extends K1, ? extends V1>>> transformerSupplier) { |
There was a problem hiding this comment.
nit: add final
In Kafka Streams, we apply final to all variable/parameters etc whenever possible. Applies to multiple place. Please all everywhere.
| public <K1, V1> KStream<K1, V1> transform(final TransformerSupplier<? super K, ? super V, KeyValue<K1, V1>> transformerSupplier, | ||
| final String... stateStoreNames) { | ||
| Objects.requireNonNull(transformerSupplier, "transformerSupplier can't be null"); | ||
| private <K1, V1> KStream<K1, V1> doTransform(final ProcessorSupplier processorSupplierForGraph, |
There was a problem hiding this comment.
\cc @bbejeck Can you help out here? I am not sure atm. This was changed recently.
| inputKey = 1; | ||
| inputValue = 10; | ||
| Transformer<Number, Number, Iterable<KeyValue<Integer, Integer>>> tempTransformer = mock(Transformer.class); | ||
| transformer = tempTransformer; |
There was a problem hiding this comment.
tempTransformer can be removed? assign to transformer directly?
| @Test | ||
| public void shouldInitialiseFlatTransformProcessor() { | ||
| transformer.init(context); | ||
| replayAll(); |
There was a problem hiding this comment.
I am always unsure: don't we need to call "expectLastCall()" or similar?
There was a problem hiding this comment.
No, it suffices to write it as I did it in this case. You would use expectLastCall() to check whether a method is called multiple times and you do not want to repeat the same line amongst other use cases. See this example in the EasyMock Documentation.
| EasyMock.reset(transformer); | ||
|
|
||
| EasyMock.expect(transformer.transform(inputKey, inputValue)) | ||
| .andReturn(Collections.<KeyValue<Integer, Integer>>emptyList()); |
There was a problem hiding this comment.
nit: fix indention (should be +4 spaces only)
| verifyAll(); | ||
| } | ||
|
|
||
| @Test(expected = NullPointerException.class) |
There was a problem hiding this comment.
Is this save? Test would pass if NPE is thrown from somewhere unexpected... I would prefer to keep the scope where NPE is expected as tide as possible. We often use:
try {
toSomethingAndExpectNPE();
fail("Should have thrown");
} catch (final NPE expected) { }
Not sure if we can rewrite this way. Doesn't EasyMock not allow to expect an exception, too (as another alternative)?
|
@cadonna What is the status of this PR? Can you address the open comment, and also rebase this PR to resolve the conflicts? Thx. |
- added final to method parameter - unified Java Generics in transform and flatTransform for TransformerSupplier - removed references to context.forward in API documentation - added integration tests for transform and flatTransform - improved unit test to check NullPointerException for transform and flatTransform
vvcephei
left a comment
There was a problem hiding this comment.
Hey @cadonna ,
Thanks for the PR! This looks good overall; I just had a few remarks.
Do you think it would be worthwhile to go ahead and replace all the other operations that can be expressed in terms of FlatTransform? For example, do we need the implementation of KStreamTransform at all, or can we just convert the supplied Transformer to one that returns a singleton of the value?
Not sure about whether that would actually save some complexity or not.
| @Override | ||
| public void close() {} | ||
| }); | ||
| .transform(new TransformerSupplier<Windowed<String>, Long, KeyValue<Object, Object>>() { |
There was a problem hiding this comment.
Do we need to drop the SAM-converted function and go back to Supplier here, or is this just an artifact of testing? (If the latter, can you revert it?)
(also applies elsewhere)
There was a problem hiding this comment.
Unfortunately, the code does not compile with the SAM-converted function. I have not yet understood the reason. The compiler error is:
method transform in interface KStream<K,V> cannot be applied to given types;
.transform(() -> new Transformer<Windowed<String>, Long, KeyValue<Object, Object>>() {
^
required: TransformerSupplier<? super Windowed<String>,? super Long,? extends KeyValue<? extends K1,? extends V1>>,String[]
found: ()->new Tr[...]{ } }
reason: cannot infer type-variable(s) K1,V1
(argument mismatch; bad return type in lambda expression
<anonymous Transformer<Windowed<String>,Long,KeyValue<Object,Object>>> cannot be converted to Transformer<Windowed<String>,Long,KeyValue<? extends K1,? extends V1>>)
where K1,V1,K,V are type-variables:
K1 extends Object declared in method <K1,V1>transform(TransformerSupplier<? super K,? super V,? extends KeyValue<? extends K1,? extends V1>>,String...)
V1 extends Object declared in method <K1,V1>transform(TransformerSupplier<? super K,? super V,? extends KeyValue<? extends K1,? extends V1>>,String...)
K extends Object declared in interface KStream
V extends Object declared in interface KStream
When I use the old generics signature for transform it compiles. I will take a deeper look at it.
There was a problem hiding this comment.
Interesting.. is it because of the signature in KStream#transform for generics as <? super K, ? super V, ..> @vvcephei ?
There was a problem hiding this comment.
Specifically, the culprit is ? extends K1 and ? extends V1. The type system won't go so far as to assume that if I return, say, Integer-valued results, then V1 must be Integer. Because I didn't say I would return a V1, just that whatever I return will extend V1, and I never said what V1 would actually be.
Unfortunately, it gets even less sensible from here... To solve that problem, it should be sufficient to fix V1 and K1, like this: .<Integer, Integer>transform(...). But that fails with a message that Transformer<Windowed<String>, Long, KeyValue<Integer, Integer>> can't be converted to Transformer<Windowed<String>, Long, KeyValue<? extends Integer, ? extends Integer>>.
I thought maybe it doesn't consider Integer to extend Integer, but .<Number, Number>transform(...) also doesn't work. I think what we're dealing with is just the insufficiency of Java's type system. I have heard several times that they deliberately stopped short of fully implementing the type system in exchange for performance, so there are some true type statements that nevertheless don't compile. Maybe this is one of them.
I think that we'd rather keep the ability to use lambda substitution for the supplier and give up the "maximum flexibility" on the return values for K1 and V1. Maybe you can just replace ? extends K1 with K1 (and the same for V1) and re-attempt the experiments I went though before so we know what we're giving up?
There was a problem hiding this comment.
That sounds reasonable to me, especially given that the only caller of the constructor is always passing in a
TransformerSupplier<? super K, ? super V, KeyValue<K1, V1>> transformerSupplier
anyways.
There was a problem hiding this comment.
I did the experiments and the most flexible declaration that worked with the lambda is the following:
<K1, V1> KStream<K1, V1> transform(final TransformerSupplier<? super K, ? super V, ? extends KeyValue<K1, V1>> transformerSupplier,
final String... stateStoreNames);
With this declaration, the Transformer may not return keys and values of subclasses of the key type and the value type of the result stream. Thus, the examples with stream4 and stream5 in @vvcephei 's experiments above do not compile with this declaration.
If nobody objects, I will proceed with the declaration above.
Removing only the ? extends in front of KeyValue instead of removing ? extends from K1 and V1 does also not compile when using lambdas.
FYI, I created a minimal example of this rather strange compiler behavior and posted a question on stackoverflow [1].
[1] https://stackoverflow.com/questions/52445222/java-using-lambda-impedes-inference-of-type-variable
There was a problem hiding this comment.
That sounds good to me. Thanks for checking, @cadonna !
There was a problem hiding this comment.
Unfortunately, I had to use the following signatures in the end:
<K1, V1> KStream<K1, V1> transform(final TransformerSupplier<? super K, ? super V, KeyValue<K1, V1>> transformerSupplier,
final String... stateStoreNames);
<K1, V1> KStream<K1, V1> flatTransform(final TransformerSupplier<? super K, ? super V, Iterable<KeyValue<K1, V1>>> transformerSupplier,
final String... stateStoreNames);
The reason is that I got similar compile errors as before with the following signature for flatTransform:
<K1, V1> KStream<K1, V1> flatTransform(final TransformerSupplier<? super K, ? super V, ? extends Iterable<? extends KeyValue<K1, V1>>> transformerSupplier,
final String... stateStoreNames);
I tried a couple of alternatives, but the only one that allowed me to define a TransformerSupplier with an anonymous class and a lambda expression was the one without any wildcards in the generics for the output of the operator. To be consistent between transform and flatTransform, I also removed the wildcards in the generics for the output of transform. Now, transform has again the same signature as on the current trunk.
Lessons learned:
- Lambdas and anonymous classes have different representations in Java byte code. See this answer to my stackoverflow question. I guess this is the reason for the differences in type variable inference.
- There would be ways to define a lambda for the more flexible signatures for
transformandflatTransform, but they consists of casts and obscure generics declarations in the lambda (see my stackoverflow question). Additionally, it is a trial and error process to get the lambda right. Thus, something you do not want to expose to your user to. - In Java 8, there exists the
java.util.function.Supplierfunctional interface which is really similar to Kafka'sTransformerSupplier. With thejava.util.function.Supplierthe more flexible signatures fortransformandflatTransformseem to compile with lambdas (I have not thoroughly tested it, though). The drawback here is that defining the transformer supplier by means of an anonymous class does not seem to work anymore with the more flexible signature (also here, I have not invested too much time to try it out).
There was a problem hiding this comment.
Ah, well, I buy your reasoning, so I'm +1 for the interface you've settled on.
I doubt anyone really needs to subclass KeyValue anyhow.
| import org.apache.kafka.streams.processor.ProcessorSupplier; | ||
|
|
||
| public class KStreamTransform<K, V, K1, V1> implements ProcessorSupplier<K, V> { | ||
| public class KStreamTransform<Ki, Vi, Ko, Vo> implements ProcessorSupplier<Ki, Vi> { |
There was a problem hiding this comment.
See my other comment about the actual name choice here. But also, I'm not sure if the increase of clarity here is worth the extra review time that including it in the diff incurs.
There was a problem hiding this comment.
I think the increase in clarity is worth the (small) extra review time, especially for newbies like me that are not used to the common patterns in the code base you described in another comment.
| import org.apache.kafka.streams.processor.ProcessorContext; | ||
| import org.apache.kafka.streams.processor.ProcessorSupplier; | ||
|
|
||
| public class KStreamFlatTransform<Ki, Vi, Ko, Vo> implements ProcessorSupplier<Ki, Vi> { |
There was a problem hiding this comment.
Ah, I guess this is "K-in" and "K-out"? I think a more common pattern in the code base is <K,V,KR,VR> (r=result) or <K,V,K1,V1> (not sure what the rationale there is ;) ).
That said, there's no reason generics should get to be alphabet soup when the project style prohibits alphabet soup variable names. Feel free to use (short) descriptive names for these types if you feel the "standard" ones are too opaque.
Even KIn and KOut would be better than replacing one arcane pattern with another.
There was a problem hiding this comment.
I am going to use KIn and KOut. Thank you for the proposal.
|
|
||
| stream = builder.stream(streamOneInput, Consumed.with(Serdes.Integer(), Serdes.Integer())); | ||
| stream | ||
| .flatTransform(new TransformerSupplier<Integer, Integer, List<KeyValue<Integer, Integer>>>() { |
There was a problem hiding this comment.
Nit: We typically use SAM conversion whenever possible just to keep the testing code compact.
There was a problem hiding this comment.
See my other comment about SAM conversion.
- declared Java Generics in transform and flatTransform for
TransformerSupplier such that the TransformerSupplier can be
defined as lambda and anonymous class
- implemented transform as a special case of flatTransform
- used more readable names for type variables of inputs and
outputs in KStreamFlatTransform
- defined TransformerSupplier as lambda expressions in the tests
- reverted use of anonymous class instead of lambda expression in test
vvcephei
left a comment
There was a problem hiding this comment.
Oh! it looks like I wrote a few comments and never submitted them...
| * @see #flatMapValues(ValueMapper) | ||
| * @see #flatMapValues(ValueMapperWithKey) | ||
| * @see #transform(TransformerSupplier, String...) | ||
| * @see #flatTransform(TransformerSupplier, String...) |
There was a problem hiding this comment.
I'd be in favor of either just adding the new method to the list (without removing another one) or deleting this whole list.
Personally, I feel the list is a little redundant with this interface itself.
| * returns zero or one output record. | ||
| * Thus, an input record {@code <K,V>} can be transformed into an output record {@code <K':V'>}. | ||
| * This is a stateful record-by-record operation (cf. {@link #map(KeyValueMapper)}). | ||
| * Furthermore, via {@link org.apache.kafka.streams.processor.Punctuator#punctuate(long)} the processing progress |
There was a problem hiding this comment.
nit: needs a comma after the {@link ...}
|
@cadonna Can you rebase this PR to resolve the conflicts? |
|
Recently, I started work on |
|
Hi @cadonna , I think it would be better to have a new PR. Anything over a few hundred lines of code becomes a significant burden on reviewers, and reviewing a PR over 1k LOC is a major undertaking. The more you can break up code into isolated features for review, the better. |
- adds missing comma in javadoc - re-adds links to transform in javadoc
|
|
||
| /** | ||
| * Transform each record of the input stream into zero or more records in the output stream (both key and value type | ||
| * Transform each record of the input stream into zero or one record in the output stream (both key and value type |
There was a problem hiding this comment.
Technically, it's still possible to emit multiple key-value pairs via context.forward(), right? I guess this change is to highlight the different to the new flatTransformValue() only?
I am wondering, if we should still mention this (but point out the flatTransformValue is recommended for this case, because of type safety)? Think, this might be worth adding, because context.forward may be use within punctuations.
Or is the intend, to discourage the usage of context.forward completely and thus remove from the JavaDocs? This seems to limit users to implement custom operators... Note, that process() does not return an output stream and thus transform is quite critical for users to embed custom operators within the DSL.
There was a problem hiding this comment.
You probably mean flatTransform and not flatTransformValue in your comment. I mention that to avoid misunderstandings in the following. How does transform differ from transformValues with respect to punctuations? I ask because in transformValues the transformer is not allowed to call context.forward. With this code the context in the transformer is wrapped with a context that throws an exception when context.forward is called. Does this mean a punctuation in transformValues is not allowed to emit any pair?
There was a problem hiding this comment.
You probably mean flatTransform and not flatTransformValue in your comment.
Yes.
In transformValue it's not allowed to use context.forward() (neither in transform nor in a punctuation) because it would allow to change the key violating the transformValue contract to not change the key.
In transform it's ok to call context.forward() though as there is no such restriction.
| * this.context = context; | ||
| * this.state = context.getStateStore("myTransformState"); | ||
| * // punctuate each 1000ms; can access this.state | ||
| * // can emit as many new KeyValue pairs as required via this.context#forward() |
There was a problem hiding this comment.
bump... should this comment be restored?
There was a problem hiding this comment.
I added a note in the javadocs about the recommended way to emit multiple output records.
| * Within the {@link Transformer}, the state is obtained via the {@link ProcessorContext}. | ||
| * To trigger periodic actions via {@link org.apache.kafka.streams.processor.Punctuator#punctuate(long) punctuate()}, | ||
| * a schedule must be registered. | ||
| * |
There was a problem hiding this comment.
nit: missing <p> for new paragraph
There was a problem hiding this comment.
Looking at the JavaDocs of transform, I guess there is a blank line too much rather than a missing <p>. I deleted the blank lines around this line.
| * this.context = context; | ||
| * this.state = context.getStateStore("myTransformState"); | ||
| * // punctuate each 1000ms; can access this.state | ||
| * context.schedule(1000, PunctuationType.WALL_CLOCK_TIME, new Punctuator(..)); |
There was a problem hiding this comment.
We updated this API in 2.1 release to take Duration instead of long -- compare https://cwiki.apache.org/confluence/display/KAFKA/KIP-358%3A+Migrate+Streams+API+to+Duration+instead+of+long+ms+times
This raises the question, if we did forget to update some JavaDocs... Would you mind to double check? If yes, it would be cool, if you could do a separate PR to update the JavaDocs (this allows us to cherry-pick it into 2.1 branch).
| if (pair != null) | ||
| public void process(final KIn key, final VIn value) { | ||
| final Iterable<KeyValue<KOut, VOut>> pairs = transformer.transform(key, value); | ||
| Objects.requireNonNull(pairs, "result of transform can't be null"); |
There was a problem hiding this comment.
I am wondering, if we need to fail here? We could actually allow null, too and just skip the for loop for this case?
There was a problem hiding this comment.
I personally do not like to use null to represent an empty result. With KStream#transform(...), we do currently use null to represent empty results although with Java 8, we could probably use java.util.Optional. With KStream#flatTransform(...), we use an empty list to represent an empty result. However, for consistency with KStream#transform(...), I now also added your proposal and skip the loop, if the transformer returns null.
There was a problem hiding this comment.
FWIW, I share your dissatisfaction with all the nullability in our API. It would be nice to migrate away from it over time.
Not important for this PR, but just to share the thought, as opposed to an optional return type, we could have coalesced transform and flattransform into one API by making the return type void and forcing implementers to use context.forward. Maybe this would have been better, since it would remove any ambiguity about how to use the API.
I think for that to work really well, we'd need to have a tighter control on the forwarded types (so we can still tell whether or not the key is going to change, and so we can properly type the down-stream). It's something I've been wanting to do anyway, to reduce casting. So it might be something to consider in the future.
There was a problem hiding this comment.
making the return type
voidand forcing implementers to usecontext.forward
I don't think that is a good idea. Note that context.forward is not type safe by design. Adding return types allows us to offer a type safe way to return output records to Kafka Streams runtime.
There was a problem hiding this comment.
Yep, that's what I was referring to in the last paragraph. I think we could offer a different context API that offers the type safety we need, but it's speculative. I haven't actually tried. My comment wasn't for this PR at all, just thinking about where we could go in the future within the context of this change.
Note that what you point out about the type-danger of context.forward is applicable to transform today. We encourage users to use forward within their transformers, but doing so completely circumvents the type system.
|
|
||
| @ClassRule | ||
| public static final EmbeddedKafkaCluster CLUSTER = | ||
| new EmbeddedKafkaCluster(NUM_BROKERS); |
There was a problem hiding this comment.
I am wondering, if we could do this test using TopologyTestDriver instead? Ie, what it the exact purpose of the test?
There was a problem hiding this comment.
The purpose of this test is to test the integration of all components that are involved in the calls to flatTransform and transform.
There was a problem hiding this comment.
Sure, but why would TopologyTestDriver not provide this (ie, what do you consider as "all components" to verify the correct behavior? It don't think we need broker for this, but just a "Task" that the test driver provides. Thoughts?
There was a problem hiding this comment.
I agree that the test can also be written using TopologyTestDriver instead of a broker. I re-wrote the test accordingly.
| private WindowStore<String, String> store; | ||
| private int numRecordsProcessed; | ||
|
|
||
| .transform(new TransformerSupplier<String, String, KeyValue<String, String>>() { |
There was a problem hiding this comment.
Why this rewrite instead of using lambda expression?
There was a problem hiding this comment.
This change should not be in the PR. I thought, I reverted all of these changes after our discussion about generics and lambdas. Thank you!
| final String name = builder.newProcessorName(TRANSFORM_NAME); | ||
| return flatTransform(() -> new Transformer<K, V, Iterable<KeyValue<KR, VR>>>() { | ||
|
|
||
| private Transformer<? super K, ? super V, KeyValue<KR, VR>> transformer = transformerSupplier.get(); |
There was a problem hiding this comment.
This is a critical piece -- can we add a test to make sure we create new instances correctly? We had a bug previously with similar wrapping that was done incorrectly. Compare #4435
There was a problem hiding this comment.
Test shouldAlwaysGetNewTransformerFromTransformerSupplier added in KStreamImplTest.
| } | ||
|
|
||
| @Test | ||
| public void shouldCallInitAndCloseOfSinglePairTransformer() { |
There was a problem hiding this comment.
Not sure what the purpose of this test is? Can you elaborate? It seems to test, that the KStreamImplMock is calling transformer.init(context); and transformer.close();? But this would test, test code -- I guess I am missing something.
There was a problem hiding this comment.
The test verifies that if init(context) and close(context) are called on the transformer instantiated in KStreamImplMock#flatTransform then the wrapper around the transformer class in KStreamImpl#transform calls init(context) and close(context) on the transformer instantiated within the wrapper (i.e. final Transformer<String, String, KeyValue<Integer, Integer>> transformer = mock(Transformer.class) in the test). The code lines under test are this and this.
There was a problem hiding this comment.
Ack.
Same comment as above. (If we split the wrapper code out, it might be good to split this test into two -- one for init() one for close())?
| } | ||
|
|
||
| @Test | ||
| public void shouldReturnSingletonListOrEmptyListIfSinglePairTransformerReturnsNotNullOrNull() { |
There was a problem hiding this comment.
Can you elaborate on this test? I don't understand what it is supposed to verify?
There was a problem hiding this comment.
This test verifies this method in KStreamImpl#transform that calls KStreamImpl#flatTransform. The code under test is part of the wrapper around the transformer class for KStreamImpl#transform that is needed to call KStreamImpl#flatTransform.
There was a problem hiding this comment.
Ack. Makes sense.
I am wondering, if we should extract the part that builds the wrapper into its own method (package private so we can test it) and simplify the test accordingly?
(Or at least adding a comment might help -- not sure if we can encode the purpose in the test name -- it's too complex IMHO.)
Thoughts?
There was a problem hiding this comment.
I extracted the wrapper code into an adapter class (TransformSupplierAdapter). It seemed to me to be the cleaner solution instead of loosening access to members for unit tests.
|
@cadonna are you still working on this PR? |
|
@guozhangwang Yes, I do. I just have not had spare time to work on the PR for the last couple of weeks. I will try to reply to @mjsax comments during the weekend. |
- adds test to verify, whether the wrapper for transformer supplier in transform creates a new instance for each call to its get method - allows null as return value for the transformer used in flatTransform - minor refactorings in the Java docs - removes unintended change
- extracts code to adapt a TransformerSupplier for a transform operation to a TransformSupplier for flatTransform operation to an adapter class.
- re-writes integration tests for flatTransform and transform to use TopologyTestDriver instead of a broker.
mjsax
left a comment
There was a problem hiding this comment.
Just some nits. Overall LGTM.
Call for second review anyof @guozhangwang @bbejeck @vvcephei
| context().forward(pair.key, pair.value); | ||
| public void process(final KIn key, final VIn value) { | ||
| final Iterable<KeyValue<KOut, VOut>> pairs = transformer.transform(key, value); | ||
| if (pairs != null) { |
There was a problem hiding this comment.
The JavaDocs says, the return value must not be null. With with check, this is actually not true and null will be interpreted as empty/zero output records.
I am wondering if we should keep the code as-is and update the JavaDoc, or if we would want to be strict and remove this check and let the for-loop throw an NPE?
\cc @guozhangwang @bbejeck @vvcephei WDYT?
There was a problem hiding this comment.
You mean the javadoc of Transformer? It says:
@return new {@link KeyValue} pair—if {@code null} no key-value pair will
* be forwarded to down stream
So I think this logic here is fine?
There was a problem hiding this comment.
Do you refer to the flatTransform JavaDocs? There it says
* Method {@link Transformer#transform(Object, Object) transform()} must return an {@link java.util.Iterable} (e.g.,
* any {@link java.util.Collection} type) and the return value must not be {@code null}.
Apparently, I copied this sentence from the flatMap JavaDocs. I would remove this sentence from the flatTransform JavaDocs because I guess that the only reason to not allow nulls in a map operation is to distinguish it from a filter operation. This is not needed for a transform operation. Is this correct?
There was a problem hiding this comment.
Ack.
Please update the JavaDocs accordingly.
For flatMap() as it can return zero (ie, empty collection), too, we could also allow to return null and interpret it as zero. Don't see a big difference here -- however, for a functional style API it might be ok to be more restrictive and not allow null.
There was a problem hiding this comment.
No strong opinion, but it seems like we might want to stick close to the behavior of flatMap.
- If we're permissive with the return value of
flatTransform, but not withflatMap, then it might be a confusing asymmetry withflatMap. - Also, if we want to change the return behavior of
flatMap, don't we need a KIP?
Together, it would suggest that we should not permit null returns from flatTransform.
There was a problem hiding this comment.
While I agree that the asymmetry between flatTransform and flatMap may be confusing, I would also like to note that, we have already a similar asymmetry between transform and map. The transformer for transform has to return null to represent the empty result. Consequently, we would also have an asymmetry between transform and flatTransform if we forbade null as a representation of the empty output for flatTransform.
Do we want to have an asymmetry between flatTransform and transform or between flatMap and flatTransform?
One option could be to go for the former -- i.e., forbidding null output for flatTransform -- and to create a JIRA issue to change the transformer for transform to return java.util.Optional<KeyValue<KOut, VOut>> instead of KeyValue<KOut, VOut> to restore symmetry regarding null output values. This solution would break backward compatibility, though.
There was a problem hiding this comment.
One thing to consider: map() is a 1:1 operation while flatMap() is a 1:[0,...,n] operation. Also transform() in existing API is a 1:[0,...,n] operations. Thus, the grouping and comparison of map/transform and flatMap/flatTransform cannot really be applied.
Note that map() and flatMap() are DSL operators while transform() is PAPI integration. Also note, that flatMap() has a different return type compared to transform(), that makes it possible to require non-null output for flatMap() but not for transform() in old Java7 API.
We introduce flatTransform() as new 1:[0,...,n] transformation now and also have Java8. However, as @cadonna mentioned, we cannot simply change the return type for transform() to Optional though. Also, for PAPI integration I think it's not required to apply the same rules as for DSL operators and to disallow null necessarily.
Thus the base question seems to be, should we change the "contract" of transform() from 1:[0,...,n] to a 1:1 operator because flatTransform()is the preferred operation for 1:[0,...,n] now. I think the answer is _no_, becausetransform()still allows to emit records viacontext.forward()`.
Thus, I would keep map(), flatMap(), and transform() as-is, and I would allow null as return type for flatTransform(), because it's "just" a type safe alternative to transform() anyway.
Just my 2 cents.
There was a problem hiding this comment.
That sounds fine to me @mjsax . I think we can possibly find a better way to unify all this stuff, but it would require some structural changes in the processor context.
I'm happy with permitting null return in flatTransform.
There was a problem hiding this comment.
I removed the above sentence from the javadocs of flatTransform.
| } | ||
|
|
||
| @Test | ||
| public void shouldPropagateRepartitionFlagAfterGlobalKTableJoin() { |
There was a problem hiding this comment.
I have no idea. I re-added the test.
| public void before() throws InterruptedException { | ||
| builder = new StreamsBuilder(); | ||
| final StoreBuilder<KeyValueStore<Integer, Integer>> keyValueStoreBuilder = | ||
| Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore("myTransformState"), |
There was a problem hiding this comment.
nit: extract "myTransformState" into variable that is use below to get the state from the context.
| } | ||
|
|
||
| @Test | ||
| public void shouldAllowNullAsResultOfTransform() { |
There was a problem hiding this comment.
We should add a corresponding test for flatTransform in the test above, too. (Positive or negative test, depending if we want to allow null or not -- cf. my other comment.)
There was a problem hiding this comment.
I do not understand to what you refer to? This test tests the KStreamFlatTransformProcessor. Both, calls to transform and calls to flatTransform, run this code. The transformer returns a list of output records for both calls. The list can be null or it can be empty. The null-list is tested in shouldAllowNullAsResultOfTransform and the empty list is tested in shouldAllowEmptyListAsResultOfTransform. At this point, there is no distinction between transform and flatTransform.
There was a problem hiding this comment.
@cadonna Thanks again for the PR. I made another pass on it and beside the very minor ones I had a meta comment about the adaptor class solution (cc @vvcephei ), concerning if it may have perf penalties since for each single value returned, we are effectively calling singletonList(value) followed by a foreach -> forward, and it is on the critical path.
Plus, it makes the code less readable (this one is a subjective call of course). Any thoughts about this?
| public class KStreamFlatTransform<KIn, VIn, KOut, VOut> implements ProcessorSupplier<KIn, VIn> { | ||
|
|
||
| private final TransformerSupplier<? super K, ? super V, ? extends KeyValue<? extends K1, ? extends V1>> transformerSupplier; | ||
| private final TransformerSupplier<? super KIn, ? super VIn, Iterable<KeyValue<KOut, VOut>>> transformerSupplier; |
There was a problem hiding this comment.
There was a problem hiding this comment.
Possibly, it might be worth a Jira to create tests demonstrating the desired source code. Then, not only do we prove it works (or discover it doesn't), we also have a regression test in case we change the generic bounds in the future.
There was a problem hiding this comment.
I was able to compile the following code
stream
.flatMap((Integer key, Integer value) -> Arrays.asList(
KeyValue.pair(key, value),
KeyValue.pair(key, value),
KeyValue.pair(key, value)))
.foreach(action);
I guess in the flatTransform case it is the TransformerSupplier which makes the code not compile.
I agree with @vvcephei that a Jira to create tests demonstrating the desired source code might be beneficial.
There was a problem hiding this comment.
Created ticket KAFKA-8035.
Should I label it as a newbie issue?
| context().forward(pair.key, pair.value); | ||
| public void process(final KIn key, final VIn value) { | ||
| final Iterable<KeyValue<KOut, VOut>> pairs = transformer.transform(key, value); | ||
| if (pairs != null) { |
There was a problem hiding this comment.
You mean the javadoc of Transformer? It says:
@return new {@link KeyValue} pair—if {@code null} no key-value pair will
* be forwarded to down stream
So I think this logic here is fine?
| if (pair != null) { | ||
| return Arrays.asList(pair); | ||
| } | ||
| return Arrays.asList(); |
There was a problem hiding this comment.
nit: Collections.emptyList()?
| public <K1, V1> KStream<K1, V1> transform(final TransformerSupplier<? super K, ? super V, KeyValue<K1, V1>> transformerSupplier, | ||
| final String... stateStoreNames) { | ||
| Objects.requireNonNull(transformerSupplier, "transformerSupplier can't be null"); | ||
| private <K1, V1> KStream<K1, V1> doTransform(final ProcessorSupplier processorSupplierForGraph, |
| public Iterable<KeyValue<KOut, VOut>> transform(final KIn key, final VIn value) { | ||
| final KeyValue<KOut, VOut> pair = transformer.transform(key, value); | ||
| if (pair != null) { | ||
| return Arrays.asList(pair); |
There was a problem hiding this comment.
nit: Collections.singletonList.
There was a problem hiding this comment.
@guozhangwang , you had a performance concern about this class.
It definitely pads the execution, but my gut says that it won't be too bad, and the complexity savings of consolidating the code paths probably makes up for it.
Collections.singletonList doesn't create a full-blown list, but just wraps the argument in an instance of SingletonList, so it's very light to allocate and very light to collect. Collections.emptyList just returns a reference to a static singleton instance of EmptyList, so there's no overhead to it.
Also, the iteration should be low/no cost as well in hot code, as the jit compiler should unroll the loop.
This is all speculative, but something that @cadonna could evaluate with a microbenchmark if desired.
Anyway, like I said, I'm not worried about it in this case.
There was a problem hiding this comment.
oh. I do recommend taking both your suggestions to use singletonList/emptyList in this class, and actually in general as well.
There was a problem hiding this comment.
I replaced Arrays.asList with Collections.singletonList.
There was a problem hiding this comment.
Thanks! (FYI, if you use IDEA, there's a code inspection you can enable to prompt you to make this change.)
There was a problem hiding this comment.
Cool, I think I'm convinced. If others like TransformerSupplierAdapter for complexity savings as well we can merge this as-is.
There was a problem hiding this comment.
LGTM. @mjsax please make a final pass and merge if you have no further comments.
- re-added test that was accidentally removed - added note about use of context.forward to emit multiple output records - replaced Arrays.asList with Collections.singletonList and Collections.emptyList - minor javadoc refactorings - minor integration test refactoring
- removed requirement for not null return value from javadocs of flatTransform
| * This is a stateful record-by-record operation (cf. {@link #map(KeyValueMapper)}). | ||
| * Furthermore, via {@link org.apache.kafka.streams.processor.Punctuator#punctuate(long)}, the processing progress | ||
| * can be observed and additional periodic actions can be performed. | ||
| * |
There was a problem hiding this comment.
Already changed by somebody else.
| * To trigger periodic actions via {@link org.apache.kafka.streams.processor.Punctuator#punctuate(long) punctuate()}, | ||
| * a schedule must be registered. | ||
| * The {@link Transformer} must return a {@link KeyValue} type in {@link Transformer#transform(Object, Object) | ||
| * transform()} and {@link org.apache.kafka.streams.processor.Punctuator#punctuate(long) punctuate()}. |
There was a problem hiding this comment.
I just realize, that this seems to be wrong. punctuate() does not return anything. Might be a problem throughout all JavaDocs -- could your verify @cadonna ? I would suggest to do a separate PR to address this though so we can merge this PR without additional delay
| * Transforming records might result in an internal data redistribution if a key based operator (like an aggregation | ||
| * or join) is applied to the result {@code KStream}. | ||
| * (cf. {@link #transformValues(ValueTransformerSupplier, String...)}) | ||
| * </p> |
There was a problem hiding this comment.
Nit: JavaDoc markup is not HTML... Thus, there is no closing tag. Remove the and next empty line.
There was a problem hiding this comment.
Already changed by somebody else.
| * | ||
| * <p> | ||
| * Note that it is possible to emit multiple records for each input record by using | ||
| * {@link ProcessorContext#forward(Object, Object) context#forward()} in {@link Transformer#transform(K, V)}. |
There was a problem hiding this comment.
nit: "in transform() and punctuate()"
| * To ensure type-safety at compile-time, it is recommended to use | ||
| * {@link #flatTransform(TransformerSupplier, String...)} if multiple records need to be emitted for each input | ||
| * record. | ||
| * </p> |
There was a problem hiding this comment.
Already changed by somebody else.
| * transformation). | ||
| * Furthermore, via {@link org.apache.kafka.streams.processor.Punctuator#punctuate(long)} the processing progress | ||
| * can be observed and additional periodic actions can be performed. | ||
| * |
There was a problem hiding this comment.
In transform() we have a sentence:
The {@link Transformer} must return a {@link KeyValue} type in {@link Transformer#transform(Object, Object) transform()}
Might be good to add this here too and point out it might be a an Iterable plus example. Compare flatMap():
* The provided {@link KeyValueMapper} must return an {@link Iterable} (e.g., any {@link java.util.Collection} type)
* and the return value must not be {@code null}.
There was a problem hiding this comment.
Empty line was already removed by somebody else.
| * void init(ProcessorContext context) { | ||
| * this.context = context; | ||
| * this.state = context.getStateStore("myTransformState"); | ||
| * // punctuate each 1000ms; can access this.state |
There was a problem hiding this comment.
nit: punctuate each second (might be c&p error and affect other places, too) -- maybe better to address in separate PR (the void punctuate() issue and this can be one PR IMHO).
| * Transforming records might result in an internal data redistribution if a key based operator (like an aggregation | ||
| * or join) is applied to the result {@code KStream}. | ||
| * (cf. {@link #transformValues(ValueTransformerSupplier, String...)}) | ||
| * |
There was a problem hiding this comment.
Should we add a similar paragraph about using context.forward() as in transform() ?
| public class KStreamFlatTransform<KIn, VIn, KOut, VOut> implements ProcessorSupplier<KIn, VIn> { | ||
|
|
||
| private final TransformerSupplier<? super K, ? super V, ? extends KeyValue<? extends K1, ? extends V1>> transformerSupplier; | ||
| private final TransformerSupplier<? super KIn, ? super VIn, Iterable<KeyValue<KOut, VOut>>> transformerSupplier; |
|
Retest this please. |
1 similar comment
|
Retest this please. |
|
Create follow up Jira to track second part of KIP-313: adding |
|
Thanks @cadonna for this PR! |
This commit is a leftover of pull request apache#5273.
This commit is a follow-up of pull request #5273 Reviewers: Matthias J. Sax <mjsax@apache.org>, Bill Bejeck <bbejeck@gmail.com>
This commit is a follow-up of pull request #5273 Reviewers: Matthias J. Sax <mjsax@apache.org>, Bill Bejeck <bbejeck@gmail.com>
* ak/trunk: MINOR: Update usage of deprecated API (apache#6146) KAFKA-4217: Add KStream.flatTransform (apache#5273) MINOR: Update Gradle to 5.1.1 (apache#6160) KAFKA-3522: Generalize Segments (apache#6170) Added quotes around the class path (apache#4469) KAFKA-7837: Ensure offline partitions are picked up as soon as possible when shrinking ISR (apache#6202) MINOR: In the MetadataResponse schema, ignorable should be a boolean KAFKA-7838: Log leader and follower end offsets when shrinking ISR (apache#6168) KAFKA-5692: Change PreferredReplicaLeaderElectionCommand to use Admin… (apache#3848) MINOR: clarify why suppress can sometimes drop tombstones (apache#6195) MINOR: Upgrade ducktape to 0.7.5 (apache#6197) MINOR: Improve IntegrationTestUtils documentation (apache#5664) MINOR: upgrade to jdk8 8u202 KAFKA-7693; Fix SequenceNumber overflow in producer (apache#5989) KAFKA-7692; Fix ProducerStateManager SequenceNumber overflow (apache#5990) MINOR: update copyright year in the NOTICE file. (apache#6196) KAFKA-7793: Improve the Trogdor command line. (apache#6133)
Reviewers: Matthias J. Sax <matthias@confluent.io>, John Roesler <john@confluent.io>, Guozhang Wang <guozhang@confluent.io>
This commit is a follow-up of pull request apache#5273 Reviewers: Matthias J. Sax <mjsax@apache.org>, Bill Bejeck <bbejeck@gmail.com>
This contribution is my original work and I license the work to the project under the project's open source license.
Committer Checklist (excluded from commit message)