-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-2852] Nexmark Kafka source sink #3937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-2852] Nexmark Kafka source sink #3937
Conversation
still experiment sourceFromKafka
|
Thanks for your help @vectorijk ! I'll take a look as soon as I'm done with my current PR. |
iemejia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job, thanks a lot for working on it, I commented on some small fixes. Can you also please add some documentation on how to use Nexmark with Kafka in the Nexmark README. (We are moving this eventually to the web site but we will do this later).
sdks/java/nexmark/pom.xml
Outdated
| <groupId>org.apache.beam</groupId> | ||
| <artifactId>beam-sdks-java-io-kafka</artifactId> | ||
| <exclusions> | ||
| <exclusion> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this exclusion is not needed. Can you remove it if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it
| })); | ||
| } | ||
|
|
||
| static final ParDo.SingleOutput<Event, byte[]> EVENT_TO_BYTEARRAY = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private, maybe make the return type DoFn instead of ParDo and apply the ParDo in the other part to make it more Beam style, (and yes I know nexmark does not do this in other parts shame on us :P).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
| } | ||
|
|
||
|
|
||
| static final ParDo.SingleOutput<KV<Long, byte[]>, Event> BYTEARRAY_TO_EVENT = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as with the other ParDo (private + Fn).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| * Send {@code events} to Kafka. | ||
| */ | ||
| private void sinkEventsToKafka(PCollection<Event> events) { | ||
| if (Strings.isNullOrEmpty(options.getBootstrapServers())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change for Preconditions.checkArgument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually unneeded because this check is already done in KafkaIO.Write.validate. You can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove it.
| /** | ||
| * Send {@code formattedResults} to Kafka. | ||
| */ | ||
| private void sinkResultsToKafka(PCollection<String> formattedResults) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change for Preconditions.checkArgument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
| /** | ||
| * Send {@code events} to Kafka. | ||
| */ | ||
| private void sinkEventsToKafka(PCollection<Event> events) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently unused, can you add something similar to Pubsub's PUBLISH_ONLY so users can generate and record the events in Kafka.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, it might be useful as well to inject events to the benchmark using kafka similarly to what is done in Pusb/Sub COMBINED mode.
|
retest this please |
echauchot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vectorijk, thanks for your work!
Besides I have tested your PR by launching query1 in batch mode in direct runner with a kafka sink to a local kafka instance through this command
mvn -Pdirect-runner exec:java -Dexec.mainClass=org.apache.beam.sdk.nexmark.Main "-Dexec.args=--sinkType=KAFKA --bootstrapServers=127.0.0.1:9092 --runner=DirectRunner --query=1 --streaming=false --manageResources=false --monitorJobs=true --enforceEncodability=true --enforceImmutability=true"
And I got this error:
GRAVE: Uncaught error in kafka producer I/O thread:
org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'topic_metadata': Error reading array of size 1702391137, only 37 bytes available
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:73)
at org.apache.kafka.clients.NetworkClient.parseResponse(NetworkClient.java:380)
at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:449)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:269)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:236)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:135)
at java.lang.Thread.run(Thread.java:745)
Maybe related to the fact that you use PCollection<String> formattedResults and apply to it a KafkaIO.<Long, String> with keySerializer whereas there is no key in the PCollection
| })); | ||
| } | ||
|
|
||
| static final ParDo.SingleOutput<Event, byte[]> EVENT_TO_BYTEARRAY = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| /** | ||
| * Send {@code events} to Kafka. | ||
| */ | ||
| private void sinkEventsToKafka(PCollection<Event> events) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, it might be useful as well to inject events to the benchmark using kafka similarly to what is done in Pusb/Sub COMBINED mode.
| * Send {@code events} to Kafka. | ||
| */ | ||
| private void sinkEventsToKafka(PCollection<Event> events) { | ||
| if (Strings.isNullOrEmpty(options.getBootstrapServers())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually unneeded because this check is already done in KafkaIO.Write.validate. You can remove it.
vectorijk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @echauchot @iemejia for the review! I will address comments asap.
sdks/java/nexmark/pom.xml
Outdated
| <groupId>org.apache.beam</groupId> | ||
| <artifactId>beam-sdks-java-io-kafka</artifactId> | ||
| <exclusions> | ||
| <exclusion> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it
|
@vectorijk Do you need any help? |
|
ping? |
|
I'll update this today and tomorrow
…On Wed, Nov 8, 2017, 02:19 Etienne Chauchot ***@***.***> wrote:
ping?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3937 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADQu6cjoBvLxDMjeGBs9cJHu5suimedLks5s0YAvgaJpZM4Ps1EC>
.
|
|
@echauchot I tried It works for me on my local. I can't reproduce. @iemejia Could you test this on your side? |
|
@vectorijk thanks for the fixes! I still have 2 comments:
|
echauchot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waiting for status on error to see if it is related to my setup and also the COMBINED state.
|
@echauchot I will take a look at COMBINED mode and update it soon. |
|
@vectorijk any updates? |
|
How is this going? This seems nice to have. It is probably a good time to rebase and restart review since Nexmark has change a little and the codebase is Java 8. |
|
@kennknowles I will change and rebase this. |
|
@vectorijk I'm waiting for the status of the test on @iemejia's setup so that we could see if the |
|
I checked out this PR and rebased it against master (there were some minor conflicts, easy to resolve). The output result: |
|
@aromanenko-dev thanks for the feedback! So the issue that I reported was due to my setup then. I'll rebase and merge. |
|
There was a point in my review about the COMBINED mode but we could support it later in another PR. |
|
@vectorijk seems to be unavailable so I propose to do another PR for minor fixes: squash (with obviously same author), rebase on master, and wire up unused |
Add support for Kafka as source/sink on Nexmark