[BEAM-52] Kafka custom source #121

rangadi · 2016-02-19T16:19:34Z

Current status (3/18/16) : well tested and feature complete. going through reviews.

This is a preliminary PR. It is not tested yet. I am working on example application and unit tests. Some TODOs :

unit tests
testing in the cluster with larger volumes of data and larger number of partitions
add more stats (I need to look into stats support in sdk)

This reverts commit af9b887.

googlebot · 2016-02-19T16:19:36Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please let us know the company's name.

rangadi · 2016-02-19T16:23:54Z

I signed the CLA.

googlebot · 2016-02-19T16:23:58Z

CLAs look good, thanks!

rangadi · 2016-02-25T19:23:23Z

The functionality is well tested. will update javadoc today.

This certainly looks much better.

rangadi · 2016-03-31T18:11:13Z

@dhalperi Updated Read interface as we discussed. Needed changes to tests as well.
TODO: good JavaDoc for KafkaIO as you suggested. will update later today.

dhalperi · 2016-04-05T04:26:33Z

sdk/src/main/java/com/google/cloud/dataflow/sdk/io/bigtable/BigtableIO.java

 * <h3>Reading from Cloud Bigtable</h3>
 *
 * <p>The Bigtable source returns a set of rows from a single table, returning a
- * {@code PCollection&lt;Row&gt;}.


please revert

done. Eclipse didn't render this properly.

dhalperi · 2016-04-05T05:58:34Z

R: @dpmills @mshields822

Daniel has already taken a look at this a month back, can you re-skim it?

Mark has been implementing PubSub. Mark, can you take a look at the Kafka source and see what lessons you can impart?

mshields822 · 2016-04-05T15:36:38Z

contrib/kafka/src/main/java/com/google/cloud/dataflow/contrib/kafka/KafkaIO.java

+import static com.google.common.base.Preconditions.checkNotNull;
+import static com.google.common.base.Preconditions.checkState;
+
+import com.google.api.client.repackaged.com.google.common.annotations.VisibleForTesting;


No need for repackaged.

oops! Thanks for catching.

mshields822 · 2016-04-05T15:53:31Z

LGTM
No overlap with pub/sub impl:

Kafka: Due to strict partitioning cannot use additional source splits to naively hide get latency. So need background thread. Pubsub: No background threads needed.
Kafka: Since seekable (modulo caveats) can estimate backlog. Pubsub: Not seekable and pub/sub don't publish any backlog estimate (at least publically). So backlog is just what's received but not yet read by advance in the reader.
Kafka: Imposes key/value structure on records. Pub/sub: uninterpreted bytes.
Kafka: Order preserving, so assuming (implied) timestamps are monotonic watermark is just current records timestamp. Pub/sub: Not order preserving, no assumption on timestamp monotonicity, need to estimate watermark.

rangadi · 2016-04-05T18:55:19Z

Thanks @mshields822.
One thing I would like to expand is that Kafka wouldn't be much different w.r.t timestamp and wantermarks. The order is only preserved within a partition, and we don't guarantee synchronous progress across multiple partitions.

dhalperi · 2016-04-05T19:26:27Z

Looks like you need to push? I'm going to stop commenting on outstanding feedback :)

rangadi · 2016-04-05T20:04:56Z

sorry, forgot to push before heading to lunch. Addressed all the comments except last one. I will double check any remaining comments.

dhalperi · 2016-04-05T21:05:30Z

contrib/kafka/src/main/java/com/google/cloud/dataflow/contrib/kafka/KafkaIO.java

+          }
+
+          ConsumerRecord<byte[], byte[]> rawRecord = pState.recordIter.next();
+          long consumed = pState.consumedOffset;


Is there a bound on the number of records that we'll have to skip? We can only log a few 1000 per second, so if it's millions then logging each record will induce large delays in starting up.

Alternative would be to set a flag and only log once.

It would be fairly small (10s to 100s). The upper limit is the number of kafka records compressed together. 0.10.x KafkaConsumer already fixed this issue with compressed messages and it skips them. So it is not at all expected in near future.

One case where it can cause millions of records is the case I mention above where Kafka is restarted from scratch which resets the offsets while the Dataflow app is running. Not sure weather we want to handle that case. It is better for the user to realize this problem (should be very rare), and take appropriate action (may be just restarting the Dataflow app).

That said I can certainly add a flag to limit it to one message.

dhalperi · 2016-04-07T06:59:55Z

Closing in favor of apache/beam#142

Raghu Angadi added 6 commits February 16, 2016 14:37

pom.xml

f1e2580

most of the implemetation

7454816

remove java 1.7 restriction in pom

af9b887

Revert "remove java 1.7 restriction in pom"

88767fd

This reverts commit af9b887.

simplify consumedOffset management with PartitionState

464ad1e

minor

ed9f92a

googlebot added the cla: no label Feb 19, 2016

googlebot added cla: yes and removed cla: no labels Feb 19, 2016

Raghu Angadi added 13 commits February 19, 2016 11:46

minor

e42533b

TopHashtagsExample (just does global count for now)

d419594

set maxNumRecords to be able to run with DirectRunner

74a27cc

add log4j dependency (temp)

773ba2b

some fixes and some temp code

5990e93

Add KafkaRecord (a Serializable version of kafka.ConsumerRecord)

7d33c59

Builder for ValueSource

3a1dc97

make identityFn a static val

8580ed2

remove full kafka depenency (was used for old consumer)

a00625d

serialization fixes and update example to track top hashtags

7e8393d

serializer fix

49fd42f

offset fix

a86f11e

testing with timestampFn. getWatermark() odd behaviour

f8fb50d

dhalperi changed the title ~~Kafka custom source~~ [BEAM-52] Kafka custom source Feb 25, 2016

Raghu Angadi added 3 commits February 25, 2016 09:56

more getWatermark() debugging

2e494ec

minor

ef2620d

minor

5e0a485

review comments

32e7362

dhalperi mentioned this pull request Mar 23, 2016

Kafka protype. #96

Closed

Update KafkaIO interface based on sugestions from Dan.

70a0093

This certainly looks much better.

KafkaIO JavaDoc

3d56073

dhalperi reviewed Apr 5, 2016
View reviewed changes

mshields822 reviewed Apr 5, 2016
View reviewed changes

review comments

5e3b54e

review comments

4e432bd

revert a small fix

95912d1

dhalperi reviewed Apr 5, 2016
View reviewed changes

dhalperi mentioned this pull request Apr 7, 2016

[BEAM-52] Kafka IO apache/beam#142

Closed

8 tasks

dhalperi closed this Apr 7, 2016

[BEAM-52] Kafka custom source #121

[BEAM-52] Kafka custom source #121

Uh oh!

Conversation

rangadi commented Feb 19, 2016

Uh oh!

googlebot commented Feb 19, 2016

Uh oh!

rangadi commented Feb 19, 2016

Uh oh!

googlebot commented Feb 19, 2016

Uh oh!

rangadi commented Feb 25, 2016

Uh oh!

rangadi commented Mar 31, 2016

Uh oh!

dhalperi Apr 5, 2016

Choose a reason for hiding this comment

Uh oh!

rangadi Apr 5, 2016

Choose a reason for hiding this comment

Uh oh!

dhalperi commented Apr 5, 2016

Uh oh!

mshields822 Apr 5, 2016

Choose a reason for hiding this comment

Uh oh!

rangadi Apr 5, 2016

Choose a reason for hiding this comment

Uh oh!

mshields822 commented Apr 5, 2016

Uh oh!

rangadi commented Apr 5, 2016

Uh oh!

dhalperi commented Apr 5, 2016

Uh oh!

rangadi commented Apr 5, 2016

Uh oh!

dhalperi Apr 5, 2016

Choose a reason for hiding this comment

Uh oh!

rangadi Apr 5, 2016

Choose a reason for hiding this comment

Uh oh!

dhalperi commented Apr 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants