Skip to content
This repository was archived by the owner on Nov 11, 2022. It is now read-only.

Conversation

@rangadi
Copy link

@rangadi rangadi commented Feb 19, 2016

Current status (3/18/16) : well tested and feature complete. going through reviews.

This is a preliminary PR. It is not tested yet. I am working on example application and unit tests. Some TODOs :

  • unit tests
  • testing in the cluster with larger volumes of data and larger number of partitions
  • add more stats (I need to look into stats support in sdk)

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please let us know the company's name.

@rangadi
Copy link
Author

rangadi commented Feb 19, 2016

I signed the CLA.

@googlebot
Copy link

CLAs look good, thanks!

@dhalperi dhalperi changed the title Kafka custom source [BEAM-52] Kafka custom source Feb 25, 2016
@rangadi
Copy link
Author

rangadi commented Feb 25, 2016

The functionality is well tested. will update javadoc today.

@dhalperi dhalperi mentioned this pull request Mar 23, 2016
This certainly looks much better.
@rangadi
Copy link
Author

rangadi commented Mar 31, 2016

@dhalperi Updated Read interface as we discussed. Needed changes to tests as well.
TODO: good JavaDoc for KafkaIO as you suggested. will update later today.

* <h3>Reading from Cloud Bigtable</h3>
*
* <p>The Bigtable source returns a set of rows from a single table, returning a
* {@code PCollection&lt;Row&gt;}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Eclipse didn't render this properly.

@dhalperi
Copy link
Contributor

dhalperi commented Apr 5, 2016

R: @dpmills @mshields822

Daniel has already taken a look at this a month back, can you re-skim it?

Mark has been implementing PubSub. Mark, can you take a look at the Kafka source and see what lessons you can impart?

import static com.google.common.base.Preconditions.checkNotNull;
import static com.google.common.base.Preconditions.checkState;

import com.google.api.client.repackaged.com.google.common.annotations.VisibleForTesting;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for repackaged.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops! Thanks for catching.

@mshields822
Copy link
Contributor

LGTM
No overlap with pub/sub impl:

  • Kafka: Due to strict partitioning cannot use additional source splits to naively hide get latency. So need background thread. Pubsub: No background threads needed.
  • Kafka: Since seekable (modulo caveats) can estimate backlog. Pubsub: Not seekable and pub/sub don't publish any backlog estimate (at least publically). So backlog is just what's received but not yet read by advance in the reader.
  • Kafka: Imposes key/value structure on records. Pub/sub: uninterpreted bytes.
  • Kafka: Order preserving, so assuming (implied) timestamps are monotonic watermark is just current records timestamp. Pub/sub: Not order preserving, no assumption on timestamp monotonicity, need to estimate watermark.

@rangadi
Copy link
Author

rangadi commented Apr 5, 2016

Thanks @mshields822.
One thing I would like to expand is that Kafka wouldn't be much different w.r.t timestamp and wantermarks. The order is only preserved within a partition, and we don't guarantee synchronous progress across multiple partitions.

@dhalperi
Copy link
Contributor

dhalperi commented Apr 5, 2016

Looks like you need to push? I'm going to stop commenting on outstanding feedback :)

@rangadi
Copy link
Author

rangadi commented Apr 5, 2016

sorry, forgot to push before heading to lunch. Addressed all the comments except last one. I will double check any remaining comments.

}

ConsumerRecord<byte[], byte[]> rawRecord = pState.recordIter.next();
long consumed = pState.consumedOffset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a bound on the number of records that we'll have to skip? We can only log a few 1000 per second, so if it's millions then logging each record will induce large delays in starting up.

Alternative would be to set a flag and only log once.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be fairly small (10s to 100s). The upper limit is the number of kafka records compressed together. 0.10.x KafkaConsumer already fixed this issue with compressed messages and it skips them. So it is not at all expected in near future.

One case where it can cause millions of records is the case I mention above where Kafka is restarted from scratch which resets the offsets while the Dataflow app is running. Not sure weather we want to handle that case. It is better for the user to realize this problem (should be very rare), and take appropriate action (may be just restarting the Dataflow app).

That said I can certainly add a flag to limit it to one message.

@dhalperi dhalperi mentioned this pull request Apr 7, 2016
8 tasks
@dhalperi
Copy link
Contributor

dhalperi commented Apr 7, 2016

Closing in favor of apache/beam#142

@dhalperi dhalperi closed this Apr 7, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants