Skip to content

Conversation

@mshields822
Copy link
Contributor

First step towards supporting pub/sub i/o in any Java runner.

Disclaimers:

  • No integration tests yet.
  • No unit tests for the source and sink. Propose we support mocking PubsubGrpcClient.
  • Depends on grpc-pubsub-v1 which is about to be renamed.
  • Only supports 'application default' credentials, and ignores any GcsOptions flags.
  • Not yet wired into the 'deault' PubsubIO implementations.
  • Watermark tracking is heuristic and may introduce late data.

But other than that we're ready to go.

@mshields822
Copy link
Contributor Author

R=Ken
@kennknowles

@mshields822
Copy link
Contributor Author

Note this has been extensively stress tested by running within a custom Google Dataflow runner.

@dhalperi
Copy link
Contributor

R= @kennknowles

(Copying from #85 (comment), so my scripts work)

@mshields822
Copy link
Contributor Author

Made PubsubGrpcClient public and refactored a bit so that it API is free of gRPC and protoc dependencies. This means it can be reused, and also will make mocking easier.

@mshields822
Copy link
Contributor Author

Looking at the ci failure the grpc-pubsub.jar (containing all the pubsub proto-derived classes) is jdk8 only. Advice?

@mshields822
Copy link
Contributor Author

R: @davorbonaci
for help with ci failure above

@mshields822
Copy link
Contributor Author

The two pubsub/grpc jars this depends on have been compiled with jdk8 and thus break our jdk7 support requirement. I'm taking this up with the Google team responsible. We can review this but we can't merge until that is sorted out.


/**
* Keep track of the minimum/maximum/sum of a set of timestamped long values.
* For efficiency, bucket values by their timestamp.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be super clear - the min/max/sum are meant as examples here, right? It seems this is a generic binning/bucketing map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are the three functions I needed for watermark tracking and implemented in SimpleFunction. I'm trying to keep the scope as small as possible.

@kennknowles
Copy link
Member

This is a very large PR. I am going to need to take another pass to continue to grok.

@kennknowles
Copy link
Member

One thing you could do that would lighten the cognitive load would be to separate the source and sink into two PRs.

@mshields822
Copy link
Contributor Author

PTAL (really need to squash history now)

  • Avoid SimpleFunction, move the bucking/moving functions stuff into util
  • Split off GrpcClient iface as you suggest.

So now we could proceed as
GrpcClient and friends
PubsubUnboundedSink
PubsubUnboundedSource and friends

@mshields822
Copy link
Contributor Author

Ok, I'll save this one for just the source, and will send in pub/sub client and sink as sep prs.

@kennknowles
Copy link
Member

Noting that the first PR peeled off from this is #120.

iemejia referenced this pull request in iemejia/beam Jan 12, 2018
iemejia referenced this pull request in iemejia/beam Jan 12, 2018
mareksimunek pushed a commit to mareksimunek/beam that referenced this pull request May 9, 2018
mareksimunek pushed a commit to mareksimunek/beam that referenced this pull request May 9, 2018
apache#85 Move findbugs plugin execution to the process-classes phase
dmvk pushed a commit to dmvk/beam that referenced this pull request May 15, 2018
mareksimunek pushed a commit to seznam/beam that referenced this pull request Jul 9, 2018
dmvk pushed a commit to seznam/beam that referenced this pull request Aug 17, 2018
dmvk pushed a commit to seznam/beam that referenced this pull request Oct 5, 2018
hengfengli pushed a commit to hengfengli/beam that referenced this pull request Mar 21, 2022
alnzng pushed a commit to alnzng/beam that referenced this pull request Aug 25, 2023
Improve the CombinedByKey translator in the DataStream-based Flink batch runner.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants